Killing Servers with Virtualisation and Swap

The Problem:

A problem with virtual machines is the fact that one rogue DomU can destroy the performance of all the others by inappropriate resource use. CPU scheduling is designed to allow reasonable sharing of computational resources, it is unfortunately not well documented, the XenSource wiki currently doesn’t document the “credit” scheduler which is used in Debian/Etch and CentOS 5 [1]. One interesting fact is that CPU scheduling in Xen can have a significant effect on IO performance as demonstrated in the paper by Ludmila Cherkasova, Diwaker Gupta and Amin Vahdat [2]. But they only showed a factor of two performance difference (which while bad is not THAT bad).

A more significant problem is managing virtual memory, when there is excessive paging performance can drop by a factor of 100 and even the most basic tasks become impossible.

The design of Xen is that every DomU is allocated some physical RAM and has it’s own swap space. I have previously written about my experiments to optimise swap usage on Xen systems by using a tmpfs in the Dom0 [3]. The aim was to have every Xen DomU swap data out to a tmpfs so that if one DomU was paging heavily and the other DomUs were not paging then the paging might take place in the Dom0’s RAM and not hit disk. The experiments were not particularly successful but I would be interested in seeing further research in this area as there might be some potential to do some good.

I have previously written about the issues related to swap space sizing on Linux [4]. My conclusion is that following the “twice RAM” myth will lead to systems becoming unusable due to excessive swapping in situations where they might otherwise be usable if the kernel killed some processes instead (naturally there are exceptions to my general rule due to different application memory use patterns – but I think that my general rule is a lot better than the “twice RAM” one).

One thing that I didn’t consider at the time is the implications of this problem for Xen servers. If you have 10 physical machines and one starts paging excessively then you have one machine to reboot. If you have 10 Xen DomUs on a single host and one starts paging heavily then you end up with one machine that is unusable due to thrashing and nine machines that deliver very poor disk read performance – which might make them unusable too. Read performance can particularly suffer in a situation when one process or VM is writing heavily to disk due to the way that the disk queuing works, it’s not uncommon for an application to read dozens or hundreds of blocks from disk to satisfy a single trivial request from a user, if each of these block read requests has to wait for a large amount of data to be written out from the write-back cache then performance will suck badly (I have seen this in experiments on single disks and on Linux software RAID – but have not had the opportunity to do good tests on a hardware RAID array).

Currently for Xen DomUs I am allocating swap spaces no larger than 512M, as anything larger than that is likely to cause excessive performance loss to the rest of the server if it is actually used.

A Solution for Similar Problems:

A well known optimisation technique of desktop systems is to use a separate disk for swap, in desktop machines people often use the old disk as swap after buying a new larger disk for main storage. The benefit of this is that swap use will not interfere with other disk use, for example the disk reads needed to run the “ps” and “kill” programs won’t be blocked by the memory hog that you want to kill. I believe that similar techniques can be applied to Xen servers and give even greater benefits. When a desktop machine starts paging excessively the user will eventually take a coffee break and let the machine recover, but when an Internet server such as a web server starts paging excessively the requests keep coming in and the number of active processes increases so it seems likely that using a different device for the swap will allow some processes to satisfy requests by reading data from disk while some other processes are waiting to be paged in.

Applying it to Xen Servers:

The first thing that needs to be considered for such a design is the importance of reliable swap. When it comes to low-end servers there is ongoing discussion about the relative merits of RAID-0 and RAID-1 for swap. The benefit of RAID-0 is performance (at least in perception – I can imagine some OS swapping algorithms that could potentially give better performance on RAID-1 and I am not aware of any research in this area). The benefit of RAID-1 is reliability. Now there are two issues in regard to reliability, one is continuity of service (EG being able to hot-swap a failed disk while the server is running), and the other is the absence of data loss. For some systems it may be acceptable to have a process SEGV (which I presume is the result if a page-in request fails) due to a dead disk (reserving the data loss protection of RAID for files). One issue related to this is the ability to regain control of a server after a problem. For example if the host OS of a machine had non-RAID swap then a disk failure could prevent a page-in of data related to sshd or some similar process and thus make it impossible to recover the machine without hardware access. But if the swap for a virtual machine was on a non-RAID disk and the host had RAID for it’s swap then the sysadmin could login to the host and reboot the DomU after creating a new swap space on a working disk.

Now if you have a server with 8 or 12 disks (both of which seem to be reasonably common capacities of modern 2RU servers) and if you decide that RAID is not required for the swap space of DomUs then it would be possible to assign single disks for swap spaces for groups of virtual machines. So if one client had several virtual machines they could have them share the same single disk for the swap, so a thrashing server would only affect the performance of other VMs from the same client. One possible configuration would be a 12 disk server that has a four disk RAID-5 array for main storage and 8 single disks for swap. 8 CPU cores is common for a modern 2RU server, so it would be possible to lock 8 groups of DomUs so that they share CPUs and swap spaces. Another possibility would be to have four groups of DomUs where each group had a RAID-1 array for swap and two CPU cores.

I am not sure of the aggregate performance impact of such a configuration, I suspect that a group of single disks would give better performance for swap than a single RAID array and that RAID-1 would outperform RAID-5. For a single DomU it seems most likely that using part of a large RAID array for swap space would give better performance. But the benefit in partitioning the server seems clear. An architecture where each DomU had it’s own dedicated disk for a swap space is something that I would consider a significant benefit if renting a Xen DomU. I would rather have the risk of down-time (which should be short with hot-swap disks and hardware monitoring) in the rare case of a disk failure than have bad performance regularly in the common situation of someone else’s DomU being overloaded.

Failing that, having a separate RAID array for swap would be a significant benefit. If every process that isn’t being paged out could deliver full performance while one DomU was thrashing then it would be a significant benefit over the situation where any DomU can thrash and kill the file access performance of all other DomUs. A single RAID-1 array should handle all the swap space requirements for a small or medium size Xen server

One thing that I have not tested is the operation of LVM when one disk goes bad. In the case of a disk with bad sectors it’s possible to move the LVs that are not affected to other disks and to remove the LV that was affected and re-create it after removing the bad disk. The case of a disk that is totally dead (IE the PV header can’t be read or written) might cause some additional complications.

Update Nov 2012: This post was discussed on the Linode forum:

Comments include “The whole etbe blog is pretty interesting” and “Russell Coker is a long-time Debian maintainer and all-round smart guy” [5]. Thanks for that!

6 comments to Killing Servers with Virtualisation and Swap

  • […] with Virtualisation and Swap Russel Coker blogs about a problem, which concerned me as well: Killing Servers with Virtualisation and Swap, i.e.: what happens when one domU is happily swapping the whole day? Luckily this didn’t happen […]

  • Anonymous

    Why would you use swap at all? Nowadays, if a system needs to swap, something has gone horribly wrong.

    If someone *really* wants swap on their DomU, they could always create a swap file in their disk storage. However, if the swap ever gets touched the system performance will almost certainly become unacceptably bad, so unless you have some huge single calculation that can’t fit in RAM, why would you want swap? Just scale up your system, either permanently or (if you use something like EC2 or Gandi Hosting) temporarily.

  • etbe

    Ingo comments on my post at the above URL. He suggests using a separate partition of each disk that is used in the main RAID for swap. While this will allow some partitioning of swap use (so swap of domain 1 won’t necessarily affect domain 2), it doesn’t stop a domain that swaps heavily from impacting the performance of filesystem access.

  • […] Wolf on Swapping to a Floppy DiskBerto on Swapping to a Floppy Disketbe on AppArmor is Deadetbe on Killing Servers with Virtualisation and SwapMorris on AppArmor is Deadneo on AppArmor is […]

  • etbe

    There is a short discussion of this post at the above URL. There is an interesting comment from sweh “The swap issue is well known on linodes. The UML IO tokens system helps mitigate this to some extent ( a UML linode swapping intensively will run out of I/O tokens and so no longer chew up host I/O bandwidth, freeing the other linodes on the same host to continue on as normal)”. That is really interesting, I’ll have to investigate what UML offers in this regard next time I work on such things.

    The Linode discussion concludes with SteveG saying “For those that don’t recognize the name, Russell Coker is a long-time Debian maintainer and all-round smart guy”. Thanks SteveG!