2

Upgrading a server to 64bit Xen

I have access to a server in Germany that was running Debian/Etch i386 but needed to be running Xen with the AMD64 version of Debian/Lenny (well it didn’t really need to be Lenny but we might as well get two upgrades done at the same time). Most people would probably do a complete reinstall, but I knew that I could do the upgrade while the machine is in a server room without any manual intervention. I didn’t achieve all my goals (I wanted to do it without having to boot the recovery system – we ended up having to boot it twice) but no dealings with the ISP staff were required.

The first thing to do is to get a 64bit kernel running. Based on past bad experiences I’m not going to use the Debian Xen kernel on a 64bit system (in all my tests it has had kernel panics in the Dom0 when doing any serious disk IO). So I chose the CentOS 5 kernel.

To get the kernel running I copied the kernel files (/boot/vmlinuz-2.6.18-92.1.13.el5xen /boot/System.map-2.6.18-92.1.13.el5xen /boot/config-2.6.18-92.1.13.el5xen) and the modules (/lib/modules/2.6.18-92.1.13.el5xen) from a CentOS machine. I just copied a .tgz archive as I didn’t want to bother installing alien or doing anything else that took time. Then I ran the Debian mkinitramfs program to create the initrd (the 32bit tools for creating an initrd work well with a 64bit kernel). Then I created the GRUB configuration entry (just copied the one from the CentOS box and changed the root= kernel parameter and the root GRUB parameter), crossed my fingers and rebooted. I tested this on a machine in my own computer room to make sure it worked before deploying it in Germany, but there was still some risk.

After rebooting it the command arch reported x86_64 – so it had a 64bit Xen kernel running correctly.

The next thing was to create a 64bit Lenny image. I got the Lenny Beta 2 image and used debootstrap to create the image (I consulted my blog post about creating Xen images for the syntax [1] – one of the benefits of blogging about how you solve technical problems). Then I used scp to copy a .tgz file of that to the server in Germany. Unfortunately the people who had set up that server had used all the disk space in two partitions, one for root and one for swap. While I can use regular files for Xen images (with performance that will probably suck a bit – Ext3 is not a great filesystem for big files) I can’t use them for a new root filesystem. So I formatted the swap space as ext3.

Then to get it working I merely had to update the /etc/fstab, /etc/network/interfaces, and /etc/resolv.conf files to make it basically functional. Of course ssh access is necessary to do anything with the server once it boots, so I chrooted into the environment and ran “apt-get update ; apt-get install openssh-server udev ; apt-get dist-upgrade“.

I stuffed this up and didn’t allow myself ssh access the first time, so the thing to do is to start sshd in the chroot environment and make sure that you can really login. Without having udev running a ssh login will probably result in the message “stdin: is not a tty“, that is not a problem. Getting that to work by the commands ‘ssh root@server “mkdir /dev/pts”‘ and ‘ssh root@server “mount -t devpts devpts /dev/pts”‘ is not a challenge. But installing udev first is a better idea.

Then after that I added a new grub entry as the default which used the CentOS kernel and /dev/sda1 (the device formerly used for swap space) as root. I initially used the CentOS Xen kernel (all Red Hat based distributions bundle the Xen kernel with the Linux kernel – which makes some sense). But the Debian Xen utilities didn’t like that so I changed to the Debian Xen kernel.

Once I had this basically working I copied the 64bit installation to the original device and put the 32bit files in a subdirectory named “old” (so configuration can be copied). When I changed the configuration and rebooted it worked until I installed SE Linux. It seems that the Debian init scripts will in many situations quietly work when the root device is incorectly specified in /etc/fstab. This however requires creating a device node somewhere else for fsck and the SE Linux policy version 2:0.0.20080702-12 was not permitting this. I have since uploaded policy 2:0.0.20080702-13 to fix this bug and requested that the release team allow it in Lenny – I think that a bug which can make a server fail to boot is worthy of inclusion!

Finally to get the CentOS kernel working with Debian you need to load the following modules in the Dom0 (as discussed in my previous post about kernel issues [2]):
blktap
blkbk
netbk

It seems that the Debian Xen kernel has those modules linked in and the Debian Xen utilities expect that.

Currently I’m using Debian kernels 2.6.18 and 2.6.26 for the DomUs. I have considered using the CentOS kernel but they decided that /dev/console is not good enough for the console of a DomU and decided to use something else. Gratuitous differences are annoying (every other machine both real and virtual has /dev/console). If I find problems with the Debian kernels in DomUs I will change to the CentOS kernel. Incidentally one problem I have had with a CentOS kernel for a DomU (when running on a CentOS Dom0) was that the CentOS initrd seems to have some strange expectations of the root filesystem, when they are not met things go wrong – a common symptom is that the nash process will go in a loop and use 100% CPU time.

One of the problems I had was converting the configuration for the primary network device from eth0 to xenbr0. In my first attempt I had not installed the bridge-utils package and the machine booted up without network access. In future I will setup xenbr1 (a device for private networking that is not connected to an Ethernet device) first and test it, if it works then there’s a good chance that the xenbr0 device (which is connected to the main Ethernet port of the machine) will work.

After getting the machine going I found a number of things that needed to be fixed with the Xen SE Linux policy. Hopefully the release team will let me get another version of the policy into Lenny (the current one doesn’t work).

Kernel issues with Debian Xen and CentOS Kernels

Last time I tried using a Debian 64bit Xen kernel for Dom0 I was unable to get it to work correctly, it continually gave kernel panics when doing any serious disk IO. I’ve just tried to reproduce that problem on a test machine with a single SATA disk and it seems to be working correctly so I guess that it might be related to using software RAID and LVM (LVM is really needed for Xen and RAID is necessary for every serious server IMHO).

To solve this I am now experimenting with using a CentOS kernel on Debian systems.

There are some differences between the kernels that are relevant, the most significant one is the choice of which modules are linked in to the kernel and which ones have to be loaded with modprobe. The Debian choice is to have the drivers blktap blkbk and netbk linked in while the Red Hat / CentOS choice was to have them as modules. Therefore the Debian Xen utilities don’t try and load those modules and therefore when you use the CentOS kernel without them loaded Xen simply doesn’t work.

Error: Device 0 (vif) could not be connected. Hotplug scripts not working.

You will get the above error (after a significant delay) from the command “xm create -c name” if you try and start a DomU that has networking when the driver netbk is not loaded.

XENBUS: Timeout connecting to device: device/vbd/768 (state 3)

You will get the above error (or something similar with a different device number) for every block device from the kernel of the DomU if using one of the Debian 2.6.18 kernels, if using a 2.6.26 kernel then you get “XENBUS: Waiting for devices to initialise“.

Also one issue to note is that when you use a file: block device (IE a regular file) then Xen will use a loopback device (internally it seems to only like block devices). If you are having this problem and you destroy the DomU (or have it abort after trying for 300 seconds) then it will leave the loopback device enabled (it seems that the code for freeing resources in the error path is buggy). I have filed Debian bug report #503044 [1] requesting that the Xen packages change the kernel configuration to allow more loopback devices and Debian bug report #503046 [2] requesting that the resources be freed correctly.

Finally the following messages appear in /var/log/daemon.log if you don’t have the driver blktap installed:
BLKTAPCTRL[2150]: couldn’t find device number for ‘blktap0’
BLKTAPCTRL[2150]: Unable to start blktapctrl

It doesn’t seem to cause a problem (in my tests I can’t find something I want to do with Xen that required blktap), but I have loaded the driver – even removing error messages is enough of a benefit.

Another issue is that the CentOS kernel packages include a copy of the Xen kernel, so you have a Linux kernel matching the Xen kernel. So of course it is tempting to try and run that CentOS Xen kernel on a Debian system. Unfortunately the Xen utilities in Debian/Lenny don’t match the Xen kernel used for CentOS 5 and you get messages such as the following in /var/log/xen/xend-debug.log:

sysctl operation failed — need to rebuild the user-space tool set?
Exception starting xend: (13, ‘Permission denied’)

Update: Added a reference to another Debian bug report.

Updated EC2 API Tools package

I’ve updated my package of the Amazon EC2 API Tools for Debian [1]. Now it uses the Sun JDK. Kaffe doesn’t work due to not supporting annotations, I haven’t filed a bug because Kaffe is known to be incomplete.

OpenJDK doesn’t work – apparently because it doesn’t include trusted root certificates (see Debian bug #501643) [2].

GCJ doesn’t work, not sure why so I filed Debian bug #501743 [3].

I don’t think that Java is an ideal language choice for utility programs. It seems that Perl might be a better option as it’s supported everywhere and has always been free (the Sun JVM has only just started to become free). The lack of freeness of Java results in a lower quality, and in this case several hours of my time wasted.

4

Getting Started with Amazon EC2

The first thing you need to do to get started using the Amazon Elastic Compute Cloud (EC2) [1] is to install the tools to manage the service. The service is run in a client-server manner. You install the client software on your PC to manage the EC2 services that you use.

There are the AMI tools to manage the machine images [2] and the API tools to launch and manage instances [3].

The AMI tools come as both a ZIP file and an RPM package and contain Ruby code, while the API tools are written in Java and only come as a ZIP file.

There are no clear license documents that I have seen for any of the software in question, I recall seeing one mention on one of the many confusing web pages of the code being “proprietary” but nothing else. While it seems most likely (but far from certain) that Amazon owns the copyright to the code in question, there is no information on how the software may be used – apart from an implied license that if you are a paying EC2 customer then you can use the tools (as there is no other way to use EC2). If anyone can find a proper license agreement for this software then please let me know.

To get software working in the most desirable manner it needs to be packaged for the distribution on which it is going to be used, as I prefer to use Debian that means packaging it for Debian. Also when packaging the software you can fix some of the silly things that get included in software that is designed for non-packaged release (such as demanding that environment variables be set to specify where the software is installed). So I have built packages for Debian/Lenny for the benefit of myself and some friends and colleagues who use Debian and EC2.

As I can’t be sure of what Amazon would permit me to do with their code I have to assume that they don’t want me to publish Debian packages for the benefit of all Debian and Ubuntu users who are (or might become) EC2 customers. So instead I have published the .diff.gz files from my Debian/Lenny packages [4] to allow other people to build identical packages after downloading the source from Amazon. At the moment the packages are a little rough, and as I haven’t actually got an EC2 service running with them yet they may have some really bad bugs. But getting the software to basically work took more time than expected. So even if there happen to be some bugs that make it unusable in it’s current state (the code for determining where it looks for PEM files at best needs a feature enhancement and at worst may be broken at the moment) then it would still save people some time to use my packages and fix whatever needs fixing.

9

Could we have an Open Computing Cloud?

One of the most interesting new technologies that has come out recently is Cloud Computing, the most popular instance seems to be the Amazon EC2 (Elastic Cloud Computing). I think it would be good if there were some open alternatives to EC2.

Amazon charges $0.10 per compute hour for a virtual machine that has one Compute Unit (equivalent to a 1.0 to 1.2GHz 2007 Opteron core) and 1.7G of RAM. Competing with this will be difficult as it’s difficult to be cheaper than 10 cents an hour ($876.60 per annum) to a sufficient extent to compensate for the great bandwidth that Amazon has on offer.

The first alternative that seems obvious is a cooperative model. In the past I’ve run servers for the use of friends in the free software community. It would be easy for me to do such things in future, and Xen makes this a lot easier than it used to be. If anyone wants a DomU for testing something related to Debian SE Linux then I can set one up in a small amount of time. If there was free software to manage such things then it wuld be practical to have some sort of share system for community members.

The next possibility is a commercial model. If I could get Xen to provide a single Amazon Compute Unit to one DomU (not less or more) then I wouldn’t notice it on some of my Xen servers. 1.7G of RAM is a moderate amount, but as 3G seems to be typical for new desktop systems (Intel is still making chipsets that support a maximum of 4G of address space [2], when you subtract the address space for video and PCI you might as well only get 3G) it would not be inconceivable to use 1.7G DomUs on idle desktop machines. But it’s probably more practical to have a model with less RAM. For my own use I run a number of DomUs with 256M of RAM for testing and development and the largest server DomU I run is 400M (that is for ClamAV, SpamAssassin, and WordPress). While providing 1.7G of RAM and 1CU for less than 10 cents an hour may be difficult, but providing an option of 256M of RAM and 0.2CU (burstable to 0.5CU) for 2 cents an hour would give the same aggregate revenue for the hardware while also offering a cheaper service for people who want that. 2 cents an hour is more than the cost of some of the Xen server plans that ISPs offer [3] but if you only need a server for part of the time then it would have the potential to save some money.

For storage Amazon has some serious bandwidth inside it’s own network for transferring the image to the machine for booting. To do things on the cheap the way to go would be to create a binary diff of a common image. If everyone who ran virtual servers had images of the common configurations of the popular distributions then creating an image to boot would only require sending a diff (maybe something based on XDelta [4]). Transferring 1GB of filesystem image over most network links is going to be unreasonably time consuming, transferring a binary diff of an up to date CentOS or Debian install vs a usable system image based on CentOS or Debian which has all the updates applied is going to be much faster.

Of course something like this would not be suitable for anything that requires security. But there are many uses for servers that don’t require much security.

6

Killing Servers with Virtualisation and Swap

The Problem:

A problem with virtual machines is the fact that one rogue DomU can destroy the performance of all the others by inappropriate resource use. CPU scheduling is designed to allow reasonable sharing of computational resources, it is unfortunately not well documented, the XenSource wiki currently doesn’t document the “credit” scheduler which is used in Debian/Etch and CentOS 5 [1]. One interesting fact is that CPU scheduling in Xen can have a significant effect on IO performance as demonstrated in the paper by Ludmila Cherkasova, Diwaker Gupta and Amin Vahdat [2]. But they only showed a factor of two performance difference (which while bad is not THAT bad).

A more significant problem is managing virtual memory, when there is excessive paging performance can drop by a factor of 100 and even the most basic tasks become impossible.

The design of Xen is that every DomU is allocated some physical RAM and has it’s own swap space. I have previously written about my experiments to optimise swap usage on Xen systems by using a tmpfs in the Dom0 [3]. The aim was to have every Xen DomU swap data out to a tmpfs so that if one DomU was paging heavily and the other DomUs were not paging then the paging might take place in the Dom0’s RAM and not hit disk. The experiments were not particularly successful but I would be interested in seeing further research in this area as there might be some potential to do some good.

I have previously written about the issues related to swap space sizing on Linux [4]. My conclusion is that following the “twice RAM” myth will lead to systems becoming unusable due to excessive swapping in situations where they might otherwise be usable if the kernel killed some processes instead (naturally there are exceptions to my general rule due to different application memory use patterns – but I think that my general rule is a lot better than the “twice RAM” one).

One thing that I didn’t consider at the time is the implications of this problem for Xen servers. If you have 10 physical machines and one starts paging excessively then you have one machine to reboot. If you have 10 Xen DomUs on a single host and one starts paging heavily then you end up with one machine that is unusable due to thrashing and nine machines that deliver very poor disk read performance – which might make them unusable too. Read performance can particularly suffer in a situation when one process or VM is writing heavily to disk due to the way that the disk queuing works, it’s not uncommon for an application to read dozens or hundreds of blocks from disk to satisfy a single trivial request from a user, if each of these block read requests has to wait for a large amount of data to be written out from the write-back cache then performance will suck badly (I have seen this in experiments on single disks and on Linux software RAID – but have not had the opportunity to do good tests on a hardware RAID array).

Currently for Xen DomUs I am allocating swap spaces no larger than 512M, as anything larger than that is likely to cause excessive performance loss to the rest of the server if it is actually used.

A Solution for Similar Problems:

A well known optimisation technique of desktop systems is to use a separate disk for swap, in desktop machines people often use the old disk as swap after buying a new larger disk for main storage. The benefit of this is that swap use will not interfere with other disk use, for example the disk reads needed to run the “ps” and “kill” programs won’t be blocked by the memory hog that you want to kill. I believe that similar techniques can be applied to Xen servers and give even greater benefits. When a desktop machine starts paging excessively the user will eventually take a coffee break and let the machine recover, but when an Internet server such as a web server starts paging excessively the requests keep coming in and the number of active processes increases so it seems likely that using a different device for the swap will allow some processes to satisfy requests by reading data from disk while some other processes are waiting to be paged in.

Applying it to Xen Servers:

The first thing that needs to be considered for such a design is the importance of reliable swap. When it comes to low-end servers there is ongoing discussion about the relative merits of RAID-0 and RAID-1 for swap. The benefit of RAID-0 is performance (at least in perception – I can imagine some OS swapping algorithms that could potentially give better performance on RAID-1 and I am not aware of any research in this area). The benefit of RAID-1 is reliability. Now there are two issues in regard to reliability, one is continuity of service (EG being able to hot-swap a failed disk while the server is running), and the other is the absence of data loss. For some systems it may be acceptable to have a process SEGV (which I presume is the result if a page-in request fails) due to a dead disk (reserving the data loss protection of RAID for files). One issue related to this is the ability to regain control of a server after a problem. For example if the host OS of a machine had non-RAID swap then a disk failure could prevent a page-in of data related to sshd or some similar process and thus make it impossible to recover the machine without hardware access. But if the swap for a virtual machine was on a non-RAID disk and the host had RAID for it’s swap then the sysadmin could login to the host and reboot the DomU after creating a new swap space on a working disk.

Now if you have a server with 8 or 12 disks (both of which seem to be reasonably common capacities of modern 2RU servers) and if you decide that RAID is not required for the swap space of DomUs then it would be possible to assign single disks for swap spaces for groups of virtual machines. So if one client had several virtual machines they could have them share the same single disk for the swap, so a thrashing server would only affect the performance of other VMs from the same client. One possible configuration would be a 12 disk server that has a four disk RAID-5 array for main storage and 8 single disks for swap. 8 CPU cores is common for a modern 2RU server, so it would be possible to lock 8 groups of DomUs so that they share CPUs and swap spaces. Another possibility would be to have four groups of DomUs where each group had a RAID-1 array for swap and two CPU cores.

I am not sure of the aggregate performance impact of such a configuration, I suspect that a group of single disks would give better performance for swap than a single RAID array and that RAID-1 would outperform RAID-5. For a single DomU it seems most likely that using part of a large RAID array for swap space would give better performance. But the benefit in partitioning the server seems clear. An architecture where each DomU had it’s own dedicated disk for a swap space is something that I would consider a significant benefit if renting a Xen DomU. I would rather have the risk of down-time (which should be short with hot-swap disks and hardware monitoring) in the rare case of a disk failure than have bad performance regularly in the common situation of someone else’s DomU being overloaded.

Failing that, having a separate RAID array for swap would be a significant benefit. If every process that isn’t being paged out could deliver full performance while one DomU was thrashing then it would be a significant benefit over the situation where any DomU can thrash and kill the file access performance of all other DomUs. A single RAID-1 array should handle all the swap space requirements for a small or medium size Xen server

One thing that I have not tested is the operation of LVM when one disk goes bad. In the case of a disk with bad sectors it’s possible to move the LVs that are not affected to other disks and to remove the LV that was affected and re-create it after removing the bad disk. The case of a disk that is totally dead (IE the PV header can’t be read or written) might cause some additional complications.

Update Nov 2012: This post was discussed on the Linode forum:

Comments include “The whole etbe blog is pretty interesting” and “Russell Coker is a long-time Debian maintainer and all-round smart guy” [5]. Thanks for that!

1

Xen and Linux Memory Assignment Bugs

The Linux kernel has a number of code sections which look at the apparent size of the machine and determine what would be the best size for buffers. For physical hardware this makes sense as the hardware doesn’t change at runtime. There are many situations where performance can be improved by using more memory for buffers, enabling large buffers for those situations when the machine has a lot of memory makes it convenient for the sysadmin.

Virtual machines change things as the memory available to the kernel may change at run-time. For Xen the most common case is the Dom0 automatically shrinking when memory is taken by a DomU – but it also supports removing memory from a DomU via the xm mem-set command (the use of xm mem-set seems very rare).

Now a server that is purchased for the purpose of running Xen will have a moderate amount of RAM. In recent times the smallest machine I’ve seen purchased for running Xen had 4G of RAM – and it has spare DIMM slots for another 4G if necessary. While a non-virtual server with 8G of RAM would be an unusually powerful machine dedicated for some demanding application, a Xen server with 8G or 16G of RAM is not excessively big, it’s merely got space for more DomU’s. For example one of my Xen servers has 8 CPU cores, 8G of RAM, and 14 DomUs. Each DomU has on average just over half a gig of RAM and half of a CPU core – not particularly big.

In a default configuration the Dom0 will start by using all the RAM in the machine, which in this case meant that the buffer sizes were appropriate for a machine with 8G of RAM. Then as DomUs are started memory is removed from the Dom0 and these buffers become a problem. This ended up forcing a reboot of the machine by preventing Xen virtual network access to most of the DomUs. I was seeing many messages in the Dom0 kernel message log such as “xen_net: Memory squeeze in netback driver” and most DomUs were inaccessible from the Internet (I didn’t verify that all DomUs were partially or fully unavailable or test the back-end network as I was in a hurry to shut it down and reboot before too many customers complained).

The solution to this is to have the Dom0 start by using a small amount of RAM. To do this I edited the GRUB configuration file and put “dom0_mem=256000” at the end of the Xen kernel line (that is the line starting with “kernel /xen.gz“). This gives the Dom0 kernel just under 256M of RAM from when it is first loaded and prevents allocation of bad buffer sizes, it’s the only solution to this network problem that a quick Google search (the kind you do when trying to fix a serious outage before your client notices (*)) could find.

One thing to note is that my belief that it’s kernel buffer sizes that are at the root cause of this problem is based on my knowledge of how some of the buffers are allocated plus an observation of the symptoms. I don’t have a test machine with anything near 8G of RAM so I really can’t do anything more to track this down.

There is another benefit to limiting the Dom0 memory, I have found that on smaller machines it’s impossible to reduce the Dom0 memory below a certain limit at run-time. In the past I’ve had problems in reducing the memory of a Dom0 below about 250M, while such a reduction is hardly desirable on a machine with 8G of RAM, when running an old P3 machine with 512M of RAM there are serious benefits to making Dom0 smaller than that. As a general rule I recommend having a limit on the memory of the Dom0 on all Xen servers. If you use the model of having no services running on the Dom0 there is no benefit in having much ram assigned to it.

(*) Hiding problems from a client is a bad idea and is not something I recommend. But being able to fix a problem and then tell the client that it’s already fixed is much better than having them call you when you don’t know how long the fix will take.

CPU vs RAM

When configuring servers the trade-offs between RAM and disk are well known. If your storage is a little slow then you can often alleviate the performance problems by installing more RAM for caching and to avoid swapping. If you have more than adequate disk IO capacity then you can over-commit memory and swap out the things that don’t get used much.

One that often doesn’t get considered i the trade-off between RAM and CPU. I just migrated a server image from a machine with two P4 CPUs to a DomU on a machine with Opteron CPUs. The P4 system seemed lightly loaded (a maximum of 30% CPU time in use over any 5 minute period) so I figured that if two P4 CPUs are 30% busy then a single Opteron core should do the job. It seems that when running 32bit code, 30% of 2*3.0GHz P4 CPUs is close to the CPU power of one core of an Opteron 2352 (2.1GHz). I’m not sure whether this is due to hyper-threading actually doing some good or inefficiencies in running 32bit code on the Opteron – but the Opteron is not giving the performance I expected in this regard.

Now having about 90% of the power of that CPU core in use might not be a problem, except that the load came in bursts. When a burst took the machine to 100% CPU power a server process kept forking off children to answer requests. As all the CPU power was being used it took a long time to answer queries (several seconds) so the queue started growing without end. Eventually there were enough processes running that all memory was used, the machine started thrashing, and eventually the kernel out of memory handler started killing things.

I rebooted the DomU with two VCPUs (two Opteron cores) and there was no problem, performance was good, and because the load bursts last less than a minute the load average seems to stay below 1.

It seems that the use of virtual machines increases the scope of this problem. The advantage of virtual machines is that you can add extra virtual hardware more easily (up to the limit of the physical hardware of course) – I could give the DomU in question 6 Opteron cores in a matter of minutes if it was necessary. The disadvantage is that the CPU use of other virtual machines can impact the operation. As there seems to be an exponential relationship between the number of CPU cores in a system and the overall price it’s not feasible to just put in 32 core Xen servers. While CPU power has generally been increasing faster than disk performance for a long time (at least the last 20 years) it seems that virtualisation provides a way of using a lot of that CPU power. It is possible to have a 1:1 mapping of real CPUs and VCPUs in the Xen DomU’s, if you were to install 8 DomU’s that each had one VCPU on a server with 8 cores then there would be no competition between DomUs for CPU time – but that would significantly increase the cost of running them (some ISPs offer this for a premium price).

In this example, if I had a burst of load for the service in question at the same time as other DomUs were using a lot of CPU time (which is a possibility as the other DomUs are the clients for the service in question) then I might end up with the same problem in spite of having assigned two VCPUs to the DomU.

The real solution is to configure the server to limit the number of children that it forks off, the limit can be high enough to guarantee 100% CPU use at times of peak load without being high enough to start swapping.

I wonder how this goes with ISPs that offer Xen hosting. It seems that you would only need to have one customer who shares the same Xen server as you experiencing such a situation to cause enough disk IO to cripple the performance that you get.

Xen CPU use per Domain

The command “xm list” displays the number of seconds of CPU time used by each Xen domain. This makes it easy to compare the CPU use of the various domains if they were all started at the same time (usually system boot). But is not very helpful if they were started at different times.

I wrote a little Perl program to display the percentage of one CPU that has been used by each domain, here is a sample of the output:

Domain-0 uses 7.70% of one CPU
demo uses 0.06% of one CPU
lenny32 uses 2.07% of one CPU
unstable uses 0.30% of one CPU

Now the command “xm top” can give you the amount of CPU time used at any moment (which is very useful). But it’s also good to be able to see how much is being used over the course of some days of operation. For example if a domain is using the equivalent of 34% of one CPU over the course of a week (as one domain that I run is doing) then it makes sense to allocate more than one VCPU to it so that things don’t slow down at peak times or when cron jobs are running.

I believe that it’s best to limit the number of CPUs allocated to Xen domains. For example I am running a Xen server with 8 CPU cores. I could grant each domain access to 8 VCPUs, but then any domain could use all that CPU power if it ran wild. While if I give no domain more than 2 VCPUs then one domain could use all the CPU resources allocated to it without the other domains being impacted. I realise that there are scheduling algorithms in the Xen kernel that are designed to deal with such situations, but I believe that simply denying access to excessive resource use is more effective and reliable.

I have not filed a bug report requesting that my script be added to one of the Xen packages as I’m not sure which one it would belong in (and also it’s a bit of a hack). It’s licensed under the GPL so anyone who wants to use it can do what they want. Any distribution package maintainer who wants to include it in a Xen utilities package is welcome to do so. The code is below.
Continue reading

8

A New Strategy for Xen MAC Allocation

When installing Xen servers one issue that arises is how to assign MAC addresses. The Wikipedia page about MAC addresses [1] shows that all addresses that have the second least significant bit of the most significant byte set to 1 are “locally administered”. In practice people just use addresses starting with 02: for this purpose although any number congruent to two mod four used in the first octet would give the same result. I prefer to use 02: because it’s best known and therefore casual observers will be more likely to realise what is happening.

Now if you have a Xen bridge that is private to one Dom0 (for communication between Xen DomU’s on the same host) or on a private network (a switch that connects servers owned by one organisation and not connected to machines owned by others) then it’s easy to just pick MAC addresses starting with 02: or 00:16:3e: (the range assigned to the Xen project). But if Xen servers run by other people are likely to be on the same network then there is a problem.

Currently I’m setting up some Xen servers that have public and private networks. The private network will either be a local bridge (that doesn’t permit sending data out any Ethernet ports) or a bridge to an Ethernet port that is connected to a private switch, for that I am using MAC addresses starting with 02:. As far as I am aware there is no issue with machine A having a particular MAC address on one VLAN while machine B has the same MAC address on another VLAN.

My strategy for dealing with the MAC addresses for the public network at the moment is to copy MAC addresses from machines that will never be in the same network. For example if I use the MAC addresses from Ethernet cards in a P3 desktop system running as a router in a small company in Australia then I can safely use them in a Xen server in a co-location center in the US (there’s no chance of someone taking the PCI ethernet cards from the machine in Australia and sending them to the US – and no-one sells servers that can use such cards anyway). Note that I only do this when I have root on the machine in question and where there is no doubt about who runs the machine, so there should not be any risk.

Of course if someone from the ISP analyses the MAC addresses on their network it will look like they have some very old machines in their server room. ;)

I wonder if there are any protocols that do anything nasty with MAC addresses. I know that IPv6 addresses can be based on the MAC address, but as long as the separate networks have separate IPv6 ranges that shouldn’t be a problem. I’m certainly not going to try bridging networks between Australia and the US!

Another possible way of solving this issue would be to have the people who run a server room assign and manage MAC addresses. One way of doing this would be to specify a mapping of IP addresses to MAC addresses, EG you could have the first two bytes be 02:00: and the next four be the same as the IPv4 address assigned to the DomU in question. In the vast majority of server rooms I’ve encountered the number of public IP addresses has been greater than or equal to the number of MAC addresses with the only exception being corporate server rooms where everything runs on private IP address space (but there’s nothing wrong with 02:00:0a: as the prefix for a MAC address).

I also wonder if anyone else is thinking about the potential for MAC collisions. I’ve got Xen servers in a couple of server rooms, I told the relevant people in writing of my precise plans (and was assigned extra IP addresses for all the DomUs) but never had anyone mention any scheme for assigning MAC addresses.