Linux, politics, and other interesting things


  • Category Archives Virtualisation
  • LUV Talk about Cloud Computing

    Last week I gave a talk for the Linux Users of Victoria about Cloud Computing and Amazon EC2 [1]. I was a little nervous as I was still frantically typing the notes a matter of minutes before my talk was due to start (which isn’t ideal). But it went well. There were many good questions and the audience seemed very interested. The talk will probably appear on the web (it was recorded and I signed a release form for the video).

    Also I have just noticed that Amazon have published a license document for their AMI tools, so I have added the package ec2-ami-tools to my Debian/Lenny repository (for i386 and AMD64) with the following APT sources.list line:

    deb http://www.coker.com.au lenny misc

    Also the Amazon license [2] might permit adding it to the non-free section of Debian, if so I’m prepared to maintain it in Debian for the next release – but I would prefer that someone who knows Ruby take it over.


  • EC2 and IP Addresses

    One of the exciting things about having a cloud computing service is how to talk to the rest of the world. It’s all very well to have a varying number of machines in various locations, but you need constant DNS names at least (and sometimes constant IP addresses) to do most useful things.

    I have previously described how to start an EC2 instance and login to it – which includes discovering it’s IP address [1]. It would not be difficult (in theory at least) to use nsupdate to change DNS records after an instance is started or terminated. One problem is that there is no way of knowing when an instance is undesirably terminated (IE killed by hardware failure) apart from polling ec2-describe-instances so it seems impossible to remove a DNS name before some other EC2 customer gets a dynamic IP address. So it seems that in most cases you will want a constant IP address (which Amazon calls an Elastic IP address) if you care about this possibility. For the case of an orderly shutdown you could have a script remove the DNS record, wait for the timeout period specified by the DNS server (so that all correctly operating DNS caches have purged the record) and then terminate the instance.

    One thing that interests me is the possibility of running front-end mail servers on EC2. Mail servers that receive mail from the net can take significant amounts of CPU time and RAM for spam and virus filters. Instead of having the expense of running enough MX servers to sustain the highest possible load even while one of the servers has experienced a hardware failure there is a possibility of running an extra EC2 instance at peak times with the possibility of running a large instance for a peak time when one of the dedicated servers has experienced a problem. The idea of having a mail server die and have someone else’s server take the IP address and receive the mail is too horrible to contemplate, so an Elastic IP address is required.

    It is quite OK to have a set of mail servers of which not all servers run all the time (this is why the MX record was introduced to the DNS) so having a server run periodically at periods of high load (one of the benefits of the EC2 service) will not require changes to the DNS. I think it’s reasonably important to minimise the amount of changes to the DNS due to the possibility of accidentally breaking it (which is a real catastrophe) and the possibility of servers caching DNS data for longer than they should. The alternative is to change the MX record to not point to the hostname of the server when the instance is terminated. I will be interested to read comments on this issue.

    The command ec2-allocate-address will allocate a public IP address for your use. Once the address is allocated it will cost $0.01 per hour whenever it is unused. There are also commands ec2-describe-addresses (to list all addresses allocated to you), ec2-release-address (to release an allocated address), ec2-associate-address to associate an IP address with a running instance, and ec2-disassociate-address to remove such an association.

    The command “ec2-associate-address -i INSTANCE ADDRESS” will associate an IP address with the specified instance (replace INSTANCE with the instance ID – a code starting with “i-” that is returned from ec2-describe-instances. The command “ec2-describe-instances |grep ^INSTANCE|cut -f2” will give you a list of all instance IDs in use – this is handy if your use of EC2 involves only one active instance at a time (all the EC2 API commands give output in tab-separated lists and can be easily manipulated with grep and cut). Associating an IP address with an instance is documented as taking several minutes, while Amazon provides no guarantees or precise figures as to how long the various operations take it seems that assigning an IP address is one of the slower operations. I expect that is due to the requirement for reconfiguring a firewall device (which services dozens or maybe hundreds of nodes) while creating or terminating an instance is an operation that is limited in scope to a single Xen host.

    One result that I didn’t expect was that associating an elastic address is that the original address that was assigned to the instance is removed. I had a ssh connection open to an instance when I assigned an elastic address and my connection was broken. It makes sense to remove addresses that aren’t needed (IPv4 addresses are a precious commodity) and further reading of the documentation revealed that this is the documented behavior.

    One thing I have not yet investigated is whether assigning an IP address from one instance to another is atomic. Taking a few minutes to assign an IP address is usually no big deal, but having an IP address be unusable for a few minutes while in the process of transitioning between servers would be quite inconvenient. It seems that a reasonably common desire would be to have a small instance running and to then transition the IP address to a large (or high-CPU) instance if the load gets high, having this happen without the users noticing would be a good thing.


  • Basics of EC2

    I have previously written about my work packaging the tools to manage Amazon EC2 [1].

    First you need to login and create a certificate (you can upload your own certificate – but this is probably only beneficial if you have two EC2 accounts and want to use the same certificate for both). Download the X509 private key file (named pk-X.pem) and the public key (named cert-X.pem). My Debian package of the EC2 API tools will look for the key files in the ~/.ec2 and /etc/ec2 directories and will take the first one it finds by default.

    To override the certificate (when using my Debian package) or to just have it work when using the code without my package you set the variables EC2_PRIVATE_KEY and EC2_CERT.

    This Amazon page describes some of the basics of setting up the client software and RSA keys [2]. I will describe some of the most important things now:

    The command “ec2-add-keypair gsg-keypair > id_rsa-gsg-keypair” creates a new keypair for logging in to an EC2 instance. The public key goes to amazon and the private key can be used by any ssh client to login as root when you creat an instance. To create an instance with that key you use the “-k gsg-keypair” option, so it seems a requirement to use the same working directory for creating all instances. Note that gsg-keypair could be replaced by any other string, if you are doing something really serious with EC2 you might use one account to create instances that are run by different people with different keys. But for most people I think that a single key is all that is required. Strangely they don’t provide a way of getting access to the public key, you have to create an instance and then copy the /root/.ssh/authorized_keys file for that.

    This Amazon page describes how to set up sample images [3].

    The first thing it describes is the command ec2-describe-images -o self -o amazon which gives a list of all images owned by yourself and all public images owned by Amazon. It’s fairly clear that Amazon doesn’t expect you to use their images. The i386 OS images that they have available are Fedora Core 4 (four configurations with two versions of each) and Fedora 8 (a single configuration with two versions) as well as three other demo images that don’t indicate the version. The AMD64 OS images that they have available are Fedora Core 6 and Fedora Core 8. Obviously if they wanted customers to use their own images (which seems like a really good idea to me) they would provide images of CentOS (or one of the other recompiles of RHEL) and Debian. I have written about why I think that this is a bad idea for security [4], please make sure that you don’t use the ancient Amazon images for anything other than testing!

    To test choose an i386 image from Amazon’s list, i386 is best for testing because it allows the cheapest instances (currently $0.10 per hour).

    Before launching an instance allow ssh access to it with the command “ec2-authorize default -p 22“. Note that this command permits access for the entire world. There are options to limit access to certain IP address ranges, but at this stage it’s best to focus on getting something working. Of course you don’t want to actually use your first attempt at creating an instance, I think that setting up an instance to run in a secure and reliable manner would require many attempts and tests. As all the storage of the instance is wiped when it terminates (as we aren’t using S3 yet) and you won’t have any secret data online security doesn’t need to be the highest priority.

    A sample command to run an instance is “ec2-run-instances ami-2b5fba42 -k gsg-keypair” where ami-2b5fba42 is a public Fedora 8 image available at this moment. This will give output similar to the following:

    RESERVATION r-281fc441 999999999999 default
    INSTANCE i-0c999999 ami-2b5fba42 pending gsg-keypair 0 m1.small 2008-11-04T06:03:09+0000 us-east-1c aki-a71cf9ce ari-a51cf9cc

    The parameter after the word INSTANCE is the serial number of the instance. The command “ec2-describe-instances i-0c999999” will provide information on the instance, once it is running (which may be a few minutes after you request it) you will see output such as the following:

    RESERVATION r-281fc441 999999999999 default
    INSTANCE i-0c999999 ami-2b5fba42 ec2-10-11-12-13.compute-1.amazonaws.com domU-12-34-56-78-9a-bc.compute-1.internal running gsg-keypair 0 m1.small 2008-11-04T06:03:09+0000 us-east-1c aki-a71cf9ce ari-a51cf9cc

    The command “ssh -i id_rsa-gsg-keypair root@ec2-10-11-12-13.compute-1.amazonaws.com” will then grant you root access. The part of the name such as 10-11-12-13 is the public IP address. Naturally you won’t see 10.11.12.13, it will instead be public addresses in the Amazon range – I replaced the addresses to avoid driving bots to their site.

    The name domU-12-34-56-78-9a-bc.compute-1.internal is listed in Amazon’s internal DNS and returns the private IP address (in the 10.0.0.0/8 range) which is used for the instance. The instance has no public IP address, all connections (both inbound and outbound) run through some sort of NAT. This shouldn’t be a problem for HTTP, SMTP, and most protocols that are suitable for running on such a service. But for FTP or UDP based services it might be a problem. The part of the name such as12-34-56-78-9a-bc is the MAC address of the eth0 device.

    To halt a service you can run shutdown or halt as root in the instance, or run the ec2-terminate-instances command and give it the instance ID that you want to terminate. It seems to me that the best way of terminating an instance would be to run a script that produces a summary of whatver the instance did (you might not want to preserve all the log data, but some summary information would be useful), and give all operations that are in progress time to stop before running halt. A script could run on the management system to launch such an orderly shutdown script on the instance and then uses ec2-terminate-instances if the instance does not terminate quickly enough.

    In the near future I will document many aspects of using EC2. This will include dynamic configuration of the host, dynamic DNS, and S3 storage among other things.


  • Types of Cloud Computing

    The term Cloud Computing seems to be poorly defined at the moment, as an example the Wikipedia page about it is rather incoherent [1].

    The one area in which all definitions of the term agree is that a service is provided by a varying number of servers of which the details are unknown to the people who use the services – including the people who program them.

    There seem to be four main definitions of cloud computing. The first one is Distributed Computing [2] (multiple machines in different administrative domains being used to perform a single task or several tasks), this definition does not seem to be well accepted and will probably disappear.

    The next one is Software As A Service [3] which is usually based on the concept of outsourcing the ownership and management of some software and accessing it over the Internet. An example is the companies that offer outsourced management of mail servers that are compatible with MS Exchange (a notoriously difficult system to manage). For an annual fee per email address you can have someone else run an Exchange compatible mail server which can talk to Blackberries etc. The main benefit of SAAS is that it saves the risk and expense of managing the software, having the software deployed at a central location is merely a way of providing further cost reductions (some companies that offer SAAS services will also install the software on your servers at greater expense).

    The last two definitions are the ones that I consider the most interesting, that is virtual machines (also known as Cloud Infrastructure [4]) and virtual hosting of applications (also known as Cloud Platforms [5]). The Amazon EC2 (Elastic Cloud Computing) service [6] seems to be the leading virtual machine cloud service at the moment. Google App Engine [7] seems to be the leading cloud application hosting service at the moment.

    It is also claimed that “Cloud Storage” is part of cloud computing. It seems to me that storing data on servers all over the world is something that was done long before the term “Cloud Computing” was invented, so I don’t think it’s deserving of the term.


  • Upgrading a server to 64bit Xen

    I have access to a server in Germany that was running Debian/Etch i386 but needed to be running Xen with the AMD64 version of Debian/Lenny (well it didn’t really need to be Lenny but we might as well get two upgrades done at the same time). Most people would probably do a complete reinstall, but I knew that I could do the upgrade while the machine is in a server room without any manual intervention. I didn’t achieve all my goals (I wanted to do it without having to boot the recovery system – we ended up having to boot it twice) but no dealings with the ISP staff were required.

    The first thing to do is to get a 64bit kernel running. Based on past bad experiences I’m not going to use the Debian Xen kernel on a 64bit system (in all my tests it has had kernel panics in the Dom0 when doing any serious disk IO). So I chose the CentOS 5 kernel.

    To get the kernel running I copied the kernel files (/boot/vmlinuz-2.6.18-92.1.13.el5xen /boot/System.map-2.6.18-92.1.13.el5xen /boot/config-2.6.18-92.1.13.el5xen) and the modules (/lib/modules/2.6.18-92.1.13.el5xen) from a CentOS machine. I just copied a .tgz archive as I didn’t want to bother installing alien or doing anything else that took time. Then I ran the Debian mkinitramfs program to create the initrd (the 32bit tools for creating an initrd work well with a 64bit kernel). Then I created the GRUB configuration entry (just copied the one from the CentOS box and changed the root= kernel parameter and the root GRUB parameter), crossed my fingers and rebooted. I tested this on a machine in my own computer room to make sure it worked before deploying it in Germany, but there was still some risk.

    After rebooting it the command arch reported x86_64 – so it had a 64bit Xen kernel running correctly.

    The next thing was to create a 64bit Lenny image. I got the Lenny Beta 2 image and used debootstrap to create the image (I consulted my blog post about creating Xen images for the syntax [1] – one of the benefits of blogging about how you solve technical problems). Then I used scp to copy a .tgz file of that to the server in Germany. Unfortunately the people who had set up that server had used all the disk space in two partitions, one for root and one for swap. While I can use regular files for Xen images (with performance that will probably suck a bit – Ext3 is not a great filesystem for big files) I can’t use them for a new root filesystem. So I formatted the swap space as ext3.

    Then to get it working I merely had to update the /etc/fstab, /etc/network/interfaces, and /etc/resolv.conf files to make it basically functional. Of course ssh access is necessary to do anything with the server once it boots, so I chrooted into the environment and ran “apt-get update ; apt-get install openssh-server udev ; apt-get dist-upgrade“.

    I stuffed this up and didn’t allow myself ssh access the first time, so the thing to do is to start sshd in the chroot environment and make sure that you can really login. Without having udev running a ssh login will probably result in the message “stdin: is not a tty“, that is not a problem. Getting that to work by the commands ‘ssh root@server “mkdir /dev/pts”‘ and ‘ssh root@server “mount -t devpts devpts /dev/pts”‘ is not a challenge. But installing udev first is a better idea.

    Then after that I added a new grub entry as the default which used the CentOS kernel and /dev/sda1 (the device formerly used for swap space) as root. I initially used the CentOS Xen kernel (all Red Hat based distributions bundle the Xen kernel with the Linux kernel – which makes some sense). But the Debian Xen utilities didn’t like that so I changed to the Debian Xen kernel.

    Once I had this basically working I copied the 64bit installation to the original device and put the 32bit files in a subdirectory named “old” (so configuration can be copied). When I changed the configuration and rebooted it worked until I installed SE Linux. It seems that the Debian init scripts will in many situations quietly work when the root device is incorectly specified in /etc/fstab. This however requires creating a device node somewhere else for fsck and the SE Linux policy version 2:0.0.20080702-12 was not permitting this. I have since uploaded policy 2:0.0.20080702-13 to fix this bug and requested that the release team allow it in Lenny – I think that a bug which can make a server fail to boot is worthy of inclusion!

    Finally to get the CentOS kernel working with Debian you need to load the following modules in the Dom0 (as discussed in my previous post about kernel issues [2]):
    blktap
    blkbk
    netbk

    It seems that the Debian Xen kernel has those modules linked in and the Debian Xen utilities expect that.

    Currently I’m using Debian kernels 2.6.18 and 2.6.26 for the DomUs. I have considered using the CentOS kernel but they decided that /dev/console is not good enough for the console of a DomU and decided to use something else. Gratuitous differences are annoying (every other machine both real and virtual has /dev/console). If I find problems with the Debian kernels in DomUs I will change to the CentOS kernel. Incidentally one problem I have had with a CentOS kernel for a DomU (when running on a CentOS Dom0) was that the CentOS initrd seems to have some strange expectations of the root filesystem, when they are not met things go wrong – a common symptom is that the nash process will go in a loop and use 100% CPU time.

    One of the problems I had was converting the configuration for the primary network device from eth0 to xenbr0. In my first attempt I had not installed the bridge-utils package and the machine booted up without network access. In future I will setup xenbr1 (a device for private networking that is not connected to an Ethernet device) first and test it, if it works then there’s a good chance that the xenbr0 device (which is connected to the main Ethernet port of the machine) will work.

    After getting the machine going I found a number of things that needed to be fixed with the Xen SE Linux policy. Hopefully the release team will let me get another version of the policy into Lenny (the current one doesn’t work).


  • Kernel issues with Debian Xen and CentOS Kernels

    Last time I tried using a Debian 64bit Xen kernel for Dom0 I was unable to get it to work correctly, it continually gave kernel panics when doing any serious disk IO. I’ve just tried to reproduce that problem on a test machine with a single SATA disk and it seems to be working correctly so I guess that it might be related to using software RAID and LVM (LVM is really needed for Xen and RAID is necessary for every serious server IMHO).

    To solve this I am now experimenting with using a CentOS kernel on Debian systems.

    There are some differences between the kernels that are relevant, the most significant one is the choice of which modules are linked in to the kernel and which ones have to be loaded with modprobe. The Debian choice is to have the drivers blktap blkbk and netbk linked in while the Red Hat / CentOS choice was to have them as modules. Therefore the Debian Xen utilities don’t try and load those modules and therefore when you use the CentOS kernel without them loaded Xen simply doesn’t work.

    Error: Device 0 (vif) could not be connected. Hotplug scripts not working.

    You will get the above error (after a significant delay) from the command “xm create -c name” if you try and start a DomU that has networking when the driver netbk is not loaded.

    XENBUS: Timeout connecting to device: device/vbd/768 (state 3)

    You will get the above error (or something similar with a different device number) for every block device from the kernel of the DomU if using one of the Debian 2.6.18 kernels, if using a 2.6.26 kernel then you get “XENBUS: Waiting for devices to initialise“.

    Also one issue to note is that when you use a file: block device (IE a regular file) then Xen will use a loopback device (internally it seems to only like block devices). If you are having this problem and you destroy the DomU (or have it abort after trying for 300 seconds) then it will leave the loopback device enabled (it seems that the code for freeing resources in the error path is buggy). I have filed Debian bug report #503044 [1] requesting that the Xen packages change the kernel configuration to allow more loopback devices and Debian bug report #503046 [2] requesting that the resources be freed correctly.

    Finally the following messages appear in /var/log/daemon.log if you don’t have the driver blktap installed:
    BLKTAPCTRL[2150]: couldn’t find device number for ‘blktap0′
    BLKTAPCTRL[2150]: Unable to start blktapctrl

    It doesn’t seem to cause a problem (in my tests I can’t find something I want to do with Xen that required blktap), but I have loaded the driver – even removing error messages is enough of a benefit.

    Another issue is that the CentOS kernel packages include a copy of the Xen kernel, so you have a Linux kernel matching the Xen kernel. So of course it is tempting to try and run that CentOS Xen kernel on a Debian system. Unfortunately the Xen utilities in Debian/Lenny don’t match the Xen kernel used for CentOS 5 and you get messages such as the following in /var/log/xen/xend-debug.log:

    sysctl operation failed — need to rebuild the user-space tool set?
    Exception starting xend: (13, ‘Permission denied’)

    Update: Added a reference to another Debian bug report.


  • Updated EC2 API Tools package

    I’ve updated my package of the Amazon EC2 API Tools for Debian [1]. Now it uses the Sun JDK. Kaffe doesn’t work due to not supporting annotations, I haven’t filed a bug because Kaffe is known to be incomplete.

    OpenJDK doesn’t work – apparently because it doesn’t include trusted root certificates (see Debian bug #501643) [2].

    GCJ doesn’t work, not sure why so I filed Debian bug #501743 [3].

    I don’t think that Java is an ideal language choice for utility programs. It seems that Perl might be a better option as it’s supported everywhere and has always been free (the Sun JVM has only just started to become free). The lack of freeness of Java results in a lower quality, and in this case several hours of my time wasted.


  • Getting Started with Amazon EC2

    The first thing you need to do to get started using the Amazon Elastic Compute Cloud (EC2) [1] is to install the tools to manage the service. The service is run in a client-server manner. You install the client software on your PC to manage the EC2 services that you use.

    There are the AMI tools to manage the machine images [2] and the API tools to launch and manage instances [3].

    The AMI tools come as both a ZIP file and an RPM package and contain Ruby code, while the API tools are written in Java and only come as a ZIP file.

    There are no clear license documents that I have seen for any of the software in question, I recall seeing one mention on one of the many confusing web pages of the code being “proprietary” but nothing else. While it seems most likely (but far from certain) that Amazon owns the copyright to the code in question, there is no information on how the software may be used – apart from an implied license that if you are a paying EC2 customer then you can use the tools (as there is no other way to use EC2). If anyone can find a proper license agreement for this software then please let me know.

    To get software working in the most desirable manner it needs to be packaged for the distribution on which it is going to be used, as I prefer to use Debian that means packaging it for Debian. Also when packaging the software you can fix some of the silly things that get included in software that is designed for non-packaged release (such as demanding that environment variables be set to specify where the software is installed). So I have built packages for Debian/Lenny for the benefit of myself and some friends and colleagues who use Debian and EC2.

    As I can’t be sure of what Amazon would permit me to do with their code I have to assume that they don’t want me to publish Debian packages for the benefit of all Debian and Ubuntu users who are (or might become) EC2 customers. So instead I have published the .diff.gz files from my Debian/Lenny packages [4] to allow other people to build identical packages after downloading the source from Amazon. At the moment the packages are a little rough, and as I haven’t actually got an EC2 service running with them yet they may have some really bad bugs. But getting the software to basically work took more time than expected. So even if there happen to be some bugs that make it unusable in it’s current state (the code for determining where it looks for PEM files at best needs a feature enhancement and at worst may be broken at the moment) then it would still save people some time to use my packages and fix whatever needs fixing.


  • Could we have an Open Computing Cloud?

    One of the most interesting new technologies that has come out recently is Cloud Computing, the most popular instance seems to be the Amazon EC2 (Elastic Cloud Computing). I think it would be good if there were some open alternatives to EC2.

    Amazon charges $0.10 per compute hour for a virtual machine that has one Compute Unit (equivalent to a 1.0 to 1.2GHz 2007 Opteron core) and 1.7G of RAM. Competing with this will be difficult as it’s difficult to be cheaper than 10 cents an hour ($876.60 per annum) to a sufficient extent to compensate for the great bandwidth that Amazon has on offer.

    The first alternative that seems obvious is a cooperative model. In the past I’ve run servers for the use of friends in the free software community. It would be easy for me to do such things in future, and Xen makes this a lot easier than it used to be. If anyone wants a DomU for testing something related to Debian SE Linux then I can set one up in a small amount of time. If there was free software to manage such things then it wuld be practical to have some sort of share system for community members.

    The next possibility is a commercial model. If I could get Xen to provide a single Amazon Compute Unit to one DomU (not less or more) then I wouldn’t notice it on some of my Xen servers. 1.7G of RAM is a moderate amount, but as 3G seems to be typical for new desktop systems (Intel is still making chipsets that support a maximum of 4G of address space [2], when you subtract the address space for video and PCI you might as well only get 3G) it would not be inconceivable to use 1.7G DomUs on idle desktop machines. But it’s probably more practical to have a model with less RAM. For my own use I run a number of DomUs with 256M of RAM for testing and development and the largest server DomU I run is 400M (that is for ClamAV, SpamAssassin, and WordPress). While providing 1.7G of RAM and 1CU for less than 10 cents an hour may be difficult, but providing an option of 256M of RAM and 0.2CU (burstable to 0.5CU) for 2 cents an hour would give the same aggregate revenue for the hardware while also offering a cheaper service for people who want that. 2 cents an hour is more than the cost of some of the Xen server plans that ISPs offer [3] but if you only need a server for part of the time then it would have the potential to save some money.

    For storage Amazon has some serious bandwidth inside it’s own network for transferring the image to the machine for booting. To do things on the cheap the way to go would be to create a binary diff of a common image. If everyone who ran virtual servers had images of the common configurations of the popular distributions then creating an image to boot would only require sending a diff (maybe something based on XDelta [4]). Transferring 1GB of filesystem image over most network links is going to be unreasonably time consuming, transferring a binary diff of an up to date CentOS or Debian install vs a usable system image based on CentOS or Debian which has all the updates applied is going to be much faster.

    Of course something like this would not be suitable for anything that requires security. But there are many uses for servers that don’t require much security.


  • Killing Servers with Virtualisation and Swap

    The Problem:

    A problem with virtual machines is the fact that one rogue DomU can destroy the performance of all the others by inappropriate resource use. CPU scheduling is designed to allow reasonable sharing of computational resources, it is unfortunately not well documented, the XenSource wiki currently doesn’t document the “credit” scheduler which is used in Debian/Etch and CentOS 5 [1]. One interesting fact is that CPU scheduling in Xen can have a significant effect on IO performance as demonstrated in the paper by Ludmila Cherkasova, Diwaker Gupta and Amin Vahdat [2]. But they only showed a factor of two performance difference (which while bad is not THAT bad).

    A more significant problem is managing virtual memory, when there is excessive paging performance can drop by a factor of 100 and even the most basic tasks become impossible.

    The design of Xen is that every DomU is allocated some physical RAM and has it’s own swap space. I have previously written about my experiments to optimise swap usage on Xen systems by using a tmpfs in the Dom0 [3]. The aim was to have every Xen DomU swap data out to a tmpfs so that if one DomU was paging heavily and the other DomUs were not paging then the paging might take place in the Dom0′s RAM and not hit disk. The experiments were not particularly successful but I would be interested in seeing further research in this area as there might be some potential to do some good.

    I have previously written about the issues related to swap space sizing on Linux [4]. My conclusion is that following the “twice RAM” myth will lead to systems becoming unusable due to excessive swapping in situations where they might otherwise be usable if the kernel killed some processes instead (naturally there are exceptions to my general rule due to different application memory use patterns – but I think that my general rule is a lot better than the “twice RAM” one).

    One thing that I didn’t consider at the time is the implications of this problem for Xen servers. If you have 10 physical machines and one starts paging excessively then you have one machine to reboot. If you have 10 Xen DomUs on a single host and one starts paging heavily then you end up with one machine that is unusable due to thrashing and nine machines that deliver very poor disk read performance – which might make them unusable too. Read performance can particularly suffer in a situation when one process or VM is writing heavily to disk due to the way that the disk queuing works, it’s not uncommon for an application to read dozens or hundreds of blocks from disk to satisfy a single trivial request from a user, if each of these block read requests has to wait for a large amount of data to be written out from the write-back cache then performance will suck badly (I have seen this in experiments on single disks and on Linux software RAID – but have not had the opportunity to do good tests on a hardware RAID array).

    Currently for Xen DomUs I am allocating swap spaces no larger than 512M, as anything larger than that is likely to cause excessive performance loss to the rest of the server if it is actually used.

    A Solution for Similar Problems:

    A well known optimisation technique of desktop systems is to use a separate disk for swap, in desktop machines people often use the old disk as swap after buying a new larger disk for main storage. The benefit of this is that swap use will not interfere with other disk use, for example the disk reads needed to run the “ps” and “kill” programs won’t be blocked by the memory hog that you want to kill. I believe that similar techniques can be applied to Xen servers and give even greater benefits. When a desktop machine starts paging excessively the user will eventually take a coffee break and let the machine recover, but when an Internet server such as a web server starts paging excessively the requests keep coming in and the number of active processes increases so it seems likely that using a different device for the swap will allow some processes to satisfy requests by reading data from disk while some other processes are waiting to be paged in.

    Applying it to Xen Servers:

    The first thing that needs to be considered for such a design is the importance of reliable swap. When it comes to low-end servers there is ongoing discussion about the relative merits of RAID-0 and RAID-1 for swap. The benefit of RAID-0 is performance (at least in perception – I can imagine some OS swapping algorithms that could potentially give better performance on RAID-1 and I am not aware of any research in this area). The benefit of RAID-1 is reliability. Now there are two issues in regard to reliability, one is continuity of service (EG being able to hot-swap a failed disk while the server is running), and the other is the absence of data loss. For some systems it may be acceptable to have a process SEGV (which I presume is the result if a page-in request fails) due to a dead disk (reserving the data loss protection of RAID for files). One issue related to this is the ability to regain control of a server after a problem. For example if the host OS of a machine had non-RAID swap then a disk failure could prevent a page-in of data related to sshd or some similar process and thus make it impossible to recover the machine without hardware access. But if the swap for a virtual machine was on a non-RAID disk and the host had RAID for it’s swap then the sysadmin could login to the host and reboot the DomU after creating a new swap space on a working disk.

    Now if you have a server with 8 or 12 disks (both of which seem to be reasonably common capacities of modern 2RU servers) and if you decide that RAID is not required for the swap space of DomUs then it would be possible to assign single disks for swap spaces for groups of virtual machines. So if one client had several virtual machines they could have them share the same single disk for the swap, so a thrashing server would only affect the performance of other VMs from the same client. One possible configuration would be a 12 disk server that has a four disk RAID-5 array for main storage and 8 single disks for swap. 8 CPU cores is common for a modern 2RU server, so it would be possible to lock 8 groups of DomUs so that they share CPUs and swap spaces. Another possibility would be to have four groups of DomUs where each group had a RAID-1 array for swap and two CPU cores.

    I am not sure of the aggregate performance impact of such a configuration, I suspect that a group of single disks would give better performance for swap than a single RAID array and that RAID-1 would outperform RAID-5. For a single DomU it seems most likely that using part of a large RAID array for swap space would give better performance. But the benefit in partitioning the server seems clear. An architecture where each DomU had it’s own dedicated disk for a swap space is something that I would consider a significant benefit if renting a Xen DomU. I would rather have the risk of down-time (which should be short with hot-swap disks and hardware monitoring) in the rare case of a disk failure than have bad performance regularly in the common situation of someone else’s DomU being overloaded.

    Failing that, having a separate RAID array for swap would be a significant benefit. If every process that isn’t being paged out could deliver full performance while one DomU was thrashing then it would be a significant benefit over the situation where any DomU can thrash and kill the file access performance of all other DomUs. A single RAID-1 array should handle all the swap space requirements for a small or medium size Xen server

    One thing that I have not tested is the operation of LVM when one disk goes bad. In the case of a disk with bad sectors it’s possible to move the LVs that are not affected to other disks and to remove the LV that was affected and re-create it after removing the bad disk. The case of a disk that is totally dead (IE the PV header can’t be read or written) might cause some additional complications.




©2012 etbe - Russell Coker Entries (RSS) and Comments (RSS)  Raindrops Theme