Debian Multimedia and SE Linux

I have just had a need to install packages from to correctly play .3gp files from my mobile phone (the stock Mplayer in Debian would not play the sound).

As part of getting this to work in a way that I like I rebuilt some packages so that shared objects would not demand an executable stack and added them to My SE Linux Etch repository [1]. The liblzo2-2 package is in Debian so I filed bug report #511479 against it. Not that I expect it to be fixed for Etch now that Lenny is about to be released. But it’s good to have the data in the bug tracking system for the benefit of all interested people.

The lame and xvidcore packages are only in the Debian Multimedia archive. I’ve sent email to the maintainer with patches. Not sure if he will accept them (again it’s not a good time for filing bug reports about Etch), but there’s no harm in sending them in.

The lame package also required execmod access, but I don’t have enough time to devote to this to fix it. For background information about execmod see my previous post [2].

See my previous post about executable stacks for more background information [3].

The next thing to do is to test this out in Lenny, hopefully I’ll get time to work on this tomorrow.

New version of Bonnie++ and Violin Memory

I have just released version 1.03e of my Bonnie++ benchmark [1]. The only change is support for direct IO in Bonnie++ (via the -D command-line parameter). The patch for this was written by Dave Murch of Violin Memory [2]. Violin specialise in 2RU storage servers based on DRAM and/or Flash storage. One of their products is designed to handle a sustained load of 100,000 write IOPS (in 4K blocks) and 200,000 read IOPS per second for it’s 10 year life (but it’s not clear whether you could do 100,000 writes AND 200,000 reads in a second). The only pricing information that they have online is a claim that flash costs less than $50 per gig, while that would be quite affordable for dozens of gigs and not really expensive for hundreds of gigs, as they are discussing a device with 4TB capacity it sounds rather expensive – but of course it would be a lot cheaper than using hard disks if you need that combination of capacity and performance.

I wonder how much benefit you would get from using a Violin device to manage the journals for 100 servers in a data center. It seems that 1000 writes per second is near the upper end of the capacity of a 2RU server for many common work-loads, this is of course just a rough estimation based on observations of some servers that I run. If the main storage was on a SAN then using data journaling and putting the journals on a Violin device seems likely to improve latency (data is committed faster and the application can report success to the client sooner) while also reducing the load on the SAN disks (which are really expensive).

Now given that their price point is less than $50 per gig, it seems that a virtual hosting provider could provide really fast storage to their customers for a quite affordable price. $5 per month per gig for flash storage in a virtual hosting environment would be an attractive option for many people. Currently if you have a small service that you want hosted a virtual server is the best way to do it, and as most providers offer little information on the disk IO capacity of their services it seems quite unlikely that anyone has taken any serious steps to prevent high load from one customer from degrading the performance of the rest. With flash storage you not only get a much higher number of writes per second, but one customer writing data won’t seriously impact read speed for other customers (with hard drive one process that does a lot of writes can cripple the performance of processes that do reads).

The experimental versions of Bonnie++ have better support for testing some of these usage scenarios. One new feature is measuring the worst-case latency of all operations in each section of the test run. I will soon release Bonnie++ version 1.99 which includes direct IO support, it should show some significant benefits for all usage cases involving Violin devices, ZFS (when configured with multiple types of storage hardware), NetApp Filers, and other advanced storage options.

For a while I have been dithering about the exact feature list of Bonnie++ 2.x. After some pressure from a contributor to the OpenSolaris project I have decided to freeze the feature list at the current 1.94 level plus direct IO support. This doesn’t mean that I will stop adding new features in the 2.0x branch, but I will avoid doing anything that can change the results. So in future benchmark results made from Bonnie++ version 1.94 can be directly compared to results that will be made from version 2.0 and above. There is one minor issue, new versions of GCC have in the past made differences to some of the benchmark results (the per-character IO test was the main one) – but that’s not my problem. As far as I am concerned Bonnie++ benchmarks everything from the compiler to the mass storage device in terms of disk IO performance. If you compare two systems with different kernels, different versions of GCC, or other differences then it’s up to you to make appropriate notes of what was changed.

This means that the OpenSolaris people can now cease using the 1.0x branch of Bonnie++, and other distributions can do the same if they wish. I have just uploaded version 1.03e to Debian and will request that it goes in Lenny – I believe that it is way too late to put 1.9x in Lenny. But once Lenny is released I will upload version 2.00 to Debian/Unstable and that will be the only version supported in Debian after that time.

Physical vs Virtual Servers

In a comment on my post about Slicehost, Linode, and scaling up servers [1] it was suggested that there is no real difference between a physical server and a set of slices of a virtual server that takes up all the resources of the machine.

The commentator notes that it’s easier to manage a virtual machine. When you have a physical machine running at an ISP server room there are many things that need to be monitored, this includes the temperature at various points inside the case and the operation of various parts (fans and hard disks being two obvious ones). When you run the physical server you have to keep such software running (you maintain the base OS). If the ISP owns the server (which is what you need if the server is in another country) then the ISP staff are the main people to review the output. Having to maintain software that provides data for other people is a standard part of a sys-admin’s job, but when that data determines whether the server will die it is easier if one person manages it all. If you have a Xen DomU that uses all the resources of the machine (well all but the small portion used by the Dom0 and the hypervisor) then a failing hard disk could simply be replaced by the ISP staff who would notify you of the expected duration of the RAID rebuild (which would degrade performance). For more serious failures the data could be migrated to another machine, in the case of predicted failures (such as unexpected temperature increases or the failure of a cooling fan) it is possible to migrate a running Xen DomU to another server. If the server migration is handled well then this can be a significant benefit of virtualisation for an ISP customer. Also Xen apparently supports having RAM for a DomU balloon out to a larger size than was used on boot, I haven’t tested this feature and don’t know how well it works. If it supports ballooning to something larger than the physical size in the original server then it would be possible to migrate a running instance to a machine with more RAM to upgrade it.

The question is whether it’s worth the cost. Applications which need exactly the resources of one physical server seem pretty rare to me. Applications which need resources that are considerably smaller than a single modern server are very common, and applications which have to be distributed among multiple servers are not that common (although many of us hope that our projects will become so successful ;). So the question of whether it’s worth the cost is often really whether the overhead of virtualisation will make a single large machine image take more resources than a single server can provide (moving from a single server to multiple servers costs a lot of developer time, and moving to a larger single server exponentially increases the price). There is also an issue of latency, all IO operations can be expected to take slightly longer so even if the CPU is at 10% load and there is a lot of free RAM some client operations will still take longer, but I hope that it wouldn’t be enough to compete with the latency of the Internet – even a hard drive seek is faster than the round trip times I expect for IP packets from most customer machines.

VMware has published an interesting benchmark of VMware vs Xen vs native hardware [2]. It appears to have been written in February 2007 and while it’s intent is to show VMware as being better than Xen, in most cases it seems to show them both as being good enough. The tests involved virtualising 32bit Windows systems, this doesn’t seem an unreasonable test as many ISPs are offering 32bit virtual machines as 32bit code tends to use less RAM. One unfortunate thing is that they make no explanation of why “Intger Math” might run at just over 80% native performance on VMware and just under 60% native performance on Xen. The other test results seem to show that for a virtualised Windows OS either VMware or Xen will deliver enough performance (apart from the ones where VMware claims that Xen provides only a tiny fraction of native performance – that’s a misconfiguration that is best ignored). Here is an analysis of the VMware benchmark and the XenSource response (which has disappeared from the net) [3].

The Cambridge Xen people have results showing a single Xen DomU delivering more than 90% native performance on a variety of well known benchmarks [4].

As it seems that in every case we can expect more than 90% native performance from a single DomU and as the case of needing more than 90% native performance is rare it seems that there is no real difference that we should care about when running servers and that the ease of management outweighs the small performance benefit from using native hardware.

Now it appears that Slicehost [5] caters to people who desire this type of management. Their virtual server plans have RAM going in all powers of two from 256M to 8G, and then they have 15.5G – which seems to imply that they are using physical servers with 16G of RAM and that 15.5G is all that is left after the Xen hypervisor and the Dom0 have taken some. One possible disadvantage of this is that if you want all the CPU power of a server but not so much RAM (or the other way around) then the Slicehost 15.5G plan might involve more hardware being assigned to you than you really need. But given the economies of scale involved in purchasing and managing the large number of servers that Slicehost is running it might cost them more to run a machine with 8G of RAM as a special order than to buy their standard 16G machine.

Other virtual hosting companies such as Gandi and Linode clearly describe that they don’t support a single instance taking all the resources of the machine (1/4 and 1/5 of a machine respectively are the maximums). I wonder if they are limiting the size of virtual machines to avoid the possibility of needing to shuffle virtual machines when migrating a running virtual machine.

One significant benefit of having a physical machine over renting a collection of DomUs is the ability to run virtual machines as you desire. I prefer to have a set of DomUs on the same physical server so that if one DomU is running slowly then I have the option to optimise other DomUs to free up some capacity. I can change the amounts of RAM and the number of virtual CPUs allocated to each DomU as needed. I am not aware of anyone giving me the option to rent all the capacity of a single server in the form of managed DomUs and then assign the amounts of RAM, disk, and CPU capacity to them as I wish. If Slicehost offered such a deal then one of my clients would probably rent a Slicehost server for this purpose as soon as their current contract runs out.

It seems that there is a lot of potential to provide significant new features for virtual hosting. I expect that someone will start offering these things in the near future. I will advise my clients to try and avoid signing any long-term contracts (where long means one year in the context of hosting) so that they keep their options open for future offers.