Linux, politics, and other interesting things
I’ve just bought a new Thinkpad that has hardware virtualisation support and I’ve got KVM running.
The Linux-KVM site has some information on using hugetlbfs to allow the use of 2MB pages for KVM . I put “vm.nr_hugepages = 1024” in /etc/sysctl.conf to reserve 2G of RAM for KVM use. The web page notes that it may be impossible to allocate enough pages if you set it some time after boot (the kernel can allocate memory that can’t be paged and it’s possible for RAM to become too fragmented to allow allocation). As a test I reduced my allocation to 296 pages and then increased it again to 1024, I was surprised to note that my system ran extremely slow while reserving the pages – it seems that allocating such pages is efficient when done at boot time but not so efficient when done later.
hugetlbfs /hugepages hugetlbfs mode=1770,gid=121 0 0
I put the above line in /etc/fstab to mount the hugetlbfs filesystem. The mode of 1770 allows anyone in the group to create files but not unlink or rename each other’s files. The gid of 121 is for the kvm group.
I’m not sure how hugepages are used, they aren’t used in the most obvious way. I expected that allocating 1024 huge pages would allow allocating 2G of RAM to the virtual machine, that’s not the case as “-m 2048” caused kvm to fail. I also expected that the number of HugePages free according to /proc/meminfo would reliably drop by an amount that approximately matches the size of the virtual machine – which doesn’t seem to be the case.
I have no idea why KVM with Hugepages would be significantly slower for user and system CPU time but still slightly faster for the overall build time (see the performance section below). I’ve been unable to find any documents explaining in which situations huge pages provide advantages and disadvantages or how they work with KVM virtualisation – the virtual machine allocates memory in 4K pages so how does that work with 2M pages provided to it by the OS?
But Hugepages does provide a slight benefit in performance and if you have plenty of RAM (I have 5G and can afford to buy more if I need it) you should just install it as soon as you start.
open /dev/kvm: Permission denied
Could not initialize KVM, will disable KVM support
One thing that annoyed me about KVM is that the Debian/Lenny version will run QEMU instead if it can’t run KVM. I discovered this when a routine rebuild of the SE Linux Policy packages in a Debian/Unstable virtual machine took an unreasonable amount of time. When I halted the virtual machine I noticed that it had displayed the above message on stderr before changing into curses mode (I’m not sure the correct term for this) such that the message was obscured until the xterm was returned to the non-curses mode at program exit. I had to add the user in question to the kvm group. I’ve filed Debian bug report #574063 about this .
Below is a table showing the time taken for building the SE Linux reference policy on Debian/Unstable. It compares running QEMU emulation (using the kvm command but without permission to access /dev/kvm), KVM with and without hugepages, Xen, and a chroot. Xen is run on an Opteron 1212 Dell server system with 2*1TB SATA disks in a RAID-1 while the KVM/QEMU tests are on an Intel T7500 CPU in a Thinkpad T61 with a 100G SATA disk . All virtual machines had 512M of RAM and 2 CPU cores. The Opteron 1212 system is running Debian/Lenny and the Thinkpad is running Debian/Lenny with a 2.6.32 kernel from Testing.
|QEMU on Opteron 1212 with Xen installed||126m54||39m36||8m1|
|QEMU on T7500||95m42||42m57||8m29|
|KVM on Opteron 1212||7m54||4m47||2m26|
|Xen on Opteron 1212||6m54||3m5||1m5|
|KVM on T7500||6m3||2m3||1m9|
|KVM Hugepages on T7500 with NCurses console||5m58||3m32||2m16|
|KVM Hugepages on T7500||5m50||3m31||1m54|
|KVM Hugepages on T7500 with 1800M of RAM||5m39||3m30||1m48|
|KVM Hugepages on T7500 with 1800M and file output||5m7||3m28||1m38|
|Chroot on T7500||3m43||3m11||29|
I was surprised to see how inefficient it is when compared with a chroot on the same hardware. It seems that the system time is the issue. Most of the tests were done with 512M of RAM for the virtual machine, I tried 1800M which improved performance slightly (less IO means less context switches to access the real block device) and redirecting the output of dpkg-buildpackage to /tmp/out and /tmp/err reduced the built time by 32 seconds – it seems that the context switches for networking or console output really hurt performance. But for the default build it seems that it will take about 50% longer in a virtual machine than in a chroot, this is bearable for the things I do (of which building the SE Linux policy is the most time consuming), but if I was to start compiling KDE then I would be compelled to use a chroot.
I was also surprised to see how slow it was when compared to Xen, for the tests on the Opteron 1212 system I used a later version of KVM (qemu-kvm 0.11.0+dfsg-1~bpo50+1 from Debian/Unstable) but could only use 2.6.26 as the virtualised kernel (the Debian 2.6.32 kernels gave a kernel Oops on boot). I doubt that the lower kernel version is responsible for any significant portion of the extra minute of build time.
One way of managing storage for a virtual machine is to use files on a large filesystem for it’s block devices, this can work OK if you use a filesystem that is well designed for large files (such as XKS). I prefer to use LVM, one thing I have not yet discovered is how to make udev assign the KVM group to all devices that match /dev/V0/kvm-*.
KVM seems to be basically designed to run from a session, unlike Xen which can be started with “xm create” and then run in the background until you feel like running “xm console” to gain access to the console. One way of dealing with this is to use screen. The command “screen -S kvm-foo -d -m kvm WHATEVER” will start a screen session named kvm-foo that will be detached and will start by running kvm with “WHATEVER” as the command-line options. When screen is used for managing virtual machines you can use the command “screen -ls” to list the running sessions and then commands such as “screen -r kvm-unstable” to reattach to screen sessions. To detach from a running screen session you type ^A^D.
The problem with this is that screen will exit when the process ends and that loses the shutdown messages from the virtual machine. To solve this you can put “exec bash” or “sleep 200” at the end of the script that runs kvm.
start-stop-daemon -S -c USERNAME --exec /usr/bin/screen -- -S kvm-unstable -d -m /usr/local/sbin/kvm-unstable
On a Debian system the above command in a system boot script (maybe /etc/rc.local) could be used to start a KVM virtual machine on boot. In this example USERNAME would be replaced by the name of the account used to run kvm, and /usr/local/sbin/kvm-unstable is a shell script to run kvm with the correct parameters. Then as user USERNAME you can attach to the session later with the command “screen -x kvm-unstable“. Thanks to Jason White for the tip on using screen.
I’ve filed Debian bug report #574069  requesting that kvm change it’s argv so that top(1) and similar programs can be used to distinguish different virtual machines. Currently when you have a few entries named kvm in top’s output it is annoying to match the CPU hogging process to the virtual machine it’s running.
It is possible to use KVM with X or VNC for a graphical display by the virtual machine. I don’t like these options, I believe that Xephyr provides better isolation, I’ve previously documented how to use Xephyr .
kvm -kernel /boot/vmlinuz-2.6.32-2-amd64 -initrd /boot/initrd.img-2.6.32-2-amd64 -hda /dev/V0/unstable -hdb /dev/V0/unstable-swap -m 512 -mem-path /hugepages -append "selinux=1 audit=1 root=/dev/hda ro rootfstype=ext4" -smp 2 -curses -redir tcp:2022::22
The above is the current kvm command-line that I’m using for my Debian/Unstable test environment.
I’m using KVM options such as “-redir tcp:2022::22” to redirect unprivileged ports (in this case 2022) to the ssh port. This works for a basic test virtual machine but is not suitable for production use. I want to run virtual machines with minimal access to the environment, this means not starting them as root.
One thing I haven’t yet investigated is the vde2 networking system which allows a private virtual network over multiple physical hosts and which should allow kvm to be run without root privs. It seems that all the other networking options for kvm which have appealing feature sets require that the kvm process be started with root privs.
It seems that KVM is significantly slower than a chroot, so for a basic build environment a secure chroot environment would probably be a better option. I had hoped that KVM would be more reliable than Xen which would offset the performance loss – however as KVM and Debian kernel 2.6.32 don’t work together on my Opteron system it seems that I will have some reliability issues with KVM that compare with the Xen issues. There are currently no Xen kernels in Debian/Testing so KVM is usable now with the latest bleeding edge stuff (on my Thinkpad at least) while Xen isn’t.
Qemu is really slow, so Xen is the only option for 32bit hardware. Therefore all my 32bit Xen servers need to keep running Xen.
I don’t plan to switch my 64bit production servers to KVM any time soon. When Debian/Squeeze is released I will consider whether to use KVM or Xen after upgrading my 64bit Debian server. I probably won’t upgrade my 64bit RHEL-5 server any time soon – maybe when RHEL-7 is released. My 64bit Debian test and development server will probably end up running KVM very soon, I need to upgrade the kernel for Ext4 support and that makes KVM more desirable.
So it seems that for me KVM is only going to be seriously used on my laptop for a while.
Generally I am disappointed with KVM. I had hoped that it would give almost the performance of Xen (admittedly it was only 14.5% slower). I had also hoped that it would be really reliable and work with the latest kernels (unlike Xen) but it is giving me problems with 2.6.32 on Opteron. Also it has some new issues such as deciding to quietly do something I don’t want when it’s unable to do what I want it to do.