Minikube and Debian

I just started looking at the Kubernetes documentation and interactive tutorial [1], which incidentally is really good. Everyone who is developing a complex system should look at this to get some ideas for online training. Here are some notes on setting it up on Debian.

Add Kubernetes Apt Repository

deb https://apt.kubernetes.io/ kubernetes-xenial main

First add the above to your apt sources configuration (/etc/apt/sources.list or some file under /etc/apt/sources.list.d) for the kubectl package. Ubuntu Xenial is near enough to Debian/Buster and Debian/Unstable that it should work well for both of them. Then install the GPG key “6A030B21BA07F4FB” for use by apt:

gpg --recv-key 6A030B21BA07F4FB
gpg --list-sigs 6A030B21BA07F4FB
gpg --export 6A030B21BA07F4FB | apt-key add -

The Google key in question is not signed.

Install Packages for the Tutorial

The online training is based on “minikube” which uses libvirt to setup a KVM virtual machine to do stuff. To get this running you need to have a system that is capable of running KVM (IE the BIOS is set to allow hardware virtualisation). It MIGHT work on QEMU software emulation without KVM support (technically it’s possible but it would be slow and require some code to handle that), I didn’t test if it does. Run the following command to install libvirt, kvm, and dnsmasq (which minikube requires) and kubectl on Debian/Buster:

apt install libvirt-clients libvirt-daemon-system qemu-kvm dnsmasq kubectl

For Debian/Unstable run the following command:

apt install libvirt-clients libvirt-daemon-system qemu-system-x86 dnsmasq kubectl

To run libvirt as non-root without needing a password for everything you need to add the user in question to the libvirt group. I recommend running things as non-root whenever possible. In this case entering a password for everything will probably be more pain than you want. The Debian Wiki page about KVM [2] is worth reading.

Install Minikube Test Environment

Here is the documentation for installing Minikube [3]. Basically just download a single executable from the net, put it in your $PATH, and run it. Best to use non-root for that. Also you need at least 3G of temporary storage space in the home directory of the user that runs it.

After installing minikube run “minikube start” which will download container image data and start it up. Then you can run commands like the following to see what it has done.

# get overview of virsh commands
virsh help
# list domains
virsh --connect qemu:///system list
# list block devices a domain uses
virsh --connect qemu:///system domblklist minikube
# show stats on block device usage
virsh --connect qemu:///system domblkstat minikube hda
# list virtual networks
virsh --connect qemu:///system net-list
# list dhcp leases on a virtual network
virsh --connect qemu:///system net-dhcp-leases minikube-net
# list network filters
virsh --connect qemu:///system nwfilter-list
# list real network interfaces
virsh --connect qemu:///system iface-list

Qemu (KVM) and 9P (Virtfs) Mounts

I’ve tried setting up the Qemu (in this case KVM as it uses the Qemu code in question) 9P/Virtfs filesystem for sharing files to a VM. Here is the Qemu documentation for it [1].

VIRTFS="-virtfs local,path=/vmstore/virtfs,security_model=mapped-xattr,id=zz,writeout=immediate,fmode=0600,dmode=0700,mount_tag=zz"
VIRTFS="-virtfs local,path=/vmstore/virtfs,security_model=passthrough,id=zz,writeout=immediate,mount_tag=zz"

Above are the 2 configuration snippets I tried on the server side. The first uses mapped xattrs (which means that all files will have the same UID/GID and on the host XATTRs will be used for storing the Unix permissions) and the second uses passthrough which requires KVM to run as root and gives the same permissions on the host as on the VM. The advantages of passthrough are better performance through writing less metadata and having the same permissions in host and VM. The advantages of mapped XATTRs are running KVM/Qemu as non-root and not having a SUID file in the VM imply a SUID file in the host.

Here is the link to Bonnie++ output comparing Ext3 on a KVM block device (stored on a regular file in a BTRFS RAID-1 filesystem on 2 SSDs on the host), a NFS share from the host from the same BTRFS filesystem, and virtfs shares of the same filesystem. The only tests that Ext3 doesn’t win are some of the latency tests, latency is based on the worst-case not the average. I expected Ext3 to win most tests, but didn’t expect it to lose any latency tests.

Here is a link to Bonnie++ output comparing just NFS and Virtfs. It’s obvious that Virtfs compares poorly, giving about half the performance on many tests. Surprisingly the only tests where Virtfs compared well to NFS were the file creation tests which I expected Virtfs with mapped XATTRs to do poorly due to the extra metadata.

Here is a link to Bonnie++ output comparing only Virtfs. The options are mapped XATTRs with default msize, mapped XATTRs with 512k msize (I don’t know if this made a difference, the results are within the range of random differences), and passthrough. There’s an obvious performance benefit in passthrough for the small file tests due to the less metadata overhead, but as creating small files isn’t a bottleneck on most systems a 20% to 30% improvement in that area probably doesn’t matter much. The result from the random seeks test in passthrough is unusual, I’ll have to do more testing on that.

SE Linux

On Virtfs the XATTR used for SE Linux labels is passed through to the host. So every label used in a VM has to be valid on the host and accessible to the context of the KVM/Qemu process. That’s not really an option so you have to use the context mount option. Having the mapped XATTR mode work for SE Linux labels is a necessary feature.

Conclusion

The msize mount option in the VM doesn’t appear to do anything and it doesn’t appear in /proc/mounts, I don’t know if it’s even supported in the kernel I’m using.

The passthrough and mapped XATTR modes give near enough performance that there doesn’t seem to be a benefit of one over the other.

NFS gives significant performance benefits over Virtfs while also using less CPU time in the VM. It has the issue of files named .nfs* hanging around if the VM crashes while programs were using deleted files. It’s also more well known, ask for help with an NFS problem and you are more likely to get advice than when asking for help with a virtfs problem.

Virtfs might be a better option for accessing databases than NFS due to it’s internal operation probably being a better map to Unix filesystem semantics, but running database servers on the host is probably a better choice anyway.

Virtfs generally doesn’t seem to be worth using. I had hoped for performance that was better than NFS but the only benefit I seemed to get was avoiding the .nfs* file issue.

The best options for storage for a KVM/Qemu VM seem to be Ext3 for files that are only used on one VM and for which the size won’t change suddenly or unexpectedly (particularly the root filesystem) and NFS for everything else.

6

Windows 10 on Debian under KVM

Here are some things that you need to do to get Windows 10 running on a Debian host under KVM.

UEFI Booting

UEFI is big and complex, but most of what it does isn’t needed at all. If all you want to do is boot from an image of a disk with a GPT partition table then you just install the package ovmf and add something like the following to your KVM start script:

UEFI="-drive if=pflash,format=raw,readonly,file=/usr/share/OVMF/OVMF_CODE.fd -drive if=pflash,format=raw,readonly,file=/usr/share/OVMF/OVMF_VARS.fd"

Note that some of the documentation on this doesn’t have the OVMF_VARS.fd file set to readonly. Allowing writes to that file means that the VM boot process (and maybe later) can change EFI variables that affect later boots and other VMs if they all share the same file. For a basic boot you don’t need to change variables so you want it read-only. Also having it read-only is necessary if you want to run KVM as non-root.

As an experiment I tried booting without the OVMF_VARS.fd file, it didn’t boot and then even after configuring it to use the OVMF_VARS.fd file again Windows gave a boot error about the “boot configuration data file” that required booting from recovery media. Apparently configuration mistakes with EFI can mess up the Windows installation, so be careful and backup the Windows installation regularly!

Linux can boot from EFI but you generally don’t want to unless the boot device is larger than 2TB. It’s relatively easy to convert a Linux installation on a GPT disk to a virtual image on a DOS partition table disk or on block devices without partition tables and that gives a faster boot. If the same person runs the host hardware and the VMs then the best choice for Linux is to have no partition tables just one filesystem per block device (which makes resizing much easier) and have the kernel passed as a parameter to kvm. So booting a VM from EFI is probably only useful for booting Windows VMs and for Linux boot loader development and testing.

As an aside, the Debian Wiki page about Secure Boot on a VM [4] was useful for this. It’s unfortunate that it and so much of the documentation about UEFI is about secure boot which isn’t so useful if you just want to boot a system without regard to the secure boot features.

Emulated IDE Disks

Debian kernels (and probably kernels from many other distributions) are compiled with the paravirtualised storage device drivers. Windows by default doesn’t support such devices so you need to emulate an IDE/SATA disk so you can boot Windows and install the paravirtualised storage driver. The following configuration snippet has a commented line for paravirtualised IO (which is fast) and an uncommented line for a virtual IDE/SATA disk that will allow an unmodified Windows 10 installation to boot.

#DRIVE="-drive format=raw,file=/home/kvm/windows10,if=virtio"
DRIVE="-drive id=disk,format=raw,file=/home/kvm/windows10,if=none -device ahci,id=ahci -device ide-drive,drive=disk,bus=ahci.0"

Spice Video

Spice is an alternative to VNC, Here is the main web site for Spice [1]. Spice has many features that could be really useful for some people, like audio, sharing USB devices from the client, and streaming video support. I don’t have a need for those features right now but it’s handy to have options. My main reason for choosing Spice over VNC is that the mouse cursor in the ssvnc doesn’t follow the actual mouse and can be difficult or impossible to click on items near edges of the screen.

The following configuration will make the QEMU code listen with SSL on port 1234 on all IPv4 addresses. Note that this exposes the Spice password to anyone who can run ps on the KVM server, I’ve filed Debian bug #965061 requesting the option of a password file to address this. Also note that the “qxl” virtual video hardware is VGA compatible and can be expected to work with OS images that haven’t been modified for virtualisation, but that they work better with special video drivers.

KEYDIR=/etc/letsencrypt/live/kvm.example.com-0001
-spice password=xxxxxxxx,x509-cacert-file=$KEYDIR/chain.pem,x509-key-file=$KEYDIR/privkey.pem,x509-cert-file=$KEYDIR/cert.pem,tls-port=1234,tls-channel=main -vga qxl

To connect to the Spice server I installed the spice-client-gtk package in Debian and ran the following command:

spicy -h kvm.example.com -s 1234 -w xxxxxxxx

Note that this exposes the Spice password to anyone who can run ps on the system used as a client for Spice, I’ve filed Debian bug #965060 requesting the option of a password file to address this.

This configuration with an unmodified Windows 10 image only supported 800*600 resolution VGA display.

Networking

To set up bridged networking as non-root you need to do something like the following as root:

chgrp kvm /usr/lib/qemu/qemu-bridge-helper
setcap cap_net_admin+ep /usr/lib/qemu/qemu-bridge-helper
mkdir -p /etc/qemu
echo "allow all" > /etc/qemu/bridge.conf
chgrp kvm /etc/qemu/bridge.conf
chmod 640 /etc/qemu/bridge.conf

Windows 10 supports the emulated Intel E1000 network card. Configuration like the following configures networking on a bridge named br0 with an emulated E1000 card. MAC addresses that have a 1 in the second least significant bit of the first octet are “locally administered” (like IPv4 addresses starting with “10.”), see the Wikipedia page about MAC Address for details.

The following is an example of network configuration where $ID is an ID number for the virtual machine. So far I haven’t come close to 256 VMs on one network so I’ve only needed one octet.

NET="-device e1000,netdev=net0,mac=02:00:00:00:01:$ID -netdev tap,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper,br=br0"

Final KVM Settings

KEYDIR=/etc/letsencrypt/live/kvm.example.com-0001
SPICE="-spice password=xxxxxxxx,x509-cacert-file=$KEYDIR/chain.pem,x509-key-file=$KEYDIR/privkey.pem,x509-cert-file=$KEYDIR/cert.pem,tls-port=1234,tls-channel=main -vga qxl"

UEFI="-drive if=pflash,format=raw,readonly,file=/usr/share/OVMF/OVMF_CODE.fd -drive if=pflash,format=raw,readonly,file=/usr/share/OVMF/OVMF_VARS.fd"

DRIVE="-drive format=raw,file=/home/kvm/windows10,if=virtio"

NET="-device e1000,netdev=net0,mac=02:00:00:00:01:$ID -netdev tap,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper,br=br0"

kvm -m 4000 -smp 2 $SPICE $UEFI $DRIVE $NET

Windows Settings

The Spice Download page has a link for “spice-guest-tools” that has the QNX video driver among other things [2]. This seems to be needed for resolutions greater than 800*600.

The Virt-Manager Download page has a link for “virt-viewer” which is the Spice client for Windows systems [3], they have MSI files for both i386 and AMD64 Windows.

It’s probably a good idea to set display and system to sleep after never (I haven’t tested what happens if you don’t do that, but there’s no benefit in sleeping). Before uploading an image I disabled the pagefile and set the partition to the minimum size so I had less data to upload.

Problems

Here are some things I haven’t solved yet.

The aSpice Android client for the Spice protocol fails to connect with the QEMU code at the server giving the following message on stderr: “error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca:../ssl/record/rec_layer_s3.c:1544:SSL alert number 48“.

Spice is supposed to support dynamic changes to screen resolution on the VM to match the window size at the client, this doesn’t work for me, not even with the Red Hat QNX drivers installed.

The Windows Spice client doesn’t seem to support TLS, I guess running some sort of proxy for TLS would work but I haven’t tried that yet.

Debian PPC64EL Emulation

In my post on Debian S390X Emulation [1] I mentioned having problems booting a Debian PPC64EL kernel under QEMU. Giovanni commented that they had PPC64EL working and gave a link to their site with Debian QEMU images for various architectures [2]. I tried their image which worked then tried mine again which also worked – it seemed that a recent update in Debian/Unstable fixed the bug that made QEMU not work with the PPC64EL kernel.

Here are the instructions on how to do it.

First you need to create a filesystem in an an image file with commands like the following:

truncate -s 4g /vmstore/ppc
mkfs.ext4 /vmstore/ppc
mount -o loop /vmstore/ppc /mnt/tmp

Then visit the Debian Netinst page [3] to download the PPC64EL net install ISO. Then loopback mount it somewhere convenient like /mnt/tmp2.

The package qemu-system-ppc has the program for emulating a PPC64LE system, the qemu-user-static package has the program for emulating PPC64LE for a single program (IE a statically linked program or a chroot environment), you need this to run debootstrap. The following commands should be most of what you need.

apt install qemu-system-ppc qemu-user-static

update-binfmts --display

# qemu ppc64 needs exec stack to solve "Could not allocate dynamic translator buffer"
# so enable that on SE Linux systems
setsebool -P allow_execstack 1

debootstrap --foreign --arch=ppc64el --no-check-gpg buster /mnt/tmp file:///mnt/tmp2
chroot /mnt/tmp /debootstrap/debootstrap --second-stage

cat << END > /mnt/tmp/etc/apt/sources.list
deb http://mirror.internode.on.net/pub/debian/ buster main
deb http://security.debian.org/ buster/updates main
END
echo "APT::Install-Recommends False;" > /mnt/tmp/etc/apt/apt.conf

echo ppc64 > /mnt/tmp/etc/hostname

# /usr/bin/awk: error while loading shared libraries: cannot restore segment prot after reloc: Permission denied
# only needed for chroot
setsebool allow_execmod 1

chroot /mnt/tmp apt update
# why aren't they in the default install?
chroot /mnt/tmp apt install perl dialog
chroot /mnt/tmp apt dist-upgrade
chroot /mnt/tmp apt install bash-completion locales man-db openssh-server build-essential systemd-sysv ifupdown vim ca-certificates gnupg
# install kernel last because systemd install rebuilds initrd
chroot /mnt/tmp apt install linux-image-ppc64el
chroot /mnt/tmp dpkg-reconfigure locales
chroot /mnt/tmp passwd

cat << END > /mnt/tmp/etc/fstab
/dev/vda / ext4 noatime 0 0
#/dev/vdb none swap defaults 0 0
END

mkdir /mnt/tmp/root/.ssh
chmod 700 /mnt/tmp/root/.ssh
cp ~/.ssh/id_rsa.pub /mnt/tmp/root/.ssh/authorized_keys
chmod 600 /mnt/tmp/root/.ssh/authorized_keys

rm /mnt/tmp/vmlinux* /mnt/tmp/initrd*
mkdir /boot/ppc64
cp /mnt/tmp/boot/[vi]* /boot/ppc64

# clean up
umount /mnt/tmp
umount /mnt/tmp2

# setcap binary for starting bridged networking
setcap cap_net_admin+ep /usr/lib/qemu/qemu-bridge-helper

# afterwards set the access on /etc/qemu/bridge.conf so it can only
# be read by the user/group permitted to start qemu/kvm
echo "allow all" > /etc/qemu/bridge.conf

Here is an example script for starting kvm. It can be run by any user that can read /etc/qemu/bridge.conf.

#!/bin/bash
set -e

KERN="kernel /boot/ppc64/vmlinux-4.19.0-9-powerpc64le -initrd /boot/ppc64/initrd.img-4.19.0-9-powerpc64le"

# single network device, can have multiple
NET="-device e1000,netdev=net0,mac=02:02:00:00:01:04 -netdev tap,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper"

# random number generator for fast start of sshd etc
RNG="-object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0"

# I have lockdown because it does no harm now and is good for future kernels
# I enable SE Linux everywhere
KERNCMD="net.ifnames=0 noresume security=selinux root=/dev/vda ro lockdown=confidentiality"

kvm -drive format=raw,file=/vmstore/ppc64,if=virtio $RNG -nographic -m 1024 -smp 2 $KERN -curses -append "$KERNCMD" $NET
2

Debian S390X Emulation

I decided to setup some virtual machines for different architectures. One that I decided to try was S390X – the latest 64bit version of the IBM mainframe. Here’s how to do it, I tested on a host running Debian/Unstable but Buster should work in the same way.

First you need to create a filesystem in an an image file with commands like the following:

truncate -s 4g /vmstore/s390x
mkfs.ext4 /vmstore/s390x
mount -o loop /vmstore/s390x /mnt/tmp

Then visit the Debian Netinst page [1] to download the S390X net install ISO. Then loopback mount it somewhere convenient like /mnt/tmp2.

The package qemu-system-misc has the program for emulating a S390X system (among many others), the qemu-user-static package has the program for emulating S390X for a single program (IE a statically linked program or a chroot environment), you need this to run debootstrap. The following commands should be most of what you need.

# Install the basic packages you need
apt install qemu-system-misc qemu-user-static debootstrap

# List the support for different binary formats
update-binfmts --display

# qemu s390x needs exec stack to solve "Could not allocate dynamic translator buffer"
# so you probably need this on SE Linux systems
setsebool allow_execstack 1

# commands to do the main install
debootstrap --foreign --arch=s390x --no-check-gpg buster /mnt/tmp file:///mnt/tmp2
chroot /mnt/tmp /debootstrap/debootstrap --second-stage

# set the apt sources
cat << END > /mnt/tmp/etc/apt/sources.list
deb http://YOURLOCALMIRROR/pub/debian/ buster main
deb http://security.debian.org/ buster/updates main
END
# for minimal install do not want recommended packages
echo "APT::Install-Recommends False;" > /mnt/tmp/etc/apt/apt.conf

# update to latest packages
chroot /mnt/tmp apt update
chroot /mnt/tmp apt dist-upgrade

# install kernel, ssh, and build-essential
chroot /mnt/tmp apt install bash-completion locales linux-image-s390x man-db openssh-server build-essential
chroot /mnt/tmp dpkg-reconfigure locales
echo s390x > /mnt/tmp/etc/hostname
chroot /mnt/tmp passwd

# copy kernel and initrd
mkdir -p /boot/s390x
cp /mnt/tmp/boot/vmlinuz* /mnt/tmp/boot/initrd* /boot/s390x

# setup /etc/fstab
cat << END > /mnt/tmp/etc/fstab
/dev/vda / ext4 noatime 0 0
#/dev/vdb none swap defaults 0 0
END

# clean up
umount /mnt/tmp
umount /mnt/tmp2

# setcap binary for starting bridged networking
setcap cap_net_admin+ep /usr/lib/qemu/qemu-bridge-helper

# afterwards set the access on /etc/qemu/bridge.conf so it can only
# be read by the user/group permitted to start qemu/kvm
echo "allow all" > /etc/qemu/bridge.conf

Some of the above can be considered more as pseudo-code in shell script rather than an exact way of doing things. While you can copy and past all the above into a command line and have a reasonable chance of having it work I think it would be better to look at each command and decide whether it’s right for you and whether you need to alter it slightly for your system.

To run qemu as non-root you need to have a helper program with extra capabilities to setup bridged networking. I’ve included that in the explanation because I think it’s important to have all security options enabled.

The “-object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-ccw,rng=rng0” part is to give entropy to the VM from the host, otherwise it will take ages to start sshd. Note that this is slightly but significantly different from the command used for other architectures (the “ccw” is the difference).

I’m not sure if “noresume” on the kernel command line is required, but it doesn’t do any harm. The “net.ifnames=0” stops systemd from renaming Ethernet devices. For the virtual networking the “ccw” again is a difference from other architectures.

Here is a basic command to run a QEMU virtual S390X system. If all goes well it should give you a login: prompt on a curses based text display, you can then login as root and should be able to run “dhclient eth0” and other similar commands to setup networking and allow ssh logins.

qemu-system-s390x -drive format=raw,file=/vmstore/s390x,if=virtio -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-ccw,rng=rng0 -nographic -m 1500 -smp 2 -kernel /boot/s390x/vmlinuz-4.19.0-9-s390x -initrd /boot/s390x/initrd.img-4.19.0-9-s390x -curses -append "net.ifnames=0 noresume root=/dev/vda ro" -device virtio-net-ccw,netdev=net0,mac=02:02:00:00:01:02 -netdev tap,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper

Here is a slightly more complete QEMU command. It has 2 block devices, for root and swap. It has SE Linux enabled for the VM (SE Linux works nicely on S390X). I added the “lockdown=confidentiality” kernel security option even though it’s not supported in 4.19 kernels, it doesn’t do any harm and when I upgrade systems to newer kernels I won’t have to remember to add it.

qemu-system-s390x -drive format=raw,file=/vmstore/s390x,if=virtio -drive format=raw,file=/vmswap/s390x,if=virtio -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-ccw,rng=rng0 -nographic -m 1500 -smp 2 -kernel /boot/s390x/vmlinuz-4.19.0-9-s390x -initrd /boot/s390x/initrd.img-4.19.0-9-s390x -curses -append "net.ifnames=0 noresume security=selinux root=/dev/vda ro lockdown=confidentiality" -device virtio-net-ccw,netdev=net0,mac=02:02:00:00:01:02 -netdev tap,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper

Try It Out

I’ve got a S390X system online for a while, “ssh root@s390x.coker.com.au” with password “SELINUX” to try it out.

PPC64

I’ve tried running a PPC64 virtual machine, I did the same things to set it up and then tried launching it with the following result:

qemu-system-ppc64 -drive format=raw,file=/vmstore/ppc64,if=virtio -nographic -m 1024 -kernel /boot/ppc64/vmlinux-4.19.0-9-powerpc64le -initrd /boot/ppc64/initrd.img-4.19.0-9-powerpc64le -curses -append "root=/dev/vda ro"

Above is the minimal qemu command that I’m using. Below is the result, it stops after the “4.” from “4.19.0-9”. Note that I had originally tried with a more complete and usable set of options, but I trimmed it to the minimal needed to demonstrate the problem.

  Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
  This program and the accompanying materials are made available
  under the terms of the BSD License available at
  http://www.opensource.org/licenses/bsd-license.php

Booting from memory...
Linux ppc64le
#1 SMP Debian 4.

The kernel is from the package linux-image-4.19.0-9-powerpc64le which is a dependency of the package linux-image-ppc64el in Debian/Buster. The program qemu-system-ppc64 is from version 5.0-5 of the qemu-system-ppc package.

Any suggestions on what I should try next would be appreciated.

Xen CPU Use per Domain again

8 years ago I wrote a script to summarise Xen CPU use per domain [1]. Since then changes to Xen required changes to the script. I have new versions for Debian/Wheezy (Xen 4.1) and Debian/Jessie (Xen 4.4).

Here’s a new script for Debian/Wheezy:

#!/usr/bin/perl
use strict;

open(LIST, "xm list --long|") or die "Can't get list";

my $name = "Dom0";
my $uptime = 0.0;
my $cpu_time = 0.0;
my $total_percent = 0.0;
my $cur_time = time();

open(UPTIME, "</proc/uptime") or die "Can't open /proc/uptime";
my @arr = split(/ /, <UPTIME>);
$uptime = $arr[0];
close(UPTIME);

my %all_cpu;

while(<LIST>)
{
  chomp;
  if($_ =~ /^\)/)
  {
    my $cpu = $cpu_time / $uptime * 100.0;
    if($name =~ /Domain-0/)
    {
      printf("%s uses %.2f%% of one CPU\n", $name, $cpu);
    }
    else
    {
      $all_cpu{$name} = $cpu;
    }
    $total_percent += $cpu;
    next;
  }
  $_ =~ s/\).*$//;
  if($_ =~ /start_time /)
  {
    $_ =~ s/^.*start_time //;
    $uptime = $cur_time – $_;
    next;
  }
  if($_ =~ /cpu_time /)
  {
    $_ =~ s/^.*cpu_time //;
    $cpu_time = $_;
    next;
  }
  if($_ =~ /\(name /)
  {
    $_ =~ s/^.*name //;
    $name = $_;
    next;
  }
}
close(LIST);

sub hashValueDescendingNum {
  $all_cpu{$b} <=> $all_cpu{$a};
}

my $key;

foreach $key (sort hashValueDescendingNum (keys(%all_cpu)))
{
  printf("%s uses %.2f%% of one CPU\n", $key, $all_cpu{$key});
}

printf("Overall CPU use approximates %.1f%% of one CPU\n", $total_percent);

Here’s the script for Debian/Jessie:

#!/usr/bin/perl

use strict;

open(UPTIME, "xl uptime|") or die "Can't get uptime";
open(LIST, "xl list|") or die "Can't get list";

my %all_uptimes;

while(<UPTIME>)
{
  chomp $_;

  next if($_ =~ /^Name/);
  $_ =~ s/ +/ /g;

  my @split1 = split(/ /, $_);
  my $dom = $split1[0];
  my $uptime = 0;
  my $time_ind = 2;
  if($split1[3] eq "days,")
  {
    $uptime = $split1[2] * 24 * 3600;
    $time_ind = 4;
  }
  my @split2 = split(/:/, $split1[$time_ind]);
  $uptime += $split2[0] * 3600 + $split2[1] * 60 + $split2[2];
  $all_uptimes{$dom} = $uptime;
}
close(UPTIME);

my $total_percent = 0;

while(<LIST>)
{
  chomp $_;

  my $dom = $_;
  $dom =~ s/ .*$//;

  if ( $_ =~ /(\d+)\.[0-9]$/ )
  {
    my $percent = $1 / $all_uptimes{$dom} * 100.0;
    $total_percent += $percent;
    printf("%s uses %.2f%% of one CPU\n", $dom, $percent);
  }
  else
  {
    next;
  }
}

printf("Overall CPU use approximates  %.1f%% of one CPU\n", $total_percent);

12

Dedicated vs Virtual Servers

A common question about hosting is whether to use a dedicated server or a virtual server.

Dedicated Servers

If you use a dedicated server then you will face the risk of problems which interrupt the boot process. It seems that all the affordable dedicated server offerings lack any good remote management, so when the server doesn’t boot you either have to raise a trouble ticket with the company running the Data-Center (DC) or use some sort of hack. Hetzner is a dedicated server company that I have had good experiences with [1], when a server in their DC fails to boot you can use their web based interface (at no extra charge or delay) to boot a Linux recovery environment which can then be used to fix whatever the problem may be. They also charge extra for hands-on support which could be used if the Linux recovery environment revealed no errors but the system just didn’t work. This isn’t nearly as good as using something like IPMI which permits remote console access to see error messages and more direct control of rebooting.

The up-side of a dedicated server is performance. Some people think that avoiding virtualisation improves performance, but in practice most virtual servers use virtualisation technologies that have little overhead. A bigger performance issue than the virtualisation overhead is the fact that most companies running DCs have a range of hardware in their DC and your system (whether a virtual server or a dedicated server) will be on a random system from their DC. I have observed hosting companies to give different speed CPUs and for dedicated servers different amounts of RAM for the same price. I expect that the disk IO performance also varies a lot but I have no evidence. As long as the hosting company provides everything that they offered before you sign the contract you can’t complain. It’s worth noting that CPU performance is either poorly specified or absent in most offers and disk IO performance is almost never specified. One advantage of dedicated servers in this regard is that you get to know the details of the hardware and can therefore refuse certain low spec hardware.

The real performance benefit of a dedicated server is that disk IO performance won’t be hurt by other users of the same system. Disk IO is the real issue as CPU and RAM are easy to share fairly but disk performance is difficult to share and is also a significant bottleneck on many servers.

Dedicated servers also have a higher minimum price due to the fact that a real server is being used which involves hardware purchase and rack space. Hetzner’s offers which start at E29 per month are about as cheap as it’s possible to get. But it appears that the E29 offer is for an old server – new hardware starts at E49 per month which is still quite cheap. But no dedicated server compares to the virtual servers which can be rented for prices less than $10 per month.

Virtual Servers

A virtual server will typically have an effective management interface. You should expect to get web based access to the system console as well as ssh console access. If console access is not sufficient to recover the system then there is an option to boot from a recovery device. This allows you to avoid many situations that could potentially result in down-time and when things go wrong it allows you to recover faster. Linode is an example of a company that provides virtual servers and provides a great management interface [2]. It would take a lot of work with performance monitoring and graphing tools to give the performance overview that comes for free with the Linode interface.

Disk IO performance can suck badly on virtual servers and it can happen suddenly and semi-randomly. If someone else who is using a virtual server on the same hardware is the target of a DoS attack then your performance can disappear. Performance for CPU is generally fairly reliable though. So a CPU bound server would be a better fit on the typical virtual server options than a disk IO bound server.

Virtual servers are a lot cheaper at the low end so if you don’t need the hardware capabilities of a minimal dedicated server (with 1G of RAM for Hetzner and a minimum of 8G of RAM for some other providers) then you can save a lot of money by getting a virtual server.

Finally the options for running a virtual machine under a virtual machine aren’t good, AFAIK the only options that would work on a commercial VPS offering are QEMU (an x86 CPU instruction emulator), Hercules (a S/370 S/390, and Z series IBM mainframe emulator), and similar CPU emulators. Please let me know if there are any other good options for running a virtual machine on a VPS. Now while these emulators are apparently good for debugging OS development they aren’t something that are generally useful for running a virtual machine. I knew someone who ran his important servers under Hercules so that x86 exploits couldn’t be used for attacking them, but apart from that CPU emulation isn’t generally useful for servers.

Summary

If you want to have entire control of the hardware or if you want to run your own virtual machines that suit your needs (EG one with lots of RAM and another with lots of disk space) then a dedicated server is required. If you want to have minimal expense or the greatest ease of sysadmin use then a virtual server is a better option.

But the cheapest option for virtual hosting is to rent a server from Hetzner, run Xen on it, and then rent out DomUs to other people. Apart from the inevitable pain that you experience if anything goes wrong with the Dom0 this is a great option.

As an aside, if anyone knows of a reliable company that offers some benefits over Hetzner then please let me know.

What I would Like to See

There is no technical reason why a company like Linode couldn’t make an offer which was a single DomU on a server taking up all available RAM, CPU, and disk space. Such an offer would be really compelling if it wasn’t excessively expensive. That would give Linode ease of management and also a guarantee that no-one else could disrupt your system by doing a lot of disk IO. This would be really easy for Linode (or any virtual server provider) to implement.

There is also no technical reason why a company like Linode couldn’t allow their customers to rent all the capacity of a physical system and then subdivide it among DomUs as they wish. I have a few clients who would be better suited by Linode DomUs that are configured for their needs rather than stock Linode offerings (which never seem to offer the exact amounts of RAM, disk, or CPU that are required). Also if I had a Linode physical server that only had DomUs for my clients then I could make sure that none of them had excessive disk IO that affected the others. This would require many extra features in the Linode management web pages, so it seems unlikely that they will do it. Please let me know if there is someone doing this, it’s obvious enough that someone must be doing it.

Update:

A Rimuhosting employee pointed out that they offer virtual servers on dedicated hardware which meets this criteria [3]. Rimuhosting allows you to use their VPS management system for running all the resources in a single server (so no-one else can slow your VM down) and also allow custom partitioning of a server into as many VMs as you desire.

6

Server Costs vs Virtual Server Costs

The Claim

I have seen it claimed that renting a virtual server can be cheaper than paying for electricity on a server you own. So I’m going to analyse this with electricity costs from Melbourne, Australia and the costs of running virtual servers in the US and Europe as these are the options available to me.

The Costs

According to my last bill I’m paying 18.25 cents per kWh – that’s a domestic rate for electricity use and businesses pay different rates. For this post I’m interested in SOHO and hobbyist use so business rates aren’t relevant. I’ll assume that a year has 365.25 days as I really doubt that people will change their server arrangements to save some money on a leap year. A device that draws 1W of power if left on for 365.25 days will take 365.25*24/1000 = 8.766kWh which will cost 8.766*0.1825 = $1.5997950. I’ll round that off to $1.60 per Watt-year.

I’ve documented the power use of some systems that I own [1]. I’ll use the idle power use because most small servers spend so much time idling that the time that they spend doing something useful doesn’t affect the average power use. I think it’s safe to assume that someone who really wants to save money on a small server isn’t going to buy a new system so I’ll look at the older and cheaper systems. The lowest power use there is a Cobalt Qube, a 450MHz AMD K6 is really small, but at 20W when idling means a cost of only $32 per annum. My Thinkpad T41p is a powerful little system, a 1.7GHz Pentium-M with 1.5G of RAM, a 100G IDE disk and a Gig-E port should be quite useful as a server – which now that the screen is broken is a good use for it. That Thinkpad drew 23W at idle with the screen on last time I tested it which means an annual cost of $36.80 – or something a little less if I leave the screen turned off. A 1.8GHz Celeron with 3 IDE disks drew 58W when idling (but with the disks still spinning), let’s assume for the sake of discussion that a well configured system of that era would take 60W on average and cost $96 per annum.

So my cost for electricity would vary from as little as $36.80 to as much as $96 per year depending on the specs of the system I choose. That’s not considering the possibility of doing something crazy like ripping the IDE disk out of an old Thinkpad and using some spare USB flash devices for storage – I’ve been given enough USB flash devices to run a RAID array if I was really enthusiastic.

For virtual server hosting the cheapest I could find was Xen Europe charges E5 for a virtual server with 128M of RAM, 10G of storage and 1TB of data transfer [2], that is $AU7.38. The next best was Quantact who charges $US15 for a virtual server with 256M of RAM [3], that is $AU16.41.

Really for my own use if I was paying I might choose Linode [4] or Slicehost [5], they both charge $US20 ($AU21.89) for their cheapest virtual server which has 360M or 256M of RAM respectively. I’ve done a lot of things with Linode and Slicehost and had some good experiences, Xen Europe got some good reviews last time I checked but I haven’t used them.

The Conclusion

When comparing a Xen Europe virtual server at $88.56 per annum it might be slightly cheaper than running my old Celeron system – but would be more expensive than buying electricity for my old Thinkpad. If I needed more than 128M of RAM (which seems likely) then the next cheapest option is a 256M XenEurope server for $14.76 per month which is $177.12 per annum which makes my old computers look very appealing. If I needed more than a Gig of RAM then my old Thinkpad would be a clear winner, also if I needed good disk IO capacity (something that always seems poor in virtual servers) then a local server would win.

Virtual servers win when serious data transfer is needed. Even if you aren’t based in a country like Australia where data transfer quotas are small (see my previous post about why Internet access in Australia sucks [6]) you will probably find that any home Internet connection you can reasonably afford doesn’t allow the fast transfer of large quantities of data that you would desire from a server.

So I conclude that apart from strange and unusual corner cases it is cheaper in terms of ongoing expenses to run a small server in your own home than to rent a virtual server.

If you have to purchase a system to run as a server (let’s say $200 for something cheap) and assume hardware depreciation expenses (maybe another $200 every two years) then you might be able to save money. But this also seems like a corner case as the vast majority of people who have the skills to run such servers also have plenty of old hardware, they replace their main desktop systems periodically and often receive gifts of old hardware.

One final fact that is worth considering is that if your time has a monetary value and if you aren’t going to learn anything useful by running your own local server then using a managed virtual server such as those provided by Linode (who have a really good management console) then you will probably save enough time to make it worth the expense.

3

Xen and Debian/Squeeze

Ben Hutchings announced that the Debian kernel team are now building Xen flavoured kernels for Debian/Unstable [1]. Thanks to Max Attems and the rest of the kernel team for this and all their other great work! Thanks Ben for announcing it. The same release included OpenVZ, updated DRM, and the kernel mode part of Nouveau – but Xen is what interests me most.

I’ve upgraded the Xen server that I use for my SE Linux Play Machine [2] to test this out.

To get this working you first need to remove xen-tools as the Testing version of bash-completion has an undeclared conflict, see Debian bug report #550590.

Then you need to upgrade to Unstable, this requires upgrading the kernel first as udev won’t upgrade without it.

If you have an existing system you need to install xen-hypervisor-3.4-i386 and purge xen-hypervisor-3.2-1-i386 as the older Xen hypervisor won’t boot the newer kernel. This also requires installing xen-utils-3.4 and removing xen-utils-3.2-1 as the utilities have to match the kernel. You don’t strictly need to remove the old hypervisor and utils packages as it should be possible to have dual-boot configured with old and new versions of Xen and matching Linux kernels. But this would be painful to manage as update-grub doesn’t know how to match Xen and Linux kernel versions so you will get Grub entries that are not bootable – it’s best to just do a clean break and keep a non-Xen version of the older kernel installed in case it doesn’t initially boot.

A apt-get dist-upgrade operation will result in installing the grub-pc package. The update-grub2 command doesn’t generate Xen entries. I’ve filed Debian bug report #574666 about this.

Because the Linux kernel doesn’t want to reduce in size to low values I use “xenhopt=dom0_mem=142000” in my GRUB 0.98 configuration so that the kernel doesn’t allocate as much RAM to it’s internal data structures. In the past I’ve encountered a kernel memory management bug related to significantly reducing the size of the Dom0 memory after boot [3].

Before I upgraded I had the dom0_mem size set to 122880 but when running Testing that seems to get me a kernel Out Of Memory condition from udev in the early stages of boot which prevents LVM volumes from being scanned and therefore prevents swap from being enabled so the system doesn’t work correctly (if at all). I had this problem with 138000M of RAM so I chose 142000 as a safe number. Now I admit that the system would probably boot with less RAM if I disabled SE Linux, but the SE Linux policy size of the configuration I’m using in the Dom0 has dropped from 692K to 619K so it seems likely that the increase in required memory is not caused by SE Linux.

The Xen Dom0 support on i386 in Debian/Unstable seems to work quite well. I wouldn’t recommend it for any serious use, but for something that’s inherently designed for testing (such as a SE Linux Play Machine) then it works well. My Play Machine has been offline for the last few days while I’ve been working on it. It didn’t take much time to get Xen working, it took a bit of time to get the SE Linux policy for Unstable working well enough to run Xen utilities in enforcing mode, and it took three days because I had to take time off to work on other projects.

11

Starting with KVM

I’ve just bought a new Thinkpad that has hardware virtualisation support and I’ve got KVM running.

HugePages

The Linux-KVM site has some information on using hugetlbfs to allow the use of 2MB pages for KVM [1]. I put “vm.nr_hugepages = 1024” in /etc/sysctl.conf to reserve 2G of RAM for KVM use. The web page notes that it may be impossible to allocate enough pages if you set it some time after boot (the kernel can allocate memory that can’t be paged and it’s possible for RAM to become too fragmented to allow allocation). As a test I reduced my allocation to 296 pages and then increased it again to 1024, I was surprised to note that my system ran extremely slow while reserving the pages – it seems that allocating such pages is efficient when done at boot time but not so efficient when done later.

hugetlbfs /hugepages hugetlbfs mode=1770,gid=121 0 0

I put the above line in /etc/fstab to mount the hugetlbfs filesystem. The mode of 1770 allows anyone in the group to create files but not unlink or rename each other’s files. The gid of 121 is for the kvm group.

I’m not sure how hugepages are used, they aren’t used in the most obvious way. I expected that allocating 1024 huge pages would allow allocating 2G of RAM to the virtual machine, that’s not the case as “-m 2048” caused kvm to fail. I also expected that the number of HugePages free according to /proc/meminfo would reliably drop by an amount that approximately matches the size of the virtual machine – which doesn’t seem to be the case.

I have no idea why KVM with Hugepages would be significantly slower for user and system CPU time but still slightly faster for the overall build time (see the performance section below). I’ve been unable to find any documents explaining in which situations huge pages provide advantages and disadvantages or how they work with KVM virtualisation – the virtual machine allocates memory in 4K pages so how does that work with 2M pages provided to it by the OS?

But Hugepages does provide a slight benefit in performance and if you have plenty of RAM (I have 5G and can afford to buy more if I need it) you should just install it as soon as you start.

I have filed Debian bug report #574073 about KVM displaying an error you normally can’t see when it can’t access the hugepages filesystem [6].

Permissions

open /dev/kvm: Permission denied
Could not initialize KVM, will disable KVM support

One thing that annoyed me about KVM is that the Debian/Lenny version will run QEMU instead if it can’t run KVM. I discovered this when a routine rebuild of the SE Linux Policy packages in a Debian/Unstable virtual machine took an unreasonable amount of time. When I halted the virtual machine I noticed that it had displayed the above message on stderr before changing into curses mode (I’m not sure the correct term for this) such that the message was obscured until the xterm was returned to the non-curses mode at program exit. I had to add the user in question to the kvm group. I’ve filed Debian bug report #574063 about this [2].

Performance

Below is a table showing the time taken for building the SE Linux reference policy on Debian/Unstable. It compares running QEMU emulation (using the kvm command but without permission to access /dev/kvm), KVM with and without hugepages, Xen, and a chroot. Xen is run on an Opteron 1212 Dell server system with 2*1TB SATA disks in a RAID-1 while the KVM/QEMU tests are on an Intel T7500 CPU in a Thinkpad T61 with a 100G SATA disk [4]. All virtual machines had 512M of RAM and 2 CPU cores. The Opteron 1212 system is running Debian/Lenny and the Thinkpad is running Debian/Lenny with a 2.6.32 kernel from Testing.

Elapsed User System
QEMU on Opteron 1212 with Xen installed 126m54 39m36 8m1
QEMU on T7500 95m42 42m57 8m29
KVM on Opteron 1212 7m54 4m47 2m26
Xen on Opteron 1212 6m54 3m5 1m5
KVM on T7500 6m3 2m3 1m9
KVM Hugepages on T7500 with NCurses console 5m58 3m32 2m16
KVM Hugepages on T7500 5m50 3m31 1m54
KVM Hugepages on T7500 with 1800M of RAM 5m39 3m30 1m48
KVM Hugepages on T7500 with 1800M and file output 5m7 3m28 1m38
Chroot on T7500 3m43 3m11 29

I was surprised to see how inefficient it is when compared with a chroot on the same hardware. It seems that the system time is the issue. Most of the tests were done with 512M of RAM for the virtual machine, I tried 1800M which improved performance slightly (less IO means less context switches to access the real block device) and redirecting the output of dpkg-buildpackage to /tmp/out and /tmp/err reduced the built time by 32 seconds – it seems that the context switches for networking or console output really hurt performance. But for the default build it seems that it will take about 50% longer in a virtual machine than in a chroot, this is bearable for the things I do (of which building the SE Linux policy is the most time consuming), but if I was to start compiling KDE then I would be compelled to use a chroot.

I was also surprised to see how slow it was when compared to Xen, for the tests on the Opteron 1212 system I used a later version of KVM (qemu-kvm 0.11.0+dfsg-1~bpo50+1 from Debian/Unstable) but could only use 2.6.26 as the virtualised kernel (the Debian 2.6.32 kernels gave a kernel Oops on boot). I doubt that the lower kernel version is responsible for any significant portion of the extra minute of build time.

Storage

One way of managing storage for a virtual machine is to use files on a large filesystem for it’s block devices, this can work OK if you use a filesystem that is well designed for large files (such as XKS). I prefer to use LVM, one thing I have not yet discovered is how to make udev assign the KVM group to all devices that match /dev/V0/kvm-*.

Startup

KVM seems to be basically designed to run from a session, unlike Xen which can be started with “xm create” and then run in the background until you feel like running “xm console” to gain access to the console. One way of dealing with this is to use screen. The command “screen -S kvm-foo -d -m kvm WHATEVER” will start a screen session named kvm-foo that will be detached and will start by running kvm with “WHATEVER” as the command-line options. When screen is used for managing virtual machines you can use the command “screen -ls” to list the running sessions and then commands such as “screen -r kvm-unstable” to reattach to screen sessions. To detach from a running screen session you type ^A^D.

The problem with this is that screen will exit when the process ends and that loses the shutdown messages from the virtual machine. To solve this you can put “exec bash” or “sleep 200” at the end of the script that runs kvm.

start-stop-daemon -S -c USERNAME --exec /usr/bin/screen -- -S kvm-unstable -d -m /usr/local/sbin/kvm-unstable

On a Debian system the above command in a system boot script (maybe /etc/rc.local) could be used to start a KVM virtual machine on boot. In this example USERNAME would be replaced by the name of the account used to run kvm, and /usr/local/sbin/kvm-unstable is a shell script to run kvm with the correct parameters. Then as user USERNAME you can attach to the session later with the command “screen -x kvm-unstable“. Thanks to Jason White for the tip on using screen.

I’ve filed Debian bug report #574069 [3] requesting that kvm change it’s argv[0] so that top(1) and similar programs can be used to distinguish different virtual machines. Currently when you have a few entries named kvm in top’s output it is annoying to match the CPU hogging process to the virtual machine it’s running.

It is possible to use KVM with X or VNC for a graphical display by the virtual machine. I don’t like these options, I believe that Xephyr provides better isolation, I’ve previously documented how to use Xephyr [5].

kvm -kernel /boot/vmlinuz-2.6.32-2-amd64 -initrd /boot/initrd.img-2.6.32-2-amd64 -hda /dev/V0/unstable -hdb /dev/V0/unstable-swap -m 512 -mem-path /hugepages -append "selinux=1 audit=1 root=/dev/hda ro rootfstype=ext4" -smp 2 -curses -redir tcp:2022::22

The above is the current kvm command-line that I’m using for my Debian/Unstable test environment.

Networking

I’m using KVM options such as “-redir tcp:2022::22” to redirect unprivileged ports (in this case 2022) to the ssh port. This works for a basic test virtual machine but is not suitable for production use. I want to run virtual machines with minimal access to the environment, this means not starting them as root.

One thing I haven’t yet investigated is the vde2 networking system which allows a private virtual network over multiple physical hosts and which should allow kvm to be run without root privs. It seems that all the other networking options for kvm which have appealing feature sets require that the kvm process be started with root privs.

Is KVM worth using?

It seems that KVM is significantly slower than a chroot, so for a basic build environment a secure chroot environment would probably be a better option. I had hoped that KVM would be more reliable than Xen which would offset the performance loss – however as KVM and Debian kernel 2.6.32 don’t work together on my Opteron system it seems that I will have some reliability issues with KVM that compare with the Xen issues. There are currently no Xen kernels in Debian/Testing so KVM is usable now with the latest bleeding edge stuff (on my Thinkpad at least) while Xen isn’t.

Qemu is really slow, so Xen is the only option for 32bit hardware. Therefore all my 32bit Xen servers need to keep running Xen.

I don’t plan to switch my 64bit production servers to KVM any time soon. When Debian/Squeeze is released I will consider whether to use KVM or Xen after upgrading my 64bit Debian server. I probably won’t upgrade my 64bit RHEL-5 server any time soon – maybe when RHEL-7 is released. My 64bit Debian test and development server will probably end up running KVM very soon, I need to upgrade the kernel for Ext4 support and that makes KVM more desirable.

So it seems that for me KVM is only going to be seriously used on my laptop for a while.

Generally I am disappointed with KVM. I had hoped that it would give almost the performance of Xen (admittedly it was only 14.5% slower). I had also hoped that it would be really reliable and work with the latest kernels (unlike Xen) but it is giving me problems with 2.6.32 on Opteron. Also it has some new issues such as deciding to quietly do something I don’t want when it’s unable to do what I want it to do.