2

Xen and Bridging

In a default configuration of Xen there will be a virtual Ethernet device created for each interface which will be associated with a bridge. A previous post documented how to configure a bridge named xenbr0.

The basic configuration of Xen that most people use is to have a single virtual Ethernet port for each Xen instance and have them all connected to the one bridge, and then the Dom0 will have an IP address on the bridge interface that is used for routing packets to the outside world. This works really well if you have a subnet that you are using for all Xen DomU IP addresses, if you are using NAT for communication, or if the DomU needs no communication outside the Dom0 and other DomU’s on the same machine (a common case for testing).

But if you have a collection of servers that you want to consolidate on a single piece of hardware then you end up using a single sub-net that spans some physical machines, some Xen Dom0’s, and some DomU’s. The solution to this is to use bridged networking.

Unfortunately most documentation of bridged networking is really confusing, and non of my google searches turned up the most relevant fact:

When setting up a bridge on the local Ethernet you must make your physical ethernet device (eth0 or whatever) be strictly a slave to the bridge and then assign the IP address used for the physical network to the bridge.

ifconfig eth0 up
brctl addif xenbr0 eth0

For example if you have 10.0.0.42 being the IP address used by the Dom0 on the local Ethernet via device eth0 and you want to use bridging for DomU’s then you simply make eth0 owned by xenbr0 (the typical name for the Xen bridge) with the above commands in your script to configure the xenbr0 device. Then treat xenbr0 in the same way that you treated eth0 before enabling bridging.

Also there’s nothing stopping you from having one bridge for DomU’s that can talk directly to the physical Ethernet and another for DomU’s that are only to use routed networking, see my previous post about using multiple ethernet devices in Xen for more background information.

Modules and NFS for Xen

I’m just in the process of converting a multi-user system to a Xen DomU. It was running on a stand-alone Fedora Core 5 i386 system and I want to run it on a Fedora 7 DomU under a CentOS 5 Dom0 on an Opteron system.

The first stage of the conversion was to copy an image of the Fedora Core 5 system and make it a DomU under CentOS. I had some problems getting a Fedora Core 5 Xen kernel to boot so I installed a 64bit CentOS 5 kernel with the Fedora Core 5 user-space and surprisingly everything worked. I had expected to have problems with kernel modules, but everything just worked! I had expected that the 32bit modutils would be unable to load 64bit modules, but things just worked.

The first stage was to have the old server NFS export /home and have it mounted by the Xen DomU, this worked well for about a week. The next step was to move the data on to the new server. My first attempt was to have the Dom0 running the filesystems and NFS exporting them to the DomU but this caused an OpenOffice error “Error saving the document Name: General Error. General input/output error.“.

So having 32bit Fedora Core 5 with a 64bit Cent OS 5 kernel NFS mounting from a 32bit Fedora Core 5 system works well, while mounting from a 64bit Cent OS 5 system fails. If anything I would have expected better results from having the same version of the kernel on NFS client and server.

The next issue is whether a 64bit Fedora 7 system in a DomU can NFS mount the data from the Cent OS 5 kernel with Fedora Core 5 user-space. If not it’ll make testing the Fedora 7 upgrade significantly more painful than it might otherwise be.

If only we had a network filesystem for Unix that supported POSIX semantics.

Bizarre “No space left on device” error from Xen

What should have been a routine “remove DIMMs and run memtest until things work” procedure to solve a memory error became a lot more complex due to poor error handling in Xen.

The following error occured because the tdb database /var/lib/xenstored/tdb was corrupt. To fix it you must rm the file and kill the xenstored process (which will otherwise recreate the file with the same corrupt data). It took me a few hours to work this out.

# xm create -c smtp
Using config file “/etc/xen/smtp”.
Error: (28, ‘No space left on device, while writing /local/domain/0/backend/vbd/
18/768/online : 1’)

Strangely the command “/etc/init.d/xend restart” does not restart xenstore or xenconsole (which is why running “/etc/init.d/xend stop” before rm’ing the file didn’t do any good).

After fixing the above problem I encountered the following error condition. It seems that there is no support for restarting daemons such as xenstored. So I had to reboot the machine. After that it worked.

# xm create -c smtp
Using config file “/etc/xen/smtp”.
Error: Device 0 (vif) could not be connected. Hotplug scripts not working.

I’ve filed Debian bug 433780 about this.

2

Xen and Heartbeat

Xen (a system for running multiple virtual Linux machines) and has some obvious benefits for testing Heartbeat (the clustering system) – the cheapest new machine that is on sale in Australia can be used to simulate a four node cluster. I’m not sure whether there is any production use for a cluster running under Xen (I look forward to seeing some comments or other blog posts about this).

Most cluster operations run on a Xen virtual machine in the same way as they would under physically separate machines, and Xen even supports simulating a SAN or fiber-channel shared storage device if you use the syntax phy:/dev/VG/LV,hdd,w! in the Xen disk configuration line (the exclamation mark means that the volume is writable even if someone else is writing to it).

The one missing feature is the ability to STONITH a failed node. This is quite critical as the design of Heartbeat is that a service on a node which is not communicating will not be started on another node until the failed node comes up after a reboot or the STONITH sub-system states that it has rebooted it or turned it off. This means that the failure of a node implies the permanent failure of all services on it until/unless the node can be STONITH’d.

To solve this problem I have written a quick Xen STONITH module. The first issue is how to communicate between the DomU’s (Xen virtual machines) and the Dom0 (the physical host). It seemed that the best way to do this is to ssh to special accounts on the Dom0 and then use sudo to run a script that calls the Xen xm utility to actually restart the node. That way the Xen virtual machine gets limited access to the Dom0, and the shell script could even be written to allow each VM to only manage a sub-set of the VMs on the host (so you could have multiple virtual clusters on the one physical host and prevent them from messing with each other through accident or malice).

xen ALL=NOPASSWD:/usr/local/sbin/xen-stonith

Above is the relevant section from my /etc/sudoers file. It allows user xen to execute the script /usr/local/sbin/xen-stonith as root to do the work.

One thing to note is that from each of the DomU’s you must be able to ssh from root on the node to the specified account for the Xen STONITH service without using a password and without any unreasonable delay (IE put UseDNS no in /etc/ssh/sshd_config.

The below section (which isn’t in the feed) there are complete scripts for configuring this.

Continue reading

first look at CentOS 5 Xen

I have just installed a machine running CentOS 5 as a Xen server. I installed a full GUI environment on the dom0 so that GUI tools can be used for managing the virtual servers.

The first problem I had was selecting the “Installation source”, it’s described in the error message as an “Invalid PV media address” when you get it wrong which caused me a little confusion when installing it at 10PM. Then I had a few problems getting the syntax of a nfs://1.2.3.4:/directory URL correct. But these were trivial annoyances. It was a little annoying that my attempts to use a “file://” URL were rejected, I had hoped that it would just run exportfs to make the NFS export from the local machine (much faster than using an NFS server over the network which is what the current setup will lead people to do).

The first true deficiency I found with the tools is that it provides no way of creating filesystems on block devices. The process of allocating a block device or file from the Xen configuration tool is merely assigning a virtual block device to the Xen image – and only one such virtual block device is permitted. Then the CentOS 5 installation instance that runs under Xen will have to partition the disk (it doesn’t support installing directly to an unpartitioned disk) which will make things painful when it comes time to resize the filesystems.

When running Debian Xen servers I do everything manually. A typical Debian Xen instance that I run will have a virtual block device /dev/hda for the root FS, /dev/hdb for swap, and /dev/hdc for /home. Then if I want to resize them I merely stop the Xen instance, run “e2fsck -f” on the filesystem followed by “resize2fs” and the LVM command “lvresize” (in the appropriate order depending on whether I am extending or reducing the filesystem).

Xen also supports creating a virtual partitioned disk. This means I could have /dev/lvm/xenroot, and /dev/lvm/xenswap, and /dev/lvm/xenhome appear in the domU as /dev/hda1, /dev/hda2, and /dev/hda3. This means that I could have a single virtual disk that allows the partitions to be independently resized when the domU in question is not running. I have not tried using this feature as it doesn’t suit my usage patterns. But it’s interesting and unfortunate that the GUI tools which are part of CentOS don’t support it.

When I finally got to run the install process it had a virtual graphics environment (which is good) but unfortunately it suffered badly from the two-mouse-cursor problem with different accellerations used for both cursors so the difference in position of the two cursors varied in different parts of the screen. This was rather surprising as the dom0 had a default GNOME install.

Xen and eth device renaming

Recently I rebooted one of my Debian Xen servers and suddenly all the Ethernet devices which used to be eth0 in the domU’s became eth1.

vif = [ ”, ‘bridge=xenbr1’ ]

I used to have the above as the interface definition and for domU’s that had only a single interface that worked well (if there is only one interface then it should be eth0). However in a recent etch update this changed, so I had to use ifrename as documented in my previous blog post. It’s annoying when things break because a reasonable assumption which previously worked suddenly stops working.

Even if the bug in question (if it is regarded as a bug) is fixed I’ll keep using ifrename, it doesn’t do any harm.

Update: I have changed my Xen configuration to use fixed MAC addresses which seems to be a better solution than using ifrename. See the Wikipedia page about MAC addresses for information on how to choose them. I’m currently using manually assigned MAC addresses from the range 00:16:3e (which is assigned to Xen).

xen sucks

According to Debian bug #399113 and linked discussion it is impossible to run a stable system on Xen without enabling PAE. It seems that no-one is considering the fact that a hypervisor that runs on both 32bit and 64bit architectures should be able to support 32bit systems with <4G of RAM (IE not using the PAE feature).

But instead to work around this bug the Debian developers have decided to just enable PAE. This is annoying for me as I have to either buy a new laptop or reduce my use of Xen.

I wonder what will happen if a Xen bug is discovered that only happens on PAE systems? Would that make Xen only an AMD64 thing?

Xephyr

As part of my work on Xen I’ve been playing with Xephyr (a replacement for Xnest). My plan is to use Xen instances for running different versions of desktop environments. You can’t just ssh -X to a Xen image and run things. One problem is that some programs such as Firefox do strange things to try and ensure that you only have one instance running. Another problem is with security, the X11 security extensions don’t seem to do much good. A quick test indicates that a ssh -X session can’t copy the window contents of a ssh -Y session, but can copy the contents of all windows run in the KDE environment. So this extension to X (and the matching ssh support) seem to do little good.

One thing I want to do is to have a Xen image for running Firefox with risky extenstions such as Flash and keep it separate from my main desktop for security and managability.

Xephyr :1 -auth ~/.Xauth-Xephyr -reset -terminate -screen 1280×1024

My plan is to use a command such as the above to run the virtual screen. That means to have a screen resolution of 1280×1024, to terminate the X server when the last client exits (both the -reset and the -terminate options are required for this), to be display :1 and listen with TCP (the default), and to use an authority file named ~/.Xauth-Xephyr.

xauth generate :1 .

The first problem is how to generate the auth file, the xauth utility is documented as doing it via the above command. But this really connects to a running X server and copies the auth data from it.

The solution (as pointed out to me by Dr. Brian May) is to be found in the startx script which solves this problem. The way to do it is to use the add :1 . $COOKIE command in xauth to create the auth file used by the X server, and to generate the cookie with the mcookie program.

In ~/.ssh/config:
Host server
SendEnv DISPLAY

In /etc/ssh/sshd_config:
AcceptEnv DISPLAY

The next requirement is to tell the remote machine (which incidentally doesn’t need to be a Xen virtual machine, it can be any untrusted host that contains X applications you want to run) which display to use. The first thing to do is to ssh to the machine in question and run the xauth program to add the same cookie as is used for the X server. Then the DISPLAY environment variable can be sent across the link by setting the ~/.ssh/config file at the client end to have the above settings (where server is the name of the host we will connect to via SSH) and in the sshd_config file on the server have the line AcceptEnv DISPLAY to accept the DISPLAY environment variable. It would have been a little easier to configure if I had added the auth entry to the main ~/.Xauthority file and used the command DISPLAY=:1 ssh -X server, this would be the desired configuration when operating over an untrusted network. But when talking to a local Xen instance it gives better performance to not encrypt the X data.

The following script will generate an xauth entry, run a 1280×1024 resolution Xephyr session, and connect to the root account on machine server and run the twm window manager. Xephyr will exit when all X applications end. Note that you probably want to use passwordless authentication on the server as typing a password twice to start the session would be a drag.

#!/bin/sh

COOKIE=`mcookie`
FILE=~/.Xauth-Xephyr
rm -f $FILE
#echo “add 10.1.0.1:1 . $COOKIE” | xauth
ssh root@server “echo \”add 10.1.0.1:1 . $COOKIE\” | xauth”
echo “add :1 . $COOKIE” | xauth -f $FILE
Xephyr :1 -auth $FILE -reset -terminate -screen 1280×1024 $* &
DISPLAY=10.1.0.1:1 ssh root@server twm
wait

Xen shared storage

disk = [ ‘phy:/dev/vg/xen1,hda,w’, ‘phy:/dev/vg/xen1-swap,hdb,w’, ‘phy:/dev/vg/xen1-drbd,hdc,w’, ‘phy:/dev/vg/san,hdd,w!’ ]

For some work that I am doing I am trying to simulate a cluster that uses fiber channel SAN storage (among other things). The above is the disk line I’m using for one of my cluster nodes, hda and hdb are the root and swap disks for a cluster node, hdc is a DRBD store (DRBD allows a RAID-1 to be run across the cluster nodes via TCP), and hdd is a SAN volume. The important thing to note is the “w!” mode for the device, this means write access is granted even in situations whre Xen thinks it’s unwise (IE it’s being used by another Xen node or is mounted on the dom0). I’ve briefly tested this by making a filesystem on /dev/hdd on one node, copying data to it, then umounting it and mounting it on another node to read the data.

There are some filesystems that support having multiple nodes mounting the same device at the same time, these include CXFS, GFS, and probably some others. It would be possible to run one of those filesystems across nodes of a Xen cluster. However that isn’t my aim at this time. I merely want to have one active node mount the filesystem while the others are on standby.

One thing that needs to be solved for Xen clusters is fencing. When a node of a cluster is misbehaving it needs to be denied access to the hardware in case it recovers some hours later and starts writing to a device that is now being used by another node. AFAIK the only way of doing this is via the xm destroy command. Probably the only way of doing this is to have a cluster node ssh to the dom0 and then run a setuid program that calls xm destroy.

1

multiple ethernet devices in Xen

It seems that no-one has documented what needs to be done to correctly run multiple Ethernet devices (with one always being eth0 and the other always being eth1) in a Linux Xen configuration (or if it is documented then google wouldn’t find it for me).

vif = [ ‘mac=00:16:3e:00:01:01’, ‘mac=00:16:3e:00:02:01, bridge=xenbr1’ ]

Firstly I use a vif line such as the above in the Xen configuration. This means that there is one ethernet device with the hardware address of 00:16:3e:00:01:01 and another with the address of 00:16:3e:00:02:01. I just updated this section, the 00:16:3e prefix has officially been allocated to the Xen project for virtual machines. Therefore on your Xen installation you can do whatever you like with MAC addresses in that range without risk of competing with real hardware. The Xen code uses random MAC addresses in that range if you let it.

I have two bridge devices, xenbr0 and xenbr1. I only need to specify one as Xen can figure the other out.

Now when my domU’s boot they assign ethernet device names from the range eth0 to eth8. If there is only one virtual Ethernet device then it is always eth0 and things are easy. But for multiple devices I need to rename the interfaces.

eth0 mac 00:16:3e:00:01:01
eth1 mac 00:16:3e:00:02:01

This is done through the ifrename program (package name ifrename in Debian). I create a file named /etc/iftab with the above contents and then early in the boot process (before the interfaces are brought up) the devices will be renamed.

In the Red Hat model you edit the files such as /etc/sysconfig/networking/devices/ifcfg-eth0 and change the line that starts with HWADDR to cause a device rename on boot.

Update: the original version of this post used MAC addresses with a prefix of 00:00:00, the officially allocated prefix for Xen is 00:16:3e which I now use. Thanks to the person who commented about this.