Discovering OS Bugs and Using Snapshots

I’m running Debian/Unstable on an EeePC 701, I’ve got an SD card for /home etc but the root filesystem is on the internal 4G flash storage which doesn’t have much spare space (I’ve got a full software development environment, GCC, debuggers, etc as well as running KDE4). On some of my systems I’ve started the practice of having two root filesystem installs, modern disks are big enough that usually it’s difficult to use all the space, and even if you do use most of the space the use of a second root filesystem only takes a fraction of a percent of the available space.

Today I discovered a problem with my EeePC, I had upgraded to the latest Unstable packages a few days ago and now when I run X programs the screen flickers really badly every time it’s updated. Pressing a key in a terminal window makes the screen shake, watching a video with mplayer makes it shake constantly to such a degree that it’s not usable. If that problem occurred on a system with a second root filesystem I could upgrade the other a few packages at a time to try and discover the root cause. But without the space for a second root filesystem this isn’t an option.

I hope that Btrfs [1] becomes ready for serious use soon, it seems that the btrfs snapshot facility might make it possible for me to preserve the old version in a bootable form before upgrading my EeePC (although even then disk space would be tight).

So I guess I now need to test different versions of the X related packages in a chroot environment to track this bug down. Sigh.

Ext4 and Debian/Lenny

I want to use the Ext4 filesystem on Xen DomUs. The reason for this is that the problem of fsck times on ext4 (as described in my previous post about Ext4 [1]) is compounded if you have multiple DomUs running fsck at the same time.

One issue that makes this difficult is the fact that it is very important to be able to mount a DomU filesystem in the Dom0 and it is extremely useful to be able to fsck a DomU filesystem from a Dom0 (for example when you want to resize the root filesystem of the DomU).

I have Dom0 systems running CentOS5, RHEL5, and Debian/Lenny, and I have DomU systems running CentOS5, RHEL4, Debian/Lenny, and Debian/Unstable. So to get Ext4 support on all my Xen servers I need it for Debian/Lenny and RHEL4 (Debian/Unstable has full support for Ext4 and RHEL5 and CentOS5 have been updated to support it [2]).

The Debian kernel team apparently don’t plan to add kernel support for Ext4 in Lenny (they generally don’t do such things) and even backports.debian.org doesn’t have a version of e2fsprogs that supports ext4. So getting Lenny going with Ext4 requires a non-default kernel and a back-port of the utilities. In the past I’ve used CentOS and RHEL kernels to run Debian systems and that has worked reasonably well. I wouldn’t recommend doing so for a Dom0 or a non-virtual install, but for a DomU it works reasonably well and it’s not too difficult to recover from problems. So I have decided to upgrade most of my Lenny virtual machines to a CentOS 5 kernel.

When installing a CentOS 5 kernel to replace a Debian/Lenny kernel you have to use “console=tty0” as a kernel parameter instead of “xencons=tty“, you have to use /dev/xvc0 as the name of the terminal for running a getty (IE xvc0 is a parameter to getty) and you have to edit /etc/rc.local (or some other init script) to run “killall -9 nash-hotplug” as a nash process from the Red Hat initrd goes into an infinite loop. Of course upgrading a CentOS kernel on a Debian system is a little more inconvenient (I upgrade a CentOS DomU and then copy the kernel modules to the Debian DomUs and the vmlinuz and initrd to the Dom0).

The inconvenience of this can be an issue in an environment where multiple people are involved in running the systems, if a sysadmin who lacks skills or confidence takes over they may be afraid to upgrade the kernel to solve security issues. Also “apt-get dist-upgrade” won’t show that a CentOS kernel can be updated, so a little more management effort is required in tracking which machines need to be upgraded.

deb http://www.coker.com.au lenny misc

To backport the e2fsprogs package I first needed to backport util-linux, debhelper, libtool, xz-utils, base-files, and dpkg. This is the most significant and invasive back-port I’ve done. The above apt repository has all the packages for AMD64 and i386 architectures.

For a Debian system after the right kernel is installed and e2fsprogs (and it’s dependencies) are upgraded the command “tune2fs -O flex_bg,uninit_bg /dev/xvda” can be used to enable the ext4 filesystem. At the next reboot the system will prompt for the root password and allow you to manually run “e2fsck -y /dev/xvda” to do the real work of transitioning the filesystem (unlike Red Hat based distributions which do this automatically).

So the state of my Debian systems running this is that the DomUs run the CentOS kernel and my backported utilities while the Dom0 just runs the backported utilities with the Lenny kernel. Thus the Debian Dom0 can’t mount filesystems from the DomUs – which makes things very difficult when there is a problem that needs to be fixed in a DomU, I have to either mount the filesystem from another DomU or boot with “init=/bin/bash“.

Ext4 and RHEL5/CentOS5

I have just noticed that Red Hat added Ext4 support to RHEL-5 in kernel 2.6.18-110.el5. They also added a new package named e4fsprogs (a break from the e2fsprogs name that has been used for so long). Hopefully they will use a single package for utilities for Ext2/3/4 filesystems in RHEL-6 and not continue this package split. Using commands such as e4fsck and tune4fs is a minor inconvenience.

Converting a RHEL 5 or CentOS 5 system to Ext4 merely requires running the command “tune4fs -O flex_bg,uninit_bg /dev/WHATEVER” to enable Ext4 on the devices, editing /etc/fstab to change the filesystem type to ext4, running a command such as “mkinitrd -f /boot/initrd-2.6.18-164.9.1.el5xen.img 2.6.18-164.9.1.el5xen” to generate a new initrd with Ext4 support (which must be done after editing /etc/fstab), and then rebooting.

When the system is booted it will run fsck on the filesystems automatically – but not display progress reports which is rather disconcerting. The system will display “/ contains a file system with errors, check forced.” and apparently hang for a large amount of time. This is however slightly better than the situation on Debian/Unstable where upgrading to Ext4 results in an fsck error on boot which forces you to login in single user mode to run fsck [1] – which would be unpleasant if you don’t have convenient console access. Hopefully this will be fixed before Squeeze is released.

I now have a couple of my CentOS 5 DomUs running with Ext4, it seems to work well.

The Transition to Ext4

I’ve been investigating the Ext4 filesystem [1].

The main factor that is driving me to Ext4 at the moment is fsck times. I have some systems running Ext3 on large filesystems which I need to extend. In most cases Ext3 filesystems have large numbers of Inodes free because the relationship between the number of Inodes and the filesystem size is set when it is created, enlarging the filesystem increases the number of Inodes and apart from backup/format/restore there is no way of changing this. Some of the filesystems I manage can’t be converted because the backup/restore time would involve an unreasonable amount of downtime.

Page 11 of the OLS paper by Avantika Mathur et al [2] has a graph of the relationship between the number of Inodes and fsck time.

Ext4 also has a number of other features to improve performance, including changes to journaling and block allocation.

Now my most important systems are all virtualised. I am using Debian/Lenny and RHEL5 for the Dom0s. Red Hat might back-port Ext4 to the RHEL5 kernel, but there will probably never be a supported kernel for Debian/Lenny with Ext4 and Xen Dom0 support (there may never be a kernel for any Debian release with such support).

So this means that in a few months time I will be running some DomUs which have filesystems that can’t be mounted in the Dom0. This isn’t a problem when everything works well. But when things go wrong it’s really convenient to be able to amount a filesystem in the Dom0 to fix things, this option will disappear for some of my systems, so if virtual machine A has a problem then I will have to mount it’s filesystems with virtual machine B to fix it. Of course this is a strong incentive to use multiple block devices for the virtual machine so that a small root filesystem can be run with Ext3 and the rest can be Ext4.

At the moment only Debian/Unstable and Fedora have support for Ext4 so this isn’t a real issue. But Debian/Squeeze will release with Ext4 support and I expect that RHEL6 will also have it. When those releases happen I will be upgrading my virtual machines and will have these support issues.

It’s a pity that Red Hat never supported XFS, I could have solved some of these problems years ago if XFS was available.

Now for non-virtual machines one factor to consider is that the legacy version of GRUB doesn’t support Ext4, I discovered this after I used tune2fs to convert all filesystems on my EeePC to Ext4. I think I could have undone that tune2fs option but instead decided to upgrade to the new version of GRUB and copy the kernel and initramfs to a USB device in case it didn’t boot. It turns out that the new version of GRUB seems to work well for booting from Ext4.

One thing that is not explicitly mentioned in the howto is that the fsck pass needed to convert to Ext4 will not be done automatically by most distributions. So when I converted my EeePC I had to use sulogin to manually fsck the filesystems. This isn’t a problem with a laptop, but could be a problem with a co-located server system.

For the long term BTRFS may be a better option, I plan to test it on /home on my EeePC. But I will give Ext4 some more testing first. In any case the Ext3 filesystems on big servers are not going to go away in a hurry.

Finding Thread-unsafe Code

One problem that I have had on a number of occasions when developing Unix software is libraries that use non-reentrant code which are called from threaded programs. For example if a function such as strtok() is used which is implemented with a static variable to allow subsequent calls to operate on the same string then calling it from a threaded program may result in a SEGV (if for example thread A calls strtok() and then frees the memory before thread B makes a second call to strtok(). Another problem is that a multithreaded program may have multiple threads performing operations on data of different sensitivity levels, for example a threaded milter may operate on email destined for different users at the same time. In that case use of a library call which is not thread safe may result in data being sent to the wrong destination.

One potential solution is to use a non-threaded programming model (IE a state machine or using multiple processes). State machines don’t work with libraries based on a callback model (EG libmilter), can’t take advantage of the CPU power available in a system with multiple CPU cores, and require asynchronous implementations of DNS name resolution. Multiple processes will often give less performance and are badly received by users who don’t want to see hundreds of processes in ps output.

So the question is how to discover whether a library that is used by your program has code that is not reentrant. Obviously a library could implement it’s own functions that use static variables – I don’t have a solution to this. But a more common problem is a library that uses strtok() and other libc functions that aren’t reentrant – simply because they are more convenient. Trying to examine the program with nm and similar tools doesn’t seem viable as libraries tend to depend on other libraries so it’s not uncommon to have 20 shared objects being linked in at run-time. Also there is the potential problem of code that isn’t called, if library function foo() happens to call strtok() but I only call function bar() from that library then even though it resolves the symbol strtok at run-time it shouldn’t be a problem for me.

So the obvious step is to use a LD_PRELOAD hack to override all the undesirable functions with code that will assert() or otherwise notify the developer. Bruce Chapman of Sun did a good job of this in 2002 for Solaris [1]. His code is very feature complete but has a limited list of unsafe functions.

Instead of using his code I wrote a minimal implementation of the same concept which searches the section 3 man pages installed on the system for functions which have a _r variant. In addition to that list of functions I added some functions from Bruce’s list which did not have a _r variant. That way I got a list of 72 functions compared to the 40 that Bruce uses. Of course with my method the number of functions that are intercepted will depend on the configuration of the system used to build the code – but that is OK, if the man pages are complete then that will cover all functions that can be called from programs that you write.

Now there is one significant disadvantage to my code. That is the case where unsafe functions are called before child threads are created. Such code will be aborted even though in production it won’t cause any problems. One thing I am idly considering is writing code to parse the man pages for the various functions so it can use the correct parameters for proxying the library calls with dlsym(RTLD_NEXT, function_name). The other option would be to hand code each of the 72 functions (and use more hand coding for each new library function I wanted to add).

To run my code you simply compile the shared object and then run “LD_PRELOAD=./thread.so ./program_to_test” and the program will abort and generate a core dump if the undesirable functions are called.

Here’s the source to the main program:

#!/bin/bash
cat > thread.c << END
#undef NDEBUG
#include <assert.h>
END
OTHERS="getservbyname getservbyport getprotobyname getnetbyname getnetbyaddr getrpcbyname getrpcbynumber getrpcent ctermid tempnam gcvt getservent"
for n in $OTHERS $(ls -1 /usr/share/man/man3/*_r.*|sed -e "s/^.*\///" -e "s/_r\..*$//"|grep -v ^lgamma|sort -u) ; do
  cat >> thread.c << END
void $n()
{
  assert(0);
}
END
done

Here is the Makefile, probably the tabs will be munged by my blog but I’m sure you know where they go:

all: thread.so

thread.c: gen.sh Makefile
./gen.sh

thread.so: thread.c
gcc -shared -o thread.so -fPIC thread.c

clean:
rm thread.so thread.c

Update:
Simon Josefsson wrote an interesting article in response to this [2].

Per-process Namespaces – pam-namespace

Mike writes about his work in using namespaces on Linux [1]. In 2006 I presented a paper titled “Polyinstantiation of directories in an SE Linux system” about this at the SAGE-AU conference [2].

Newer versions of the code in question has been included in Debian/Lenny. So if you want to use namespaces for a login session on a Lenny system you can do the following:
mkdir /tmp-inst
chmod 0 /tmp-inst
echo “/tmp /tmp-inst/ user root” >> /etc/security/namespace.conf
echo “session required pam_namespace.so” >> /etc/pam.d/common-session

Then every user will have their own unique /tmp and be unable to mess with other users.

If you want to use the shared-subtrees facility to have mount commands which don’t affect /tmp be propagated to other sessions then you need to have the following commands run at boot (maybe from /etc/rc.local):
mount –make-shared /
mount –bind /tmp /tmp
mount –make-private /tmp

The functionality in pam_namespace.so to use the SE Linux security context to instantiate the directory seems broken in Lenny. I’ll write a patch for this shortly.

While my paper is not particularly useful as documentation of pam_namespace.so (things changed after I wrote it), it does cover the threats that you face in terms of hostile use of /tmp and how namespaces may be used to solve them.

Things you can do for your LUG

A Linux Users Group like most volunteer organisations will often have a small portion of the membership making most of the contributions. I believe that every LUG has many people who would like to contribute but don’t know how, here are some suggestions for what you can do.

Firstly offer talks. Many people seem to believe that giving a talk for a LUG requires expert knowledge. While it is desired to get any experts in the area to share their knowledge, it is definitely not a requirement that you be an expert to give a talk. The only requirement is that you know more than the audience – and a small amount of research can achieve that goal.

One popular talk that is often given is “what’s new in Linux”. This is not a talk that requires deep knowledge, it does require spending some time reading the news (which lots of people do for fun anyway). So if you spend an average of 30 minutes a day every week day reading about new developments in Linux and other new technology, you could spend another minute a day (20 minutes a month) making notes and the result would be a 10 to 15 minute talk that would be well received. A talk about what’s new is one way that a novice can give a presentation that will get the attention of all the experts (who know their own area well but often don’t have time to see the big picture).

There are many aspects of Linux that are subtle, tricky, and widely misunderstood. Often mastering them is a matter that is more related to spending time testing than anything else. An example of this is the chmod command (and all the Unix permissions that are associated with it). I believe that the majority of Linux users don’t understand all the subtleties of Unix permissions (I have even seen an employee of a Linux vendor make an error in this regard while running a formal training session). A newbie who spent a few hours trying the various combinations of chmod etc and spoke about the results could give a talk that would teach something to almost everyone in the audience. I believe that there are many other potential talk topics of this nature.

One thing that is often overlooked when considering how to contribute to LUGs is the possibility of sharing hardware. We all have all the software we need for free but hardware still costs money. If you have some hardware that hasn’t been used for a year then consider whether you will ever use it again, if it’s not likely to be used then offer it to your LUG (either via a mailing list or by just bringing it to a meeting). Also if you see some hardware that is about to be discarded and you think that someone in your LUG will like it then grab it! In a typical year I give away a couple of car-loads of second-hand hardware, most of it was about to be thrown out by a client so I grab it for my local LUG. Taking such hardware reduces disposal costs for my clients, prevents computer gear from poisoning landfill (you’re not supposed to put it in the garbage but most people do), and helps random people who need hardware.

One common use for the hardware I give away is for children. Most people are hesitant to buy hardware specifically for children as it only takes one incident of playing with the switch labeled 240V/110V (or something of a similar nature) to destroy it. Free hardware allows children to get more access to computers at an early age.

Finally one way to contribute is by joining the committee. Many people find it difficult to attend meetings, so attending a regular meeting and a committee meeting every month is difficult. So if you have no problems in attending meetings then please consider contributing in this way.

Debugging as a Demonstration Sport

I was watching So You Think You Can Dance [1] and thinking about the benefits that it provides to the dancing industry. The increase in public appreciation for the sport will increase the amount of money that is available to professionals, and getting more people interested in dancing as a profession will increase the level of skill in the entire industry. While the show hasn’t interested me much (I prefer to watch dancing in the context of music videos and avoid the reality TV aspect) I appreciate what it is doing. On a more general note I think that anything which promotes interest in the arts is a good thing.

I have been wondering whether similar benefits can be provided to the IT industry through competitions. There are some well established programming contests aimed at university level students and computer conferences often have contests. But the down-side of them in terms of audience interest is that they are either performed in exam conditions or performed over the course of days – neither of which makes for good viewing. The audience interaction is generally limited to the award ceremony and maybe some blog posts by the winners explaining their work.

There are a number of real-world coding tasks that can be performed in a moderate amount of time. One example is debugging certain classes of bugs, this includes memory leaks, SEGVs, and certain types of performance and reliability problems. Another is fixing man pages.

A way of running such a contest might be to have a dozen contestants on stage with their laptops connected to a KVM switch. They could choose tasks from the bug list of their favorite distribution, and when they completed a task (built a deb or rpm package with the bug fixed and updated the bug report with a patch) they could request to have their port on the KVM switch and their microphone enabled so that they could explain to the audience what they did.

Points would be awarded based on the apparent difficulty of the bug and the clarify of the explanation to the audience. A major aim of such an exercise would be to encourage members of the audience to spend some of their spare time fixing bugs!

Basically it would be a public Bug Squashing Party (BSP) but with points awarded and some minor prizes (it would be best to avoid significant prizes as that can lead to hostility).

Swapping to a Floppy Disk

In the mid 90’s I was part-owner of a small ISP. We had given out Trumpet Winsock [1] to a large number of customers and couldn’t convert them to anything else. Unfortunately a new release of the Linux kernel (from memory I think it was 2.0) happened to not work with Trumpet Winsock. Not wanting to stick to the old kernel I decided to install a Linux machine running a 1.2.x kernel for the sole purpose of proxying connections for the Winsock users. I had a 386 machine with 8M of RAM that was suitable for the purpose.

At that time hard disks were moderately expensive, and the servers were stored in a hot place which tended to make drives die more rapidly than they might otherwise. So I didn’t want to use a hard disk for that purpose.

I configured the machine to boot from a floppy disk (CD-ROM drives also weren’t cheap then) and use an NFS root filesystem. The problem was that it needed slightly more than 8M of RAM and swapping to NFS was not supported. My solution was to mount the floppy disk read-write and use a swap file on the floppy. The performance difference between floppy disks and hard disks was probably about a factor of 10 or 20 – but they were both glacially slow when compared to main memory. After running for about half an hour the machine achieved a state where about 400K of unused data was paged out and the floppy drive would then hardly ever be used.

I had initially expected that the floppy disk would get a lot of use and wear out, I had prepared a few spare disks so that they could be swapped in case of read errors. But in about a year of service I don’t recall having a bad sector on a floppy (I replaced the floppy whenever I upgraded the kernel or rebooted for any other reason as a routine precaution).

Does anyone have an anecdote to beat that?

A Basic IPVS Configuration

I have just configured IPVS on a Xen server for load balancing between multiple virtual hosts. The benefit is not load balancing but management. With two virtual machines providing a service I can gracefully shut one down for maintenance and have the other take the load. When there are two machines providing a service a load balancing configuration is much better than a hot-spare, one reason is the fact that there may be application scaling issues that prevent one machine with twice the resources from giving as much performance as two smaller machines. Another is the fact that if you have a machine configured but never used there will always be some doubt as to whether it would work…

The first thing to do is to assign the IP address of the service to the front-end machine so that other machines on the segment (IE routers) will be able to send data to it. If the address for the service is 10.0.0.5 then the command “ip addr add dev eth0 10.0.0.5/24 broadcast +” will make it a secondary address on the eth0 interface. On a Debian system you would add the line “up ip addr add dev eth0 10.0.0.5/24 broadcast + || true” to the appropriate section of /etc/network/interfaces, for a Red Hat system it seems that /etc/rc.local is the best place for it. I expect that it would be possible to merely advertise the IP address via ARP without adding it to the interface, but the ability to ping the IPVS server on the service address seems useful and there seems no benefit in not assigning the address.

There are three methods used by IPVS for forwarding packets, gatewaying/routing (the default), IPIP encapsulation (tunneling), and masquerading. The gatewaying/routing method requires the back-end server to respond to requests on the service address. That would mean assigning the address to the back-end server without advertising it via ARP (which seems likely to have some issues for managing the system). The IPIP encapsulation method requires setting up IPIP which seemed like it would be excessively difficult (although maybe not more than required to set up masquerading). The masquerading option (which I initially chose) rewrites the packets to have the IP address of the real server. So for example if the service address is 10.0.0.5 and the back-end server has the address 10.0.1.5 then it will see packets addresses to 10.0.1.5. A benefit of masquerading is that it allows you to use different ports, so for example you could have a non-virtualised mail server listening on port 25 and a back-end server for a virtual service listening on port 26. While there is no practical limit to the number of private IP addresses that you might use it seems easier to manage servers listening on different ports with the same IP address – and there is the issue of server programs that are not written to support binding to an IP address.

ipvsadm -A -t 10.0.0.5:25 -s lblc -p
ipvsadm -a -t 10.0.0.5:25 -r 10.0.1.5 -m

The above two commands create an IPVS configuration that listens on port 25 of IP address 10.0.0.5 and then masquerades connections to 10.0.1.5 on port 25 (the default is to use the same port).

Now the problem is in getting the packets to return via the IPVS server. If the IPVS server happens to be your default gateway then it’s not a problem and it will already be working after the above two commands (if a service is listening on 10.0.1.5 port 25).

If the IPVS server is not the default gateway and you have only one IP address on the back-end server then this will require using netfilter to mark the packets and then route based on the packet matching. Marking via netfilter also seems to be the only well documented way of doing similar things. I spent some time working on this and didn’t get it working. However having multiple IP addresses per server is a recommended practice anyway (a back-end interface for communication between servers as well as a front-end interface for public data).

ip rule add from 10.0.1.5 table 1
ip route add default via 10.0.0.1 table 1

I use the above two commands to set up a new routing table for the data for the virtual service. The first line causes any packets from 10.0.1.5 to be sent to routing table 1 (I currently have a rough plan to have table numbers match ethernet device numbers, the data in question is going out device eth1). The second line adds a default router to table 1 which sends all packets to 10.0.0.1 (the private IP address of the IPVS server).

Then it SHOULD all be working, but in the network that I’m using (RHEL4 DomU and RHEL5 Dom0 and IPVS) it doesn’t. For some reason the data packets from the DomU are not seen as part of the same TCP stream (both in Net Filter connection tracking and by the TCP code in the kernel). So I get an established connection (3 way handshake completed) but no data transfer. The server sends the SMTP greeting repeatedly but nothing is received. At this stage I’m not sure whether there is something missing in my configuration or whether there’s a bug in IPVS. I would be happy to send tcpdump output to anyone who wants to try and figure it out.

My next attempt at this was via routing. I removed the “-m” option from the ipvsadm command and added the service IP address to the back-end with the command “ifconfig lo:0 10.0.0.5 netmask 255.255.255.255” and configured the mail server to bind to port 25 on address 10.0.0.5. Success at last!

Now I just have to get Piranha working to remove back-end servers from the list when they fail.

Update: It’s quite important that when adding a single IP address to device lo:0 you use a netmask of 255.255.255.255. If you use the same netmask as the front-end device (which would seem like a reasonable thing to do) then (with RHEL4 kernels at least) you get proxy ARPs by default. For example you used netmask 255.255.255.0 to add address 10.0.0.5 to device lo:0 then on device eth0 the machine will start answering ARP requests for 10.0.0.6 etc. Havoc then ensues.