WordPress Multisite on Debian

WordPress (a common CMS for blogs) is designed to be copied to a directory that Apache can serve and run by a user with no particular privileges while managing installation of it’s own updates and plugins. Debian is designed around the idea of the package management system controlling everything on behalf of a sysadmin.

When I first started using WordPress there was a version called “WordPress MU” (Multi User) which supported multiple blogs. It was a separate archive to the main WordPress and didn’t support all the plugins and themes. As a main selling point of WordPress is the ability to select from the significant library of plugins and themes this was a serious problem.

Debian WordPress

The people who maintain the Debian package of WordPress have always supported multiple blogs on one system and made it very easy to run in that manner. There’s a /etc/wordpress directory for configuration files for each blog with names such as config-etbe.coker.com.au.php. This allows having multiple separate blogs running from the same tree of PHP source which means only one thing to update when there’s a new version of WordPress (often fixing security issues).

One thing that appears to be lacking with the Debian system is separate directories for “media”. WordPress supports uploading images (which are scaled to several different sizes) as well as sound and apparently video. By default under Debian they are stored in /var/lib/wordpress/wp-content/uploads/YYYY/MM/filename. If you have several blogs on one system they all get to share the same directory tree, that may be OK for one person running multiple blogs but is obviously bad when several bloggers have independent blogs on the same server.

Multisite

If you enable the “multisite” support in WordPress then you have WordPress support for multiple blogs. The administrator of the multisite configuration has the ability to specify media paths etc for all the child blogs.

The first problem with this is that one person has to be the multisite administrator. As I’m the sysadmin of the WordPress servers in question that’s an obvious task for me. But the problem is that the multisite administrator doesn’t just do sysadmin tasks such as specifying storage directories. They also do fairly routine tasks like enabling plugins. Preventing bloggers from installing new plugins is reasonable and is the default Debian configuration. Preventing them from selecting which of the installed plugins are activated is unreasonable in most situations.

The next issue is that some core parts of WordPress functionality on the sub-blogs refer to the administrator blog, recovering a forgotten password is one example. I don’t want users of other blogs on the system to be referred to my blog when they forget their password.

A final problem with multisite is that it makes things more difficult if you want to move a blog to another system. Instead of just sending a dump of the MySQL database and a copy of the Apache configuration for the site you have to configure it for which blog will be it’s master. If going between multisite and non-multisite you have to change some of the data about accounts, this will be annoying on both adding new sites to a server and moving sites from the server to a non-multisite server somewhere else.

I now believe that WordPress multisite has little value for people who use Debian. The Debian way is the better way.

So I had to back out the multisite changes. Fortunately I had a cron job to make snapshots of the BTRFS subvolume that has the database so it was easy to revert to an older version of the MySQL configuration.

Upload Location

update etbe_options set option_value='/var/lib/wordpress/wp-content/uploads/etbe.coker.com.au' where option_name='upload_path';

It turns out that if you don’t have a multisite blog then there’s no way of changing the upload directory without using SQL. The above SQL code is an example of how to do this. Note that it seems that there is special case handling of a value of ‘wp-content/uploads‘ and any other path needs to be fully qualified.

For my own blog however I choose to avoid the WordPress media management and use the following shell script to create suitable HTML code for an image that links to a high resolution version. I use GIMP to create the smaller version of the image which gives me a lot of control over how to crop and compress the image to ensure that enough detail is visible while still being small enough for fast download.

#!/bin/bash
set -e

if [ "$BASE" = "" ]; then
  BASE="http://www.coker.com.au/blogpics/2018"
fi

while [ "$1" != "" ]; do
  BIG=$1
  SMALL=$(echo $1 | sed -s s/-big//)
  RES=$(identify $SMALL|cut -f3 -d\ )
  WIDTH=$(($(echo $RES|cut -f1 -dx)/2))px
  HEIGHT=$(($(echo $RES|cut -f2 -dx)/2))px
  echo "<a href=\"$BASE/$BIG\"><img src=\"$BASE/$SMALL\" width=\"$WIDTH\" height=\"$HEIGHT\" alt=\"\" /></a>"
  shift
done

QEMU for ARM Processes

I’m currently doing some embedded work on ARM systems. Having a virtual ARM environment is of course helpful. For the i586 class embedded systems that I run it’s very easy to setup a virtual environment, I just have a chroot run from systemd-nspawn with the --personality=x86 option. I run it on my laptop for my own development and on a server my client owns so that they can deal with the “hit by a bus” scenario. I also occasionally run KVM virtual machines to test the boot image of i586 embedded systems (they use GRUB etc and are just like any other 32bit Intel system).

ARM systems have a different boot setup, there is a uBoot loader that is fairly tightly coupled with the kernel. ARM systems also tend to have more unusual hardware choices. While the i586 embedded systems I support turned out to work well with standard Debian kernels (even though the reference OS for the hardware has a custom kernel) the ARM systems need a special kernel. I spent a reasonable amount of time playing with QEMU and was unable to make it boot from a uBoot ARM image. The Google searches I performed didn’t turn up anything that helped me. If anyone has good references for getting QEMU to work for an ARM system image on an AMD64 platform then please let me know in the comments. While I am currently surviving without that facility it would be a handy thing to have if it was relatively easy to do (my client isn’t going to pay me to spend a week working on this and I’m not inclined to devote that much of my hobby time to it).

QEMU for Process Emulation

I’ve given up on emulating an entire system and now I’m using a chroot environment with systemd-nspawn.

The package qemu-user-static has staticly linked programs for emulating various CPUs on a per-process basis. You can run this as “/usr/bin/qemu-arm-static ./staticly-linked-arm-program“. The Debian package qemu-user-static uses the binfmt_misc support in the kernel to automatically run /usr/bin/qemu-arm-static when an ARM binary is executed. So if you have copied the image of an ARM system to /chroot/arm you can run the following commands like the following to enter the chroot:

cp /usr/bin/qemu-arm-static /chroot/arm/usr/bin/qemu-arm-static
chroot /chroot/arm bin/bash

Then you can create a full virtual environment with “/usr/bin/systemd-nspawn -D /chroot/arm” if you have systemd-container installed.

Selecting the CPU Type

There is a huge range of ARM CPUs with different capabilities. How this compares to the range of x86 and AMD64 CPUs depends on how you are counting (the i5 system I’m using now has 76 CPU capability flags). The default CPU type for qemu-arm-static is armv7l and I need to emulate a system with a armv5tejl. Setting the environment variable QEMU_CPU=pxa250 gives me armv5tel emulation.

The ARM Architecture Wikipedia page [2] says that in armv5tejl the T stands for Thumb instructions (which I don’t think Debian uses), the E stands for DSP enhancements (which probably isn’t relevant for me as I’m only doing integer maths), the J stands for supporting special Java instructions (which I definitely don’t need) and I’m still trying to work out what L means (comments appreciated).

So it seems clear that the armv5tel emulation provided by QEMU_CPU=pxa250 will do everything I need for building and testing ARM embedded software. The issue is how to enable it. For a user shell I can just put export QEMU_CPU=pxa250 in .login or something, but I want to emulate an entire system (cron jobs, ssh logins, etc).

I’ve filed Debian bug #870329 requesting a configuration file for this [1]. If I put such a configuration file in the chroot everything would work as desired.

To get things working in the meantime I wrote the below wrapper for /usr/bin/qemu-arm-static that calls /usr/bin/qemu-arm-static.orig (the renamed version of the original program). It’s ugly (I would use a config file if I needed to support more than one type of CPU) but it works.

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  if(setenv("QEMU_CPU", "pxa250", 1))
  {
    printf("Can't set $QEMU_CPU\n");
    return 1;
  }
  execv("/usr/bin/qemu-arm-static.orig", argv);
  printf("Can't execute \"%s\" because of qemu failure\n", argv[0]);
  return 1;
}

More KVM Modules Configuration

Last year I blogged about blacklisting a video driver so that KVM virtual machines didn’t go into graphics mode [1]. Now I’ve been working on some other things to make virtual machines run better.

I use the same initramfs for the physical hardware as for the virtual machines. So I need to remove modules that are needed for booting the physical hardware from the VMs as well as other modules that get dragged in by systemd and other things. One significant saving from this is that I use BTRFS for the physical machine and the BTRFS driver takes 1M of RAM!

The first thing I did to reduce the number of modules was to edit /etc/initramfs-tools/initramfs.conf and change “MODULES=most” to “MODULES=dep”. This significantly reduced the number of modules loaded and also stopped the initramfs from probing for a non-existant floppy drive which added about 20 seconds to the boot. Note that this will result in your initramfs not supporting different hardware. So if you plan to take a hard drive out of your desktop PC and install it in another PC this could be bad for you, but for servers it’s OK as that sort of upgrade is uncommon for servers and only done with some planning (such as creating an initramfs just for the migration).

I put the following rmmod commands in /etc/rc.local to remove modules that are automatically loaded:
rmmod btrfs
rmmod evdev
rmmod lrw
rmmod glue_helper
rmmod ablk_helper
rmmod aes_x86_64
rmmod ecb
rmmod xor
rmmod raid6_pq
rmmod cryptd
rmmod gf128mul
rmmod ata_generic
rmmod ata_piix
rmmod i2c_piix4
rmmod libata
rmmod scsi_mod

In /etc/modprobe.d/blacklist.conf I have the following lines to stop drivers being loaded. The first line is to stop the video mode being set and the rest are just to save space. One thing that inspired me to do this is that the parallel port driver gave a kernel error when it loaded and tried to access non-existant hardware.
blacklist bochs_drm
blacklist joydev
blacklist ppdev
blacklist sg
blacklist psmouse
blacklist pcspkr
blacklist sr_mod
blacklist acpi_cpufreq
blacklist cdrom
blacklist tpm
blacklist tpm_tis
blacklist floppy
blacklist parport_pc
blacklist serio_raw
blacklist button

On the physical machine I have the following in /etc/modprobe.d/blacklist.conf. Most of this is to prevent loading of filesystem drivers when making an initramfs. I do this because I know there’s never going to be any need for CDs, parallel devices, graphics, or strange block devices in a server room. I wouldn’t do any of this for a desktop workstation or laptop.
blacklist ppdev
blacklist parport_pc
blacklist cdrom
blacklist sr_mod
blacklist nouveau

blacklist ufs
blacklist qnx4
blacklist hfsplus
blacklist hfs
blacklist minix
blacklist ntfs
blacklist jfs
blacklist xfs

Video Mode and KVM

I recently changed my KVM servers to use the kernel command-line parameter nomodeset for the virtual machine kernels so that they don’t try to go into graphics mode. I do this because I don’t have X11 or VNC enabled and I want a text console to use with the -curses option of KVM. Without the nomodeset KVM just says that it’s in 1024*768 graphics mode and doesn’t display the text.

Now my KVM server running Debian/Unstable has had it’s virtual machines start going into graphics mode in spite of nomodeset parameter. It seems that an update to QEMU has added a new virtual display driver which recent kernels from Debian/Unstable support with the bochs_drm driver, and that driver apparently doesn’t respect nomodeset.

The solution is to create a file named /etc/modprobe.d/blacklist.conf with the contents “blacklist bochs_drm” and now my virtual machines have a usable plain-text console again! This blacklist method works for all video drivers, you can blacklist similar modules for the other virtual display hardware. But it would be nice if the one kernel option would cover them all.

Ethernet Interface Naming With Systemd

Systemd has a new way of specifying names for Ethernet interfaces as documented in systemd.link(5). The Debian package should keep working with the old 70-persistent-net.rules file, but I had a problem with this that forced me to learn about systemd.link(5).

Below is a little shell script I wrote to convert a basic 70-persistent-net.rules (that only matches on MAC address) to systemd.link files.

#!/bin/bash

RULES=/etc/udev/rules.d/70-persistent-net.rules

for n in $(grep ^SUB $RULES|sed -e s/^.*NAME..// -e s/.$//) ; do
  NAME=/etc/systemd/network/10-$n.link
  LINE=$(grep $n $RULES)
  MAC=$(echo $LINE|sed -e s/^.*address….// -e s/…ATTR.*$//)
  echo "[Match]" > $NAME
  echo "MACAddress=$MAC" >> $NAME
  echo "[Link]" >> $NAME
  echo "Name=$n" >> $NAME
done

BTRFS Status April 2014

Since my blog post about BTRFS in March [1] not much has changed for me. Until yesterday I was using 3.13 kernels on all my systems and dealing with the occasional kmail index file corruption problem.

Yesterday my main workstation ran out of disk space and went read-only. I started a BTRFS balance which didn’t seem to be doing any good because most of the space was actually in use so I deleted a bunch of snapshots. Then my X session aborted (some problem with KDE or the X server – I’ll never know as logs couldn’t be written to disk). I rebooted the system and had kernel threads go into infinite loops with repeated messages about a lack of response for 22 seconds (I should have photographed the screen). When it got into that state the ALT-Fn keys to change a virtual console sometimes worked but nothing else worked – the terminal usually didn’t respond to input.

To try and stop the kernel from entering an infinite loop on every boot that I used “rootflags=skip_balance” on the kernel command line to stop it from continuing the balance which made the system usable for a little longer, unfortunately the skip_balance mount option doesn’t permanently apply, the kernel will keep trying to balance the filesystem on every mount until a “btrfs balance cancel” operation succeeds. But my attempts to cancel the balance always failed.

When I booted my system with skip_balance it would sometimes free some space from the deleted snapshots, after two good runs I got to 17G free. But after that every time I rebooted it would report another Gig or two free (according to “btrfs filesystem df“) and then hang without committing the changes to disk.

I solved this problem by upgrading my USB rescue image to kernel 3.14 from Debian/Experimental and mounting the filesystem from the rescue image. After letting kernel 3.14 work on the filesystem for a while it was in a stage where I could use it with kernel 3.13 and then boot the system normally to upgrade it to kernel 3.14.

I had a minor extra complication due to the fact that I was running “apt-get dist-upgrade” at the time the filesystem went read-only do the dpkg records of which packages were installed were a bit messed up. But that was easy to fix by running a diff against /var/lib/dpkg/info on a recent snapshot. In retrospect I should have copied from an old snapshot of the root filesystem, but I fixed the problems faster than I could think of better ways to fix them.

When running a balance the system had a peak IO rate of about 30MB/s reads and 30MB/s writes. That compares to the maximum contiguous file IO speed of 260MB/s for reads and 320MB/s for writes. During that time it had about 50% CPU time used for my Q8400 quad-core CPU. So far the only tasks that I do regularly which have CPU speed as a significant bottleneck are BTRFS filesystem balancing and recoding MP4 files. Compiling hasn’t been an issue because recently I haven’t been compiling many programs that are particularly big.

Lessons Learned

I should photograph the screen regularly when doing things that won’t be logged, those kernel error messages might have been useful to me or someone else.

The fact that the only kernel that runs BTRFS the way I need comes from the Experimental repository in Debian stands in contrast to the recent kernel patch that stops describing BTRFS as experimental. While I have a high opinion of the people who provide support for the kernel in commercial distributions and their ability to back-port fixes from newer kernels I’m concerned about their decision to support BTRFS. I’m also dubious about whether we can offer BTRFS support in Debian/Jessie (the next version of Debian) without a significant warning. OTOH if you find yourself with a BTRFS system that isn’t working well you could always hire me to fix it. I accept payment via Paypal, bank transfer, or Bitcoin. If you want to pay me in Grange then I assure you I will never forget about it. ;)

I thought that I wouldn’t have CPU speed issues when I started using the AMD64 architecture, for most tasks that’s been the case. But for systems for which storage is important I’ll look at getting faster CPUs because of BTRFS. Using faster CPUs for storage isn’t that uncommon (I used to work for SGI and dealt with some significant CPU power used for file serving), but needing a fast quad-core CPU to drive a single SSD is a little disappointing. While recovery from file system corner cases isn’t going to be particularly common it’s something that you want completed quickly, for personal systems you want to be doing something else and for work systems you don’t want down-time.

The BTRFS problems with running out of disk space are really serious. It seems that even workstations used at home can’t survive without monitoring. For any other filesystem used at home you can just let it get full and then delete stuff.

Include “rootflags=skip_balance” in the boot loader configuration for every system with a BTRFS root filesystem and in the /etc/fstab for every non-root BTRFS filesystem. I haven’t yet encountered a single situation where continuing the balance did any good or when it didn’t do any harm.

Swap Space and SSD

In 2007 I wrote a blog post about swap space [1]. The main point of that article was to debunk the claim that Linux needs a swap space twice as large as main memory (in summary such advice is based on BSD Unix systems and has never applied to Linux and that most storage devices aren’t fast enough for large swap). That post was picked up by Barrapunto (Spanish Slashdot) and became one of the most popular posts I’ve written [2].

In the past 7 years things have changed. Back then 2G of RAM was still a reasonable amount and 4G was a lot for a desktop system or laptop. Now there are even phones with 3G of RAM, 4G is about the minimum for any new desktop or laptop, and desktop/laptop systems with 16G aren’t that uncommon. Another significant development is the use of SSDs which dramatically improve speed for some operations (mainly seeks).

As SATA SSDs for desktop use start at about $110 I think it’s safe to assume that everyone who wants a fast desktop system has one. As a major limiting factor in swap use is the seek performance of the storage the use of SSDs should allow greater swap use. My main desktop system has 4G of RAM (it’s an older Intel 64bit system and doesn’t support more) and has 4G of swap space on an Intel SSD. My work flow involves having dozens of Chromium tabs open at the same time, usually performance starts to drop when I get to about 3.5G of swap in use.

While SSD generally has excellent random IO performance the contiguous IO performance often isn’t much better than hard drives. My Intel SSDSC2CT12 300i 128G can do over 5000 random seeks per second but for sustained contiguous filesystem IO can only do 225M/s for writes and 274M/s for reads. The contiguous IO performance is less than twice as good as a cheap 3TB SATA disk. It also seems that the performance of SSDs aren’t as consistent as that of hard drives, when a hard drive delivers a certain level of performance then it can generally do so 24*7 but a SSD will sometimes reduce performance to move blocks around (the erase block size is usually a lot larger than the filesystem block size).

It’s obvious that SSDs allow significantly better swap performance and therefore make it viable to run a system with more swap in use but that doesn’t allow unlimited swap. Even when using programs like Chromium (which seems to allocate huge amounts of RAM that aren’t used much) it doesn’t seem viable to have swap be much bigger than 4G on a system with 4G of RAM. Now I could buy another SSD and use two swap spaces for double the overall throughput (which would still be cheaper than buying a PC that supports 8G of RAM), but that still wouldn’t solve all problems.

One issue I have been having on occasion is BTRFS failing to allocate kernel memory when managing snapshots. I’m not sure if this would be solved by adding more RAM as it could be an issue of RAM fragmentation – I won’t file a bug report about this until some of the other BTRFS bugs are fixed. Another problem I have had is when running Minecraft the driver for my ATI video card fails to allocate contiguous kernel memory, this is one that almost certainly wouldn’t be solved by just adding more swap – but might be solved if I tweaked the kernel to be more aggressive about swapping out data.

In 2007 when using hard drives for swap I found that the maximum space that could be used with reasonable performance for typical desktop operations was something less than 2G. Now with a SSD the limit for usable swap seems to be something like 4G on a system with 4G of RAM. On a system with only 2G of RAM that might allow the system to be usable with swap being twice as large as RAM, but with the amounts of RAM in modern PCs it seems that even SSD doesn’t allow using a swap space larger than RAM for typical use unless it’s being used for hibernation.

Conclusion

It seems that nothing has significantly changed in the last 7 years. We have more RAM, faster storage, and applications that are more memory hungry. The end result is that swap still isn’t very usable for anything other than hibernation if it’s larger than RAM.

It would be nice if application developers could stop increasing the use of RAM. Currently it seems that the RAM requirements for Linux desktop use are about 3 years behind the RAM requirements for Windows. This is convenient as a PC is fully depreciated according to the tax office after 3 years. This makes it easy to get 3 year old PCs cheaply (or sometimes for free as rubbish) which work really well for Linux. But it would be nice if we could be 4 or 5 years behind Windows in terms of hardware requirements to reduce the hardware requirements for Linux users even further.

Is Portslave Still Useful?

Portslave is a project that was started in the 90’s to listen to a serial port and launch a PPP or SLIP session after a user has been authenticated, I describe it as a “project” not a “program” because a large part of it’s operation is via a shared object that hooks into pppd, so if you connect to a Portslave terminal server and just start sending PPP data then the pppd will be launched and use the Portslave shared object for authentication. This dual mode of operation makes it a little tricky to develop and maintain, every significant update to pppd requires that Portslave be recompiled at the minimum, and sometimes code changes in Portslave have been required to match changes in pppd. CHAP authentication was broken in a pppd update in 2004 and I never fixed it, as an aside the last significant code change I made was to disable CHAP support, so I haven’t been actively working on it for 9 years.

I took over the Portslave project in 2000, at the time there were three separate forks of the project with different version numbering schemes. I used the release date as the only version number for my Portslave releases so that it would be easy for users to determine which version was the latest. Getting the latest version was very important given the ties to pppd.

When I started maintaining Portslave I had a couple of clients that maintained banks of modems for ISP service and for their staff to connect to the Internet. Also multi-port serial devices were quite common and modems where the standard way of connecting to the Internet.

Since that time all my clients have ceased running modems. Most people connect to the Internet via ADSL or Cable, and when people travel they use 3G net access via their phone which is usually cheaper, faster, and more convenient than using a modem. The last code changes I made to Portslave were in 2010, since then I’ve made one upload to Debian for the sole purpose of compiling against a new version of pppd.

I have no real interest in maintaining Portslave, it’s no longer a fun project for me, I don’t have enough spare time for such things, and no-one is paying me to work on it.

Currently Portslave has two Debian bugs, one is from a CMU project to scan programs for crashes that might indicate security flaws, it seems that Portslave crashes if standard input isn’t a terminal device [1]. That one shouldn’t be difficult to solve.

The other Debian bug is due to Portslave being compiled against an obsolete RADIUS client library [2]. It also shouldn’t be that difficult to fix, when I made it use libradius1 that wasn’t a difficult task and it should be even easier to convert from one RADIUS library to another.

But the question is whether it’s worth bothering. Is anyone using Portslave? Is anyone prepared to maintain it in Debian? Should I just file a bug report requesting that Portslave be removed from Debian?

Hetzner now Offers SSD

Hetzner is offering new servers with SSD, good news for people who want to run ZFS (for ZIL and/or L2ARC). See the EX server configuration list for more information [1]. Unfortunately they don’t specify what brand of SSD, this is a concern for me as some of the reports about SSD haven’t been that positive, getting whichever SSD is cheapest isn’t appealing. A cheap SSD might be OK for L2ARC (read cache), but for ZIL (write cache) reliability is fairly important. If anyone has access to a Hetzner server with SSD then please paste the relevant output of lsscsi into a comment.

The next issue is that they only officially offer it on the new “EX 8S” server. SSD will be of most interest to people who also want lots of RAM (the zfsonlinux.org code has given me kernel panics when running with a mere 4G of RAM – even when I did the recommended tuning to reduce ARC size). Also people who want more capable storage options will tend to want more RAM if only for disk caching.

But I’m sure that there are plenty of people who would be happy to have SSD on a smaller and cheaper server. The biggest SSD offering of 240G is bigger than a lot of servers. I run a Hetzner server that has only 183G of disk space in use (and another 200G of backups). If the backups were on another site then the server in question could have just a RAID-1 of SSD for all it’s storage. In this case it wouldn’t be worth doing as the server doesn’t have much disk IO load, but it would be nice to have the option – the exact same server plus some more IO load would make SSD the ideal choice.

The biggest problem is that the EX 8S server is really expensive. Hard drives which are included in the base price for cheaper options are now expensive additions. A server with 2*3TB disks and 2*240G SSD is E167 per month! That’s more expensive than three smaller servers that have 2*3TB disks! The good news for someone who wants SSD is that the Hetzner server “auction” has some better deals [2]. As is always the case with auction sites the exact offers will change by the moment, but currently they offer a server with 2*120G SSD and 24G of RAM for E88 per month and a server with 2*120G SSD, 2*1.5T HDD, and 24G of RAM for E118. E88 is a great deal if your storage fits in 240G and E118 could be pretty good if you only have 1.5T of data that needs ZFS features.

The main SSD offering is still a good option for some cases. A project that I did a couple of years ago would probably have worked really well on a E167/month server with 2*3TB and 2*240G SSD. It was designed around multiple database servers sharding the load which was largely writes, so SSD would have allowed a significant reduction in the number of servers.

They also don’t offer SSD on their “storage servers” which is a significant omission. I presume that they will fix that soon enough. 13 disks and 2 SSD will often be more useful than 15 disks. That’s assuming the SSD doesn’t suck of course.

The reason this is newsworthy is that most hosted server offerings have very poor disk IO and no good options for expanding it. For servers that you host yourself it’s not too difficult to buy extra trays of disks or even a single rack-mount server that has any number of internal disks in the range 2 to 24 and any choice as to how you populate them. But with rented servers it’s typically 2 disks with no options to add SSD or other performance enhancements and no possibility of connecting a SAN. As an aside it would still be nice if someone ran a data center that supported NetApp devices and gave the option of connecting an arbitrary number of servers to a NetApp Filer (or a redundant pair of Filers). If anyone knows of a hosting company that provides options for good disk IO which are better than just providing SSD or cheaper than E167 per month then please provide the URL in a comment.

Update: It seems that I can get SSD added to one of the cheaper servers. This is a good option as I have some servers that already have the “flexi-pack” due to a need for more IP addresses.

ZFS on Debian/Wheezy

As storage capacities increase the probability of data corruption increases as does the amount of time required for a fsck on a traditional filesystem. Also the capacity of disks is increasing a lot faster than the contiguous IO speed which means that the RAID rebuild time is increasing, for example my first hard disk was 70M and had a transfer rate of 500K/s which meant that the entire contents could be read in a mere 140 seconds! The last time I did a test on a more recent disk a 1TB SATA disk gave contiguous transfer rates ranging from 112MB/s to 52MB/s which meant that reading the entire contents took 3 hours and 10 minutes, and that problem is worse with newer bigger disks. The long rebuild times make greater redundancy more desirable.

BTRFS vs ZFS

Both BTRFS and ZFS checksum all data to cover the case where a disk returns corrupt data, they don’t need a fsck program, and the combination of checksums and built-in RAID means that they should have less risk of data loss due to a second failure during rebuild. ZFS supports RAID-Z which is essentially a RAID-5 with checksums on all blocks to handle the case of corrupt data as well as RAID-Z2 which is a similar equivalent to RAID-6. RAID-Z is quite important if you don’t want to have half your disk space taken up by redundancy or if you want to have your data survive the loss or more than one disk, so until BTRFS has an equivalent feature ZFS offers significant benefits. Also BTRFS is still rather new which is a concern for software that is critical to data integrity.

I am about to install a system to be a file server and Xen server which probably isn’t going to be upgraded a lot over the next few years. It will have 4 disks so ZFS with RAID-Z offers a significant benefit over BTRFS for capacity and RAID-Z2 offers a significant benefit for redundancy. As it won’t be upgraded a lot I’ll start with Debian/Wheezy even though it isn’t released yet because the system will be in use without much change well after Squeeze security updates end.

ZFS on Wheezy

Getting ZFS to basically work isn’t particularly hard, the ZFSonLinux.org site has the code and reasonable instructions for doing it [1]. The zfsonlinux code doesn’t compile out of the box on Wheezy although it works well on Squeeze. I found it easier to get a the latest Ubuntu working with ZFS and then I rebuilt the Ubuntu packages for Debian/Wheezy and they worked. This wasn’t particularly difficult but it’s a pity that the zfsonlinux site didn’t support recent kernels.

Root on ZFS

The complication with root on ZFS is that the ZFS FAQ recommends using whole disks for best performance so you can avoid alignment problems on 4K sector disks (which is an issue for any disk large enough that you want to use it with ZFS) [2]. This means you have to either use /boot on ZFS (which seems a little too experimental for me) or have a separate boot device.

Currently I have one server running with 4*3TB disks in a RAID-Z array and a single smaller disk for the root filesystem. Having a fifth disk attached by duct-tape to a system that is only designed for four disks isn’t ideal, but when you have an OS image that is backed up (and not so important) and a data store that’s business critical (but not needed every day) then a failure on the root device can be fixed the next day without serious problems. But I want to fix this and avoid creating more systems like it.

There is some good documentation on using Ubuntu with root on ZFS [3]. I considered using Ubuntu LTS for the server in question, but as I prefer Debian and I can recompile Ubuntu packages for Debian it seems that Debian is the best choice for me. I compiled those packages for Wheezy, did the install and DKMS build, and got ZFS basically working without much effort.

The problem then became getting ZFS to work for the root filesystem. The Ubuntu packages didn’t work with the Debian initramfs for some reason and modules failed to load. This wasn’t necessarily a show-stopper as I can modify such things myself, but it’s another painful thing to manage and another way that the system can potentially break on upgrade.

The next issue is the unusual way that ZFS mounts filesystems. Instead of having block devices to mount and entries in /etc/fstab the ZFS system does things for you. So if you want a ZFS volume to be mounted as root you configure the mountpoint via the “zfs set mountpoint” command. This of course means that it doesn’t get mounted if you boot with a different root filesystem and adds some needless pain to the process. When I encountered this I decided that root on ZFS isn’t a good option. So for this new server I’ll install it with an Ext4 filesystem on a RAID-1 device for root and /boot and use ZFS for everything else.

Correct Alignment

After setting up the system with a 4 disk RAID-1 (or mirror for the pedants who insist that true RAID-1 has only two disks) for root and boot I then created partitions for ZFS. According to fdisk output the partitions /dev/sda2, /dev/sdb2 etc had their first sector address as a multiple of 2048 which I presume addresses the alignment requirement for a disk that has 4K sectors.

Installing ZFS

deb http://www.coker.com.au wheezy zfs

I created the above APT repository (only AMD64) for ZFS packages based on Darik Horn’s Ubuntu packages (thanks for the good work Darik). Installing zfs-dkms, spl-dkms, and zfsutils gave a working ZFS system. I could probably have used Darik’s binary packages but I think it’s best to rebuild Ubuntu packages to use on Debian.

The server in question hasn’t gone live in production yet (it turns out that we don’t have agreement on what the server will do). But so far it seems to be working OK.