13

ZFS vs BTRFS on Cheap Dell Servers

I previously wrote about my first experiences with BTRFS [1]. Since then I’ve been using BTRFS on more systems and have had good results. The main problem I want to address is with the reliability of RAID [2].

Requirements for a File Server

Now one of my clients has a need for a new fileserver. They need to reliably store terabytes of data (currently 6TB and growing) which is mostly comprised of data files in the 10MB – 15MB size range. The data files will almost never be re-written and I anticiapte that the main bottleneck will be the latency of NFS and other network file sharing protocols. I would hope that saturating a GigE network when sending 10MB data files from SATA disks via NFS, AFS, or SMB wouldn’t be a technical challenge.

It seems that BTRFS is the way of the future. But it’s still rather new and the lack of RAID-5 and RAID-6 is a serious issue when you need to store 10TB with today’s technology (that would be 8*3TB disks for RAID-10 vs 5*3TB disks for RAID-5). Also the case of two disks entirely failing in a short period of time requires RAID-6 (or RAID-Z2 as the ZFS variant of RAID-6 is known). With BTRFS at it’s current stage of development it seems that to recover from two disks failing you need to have BTRFS on another RAID-6 (maybe Linux software RAID-6). But for filesystems based on concepts similar to ZFS and BTRFS you want to have the filesystem run the RAID so that if a block has a filesystem hash mismatch then the correct copy can be reconstructed from parity.

ZFS seems to be a lot more complex than BTRFS. While having more features is a good thing (BTRFS seems to be missing some sysadmin friendly features at this stage) complexity means that I need to learn more and test more before going live.

But it seems that the built in RAID-5 and RAID-6 is the killer issue. Servers start becoming a lot more expensive if you want more than 8 disks and even going past 6 disks is a significant price point. As 3TB disks are available an 8 disk RAID-6 gives something like 18TB usable space vs 12TB on a RAID-10 and a 6 disk RAID-6 gives about 12TB vs 9TB on a RAID-10. With RAID-10 (IE BTRFS) my client couldn’t use a 6 disk server such as the Dell PowerEdge T410 for $1500 as 9TB of usable storage isn’t adequate and the Dell PowerEdge T610 which can support 8 disks and costs $2100 would be barely adequate for the near future with only 12TB of usable storage. Dell does sell significantly larger servers such that any of my clients needs could be covered by RAID-10, but in addition to costing more there are issues of power use and noise. When comparing a T610 and a T410 with a full set of disks the price difference is $1000 (assuming $200 per disk) which is probably worth paying to delay any future need for upgrades.

Buying Disks

The problem with the PowerEdge T610 server is that it uses hot-swap disks and the biggest disks available are 2TB for $586.30! 2TB*8 in RAID-6 gives 12TB of usable space for $4690.40! This compares poorly to the PowerEdge T410 which supports non-hot-swap disks so I can buy 6*3TB disks for something less than $200 each and get 12TB of usable space for $1200. If I could get hot-swap trays for Dell disks at a reasonable price then the T610 would be worth considering. But as 12TB of storage should do for at least the next 18 months it seems that the T410 is clearly the better option.

Does anyone know how to get cheap disk trays for Dell servers?

Implementation

In mailing list discussions some people suggest using Solaris or FreeBSD for a ZFS server. ZFS was designed for and implemented on Solaris, and FreeBSD was the first port. However Solaris and FreeBSD aren’t commonly used systems so it’s harder to find skilled people to work with them and there is less of a guarantee that the desired software will work. Among other things it’s really convenient to be able to run software for embedded Linux i386 systems on the server.

The first port of ZFS to Linux was based on FUSE [3]. This allows a clean separation of ZFS code from the Linux kernel code to avoid license issues but does have some performance problems. I don’t think that I will have any performance issues on this server as the data files are reasonably large, are received via an ADSL link, and which require quite a bit of CPU time to process them when they are accessed. But ZFS-FUSE doesn’t seem to be particularly popular.

The ZFS On Linux project provides source for a ZFS kernel module which you can compile and load [4]. As the module isn’t distributed with or statically linked to the kernel the license conflict of the CDDL ZFS code and the GPL Linux kernel code is apparently solved. I’ve read some positive reports from people who use this so it will be my preferred option.

14

Cheap NAS Devices Suck

There are some really good Network Attached Storage (NAS) devices on the market. NetApp is one company that is known for making good products [1]. The advantage of a NAS is that you have a device with NVRAM for write-back caching, a filesystem that supports all the necessary features for best performance (NetApp developed their own filesystem WAFL to provide the features they needed), and a set of quality hardware that has been tested and certified to work together.

If you want a cheap NAS then you end up with something running Linux with GPL filesystems. This isn’t a bad thing as such, but some of the best performance and data integrity features are available in ZFS (which isn’t GPL) and BTRFS (which isn’t ready for production use). Not to mention WAFL which has been providing ZFS/BTRFS type features for more than a decade.

A cheap NAS will generally be sold without disks as this is the best way to keep costs down. Selling with disks either means selling lots of different variations (which means it can’t be sold off the shelf) or selling packages that don’t quite suit some customers (thus causing people to buy the device and replace the disks which means extra costs). This means that the vendors can’t provide the guarantees about disk quality and suitability that NetApp can provide.

One major problem with a NAS is that you typically can’t get shell access. Commands such as “rm -rf” and “cp -rl” which are typically rather quick when performed locally can take ages when run over NFS. Also commands such as “grep -R” which can perform reasonably well over NFS will always perform better when run locally. Also tasks such as compiling big programs which require good disk speed as well as some CPU time can be run locally if you have a file server system that also has local accounts (IE a typical multi-user Unix server configuration), but a dedicated NAS will prevent that.

I have never used a NetApp device due to my clients deciding (sometimes correctly and sometimes incorrectly IMHO) that they are too expensive. But when considering what my clients do the down-sides of lacking local code execution on a NetApp would be more than compensated in many cases by the significant performance and reliability advantages that they offer (see my post about the reliability issues in standard RAID implementations [2]).

A server class system (with ECC RAM as a minimum criteria) running RAID-6 can be a fairly decent file server. That requires hardware RAID with NVRAM for the write-back cache for decent write performance, but when write performance isn’t required software RAID does the job quite well. A Dell tower system will typically hold at least 4 disks which means 6TB of RAID-6 storage and is quite cheap – it can be under $2000. also such a system can be easily expanded with extra Ethernet ports etc. NetApp doesn’t sell products directly and doesn’t list prices, but they do have some adverts for products being “under $7500“. That’s not really cheap but not THAT expensive when you consider the features.

A hidden cost in running a NAS is having someone perform sysadmin work on it. For a relatively expensive device that offers significant features such as a NetApp Filer this expense probably isn’t too great. But for a device that does what any PC running Linux can do it’s noteworthy that more training or experimenting time is required.

There are some special cases where small and cheap NAS appliances really make sense, such as the Apple Time Capsule for home network backups. But apart from that I don’t think that cheap NAS appliances make sense. It seems that cheap NAS devices provide the biggest down-sides of expensive NAS devices (in terms of lacking local access and having a different administration interface to servers) while also having the biggest down-sides of PC servers (lacking the advanced features of WAFL and performance of a NetApp).

Earlier today I started a process of reorganising some backups which included backups to a cheap NAS. I have been very unimpressed by the time taken to copy and rm files over NFS. I’m sure that the job would have been completed hours ago if I had local root access to the NAS.

9

Starting with BTRFS

Based on my investigation of RAID reliability [1] I have determined that BTRFS [2] is the Linux storage technology that has the best potential to increase data integrity without costing a lot of money. Basically a BTRFS internal RAID-1 should offer equal or greater data protection than RAID-6.

As BTRFS is so important and so very different to any prior technology for Linux it’s not something that can be easily deployed in the same way as other filesystems. It is possible to easily switch between filesystems such as Ext4 and XFS because they work in much the same way, you have a single block device which the filesystem uses to create a single mount-point. While BTRFS supports internal RAID so it may have multiple block devices and it may offer multiple mountable filesystems and snapshots. Much of the functionality of Linux Software RAID and LVM is covered by BTRFS. So the sensible way to deploy BTRFS is to give it all your storage and not make use of any other RAID or LVM.

So I decided to do a test installation. I started with a Debian install CD that was made shortly before the release of Squeeze (it was first to hand) and installed with BTRFS for the root filesystem, I then upgraded to Debian/Unstable to get the latest kernel as BTRFS is developing rapidly. The system failed on the first boot after upgrading to Unstable because the /etc/fstab entry for the root filesystem had the FSCK pass number set to 1 – which wasn’t going to work as no FSCK program has been written. I changed that number to 0 and it then worked.

The initial install was on a desktop system that had a single IDE drive and a CD-ROM drive. For /boot I used a degraded RAID-1 and then after completing the installation I removed the CD-ROM drive and installed a second hard drive, after that it was easy to add the other device to the RAID-1. Then I tried to add a new device to the BTRFS group with the command “btrfs device add /dev/sdb2 /dev/sda2” and was informed that it can’t do that to a mounted filesystem! That will decrease the possibilities for using BTRFS on systems with hot-swap drives, I hope that the developers regard it as a bug.

Then I booted with an ext3 filesystem for root and tried the “btrfs device add /dev/sdb2 /dev/sda2” again but got the error message “btrfs: sending ioctl 5000940a to a partition!” which is not even found by Google.

The next thing that I wanted to do was to put a swap file on BTRFS, the benefits for having redundancy and checksums on swap space seem obvious – and other BTRFS features such as compression might give a benefit too. So I created a file by using dd to take take from /dev/zero, ran mkswap on it and then tried to run swapon. But I was told that the file has holes and can’t be used. Automatically making zero blocks into holes is a useful feature in many situations, but not in this case.

So far my experience with BTRFS is that all the basic things work (IE storing files, directories, etc). But the advanced functions I wanted from BTRFS (mirroring and making a reliable swap space) failed. This is a bit disappointing, but BTRFS isn’t described as being ready for production yet.

2

More DRBD Performance tests

I’ve previously written Some Notes on DRBD [1] and a post about DRBD Benchmarking [2].

Previously I had determined that replication protocol C gives the best performance for DRBD, that the batch-time parameters for Ext4 aren’t worth touching for a single IDE disk, that barrier=0 gives a massive performance boost, and that DRBD gives a significant performance hit even when the secondary is not connected. Below are the results of some more tests of delivering mail from my Postal benchmark to my LMTP server which uses the Dovecot delivery agent to write it to disk, the rates are in messages per minute where each message is an average of 70K in size. The ext4 filesystem is used for all tests and the filesystem features list is “has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize“.

p4-2.8
Default Ext4 1663
barrier=0 2875
DRBD no secondary al-extents=7 645
DRBD no secondary default 2409
DRBD no secondary al-extents=1024 2513
DRBD no secondary al-extents=3389 2650
DRBD connected 1575
DRBD connected al-extents=1024 1560
DRBD connected al-extents=1024 Gig-E 1544

The al-extents option determines the size of the dirty areas that need to be resynced when a failed node rejoins the cluster. The default is 127 extents of 4M each for a block size of 508MB to be synchronised. The maximum is 3389 for a synchronisation block size of just over 13G. Even with fast disks and gigabit Ethernet it’s going to take a while to synchronise things if dirty zones are 13GB in size. In my tests using the maximum size of al-extents gives a 10% performance benefit in disconnected mode while a size of 1024 gives a 4% performance boost. Changing the al-extents size seems to make no significant difference for a connected DRBD device.

All the tests on connected DRBD devices were done with 100baseT apart from the last one which was a separate Gigabit Ethernet cable connecting the two systems.

Conclusions

For the level of traffic that I’m using it seems that Gigabit Ethernet provides no performance benefit, the fact that it gave a slightly lower result is not relevant as the difference is within the margin of error.

Increasing the al-extents value helps with disconnected performance, a value of 1024 gives a 4% performance boost. I’m not sure that a value of 3389 is a good idea though.

The ext4 barriers are disabled by DRBD so a disconnected DRBD device gives performance that is closer to a barrier=0 mount than a regular ext4 mount. With the significant performance difference between connected and disconnected modes it seems possible that for some usage scenarios it could be useful to disable the DRBD secondary at times of peak load – it depends on whether DRBD is used as a really current backup or a strict mirror.

Future Tests

I plan to do some tests of DRBD over Linux software RAID-1 and tests to compare RAID-1 with and without bitmap support. I also plan to do some tests with the BTRFS filesystem, I know it’s not ready for production but it would still be nice to know what the performance is like.

But I won’t use the same systems, they don’t have enough CPU power. In my previous tests I established that a 1.5GHz P4 isn’t capable of driving the 20G IDE disk to it’s maximum capacity and I’m not sure that the 2.8GHz P4 is capable of running a RAID to it’s capacity. So I will use a dual-core 64bit system with a pair of SATA disks for future tests. The difference in performance between 20G IDE disks and 160G SATA disks should be a lot less than the performance difference between a 2.8GHz P4 and a dual-core 64bit CPU.

4

5 Principles of Backup Software

Everyone agrees that backups are generally a good thing. But it seems that there is a lot less agreement about how backups should work. Here is a list of 5 principles of backup software that seem to get ignored most of the time:

(1/5) Backups should not be Application Specific

It’s quite reasonable for people to want to extract data from a backup on a different platform. Maybe someone will want to extract data a few decades after the platform becomes obsolete. I believe that vendors of backup software have an ethical obligation to make it possible for customers to get their data out with minimal effort regardless of the circumstances.

Often when writing a backup application there will be good reasons for not using the existing formats for data storage (tar, cpio, zip, etc). But ideally any data store which involves something conceptually similar to a collection of files in one larger file will use one of those formats. There have been backward compatible extensions to tar and zip for SE Linux contexts and for OS/2 EAs – the possibility of extending archive file formats with no consequence other than warnings on extraction with an unpatched utility has been demonstrated.

For a backup which doesn’t involve source files (EG the contents of some sort of database) then it should be in a format that can be easily understood and parsed. Well designed XML is generally a reasonable option. Generally the format should involve plain text that is readable and easy to understand which is optionally compressed with a common compression utility (pkzip is a reasonable choice).

(2/5) Data Store Formats should be Published

For every data store there should be public documentation about it’s format to allow future developers to write support for it. It really isn’t difficult to release some commented header files so that people can easily determine the data structures. This includes all data stores including databases and filesystems. If I suddenly find myself with a 15yo image of a NTFS filesystem containing a proprietary database I should be able to find official header files for the version of NTFS and the database server in question so I can decode the data if it’s important enough.

When an application vendor hides the data formats it gives the risk of substantial data loss at some future time. Imposing such risk on customers to try and prevent them from migrating to a rival product is unethical.

(3/5) Backups should be forward and backward compatible

It is entirely unreasonable for a vendor to demand that all their users install the latest versions of their software. There are lots of good reasons for not upgrading which includes hardware not supporting new versions of the OS, lack of Internet access to perform the upgrade, application compatibility, and just liking the way the old version works. Even for the case of a critical security fix it should be possible to restore data without applying the fix.

For any pair of versions of software that are only separated by a few versions it should be possible to backup data from one and restore to the other. Even if the data can’t be used directly (EG a backup of AMD64 programs that is restored on an i386 system) it should still be accessible. If a new version of the software doesn’t support the ancient file formats then it should be possible for the users to get a slightly older version which talks to both the old and new versions.

Backups made on 64bit systems running the latest development version of Linux and on 10yo 32bit proprietary Unix systems are interchangeable. Admittedly Unix is really good at preserving file format compatibility, but there is no technical reason why other systems can’t do the same. Source code to cpio, tar, and gnuzip, is freely available!

Apple TimeMachine fails badly in this regard, even a slightly older version of Mac OS can’t do a restore. It is however nice that most of the TimeMachine data is a tree of files which could be just copied to another system.

(4/5) Backup Software should not be Dropped

Sony Ericsson has made me hate them even more by putting the following message on their update web site:

The Backup and Restore app will be overwritten and cannot be used to restore data. Check out Android Market for alternative apps to back up and restore your data, such as MyBackup.

So if you own a Sony Ericsson phone and it is lost, stolen, or completely destroyed and all you have is a backup made by the Sony Ericsson tool then the one thing you absolutely can’t do is to buy a new Sony Ericsson phone to restore the data.

I believe that anyone who releases backup software has an ethical obligation to support restoring to all equivalent systems. How difficult would it be to put a new free app in the Google Market that has as it’s sole purpose recovering old Sony Ericsson backups onto newer phones? It really can’t be that difficult, so even if they don’t want to waste critical ROM space by putting the feature in all new phones they can make it available to everyone who needs it. When compared to the cost of developing a new Android release for a series of phones the cost of writing such a restore program would be almost nothing.

It is simply mind-boggling that Sony Ericsson go against their own commercial interests in this regard. Surely it would make good business sense to be able to sell replacements for all the lost and broken Sony Ericsson phones, but instead customers who get burned by broken backups are given an incentive to buy a product from any other vendor.

(5/5) The greater the control over data the greater the obligation for protecting it

If you have data stored in a simple and standard manner (EG the /DCIM directory containing MP4 and JPEG files that is on the USB accessible storage in every modern phone) then IMHO it’s quite OK to leave customers to their own devices in terms of backups. Typical users can work out that if they don’t backup their pictures then they risk losing them, and they can work out how to do it.

My Sony Ericsson phones have data stored under /data (settings for Android applications) which is apparently only accessible as root. Sony Ericsson have denied me root access which prevents me running backup programs such as Titanium Backup, therefore I believe that they have a great obligation to provide a way of making a backup of this data and restoring it on a new phone or a phone that has been updated. To just provide phone upgrade instructions which tell me that my phone will be entirely wiped and that I should search the App Market for backup programs is unacceptable.

I believe that there are two ethical options available to Sony Ericsson at this time, one is to make it easy to root phones so that Titanium Backup and similar programs can be used, and the other option is to release a suitable backup program for older phones. Based on experience I don’t expect Sony Ericsson to choose either option.

Now it is also a bad thing for the Android application developers to make it difficult or impossible to backup their data. For example the Wiki for one Android game gives instructions for moving the saved game files to a new phone which starts with “root your phone”. The developers of that game should have read the Wiki, realised that rooting a phone for the mundane task of transferring saved game files is totally unreasonable, and developed a better alternative.

The best thing for developers to do is to allow the users to access their own data in the most convenient manner. Then it becomes the user’s responsibility to manage it and they can concentrate on improving their application.

Why Freedom is Important

Installing CyanogenMod on my Galaxy S was painful, but having root access so I can do anything I want is a great benefit. If phone vendors would do the right thing then I could recommend that other people use the vendor release, but it seems that vendors can be expected to act unethically. So I can’t recommend that anyone use an un-modded Android phone at any time. I also can’t recommend ever buying a Sony Ericsson product, not even when it’s really cheap.

Google have done a great thing with their Data Liberation Front [1]. Not only are they providing access to the data they store on our behalf (which is a good thing) but they have a mission statement that demands the same behavior from other companies – they make it an issue of competitive advantage! So while Sony Ericsson and other companies might not see a benefit in making people like me stop hating them, failing to be as effective in marketing as Google is a real issue. Data Liberation is something that should be discussed at board elections of IT companies.

Keep in mind the fact that ethics are not just about doing nice things, they are about establishing expectations of conduct that will be used by people who deal with you in future. Sony Ericsson has shown that I should expect that they will treat the integrity of my data with contempt and I will keep this in mind every time I decline an opportunity to purchase their products. Google has shown that they consider the protection of my data as an important issue and therefore I can be confident when using and recommending their services that I won’t get stuck with data that is locked away.

While Google has demonstrated that corporations can do the right thing, the vast majority of evidence suggests that we should never trust a corporation with anything that we might want to retrieve when it’s not immediately profitable for the corporation. Therefore avoiding commercial services for storing important data is the sensible thing to do.

17

Reliability of RAID

ZDNet has an insightful article by Robin Harris predicting the demise of RAID-6 due to the probability of read errors [1]. Basically as drives get larger the probability of hitting a read error during reconstruction increases and therefore you need to have more redundancy to deal with this. He suggests that as of 2009 drives were too big for a reasonable person to rely on correct reads from all remaining drives after one drive failed (in the case of RAID-5) and that in 2019 there will be a similar issue with RAID-6.

Of course most systems in the field aren’t using even RAID-6. All the most economical hosting options involve just RAID-1 and RAID-5 is still fairly popular with small servers. With RAID-1 and RAID-5 you have a serious problem when (not if) a disk returns random or outdated data and says that it is correct, you have no way of knowing which of the disks in the set has good data and which has bad data. For RAID-5 it will be theoretically possible to reconstruct the data in some situations by determining which disk should have it’s data discarded to give a result that passes higher level checks (EG fsck or application data consistency), but this is probably only viable in extreme cases (EG one disk returns only corrupt data for all reads).

For the common case of a RAID-1 array if one disk returns a few bad sectors then probably most people will just hope that it doesn’t hit something important. The case of Linux software RAID-1 is of interest to me because that is used by many of my servers.

Robin has also written about some NetApp research into the incidence of read errors which indicates that 8.5% of “consumer” disks had such errors during the 32 month study period [2]. This is a concern as I run enough RAID-1 systems with “consumer” disks that it is very improbable that I’m not getting such errors. So the question is, how can I discover such errors and fix them?

In Debian the mdadm package does a monthly scan of all software RAID devices to try and find such inconsistencies, but it doesn’t send an email to alert the sysadmin! I have filed Debian bug #658701 with a patch to make mdadm send email about this. But this really isn’t going to help a lot as the email will be sent AFTER the kernel has synchronised the data with a 50% chance of overwriting the last copy of good data with the bad data! Also the kernel code doesn’t seem to tell userspace which disk had the wrong data in a 3-disk mirror (and presumably a RAID-6 works in the same way) so even if the data can be corrected I won’t know which disk is failing.

Another problem with RAID checking is the fact that it will inherently take a long time and in practice can take a lot longer than necessary. For example I run some systems with LVM on RAID-1 on which only a fraction of the VG capacity is used, in one case the kernel will check 2.7TB of RAID even when there’s only 470G in use!

The BTRFS Filesystem

The btrfs Wiki is currently at btrfs.ipv5.de as the kernel.org wikis are apparently still read-only since the compromise [3]. BTRFS is noteworthy for doing checksums on data and metadata and for having internal support for RAID. So if two disks in a BTRFS RAID-1 disagree then the one with valid checksums will be taken as correct!

I’ve just done a quick test of this. I created a filesystem with the command “mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid?” and copied /dev/urandom to it until it was full. I then used dd to copy /dev/urandom to some parts of /dev/vg0/raidb while reading files from the mounted filesystem – that worked correctly although I was disappointed that it didn’t report any errors, I had hoped that it would read half the data from each device and fix some errors on the fly. Then I ran the command “btrfs scrub start .” and it gave lots of verbose errors in the kernel message log telling me which device had errors and where the errors are. I was a little disappointed that the command “btrfs scrub status .” just gave me a count of the corrected errors and didn’t mention which device had the errors.

It seems to me that BTRFS is going to be a much better option than Linux software RAID once it is stable enough to use in production. I am considering upgrading one of my less important servers to Debian/Unstable to test out BTRFS in this configuration.

BTRFS is rumored to have performance problems, I will test this but don’t have time to do so right now. Anyway I’m not always particularly concerned about performance, I have some systems where reliability is important enough to justify a performance loss.

BTRFS and Xen

The system with the 2.7TB RAID-1 is a Xen server and LVM volumes on that RAID are used for the block devices of the Xen DomUs. It seems obvious that I could create a single BTRFS filesystem for such a machine that uses both disks in a RAID-1 configuration and then use files on the BTRFS filesystem for Xen block devices. But that would give a lot of overhead of having a filesystem within a filesystem. So I am considering using two LVM volume groups, one for each disk. Then for each DomU which does anything disk intensive I can export two LVs, one from each physical disk and then run BTRFS inside the DomU. The down-side of this is that each DomU will need to scrub the devices and monitor the kernel log for checksum errors. Among other things I will have to back-port the BTRFS tools to CentOS 4.

This will be more difficult to manage than just having an LVM VG running on a RAID-1 array and giving each DomU a couple of LVs for storage.

BTRFS and DRBD

The combination of BTRFS RAID-1 and DRBD is going to be a difficult one. The obvious way of doing it would be to run DRBD over loopback devices that use large files on a BTRFS filesystem. That gives the overhead of a filesystem in a filesystem as well as the DRBD overhead.

It would be nice if BTRFS supported more than two copies of mirrored data. Then instead of DRBD over RAID-1 I could have two servers that each have two devices exported via NBD and BTRFS could store the data on all four devices. With that configuration I could lose an entire server and get a read error without losing any data!

Comparing Risks

I don’t want to use BTRFS in production now because of the risk of bugs. While it’s unlikely to have really serious bugs it’s theoretically possible that as bug could deny access to data until kernel code is fixed and it’s also possible (although less likely) that a bug could result in data being overwritten such that it can never be recovered. But for the current configuration (Ext4 on Linux software RAID-1) it’s almost certain that I will lose small amounts of data and it’s most probable that I have silently lost data on many occasions without realising.

6

Some Notes on DRBD

DRBD is a system for replicating a block device across multiple systems. It’s most commonly used for having one system write to the DRBD block device such that all writes are written to a local disk and a remote disk. In the default configuration a write is not complete until it’s committed to disk locally and remotely. There is support for having multiple systems write to disk at the same time, but naturally that only works if the filesystem drivers are aware of this.

I’m installing DRBD on some Debian/Squeeze servers for the purpose of mirroring a mail store across multiple systems. For the virtual machines which run mail queues I’m not using DRBD because the failure conditions that I’m planning for don’t include two disks entirely failing. I’m planning for a system having an outage for a while so it’s OK to have some inbound and outbound mail delayed but it’s not OK for the mail store to be unavailable.

Global changes I’ve made in /etc/drbd.d/global_common.conf

In the common section I changed the protocol from “C” to “B“, this means that a write() system call returns after data is committed locally and sent to the other node. This means that if the primary node goes permanently offline AND if the secondary node has a transient power failure or kernel crash causing the buffer contents to be lost then writes can be lost. I don’t think that this scenario is likely enough to make it worth choosing protocol C and requiring that all writes go to disk on both nodes before they are considered to be complete.

In the net section I added the following:

sndbuf-size 512k;
data-integrity-alg sha1;

This uses a larger network sending buffer (apparently good for fast local networks – although I’d have expected that the low delay on a local Gig-E would give a low bandwidth delay product) and to use sha1 hashes on all packets (why does it default to no data integrity).

Reserved Numbers

The default port number that is used is 7789. I think it’s best to use ports below 1024 for system services so I’ve setup some systems starting with port 100 and going up from there. I use a different port for every DRBD instance, so if I have two clustered resources on a LAN then I’ll use different ports even if they aren’t configured to ever run on the same system. You never know when the cluster assignment will change and DRBD port numbers seems like something that could potentially cause real problems if there was a port conflict.

Most of the documentation assumes that the DRBD device nodes on a system will start at /dev/drbd0 and increment, but this is not a requirement. I am configuring things such that there will only ever be one /dev/drbd0 on a network. This means that there is no possibility of a cut/paste error in a /etc/fstab file or a Xen configuration file causing data loss. As an aside I recently discovered that a Xen Dom0 can do a read-write mount of a block device that is being used read-write by a Xen DomU, there is some degree of protection against a DomU using a block device that is already being used in the Dom0 but no protection against the Dom0 messing with the DomU’s resources.

It would be nice if there was an option of using some device name other than /dev/drbdX where X is a number. Using meaningful names would reduce the incidence of doing things to the wrong device.

As an aside it would be nice if there was some sort of mount helper for determining which devices shouldn’t be mounted locally and which mount options are permitted – it MIGHT be OK to do a read-only mount of a DomU’s filesystem in the Dom0 but probably all mounting should be prevented. Also a mount helper for such things would ideally be able to change the default mount options, for example it could make the defaults be nosuid,nodev (or even noexec,nodev) when mounting filesystems from removable devices.

Initial Synchronisation

After a few trials it seems to me that things generally work if you create DRBD on two nodes at the same time and then immediately make one of them primary. If you don’t then it will probably refuse to accept one copy of the data as primary as it can’t seem to realise that both are inconsistent. I can’t understand why it does this in the case where there are two nodes with inconsistent data, you know for sure that there is no good data so there should be an operation to zero both devices and make them equal. Instead there

The solution sometimes seems to be to run “drbdsetup /dev/drbd0 primary –” (where drbd0 is replaced with the appropriate device). This seems to work well and allowed me to create a DRBD installation before I had installed the second server. If the servers have been connected in Inconsistent/Inconsistent state then the solution seems to involve running “drbdadm -- --overwrite-data-of-peer primary db0-mysql” (for the case of a resource named db0-mysql defined in /etc/drbd.d/db0-mysql.res).

Also it seems that some commands can only be run from one node. So if you have a primary node that’s in service and another node in Secondary/Unknown state (IE disconnected) with data state Inconsistent/DUnknown then while you would expect to be able to connect from the secondary node is appears that nothing other than a “drbdadm connect” command run from the primary node will get things going.

Hetzner Failover Konfiguration

The Wiki documenting how to configure IP failover for Hetzner servers [1] is closely tied to the Linux HA project [2]. This is OK if you want a Heartbeat cluster, but if you want manual failover or an automatic failover from some other form of script then it’s not useful. So I’ll provide the simplest possible documentation.

Below is a sample of shell code to get the current failover settings and change them to point the IP address to a different server. In my tests this takes between 19 and 20 seconds to complete, when the command completes the new server will be active and no IP packets will be lost – but TCP connections will be broken if the servers don’t support shared TCP state.

# username and password for the Hetzner robot
USERPASS=USER:PASS
# public IP
IP=10.1.2.3
# new active server
ACTIVE=10.2.3.4
# get current values
curl -s -u $USERPASS https://robot-ws.your-server.de/failover.yaml/$IP
# change active server
curl -s -u $USERPASS https://robot-ws.your-server.de/failover.yaml/$IP -d active_server_ip=$ACTIVE

Below is the output of the above commands showing the old state and the new state.

failover:
ip: 10.1.2.3
netmask: 255.255.255.255
server_ip: 10.2.3.3
active_server_ip: 10.2.3.4
failover:
ip: 10.1.2.3
netmask: 255.255.255.255
server_ip: 10.2.3.4
active_server_ip: 10.2.3.4

11

Why Clusters Usually Don’t Work

It’s widely regarded that to solve reliability problems you can just install a cluster. It’s quite obvious that if instead of having one system of a particular type you have multiple systems of that type and a cluster configured such that broken systems aren’t used then reliability will increase. Also in the case of routine maintenance a cluster configuration can allow one system to be maintained in a serious way (EG being rebooted for a kernel or BIOS upgrade) without interrupting service (apart from a very brief interruption that may be needed for resource failover). But there are some significant obstacles in the path of getting a good cluster going.

Buying Suitable Hardware

If you only have a single server that is doing something important and you have some budget for doing things properly then you really must do everything possible to keep it going. You need RAID storage with hot-swap disks, hot-swap redundant PSUs, and redundant ethernet cables bonded together. But if you have redundant servers then the requirement for making one server reliable is slightly reduced.

Hardware is getting cheaper all the time, a Dell R300 1RU server configured with redundant hot-plug PSUs, two 250G hot-plug SATA disks in a RAID-1 array, 2G of RAM, and a dual-core Xeon Pro E3113 3.0GHz CPU apparently costs just under $2,800AU (when using Google Chrome I couldn’t add some necessary jumper cables to the list so I couldn’t determine the exact price). So a cluster of two of them would cost about $5,600 just for the servers. But a Dell R200 1RU server with no redundant PSUs, a single 250G SATA disk, 2G of RAM, and a Core 2 Duo E7400 2.8GHz CPU costs only $1,048.99AU. So if a low end server is required then you could buy two R200 servers that have no redundancy built in which cost less than a single server that has hardware RAID and redundant PSUs. Those two servers have different sets of CPU options and probably other differences in the technical specs, but for many applications they will probably both provide more than adequate performance.

Using a server that doesn’t even have RAID is a bad idea, a minimal RAID configuration is a software RAID-1 array which only requires an extra disk per server. That takes the price of a Dell R200 to $1,203. So it seems that two low-end 1RU servers from Dell that have minimal redundancy features will be cheaper than a single 1RU server that has the full set of features. If you want to serve static content then that’s all you need, and a cluster can save you money on hardware! Of course we can debate whether any cluster node should be missing redundant hot-plug PSUs and disks. But that’s not an issue I want to address in this post.

Also serving static content is the simplest form of cluster, if you have a cluster for running a database server then you will need a dual-attached RAID array which will make things start to get expensive (or software for replicating the data over the network which is difficult to configure and may be expensive), so while a trivial cluster may not cost any extra money a real-world cluster deployment is likely to add significant expense.

My observation is that most people who implement clusters tend to have problems getting budget for decent hardware. When you have redundancy via the cluster you can tolerate slightly less expected uptime from the individual servers. While we can debate about whether a cluster member should have redundant PSUs and other expensive features it does seem that using a cheap desktop system as a cluster node is a bad idea. Unfortunately some managers think that a cluster solves the reliability problem and therefore you can just use recycled desktop systems as cluster nodes, this doesn’t give a good result.

Even if it is agreed that server class hardware is used for all servers so features such as ECC RAM are used you will still have problems if someone decides to use different hardware specs for each of the cluster nodes.

Testing a Cluster

Testing a non-clustered server or some servers that use a load-balancing device at the front-end isn’t that difficult in concept. Sure you have lots of use cases and exception conditions to test, but they are all mostly straight-through tests. With a cluster you need to test node failover at unexpected times. When a node is regarded as having an inconsistent state (which can mean that one service it runs could not be cleanly shutdown when it was due to be migrated) it will need to be rebooted which is sometimes known as a STONITH. A STONITH event usually involves something like IPMI to cut the power or a command such as “reboot -nf“, this loses cached data and can cause serious problems for any application which doesn’t call fsync() as often as it should. It seems likely that the vast majority of sysadmins run programs which don’t call fsync() often enough, but the probability of losing data is low and the probability of losing data in a way that you will notice (IE it doesn’t get automatically regenerated) is even lower. The low probability of data loss due to race conditions combined with the fact that a server with a UPS and redundant PSUs doesn’t unexpectedly halt that often means that problems don’t get found easily. But when clusters have problems and start calling STONITH the probability starts increasing.

Getting cluster software to work in a correct manner isn’t easy. I filed Debian bug #430958 about dpkg (the Debian package manager) not calling fsync() and thus having the potential to leave systems in an inconsistent or unusable state if a STONITH happened at the wrong time. I was inspired to find this problem after finding the same problem with RPM on a SUSE system. The result of applying a patch to call fsync() on every file was bug report #578635 about the performance of doing so, the eventual solution was to call sync() after each package is installed. Next time I do any cluster work on Debian I will have to test whether the sync() code seems to work as desired.

Getting software to work in a cluster requires that not only bugs in system software such as dpkg be fixed, but also bugs in 3rd party applications and in-house code. Please someone write a comment claiming that their favorite OS has no such bugs and the commercial and in-house software they use is also bug-free – I could do with a cheap laugh.

For the most expensive cluster I have ever installed (worth about 4,000,000 UK pounds – back when the pound was worth something) I was not allowed to power-cycle the servers. Apparently the servers were too valuable to be rebooted in that way, so if they did happen to have any defective hardware or buggy software that would do something undesirable after a power problem it would become apparent in production rather than being a basic warranty or patching issue before the system went live.

I have heard many people argue that if you install a reasonably common OS on a server from a reputable company and run reasonably common server software then the combination would have been tested before and therefore almost no testing is required. I think that some testing is always required (and I always seem to find some bugs when I do such tests), but I seem to be in a minority on this issue as less testing saves money – unless of course something breaks. It seems that the need for testing systems before going live is much greater for clusters, but most managers don’t allocate budget and other resources for this.

Finally there is the issue of testing issues related to custom code and the user experience. What is the correct thing to do with an interactive application when one of the cluster nodes goes down and how would you implement it at the back-end?

Running a Cluster

Systems don’t just sit there without changing, you have new versions of the OS and applications and requirements for configuration changes. This means that the people who run the cluster will ideally have some specialised cluster skills. If you hire sysadmins without regard to cluster skills then you will probably end up not hiring anyone who has any prior experience with the cluster configuration that you use. Learning to run a cluster is not like learning to run yet another typical Unix daemon, it requires some differences in the way things are done. All changes have to be strictly made to all nodes in the cluster, having a cluster fail-over to a node that wasn’t upgraded and can’t understand the new data is not fun at all!

My observation is that the typical experience of having a team of sysadmins who have no prior cluster experience being hired to run a cluster usually involves “learning experiences” for everyone. It’s probably best to assume that every member of the team will break the cluster and cause down-time on at least one occasion! This can be alleviated by only having one or two people ever work on the cluster and having everyone else delegate cluster work to them. Of course if something goes wrong when the cluster experts aren’t available then the result is even more downtime than might otherwise be expected.

Hiring sysadmins who have prior experience running a cluster with the software that you use is going to be very difficult. It seems that any organisation that is planning a cluster deployment should plan a training program for sysadmins. Have a set of test machines suitable for running a cluster and have every new hire install the cluster software and get it all working correctly. It’s expensive to buy extra systems for such testing, but it’s much more expensive to have people who lack necessary skills try and run your most important servers!

The trend in recent years has been towards sysadmins not being system programmers. This may be a good thing in other areas but it seems that in the case of clustering it is very useful to have a degree of low level knowledge of the system that you can only gain by having some experience doing system coding in C.

It’s also a good idea to have a test network which has machines in an almost identical configuration to the production servers. Being able to deploy patches to test machines before applying them in production is a really good thing.

Conclusion

Running a cluster is something that you should either do properly or not at all. If you do it badly then the result can easily be less uptime than a single well-run system.

I am not suggesting that people avoid running clusters. You can take this post as a list of suggestions for what to avoid doing if you want a successful cluster deployment.

1

A Basic IPVS Configuration

I have just configured IPVS on a Xen server for load balancing between multiple virtual hosts. The benefit is not load balancing but management. With two virtual machines providing a service I can gracefully shut one down for maintenance and have the other take the load. When there are two machines providing a service a load balancing configuration is much better than a hot-spare, one reason is the fact that there may be application scaling issues that prevent one machine with twice the resources from giving as much performance as two smaller machines. Another is the fact that if you have a machine configured but never used there will always be some doubt as to whether it would work…

The first thing to do is to assign the IP address of the service to the front-end machine so that other machines on the segment (IE routers) will be able to send data to it. If the address for the service is 10.0.0.5 then the command “ip addr add dev eth0 10.0.0.5/24 broadcast +” will make it a secondary address on the eth0 interface. On a Debian system you would add the line “up ip addr add dev eth0 10.0.0.5/24 broadcast + || true” to the appropriate section of /etc/network/interfaces, for a Red Hat system it seems that /etc/rc.local is the best place for it. I expect that it would be possible to merely advertise the IP address via ARP without adding it to the interface, but the ability to ping the IPVS server on the service address seems useful and there seems no benefit in not assigning the address.

There are three methods used by IPVS for forwarding packets, gatewaying/routing (the default), IPIP encapsulation (tunneling), and masquerading. The gatewaying/routing method requires the back-end server to respond to requests on the service address. That would mean assigning the address to the back-end server without advertising it via ARP (which seems likely to have some issues for managing the system). The IPIP encapsulation method requires setting up IPIP which seemed like it would be excessively difficult (although maybe not more than required to set up masquerading). The masquerading option (which I initially chose) rewrites the packets to have the IP address of the real server. So for example if the service address is 10.0.0.5 and the back-end server has the address 10.0.1.5 then it will see packets addresses to 10.0.1.5. A benefit of masquerading is that it allows you to use different ports, so for example you could have a non-virtualised mail server listening on port 25 and a back-end server for a virtual service listening on port 26. While there is no practical limit to the number of private IP addresses that you might use it seems easier to manage servers listening on different ports with the same IP address – and there is the issue of server programs that are not written to support binding to an IP address.

ipvsadm -A -t 10.0.0.5:25 -s lblc -p
ipvsadm -a -t 10.0.0.5:25 -r 10.0.1.5 -m

The above two commands create an IPVS configuration that listens on port 25 of IP address 10.0.0.5 and then masquerades connections to 10.0.1.5 on port 25 (the default is to use the same port).

Now the problem is in getting the packets to return via the IPVS server. If the IPVS server happens to be your default gateway then it’s not a problem and it will already be working after the above two commands (if a service is listening on 10.0.1.5 port 25).

If the IPVS server is not the default gateway and you have only one IP address on the back-end server then this will require using netfilter to mark the packets and then route based on the packet matching. Marking via netfilter also seems to be the only well documented way of doing similar things. I spent some time working on this and didn’t get it working. However having multiple IP addresses per server is a recommended practice anyway (a back-end interface for communication between servers as well as a front-end interface for public data).

ip rule add from 10.0.1.5 table 1
ip route add default via 10.0.0.1 table 1

I use the above two commands to set up a new routing table for the data for the virtual service. The first line causes any packets from 10.0.1.5 to be sent to routing table 1 (I currently have a rough plan to have table numbers match ethernet device numbers, the data in question is going out device eth1). The second line adds a default router to table 1 which sends all packets to 10.0.0.1 (the private IP address of the IPVS server).

Then it SHOULD all be working, but in the network that I’m using (RHEL4 DomU and RHEL5 Dom0 and IPVS) it doesn’t. For some reason the data packets from the DomU are not seen as part of the same TCP stream (both in Net Filter connection tracking and by the TCP code in the kernel). So I get an established connection (3 way handshake completed) but no data transfer. The server sends the SMTP greeting repeatedly but nothing is received. At this stage I’m not sure whether there is something missing in my configuration or whether there’s a bug in IPVS. I would be happy to send tcpdump output to anyone who wants to try and figure it out.

My next attempt at this was via routing. I removed the “-m” option from the ipvsadm command and added the service IP address to the back-end with the command “ifconfig lo:0 10.0.0.5 netmask 255.255.255.255” and configured the mail server to bind to port 25 on address 10.0.0.5. Success at last!

Now I just have to get Piranha working to remove back-end servers from the list when they fail.

Update: It’s quite important that when adding a single IP address to device lo:0 you use a netmask of 255.255.255.255. If you use the same netmask as the front-end device (which would seem like a reasonable thing to do) then (with RHEL4 kernels at least) you get proxy ARPs by default. For example you used netmask 255.255.255.0 to add address 10.0.0.5 to device lo:0 then on device eth0 the machine will start answering ARP requests for 10.0.0.6 etc. Havoc then ensues.