Finding Corrupt Files that cause a Kernel Error

There is a BTRFS bug in kernel 3.13 which is triggered by Kmail and causes Kmail index files to become seriously corrupt. Another bug in BTRFS causes a kernel GPF when an application tries to read such a file, that results in a SEGV being sent to the application. After that the kernel ceases to operate correctly for any files on that filesystem and no command other than “reboot -nf” (hard reset without flushing write-back caches) can be relied on to work correctly. The second bug should be fixed in Linux 3.14, I’m not sure about the first one.

In the mean time I have several systems running Kmail on BTRFS which have this problem.

(strace tar cf – . |cat > /dev/null) 2>&1|tail

To discover which file is corrupt I run the above command after a reboot. Below is a sample of the typical output of that command which shows that the file named “.trash.index” is corrupt. After discovering the file name I run “reboot -nf” and then delete the file (the file can be deleted on a clean system but not after a kernel GPF). Of recent times I’ve been doing this about once every 5 days, so on average each Kmail/BTRFS system has been getting disk corruption every two weeks. Fortunately every time the corruption has been on an index file so I don’t need to restore from backups.

newfstatat(4, ".trash.index", {st_mode=S_IFREG|0600, st_size=33, …}, AT_SYMLINK_NOFOLLOW) = 0
openat(4, ".trash.index", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_NOFOLLOW|O_CLOEXEC) = 5
fstat(5, {st_mode=S_IFREG|0600, st_size=33, …}) = 0
read(5,  <unfinished …>
+++ killed by SIGSEGV +++

BTRFS Status March 2014

I’m currently using BTRFS on most systems that I can access easily. It’s not nearly reliable enough that I want to install it on a server in another country or an embedded device that’s only accessible via 3G, but for systems where I can access the console it’s not doing too badly.

Balancing and Space Allocation

# btrfs filesystem df /
Data, single: total=103.97GiB, used=85.91GiB
System, DUP: total=32.00MiB, used=20.00KiB
Metadata, DUP: total=1.78GiB, used=1.31GiB
# df -h /

Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/ac696117-473c-4945-a71e-917e09c6503c 108G 89G 19G 84% /

Currently there are still situations where it can run out of space and deadlock on freeing space. The above shows the output of the btrfs df command and the regular df command, I have about 106G of disk space used by data and metadata in BTRFS while df shows that the entire filesystem (IE the block device) is 108G. So if I use another 2G of data or metadata then the system is at risk of deadlocking. To avoid that happening I have to run “btrfs balance start /” to start a balance which defragments the space use and frees some blocks. Currently there is a bug in BTRFS (present in all Debian/Unstable kernels) which prevents a balance operation from completing when systemd is used in a default configuration (there’s something about the way systemd accesses it’s journal files that triggers a BTRFS bug). This is really inconvenient, particularly given that there’s probably a strong correlation between people who use experimental filesystems and people who use experimental init programs.

When you get to the stage of the filesystem being deadlocked you can sometimes recover by removing snapshots and sometimes by adding a new device to the filesystem (even a USB flash drive will do). But I once had a filesystem get into a state where there wasn’t enough space to balance, add a device, or remove a snapshot – so I had to do a backup/format/restore.

Quota Groups

Last time I asked the developers (a few weeks ago) they told me that quota groups aren’t ready to use. They also said that they know about enough bugs that there’s no benefit in testing that feature. Even people who want to report bugs in BTRFS shouldn’t use quotas.

Kernel Panics with Kmail

I’ve had three systems develop filesystem corruption on files related to Kmail (the email program from KDE). I suspect that Kmail is triggering a bug in BTRFS. On all three systems the filesystem developed corruption that persisted across a reboot. One of the three systems was fixed by deleting the file for the Outbox, the others are waiting for kernel 3.14 which is supposed to fix the bug that causes kernel panics when accessing the corrupted files in question.

I don’t know whether kernel 3.14 will fix the bug that caused the corruption in the first place.

Conclusion

As I don’t use quotas BTRFS is working well for me on systems that have plenty of storage space and don’t run Kmail. There are some systems running systemd where I plan to upgrade the kernel before all the filesystem space is allocated. One of my systems is currently running SysVinit so I can balance the filesystem.

Apart from these issues BTRFS is working reasonably well for me. I haven’t yet had it’s filesystem checksums correct corrupted data from disk in any situation other than tests (I have had ZFS correct such an error, so hardware I use does benefit from this). I have restored data from BTRFS snapshots on many occasions, so that feature has been a major benefit for me. When I had a system with faulty RAM the internal checks in BTRFS alerted me to the problem and I didn’t lose any data, the filesystem became read-only and I was able to copy everything off even though it was too corrupted for writes.

Dell PowerEdge T110

In June 2008 I received a Dell PowerEdge T105 server to run in my home for a client [1]. That system has run well for over 5 years for the purposes of my client and also as my own home fileserver and as a workstation. But now it’s getting a bit old, while it was still basically working the cooling fans were getting noisy, faster systems are available, and it was crashing occasionally which could have been due to hardware or software.

On the 7th of November I got a new Dell PowerEdge T110. It’s got a i3-3220 CPU (speed of 4218 according to cpubenchmark.net) which is a lot better than the AMD 1212 (speed of 982). It takes up to 4*3.5″ SATA disks (as opposed to 2 disks) and has more options for memory expansion. Next time I run out of disk space I’ll add another RAID-1 pair of disks instead of buying new disks.

Generally this system is much the same as the one it replaces. It’s a cheap server which unfortunately lacks sound hardware and usable video hardware. Sound is a problem I already solved with USB speakers but for the new system I bought a PCIe video card. Fortunately the system has PCIe*16 sockets (which apparently only have PCIe*8 wires) which avoids the problem I had in the past trying to obtain a suitable video card.

The crashes turned out to be due to BTRFS and now that I’ve made some tweaks everything is running well.

I’ll probably buy another Dell PowerEdge in about 5 years time.

Google web sites and Chromium CPU Use

Chromium is the free software build of the Google Chrome web browser. It’s essentially the same as the Google code but will often be an older version, particularly when you get Chromium from Debian/Stable (or any other Linux distribution that doesn’t track the latest versions all the time) and compare it to getting Chrome straight from Google.

My wife is using Chromium on an AMD Opteron 1212 for all the usual web browsing tasks. Recently I’ve noticed that it takes a lot of CPU time whenever she leaves a Google web site open, that can be Google+, Gmail, or Youtube.

Web standards are complex and it’s difficult to do everything the way that one might desire. Making a web browser that doesn’t take 100% CPU time when the user is away from their desk may be a difficult technical challenge. Designing a web site that doesn’t trigger such unwanted behavior in common web browsers might also be a challenge.

But when one company produces both a web browser and some web sites that get a lot of traffic it’s rather disappointing that they don’t get this right.

It could be that Google have fixed this in a more recent version of the Chrome source tree, and it could be that they fixed the browser code before rolling out a new version of Google+ etc which causes problems with the old version (which might explain why I’ve never seen this problem). But even if that is the case it’s still disappointing that they aren’t supporting older versions. There is a real need for computers that don’t need to be updated all the time, running a 3 month old Linux distribution such as Debian/Wheezy shouldn’t be a problem.

There’s also a possibility that part of the cause of the problem is that an Opteron 1212 is a relatively slow CPU by today’s standards and it’s the slowest system I’m currently supporting for serious desktop use. I don’t even think it was one of the fastest CPUs available when it was released 4 years ago. But I think we should be able to expect systems to remain usable for more than 4 years. The Opteron 1212 system is a Dell PowerEdge tower server that is used as a workstation and a file server, so while I get desktop systems with faster CPUs for free I want to keep using the old PowerEdge server to avoid data corruption. As an aside I’ve been storing important data on BTRFS for a year now and the only data loss I’ve suffered has been due to a faulty DIMM. The filesystem checksums built in to modern filesystems such as BTRFS and ZFS mean that RAM corruption covers a greater portion of the risk to data integrity and the greater complexity of the data structures in such filesystems gives the possibility of corruption that can’t be fixed without mkfs (as happened to me twice on the system with a bad DIMM).

The consequences of such wasted CPU use are reduced CPU time for other programs which might be doing something useful, extra electricity use, and more noise from CPU cooling fans (which is particularly annoying for me in this case).

Any suggesstions for reducing the CPU use of web browsers, particularly when idle?

Using BTRFS

I’ve just installed BTRFS on some systems that matter to me. It is still regarded as experimental but Oracle supports it with their kernel so it can’t be too bad – and it’s almost guaranteed that anything other than BTRFS or ZFS will lose data if you run as many systems as I do. Also I run lots of systems that don’t have enough RAM for ZFS (4G isn’t enough for ZFS in my tests). So I have to use BTRFS.

BTRFS and Virtual Machines

I’m running BTRFS for the DomUs on a virtual server which has 4G of RAM (and thus can’t run ZFS). The way I have done this is to use ext4 on Linux software RAID-1 for the root filesystem on the Dom0 and use BTRFS for the rest. For BTRFS and virtual machines there seem to be two good options given that I want BTRFS to use it’s own RAID-1 so that it can correct errors from a corrupted disk. One is to use a single BTRFS filesystem with RAID-1 for all the storage and then have each VM use a file on that big BTRFS filesystem for all it’s storage. The other option is to have each virtual machine run BTRFS RAID-1.

I’ve created two LVM Volume Groups (VGs) named diska and diskb, each DomU has a Logical Volume (LV) from each VG and runs BTRFS. So if a disk becomes corrupt the DomU will have to figure out what the problem is and fix it.

#!/bin/bash
for n in $(xm list|cut -f1 -d\ |egrep -v ^Name\|^Domain-0) ; do
  echo $n
  ssh $n "btrfs scrub start -B -d /"
done

I use the above script in a cron job from the Dom0 to scrub the BTRFS filesystems in the DomUs. I use the -B option so that I will receive email about any errors and so that there won’t be multiple DomUs scrubbing at the same time (which would be really bad for performance).

BTRFS and Workstations

The first workstation installs of BTRFS that I did were similar to installations of Ext3/4 in that I had multiple filesystems on LVM block devices. This caused all the usual problems of filesystem sizes and also significantly hurt performance (sync seems to perform very badly on a BTRFS filesystem and it gets really bad with lots of BTRFS filesystems). BTRFS allows using subvolumes for snapshots and it’s designed to handle large filesystems so there’s no reason to have more than one filesystem IMHO.

It seems to me that the only benefit in using multiple BTRFS filesystems on a system is if you want to use different RAID options. I presume that eventually the BTRFS developers will support different RAID options on a per-subvolume basis (they seem to want to copy all ZFS features). I would like to be able to configure /home to use 3 copies of all data and metadata on a workstation that only has a single disk.

Currently I have some workstations using BTRFS with a BTRFS RAID-1 configuration for /home and a regular non-RAID configuration for everything else. But now it seems that this is a bad idea, I would be better off just using a single copy of all data on workstations (as I did for everything on workstations for the previous 15 years of running Linux desktop systems) and make backups that are frequent enough to not have a great risk.

BTRFS and Servers

One server that I run is primarily used as an NFS server and as a workstation. I have a pair of 3TB SATA disks in a BTRFS RAID-1 configuration mounted as /big and with subvolumes under /big for the various NFS exports. The system also has a 120G Intel SSD for /boot (Ext4) and the root filesystem which is BTRFS and also includes /home. The SSD gives really good read performance which is largely independent of what is done with the disks so booting and workstation use are very fast even when cron jobs are hitting the file server hard.

The system has used a RAID-1 array of 1TB SATA disks for all it’s storage ever since 1TB disks were big. So moving to a single storage device for /home is a decrease in theoretical reliability (in addition to the fact that a SSD might be less reliable than a traditional disk). The next thing that I am going to do is to install cron jobs that backup the root filesystem to something under /big. The server in question isn’t used for anything that requires high uptime, so if the SSD dies entirely and I need to replace it with another boot device then it will be really annoying but it won’t be a great problem.

Snapshot Backups

One of the most important uses of backups is to recover from basic user mistakes such as deleting the wrong file. To deal with this I wrote some scripts to create backups from a cron job. I put the snapshots of a subvolume under a subvolume named “backup“. A common use is to have everything on the root filesystem, /home as a subvolume, /home/backup as another subvolume, and then subvolumes for backups such as /home/backup/2012-12-17, /home/backup/2012-12-17:00:15, and /home/backup/2012-12-17:00:30. I make /home/backup world readable so every user can access their own backups without involving me, of course this means that if they make a mistake related to security then I would have to help them correct it – but I don’t expect my users to deal with security issues, if they accidentally grant inappropriate access to their files then I will be the one to notice and correct it.

Here is a script I name btrfs-make-snapshot which has an optional first parameter “-d” to cause it do just display the btrfs commands it would run and not actually do anything. The second parameter is either “minutes” or “days” depending on whether you want to create a snapshot on a short interval (I use 15 minutes) or a daily snapshot. All other parameters are paths for subvolumes that are to be backed up:

#!/bin/bash
set -e

# usage:
# btrfs-make-snapshot [-d] minutes|days paths
# example:
# btrfs-make-snapshot minutes /home /mail

if [ "$1" == "-d" ]; then
  BTRFS="echo btrfs"
  shift
else
  BTRFS=/sbin/btrfs
fi

if [ "$1" == "minutes" ]; then
  DATE=$(date +%Y-%m-%d:%H:%M)
else
  DATE=$(date +%Y-%m-%d)
fi
shift

for n in $* ; do
  $BTRFS subvol snapshot -r $n $n/backup/$DATE
done

Here is a script I name btrfs-remove-snapshots which removes old snapshots to free space. It has an optional first parameter “-d” to cause it do just display the btrfs commands it would run and not actually do anything. The next parameters are the number of minute based and day based snapshots to keep (I am currently experimenting with 100 100 for /home to keep 15 minute snapshots for 25 hours and daily snapshots for 100 days). After that is a list of filesystems to remove snapshots from. The removal will be from under the backup subvolume of the path in question.

#!/bin/bash
set -e

# usage:
# btrfs-remove-snapshots [-d] MINSNAPS DAYSNAPS paths
# example:
# btrfs-remove-snapshots 100 100 /home /mail

if [ "$1" == "-d" ]; then
  BTRFS="echo btrfs"
  shift
else
  BTRFS=/sbin/btrfs
fi

MINSNAPS=$1
shift
DAYSNAPS=$1
shift

for DIR in $* ; do
  BASE=$(echo $DIR | cut -c 2-200)
  for n in $(btrfs subvol list $DIR|grep $BASE/backup/.*:|head -n -$MINSNAPS|sed -e "s/^.* //"); do
    $BTRFS subvol delete /$n
  done
  for n in $(btrfs subvol list $DIR|grep $BASE/backup/|grep -v :|head -n -$MINSNAPS|sed -e "s/^.* //"); do
    $BTRFS subvol delete /$n
  done
done

A Warning

The Debian/Wheezy kernel (based on the upstream kernel 3.2.32) doesn’t seem to cope well when you run out of space by making snapshots. I have a filesystem that I am still trying to recover after doing that.

I’ve just been buying larger storage devices for systems while migrating them to BTRFS, so I should be able to avoid running out of disk space again until I can upgrade to a kernel that fixes such bugs.

Cheap Bulk Storage

The Problem

Some of my clients need systems that store reasonable amounts of data. This is enough data that we can expect some data corruption on disk such that traditional RAID doesn’t work, that old fashioned filesystems like Ext3/4 will have unreasonable fsck fimes, and that the number of disks in a small server isn’t enough.

NetApp is a really good option for bulk reliable storage, but their products are very expensive. BTRFS has a lot of potential, but the currently released versions (as supported in distributions such as Debian/Wheezy) lack significant features. One significant lack in current BTRFS releases is something equivalent to the ZFS send/receive functionality for remote backups, this was a major factor when I analysed the options for hard drive based backup [1], and you should always think about backup before deploying a new system. Currently ZFS is the best choice for bulk storage which is reliable if you can’t afford NetApp. Any storage system needs a minimum level of reliability if only to protect it’s own metadata and a basic RAID array doesn’t protect against media corruption with current data volumes. The combination of performance, lack of fsck (which is a performance feature), large storage support, backup, and significant real-world use makes ZFS a really good option.

Now I need to get some servers for more than 8.1TiB of storage (the capacity of a RAID-Z array of 4*3TB disks). One of my clients needs significantly more, probably at least 10 disks in a RAID-Z array so none of the cheaper servers will do.

Basically the issue that some of my clients are dealing with (and which I have to solve) is how to provide a relatively cheap ZFS system for storing reasonable amounts of data. For some systems I need to start with about 10 disks and be able to scale to 24 disks or more without excessive expense. Also to make things a little easier and cheaper 24*7 operation is not required, so instead of paying for hot-swap disks we can just schedule down-time outside business hours.

The Problem with Dell

Dell is really good for small systems, the PowerEdge tower servers that support 2*3.5″ or 4*3.5″ disks and which have space for an SSD or two are really affordable and easy to order. But even in the mid-size Dell tower servers (which are small by server standards) you have problems with just getting a few disks operating outside a RAID array [2]. The Dell online store is really great for small servers, any time I’m buying a server for less than $2500 I check the Dell online store first and usually their price is good enough that there is no need to get a quote from another company. Unfortunately all the servers with bigger storage involve disks that are unreasonably expensive (it seems that Dell makes their profit on the parts) and prices are not available online. I gave my email address and phone number to the Dell web site on Wednesday and they haven’t cared to get back to me yet. This is the type of service that makes me avoid IBM and HP for any server deployment where the Dell online store sells something suitable!

BackBlaze

For some time BackBlaze have been getting interest by describing how they store lots of data in a small amount of space by tightly stacking SATA disks. They don’t think that ZFS on Linux is ready for production, but their hardware ideas are useful. They have recently described their latest architecture [3]. They describe it as 135TB for $7,384. Of course the 135TB number is based on the idea of getting the full 3TB capacity out of each disk which they can do as they have redundancy over multiple storage pods. But anyone who wants a single fileserver needs some internal redundancy to cover disk failure. One option might be to have three RAID-Z2 arrays of 15 disks which gives a usable capacity of 42*3TB==126TB==113TiB. Note that while the ZFS documentation recommends between 3 and 9 disks per zpool for performance I don’t expect performance problems, when you only have a gigabit Ethernet connection there shouldn’t be a problem with three ZFS zpools making the network the bottleneck.

For this option the way to go would be to start with an array of 15 disks and then buy a second set of 15 disks when the first storage pool becomes full. It seems likely that 4TB disks will become cheap before a 35TiB array is filled so we can get more efficiency by delaying purchases. The BackBlaze pod isn’t cheap, they are sold as a complete system without storage disks for $US5,395 by Protocase [4]. That gives a markup of $US3,411 over the BackBlaze cost which isn’t too bad given that BackBlaze are quoting the insane bulk discount hardware prices that I could never get. Protocase also offer the case on it’s own for anyone who wants to build a system around it. It seems like the better option is to buy the system from Protocase, but that would end up being over $6,000 when Australian import duty is added and probably close to $7,000 when shipping etc is included.

Norco

Norco offers a case that takes 24 hot-swap SATA/SAS disks and a regular PC motherboard for $US399 [5]. It’s similar to the BackBlaze pod but smaller, cheaper, and there’s no obvious option to buy a configured and tested system. 24 disks would allow two RAID-Z2 arrays of 12 disks, the first array could provide 27TiB and the second array could provide something bigger when new disks are released.

SuperMicro

SuperMicro has a range of storage servers that support from 12 to 36 disks [6]. They seem good, but I’d have to deal with a reseller to buy them which would involve pain at best and at worst they wouldn’t bother getting me a quote because I only want one server at a time.

Conclusion

Does anyone know of any other options for affordable systems suitable for running ZFS on SATA disks? Preferably ones that don’t involve dealing with resellers.

At the moment it seems that the best option is to get a Norco case and build my own system as I don’t think that any of my clients needs the capacity of a BackBlaze pod at the moment. Supermicro seems good but I’d have to deal with a reseller. In my experience the difference between the resellers of such computer systems and used car dealers is that used car dealers are happy to sell one car at a time and that every used car dealer at least knows how to drive.

Also if you are an Australian reader of my blog and you want to build such storage servers to sell to my clients in Melbourne then I’d be interested to see an offer. But please make sure that any such offer includes a reference to your contributions to the Linux community if you think I won’t recognise your name. If you don’t contribute then I probably don’t want to do business with you.

As an aside, I was recently at a camera store helping a client test a new DSLR when one of the store employees started telling me how good ZFS is for storing RAW images. I totally agree that ZFS is the best filesystem for storing large RAW files and this is what I am working on right now. But it’s not the sort of advice I expect to receive at a camera store, not even one that caters to professional photographers.

Hard Drives for Backup

The general trend seems to be that cheap hard drives are increasing in capacity faster than much of the data that is commonly stored. Back in 1998 I had a 3G disk in my laptop and about 800M was used for my home directory. Now I have 6.2G used for my home directory (and another 2G in ~/src) out of the 100G capacity in my laptop. So my space usage for my home directory has increased by a factor of about 8 while my space available has increased by a factor of about 30. When I had 800M for my home directory I saved space by cropping pictures for my web site and deleting the originals (thus losing some data I would rather have today) but now I just keep everything and it’s still doesn’t take up much of my hard drive. Similar trends apply to most systems that I use and that I run for my clients.

Due to the availability of storage people are gratuitously using a lot of disk space. A relative recently took 10G of pictures on a holiday, her phone has 12G of internal storage so there was nothing stopping her. She might decide that half the pictures aren’t that great if she had to save space, but that space is essentially free (she couldn’t buy a cheaper phone with less storage) so there’s no reason to delete any pictures.

When considering backup methods one important factor is the ability to store all of one type of data on one backup device. Having a single backup span multiple disks, tapes, etc has a dramatic impact on the ease of recovery and the potential for data loss. Currently 3TB SATA disks are really cheap and 4TB disks are available but rather expensive. Currently only one of my clients has more than 4TB of data used for one purpose (IE a single filesystem) so apart from that client a single SATA disk can backup anything that I run.

Benefits of Hard Drive Backup

When using a hard drive there is an option to make it a bootable disk in the same format as the live disk. I haven’t done this, but if you want the option of a quick recovery from a hardware failure then having a bootable disk with all the data on it is a good option. For example a server with software RAID-1 could have a backup disk that is configured as a degraded RAID-1 array.

The biggest benefit is the ability to read a disk anywhere. I’ve read many reports of tape drives being discovered to be defective at the least convenient time. With a SATA disk you can install it in any PC or put it in a USB bay if you have USB 3.0 or the performance penalty of USB 2.0 is bearable – a USB 2.0 bay is great if you want to recover a single file, but if you want terabytes in a hurry then it won’t do.

A backup on a hard drive will typically use a common filesystem. For backing up Linux servers I generally use Ext3, at some future time I will move to BTRFS as having checksums on all data is a good feature for a backup. Using a regular filesystem means that I can access the data anywhere without needing any special software, I can run programs like diff on the backup, and I can export the backup via NFS or Samba if necessary. You never know how you will need to access your backup so it’s best to keep your options open.

Hard drive backups are the best solution for files that are accidentally deleted. You can have the first line of backups on a local server (or through a filesystem like BTRFS or ZFS that supports snapshots) and files can be recovered quickly. Even a SATA disk in a USB bay is very fast for recovering a single file.

LTO tapes have a maximum capacity of 1.5TB at the moment and tape size has been increasing more slowly than disk size. Also LTO tapes have an expected lifetime of only 200 reads/writes of the entire tape. It seems to me that tapes don’t provide a great benefit unless you are backing up enough data to need a tape robot.

Problems with a Hard Drive Backup

Hard drives tend not to survive being dropped so posting a hard drive for remote storage probably isn’t a good option. This can be solved by transferring data over the Internet if the data isn’t particularly big or doesn’t change too much (I have a 400G data set backed up via rsync to another country because most of the data doesn’t change over the course of a year). Also if the data is particularly small then solid state storage (which costs about $1 per GB) is a viable option, I run more than a few servers which could be entirely backed up to a 200G SSD. $200 for a single backup of 200G of data is a bit expensive, but the potential for saving time and money on the restore means that it can be financially viable.

Some people claim that tape storage will better survive a Carrington Event than hard drives. I’m fairly dubious about the benefits of this, if a hard drive in a Faraday Cage (such as a regular safe that is earthed) is going to be destroyed then you will probably worry about security of the food supply instead of your data. Maybe I should just add a disclaimer “this backup system won’t survive a zombie apocalypse”. ;)

It’s widely regarded that tape storage lasts longer than hard drives. I doubt that this provides a real benefit as some of my personal servers are running on 20G hard drives from back when 20G was big. The fact that drives tend to last for more than 10 years combined with the fact that newer bigger drives are always being released means that important backups can be moved to bigger drives. As a general rule you should assume that anything which isn’t regularly tested doesn’t work. So whatever your backup method you should test it regularly and have multiple copies of the data to deal with the case when one copy becomes corrupt. The process of testing a backup can involve moving it to newer media.

I’ve seen it claimed that a benefit of tape storage is that part of the data can be recovered from a damaged tape. One problem with this is that part of a database often isn’t particularly useful. Another issue is that in my experience hard drives usually don’t fail entirely unless you drop them, drives usually fail a few sectors at a time.

How to Implement Hard Drive Backup

The most common need for backups is when someone deletes the wrong file. It’s usually a small restore and you want it to be an easy process. The best solution to this is to have a filesystem with snapshots such as BTRFS or ZFS. In theory it shouldn’t be too difficult to have a cron job manage snapshots, but as I’ve only just started putting BTRFS and ZFS on servers I haven’t got around to changing my backups. Snapshots won’t cover more serious problems such as hardware, software, or user errors that wipe all the disks in a server. For example the only time I lost a significant amount of data from a hosted server was when the data center staff wiped it, so obviously good off-site backups are needed.

The easiest way to deal with problems that wipe a server is to have data copied to another system. For remote backups you can rsync to a local system and then use “cp -rl” or your favorite snapshot system to make a hard linked copy of the tree. A really neat feature is the ZFS ability to “send” a filesystem snapshot (or the diff between two snapshots) to a remote system [1]. Once you have regular backups on local storage you can then copy them to removable disks as often as you wish, I think I’ll have to install ZFS on some of my servers for the sole purpose of getting the “send” feature! There are NAS devices that provide similar functionality to the ZFS send/receive (maybe implemented with ZFS), but I’m not a fan of cheap NAS devices [2].

It seems that the best way to address the first two needs of backup (fast local restore and resilience in the face of site failure) is to use ZFS snapshots on the server and ZFS send/receive to copy the data to another site. The next issue is that the backup server probably won’t be big enough for all the archives and you want to be able to recover from a failure on the backup server. This requires some removable storage.

The simplest removable backup is to use a SATA drive bay with eSATA and USB connectors. You use a regular filesystem like Ext3 and just copy the files on. It’s easy, cheap, and requires no special skill or software. Requiring no special skill is important, you never know who will be called on to recover from backups.

When a server is backing up another server by rsync (whether it’s in the same rack or another country) you want the backup server to be reliable. However there is no requirement for a single reliable server and sometimes having multiple backup servers will be cheaper. At current purchase prices you can buy two cheap tower systems with 4*3TB disks for less money than a single server that has redundant PSUs and other high end server features. Having two cheap servers die at once seems quite unlikely so getting two backup servers would be the better choice.

For filesystems that are bigger than 4TB a disk based backup would require backup software that handles multi part archives. One would hope that any software that is designed for tape backup would work well for this (consider a hard drive as a tape with a very fast seek), but often things don’t work as desired. If anyone knows of a good Linux backup program that supports multiple 4TB SATA disks in eSATA or USB bays then please let me know.

Conclusion

BTRFS or ZFS snapshots are the best way of recovering from simple mistakes.

ZFS send/receive seems to be the best way of synchronising updates to filesystems to other systems or sites.

ZFS should be used for all servers. Even if you don’t currently need send/receive you never know what the future requirements may be. Apart from needing huge amounts of RAM (one of my servers had OOM failures when it had a mere 4G of RAM) there doesn’t seem to be any down-side to ZFS.

I’m unsure of whether to use BTRFS for removable backup disks. The immediate up-sides are checksums on all data and meta-data and the possibility of using built-in RAID-1 so that a random bad sector is unlikely to lose data. There is also the possibility of using snapshots on a removable backup disk (if the disk contains separate files instead of an archive). The down-sides are lack of support on older systems and the fact that BTRFS is fairly new.

Have I missed anything?

ZFS on Debian/Wheezy

As storage capacities increase the probability of data corruption increases as does the amount of time required for a fsck on a traditional filesystem. Also the capacity of disks is increasing a lot faster than the contiguous IO speed which means that the RAID rebuild time is increasing, for example my first hard disk was 70M and had a transfer rate of 500K/s which meant that the entire contents could be read in a mere 140 seconds! The last time I did a test on a more recent disk a 1TB SATA disk gave contiguous transfer rates ranging from 112MB/s to 52MB/s which meant that reading the entire contents took 3 hours and 10 minutes, and that problem is worse with newer bigger disks. The long rebuild times make greater redundancy more desirable.

BTRFS vs ZFS

Both BTRFS and ZFS checksum all data to cover the case where a disk returns corrupt data, they don’t need a fsck program, and the combination of checksums and built-in RAID means that they should have less risk of data loss due to a second failure during rebuild. ZFS supports RAID-Z which is essentially a RAID-5 with checksums on all blocks to handle the case of corrupt data as well as RAID-Z2 which is a similar equivalent to RAID-6. RAID-Z is quite important if you don’t want to have half your disk space taken up by redundancy or if you want to have your data survive the loss or more than one disk, so until BTRFS has an equivalent feature ZFS offers significant benefits. Also BTRFS is still rather new which is a concern for software that is critical to data integrity.

I am about to install a system to be a file server and Xen server which probably isn’t going to be upgraded a lot over the next few years. It will have 4 disks so ZFS with RAID-Z offers a significant benefit over BTRFS for capacity and RAID-Z2 offers a significant benefit for redundancy. As it won’t be upgraded a lot I’ll start with Debian/Wheezy even though it isn’t released yet because the system will be in use without much change well after Squeeze security updates end.

ZFS on Wheezy

Getting ZFS to basically work isn’t particularly hard, the ZFSonLinux.org site has the code and reasonable instructions for doing it [1]. The zfsonlinux code doesn’t compile out of the box on Wheezy although it works well on Squeeze. I found it easier to get a the latest Ubuntu working with ZFS and then I rebuilt the Ubuntu packages for Debian/Wheezy and they worked. This wasn’t particularly difficult but it’s a pity that the zfsonlinux site didn’t support recent kernels.

Root on ZFS

The complication with root on ZFS is that the ZFS FAQ recommends using whole disks for best performance so you can avoid alignment problems on 4K sector disks (which is an issue for any disk large enough that you want to use it with ZFS) [2]. This means you have to either use /boot on ZFS (which seems a little too experimental for me) or have a separate boot device.

Currently I have one server running with 4*3TB disks in a RAID-Z array and a single smaller disk for the root filesystem. Having a fifth disk attached by duct-tape to a system that is only designed for four disks isn’t ideal, but when you have an OS image that is backed up (and not so important) and a data store that’s business critical (but not needed every day) then a failure on the root device can be fixed the next day without serious problems. But I want to fix this and avoid creating more systems like it.

There is some good documentation on using Ubuntu with root on ZFS [3]. I considered using Ubuntu LTS for the server in question, but as I prefer Debian and I can recompile Ubuntu packages for Debian it seems that Debian is the best choice for me. I compiled those packages for Wheezy, did the install and DKMS build, and got ZFS basically working without much effort.

The problem then became getting ZFS to work for the root filesystem. The Ubuntu packages didn’t work with the Debian initramfs for some reason and modules failed to load. This wasn’t necessarily a show-stopper as I can modify such things myself, but it’s another painful thing to manage and another way that the system can potentially break on upgrade.

The next issue is the unusual way that ZFS mounts filesystems. Instead of having block devices to mount and entries in /etc/fstab the ZFS system does things for you. So if you want a ZFS volume to be mounted as root you configure the mountpoint via the “zfs set mountpoint” command. This of course means that it doesn’t get mounted if you boot with a different root filesystem and adds some needless pain to the process. When I encountered this I decided that root on ZFS isn’t a good option. So for this new server I’ll install it with an Ext4 filesystem on a RAID-1 device for root and /boot and use ZFS for everything else.

Correct Alignment

After setting up the system with a 4 disk RAID-1 (or mirror for the pedants who insist that true RAID-1 has only two disks) for root and boot I then created partitions for ZFS. According to fdisk output the partitions /dev/sda2, /dev/sdb2 etc had their first sector address as a multiple of 2048 which I presume addresses the alignment requirement for a disk that has 4K sectors.

Installing ZFS

deb http://www.coker.com.au wheezy zfs

I created the above APT repository (only AMD64) for ZFS packages based on Darik Horn’s Ubuntu packages (thanks for the good work Darik). Installing zfs-dkms, spl-dkms, and zfsutils gave a working ZFS system. I could probably have used Darik’s binary packages but I think it’s best to rebuild Ubuntu packages to use on Debian.

The server in question hasn’t gone live in production yet (it turns out that we don’t have agreement on what the server will do). But so far it seems to be working OK.

BTRFS and ZFS as Layering Violations

LWN has an interesting article comparing recent developments in the Linux world to the “Unix Wars” that essentially killed every proprietary Unix system [1]. The article is really interesting and I recommend reading it, it’s probably only available to subscribers at the moment but should be generally available in a week or so (I used my Debian access sponsored by HP to read it).

A comment on that article cites my previous post about the reliability of RAID [2] and then goes on to disagree with my conclusion that using the filesystem for everything is the right thing to do.

The Benefits of Layers

I don’t believe as strongly in the BTRFS/ZFS design as the commentator probably thinks. The current way my servers (and a huge number of other Linux systems) work of having RAID to form a reliable array of disks from a set of cheap disks for the purpose of reliability and often capacity or performance is a good thing. I have storage on top of the RAID array and can fix the RAID without bothering about the filesystem(s) – and have done so in the past. I can also test the RAID array without involving any filesystem specific code. Then I have LVM running on top of the RAID array in exactly the same way that it runs on top of a single hard drive or SSD in the case of a laptop or netbook. So Linux on a laptop is much the same as Linux on a server in terms of storage once we get past the issue of whether a single disk or a RAID array is used for the LVM PV, among other things this means that the same code paths are used and I’m less likely to encounter a bug when I install a new system.

LVM provides multiple LVs which can be used for filesystems, swap, or anything else that uses storage. So if a filesystem gets badly corrupted I can umount it, create an LVM snapshot, and then take appropriate measures to try and fix it – without interfering with other filesystems.

When using layered storage I can easily add or change layers when it’s appropriate. For example I have encryption on only some LVs on my laptop and netbook systems (there is no point encrypting the filesystem used for .iso files of Linux distributions) and on some servers I use RAID-0 for cached data.

When using a filesystem like BTRFS or ZFS which includes subvolumes (similar in result to LVM in some cases) and internal RAID you can’t separate the layers. So if something gets corrupted then you have to deal with all the complexity of BTRFS or ZFS instead of just fixing the one layer that has a problem.

Update: One thing I forgot to mention when I first published this is the benefits of layering for some uncommon cases such as network devices. I can run an Ext4 filesystem over a RAID-1 array which has one device on NBD on another system. That’s a bit unusual but it is apparently working well for some people. The internal RAID on ZFS and BTRFS doesn’t support such things and using software RAID underneath ZFS or BTRFS loses some features.

When using DRBD you might have two servers with local RAID arrays, DRBD on top of that, and then an Ext4 filesystem. As any form of RAID other than internal RAID loses reliability features for ZFS and BTRFS that means that no matter how you might implement those filesystems with DRBD it seems that you will lose somehow. It seems that neither BTRFS nor ZFS supports a disconnected RAID mode (like a Linux software RAID with a bitmap so it can resync only the parts that didn’t change) so it’s not possible to use BTRFS or ZFS RAID-1 with an NBD device.

The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that.

The Benefits of Integration

When RAID and the filesystem are separate things (with some added abstraction from LVM) it’s difficult to optimise the filesystem for RAID performance at the best of times and impossible in many cases. When the filesystem manages RAID it can optimise it’s operation to match the details of the RAID layout. I believe that in some situations ZFS will use mirroring instead of RAID-Z for small writes to reduce the load and that ZFS will combine writes into a single RAID-Z stripe (or set of contiguous RAID-Z stripes) to improve write performance.

It would be possible to have a RAID driver that includes checksums for all blocks, it could then read from another device when a checksum fails and give some of the reliability features that ZFS and BTRFS offer. Then to provide all the reliability benefits of ZFS you would at least need a filesystem that stores multiple copies of the data which would of course need checksums (because the filesystem could be used on a less reliable block device) and therefore you would end up with two checksums on the same data. Note that if you want to have a RAID array with checksums on all blocks then ZFS has a volume management feature (which is well described by Mark Round) [3]. Such a zvol could be used for a block device in a virtual machine and in an ideal world it would be possible to use one as swap space. But the zvol is apparently managed with all the regular ZFS mechanisms so it’s not a direct list of blocks on disk and thus can’t be extracted if there is a problem with ZFS.

Snapshots are an essential feature by today’s standards. The ability to create lots of snapshots with low overhead is a significant feature of filesystems like BTRFS and ZFS. Now it is possible to run BTRFS or ZFS on top of a volume manager like LVM which does snapshots to cover the case of the filesystem getting corrupted. But again that would end up with two sets of overhead.

The way that ZFS supports snapshots which inherit encryption keys is also interesting.

Conclusion

It’s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn’t much interest in developing such things. So while people would use it (and people are using ZFS ZVols as block devices for other filesystems as described in a comment on Mark Round’s blog) it’s probably not going to be implemented.

Therefore we have a choice of all the complexity and features of BTRFS or ZFS, or the current RAID+LVM+Ext4 option. While the complexity of BTRFS and ZFS is a concern for me (particularly as BTRFS is new and ZFS is really complex and not well supported on Linux) it seems that there is no other option for certain types of large storage at the moment.

ZFS on Linux isn’t a great option for me, but for some of my clients it seems to be the only option. ZFS on Solaris would be a better option in some ways, but that’s not possible when you have important Linux software that needs fast access to the storage.

ZFS vs BTRFS on Cheap Dell Servers

I previously wrote about my first experiences with BTRFS [1]. Since then I’ve been using BTRFS on more systems and have had good results. The main problem I want to address is with the reliability of RAID [2].

Requirements for a File Server

Now one of my clients has a need for a new fileserver. They need to reliably store terabytes of data (currently 6TB and growing) which is mostly comprised of data files in the 10MB – 15MB size range. The data files will almost never be re-written and I anticiapte that the main bottleneck will be the latency of NFS and other network file sharing protocols. I would hope that saturating a GigE network when sending 10MB data files from SATA disks via NFS, AFS, or SMB wouldn’t be a technical challenge.

It seems that BTRFS is the way of the future. But it’s still rather new and the lack of RAID-5 and RAID-6 is a serious issue when you need to store 10TB with today’s technology (that would be 8*3TB disks for RAID-10 vs 5*3TB disks for RAID-5). Also the case of two disks entirely failing in a short period of time requires RAID-6 (or RAID-Z2 as the ZFS variant of RAID-6 is known). With BTRFS at it’s current stage of development it seems that to recover from two disks failing you need to have BTRFS on another RAID-6 (maybe Linux software RAID-6). But for filesystems based on concepts similar to ZFS and BTRFS you want to have the filesystem run the RAID so that if a block has a filesystem hash mismatch then the correct copy can be reconstructed from parity.

ZFS seems to be a lot more complex than BTRFS. While having more features is a good thing (BTRFS seems to be missing some sysadmin friendly features at this stage) complexity means that I need to learn more and test more before going live.

But it seems that the built in RAID-5 and RAID-6 is the killer issue. Servers start becoming a lot more expensive if you want more than 8 disks and even going past 6 disks is a significant price point. As 3TB disks are available an 8 disk RAID-6 gives something like 18TB usable space vs 12TB on a RAID-10 and a 6 disk RAID-6 gives about 12TB vs 9TB on a RAID-10. With RAID-10 (IE BTRFS) my client couldn’t use a 6 disk server such as the Dell PowerEdge T410 for $1500 as 9TB of usable storage isn’t adequate and the Dell PowerEdge T610 which can support 8 disks and costs $2100 would be barely adequate for the near future with only 12TB of usable storage. Dell does sell significantly larger servers such that any of my clients needs could be covered by RAID-10, but in addition to costing more there are issues of power use and noise. When comparing a T610 and a T410 with a full set of disks the price difference is $1000 (assuming $200 per disk) which is probably worth paying to delay any future need for upgrades.

Buying Disks

The problem with the PowerEdge T610 server is that it uses hot-swap disks and the biggest disks available are 2TB for $586.30! 2TB*8 in RAID-6 gives 12TB of usable space for $4690.40! This compares poorly to the PowerEdge T410 which supports non-hot-swap disks so I can buy 6*3TB disks for something less than $200 each and get 12TB of usable space for $1200. If I could get hot-swap trays for Dell disks at a reasonable price then the T610 would be worth considering. But as 12TB of storage should do for at least the next 18 months it seems that the T410 is clearly the better option.

Does anyone know how to get cheap disk trays for Dell servers?

Implementation

In mailing list discussions some people suggest using Solaris or FreeBSD for a ZFS server. ZFS was designed for and implemented on Solaris, and FreeBSD was the first port. However Solaris and FreeBSD aren’t commonly used systems so it’s harder to find skilled people to work with them and there is less of a guarantee that the desired software will work. Among other things it’s really convenient to be able to run software for embedded Linux i386 systems on the server.

The first port of ZFS to Linux was based on FUSE [3]. This allows a clean separation of ZFS code from the Linux kernel code to avoid license issues but does have some performance problems. I don’t think that I will have any performance issues on this server as the data files are reasonably large, are received via an ADSL link, and which require quite a bit of CPU time to process them when they are accessed. But ZFS-FUSE doesn’t seem to be particularly popular.

The ZFS On Linux project provides source for a ZFS kernel module which you can compile and load [4]. As the module isn’t distributed with or statically linked to the kernel the license conflict of the CDDL ZFS code and the GPL Linux kernel code is apparently solved. I’ve read some positive reports from people who use this so it will be my preferred option.