BTRFS Training

Some years ago Barwon South Water gave LUV 3 old 1RU Sun servers for any use related to free software. We gave one of those servers to the Canberra makerlab and another is used as the server for the LUV mailing lists and web site and the 3rd server was put aside for training. The servers have hot-swap 15,000rpm SAS disks – IE disks that have a replacement cost greater than the budget we have for hardware. As we were given a spare 70G disk (and a 140G disk can replace a 70G disk) the LUV server has 2*70G disks and the 140G disks (which can’t be replaced) are in the server for training.

On Saturday I ran a BTRFS and ZFS training session for the LUV Beginners’ SIG. This was inspired by the amount of discussion of those filesystems on the mailing list and the amount of interest when we have lectures on those topics.

The training went well, the meeting was better attended than most Beginners’ SIG meetings and the people who attended it seemed to enjoy it. One thing that I will do better in future is clearly documenting commands that are expected to fail and documenting how to login to the system. The users all logged in to accounts on a Xen server and then ssh’d to root at their DomU. I think that it would have saved a bit of time if I had aliased commands like “btrfs” to “echo you must login to your virtual server first” or made the shell prompt at the Dom0 include instructions to login to the DomU.

Each user or group had a virtual machine. The server has 32G of RAM and I ran 14 virtual servers that each had 2G of RAM. In retrospect I should have configured fewer servers and asked people to work in groups, that would allow more RAM for each virtual server and also more RAM for the Dom0. The Dom0 was running a BTRFS RAID-1 filesystem and each virtual machine had a snapshot of the block devices from my master image for the training. Performance was quite good initially as the OS image was shared and fit into cache. But when many users were corrupting and scrubbing filesystems performance became very poor. The disks performed well (sustaining over 100 writes per second) but that’s not much when shared between 14 active users.

The ZFS part of the tutorial was based on RAID-Z (I didn’t use RAID-5/6 in BTRFS because it’s not ready to use and didn’t use RAID-1 in ZFS because most people want RAID-Z). Each user had 5*4G virtual disks (2 for the OS and 3 for BTRFS and ZFS testing). By the end of the training session there was about 76G of storage used in the filesystem (including the space used by the OS for the Dom0), so each user had something like 5G of unique data.

We are now considering what other training we can run on that server. I’m thinking of running training on DNS and email. Suggestions for other topics would be appreciated. For training that’s not disk intensive we could run many more than 14 virtual machines, 60 or more should be possible.

Below are the notes from the BTRFS part of the training, anyone could do this on their own if they substitute 2 empty partitions for /dev/xvdd and /dev/xvde. On a Debian/Jessie system all that you need to do to get ready for this is to install the btrfs-tools package. Note that this does have some risk if you make a typo. An advantage of doing this sort of thing in a virtual machine is that there’s no possibility of breaking things that matter.

  1. Making the filesystem
    1. Make the filesystem, this makes a filesystem that spans 2 devices (note you must use the-f option if there was already a filesystem on those devices):
      mkfs.btrfs /dev/xvdd /dev/xvde
    2. Use file(1) to see basic data from the superblocks:
      file -s /dev/xvdd /dev/xvde
    3. Mount the filesystem (can mount either block device, the kernel knows they belong together):
      mount /dev/xvdd /mnt/tmp
    4. See a BTRFS df of the filesystem, shows what type of RAID is used:
      btrfs filesystem df /mnt/tmp
    5. See more information about FS device use:
      btrfs filesystem show /mnt/tmp
    6. Balance the filesystem to change it to RAID-1 and verify the change, note that some parts of the filesystem were single and RAID-0 before this change):
      btrfs balance start -dconvert=raid1 -mconvert=raid1 -sconvert=raid1 –force /mnt/tmp
      btrfs filesystem df /mnt/tmp
    7. See if there are any errors, shouldn’t be any (yet):
      btrfs device stats /mnt/tmp
    8. Copy some files to the filesystem:
      cp -r /usr /mnt/tmp
    9. Check the filesystem for basic consistency (only checks checksums):
      btrfs scrub start -B -d /mnt/tmp
  2. Online corruption
    1. Corrupt the filesystem:
      dd if=/dev/zero of=/dev/xvdd bs=1024k count=2000 seek=50
    2. Scrub again, should give a warning about errors:
      btrfs scrub start -B /mnt/tmp
    3. Check error count:
      btrfs device stats /mnt/tmp
    4. Corrupt it again:
      dd if=/dev/zero of=/dev/xvdd bs=1024k count=2000 seek=50
    5. Unmount it:
      umount /mnt/tmp
    6. In another terminal follow the kernel log:
      tail -f /var/log/kern.log
    7. Mount it again and observe it correcting errors on mount:
      mount /dev/xvdd /mnt/tmp
    8. Run a diff, observe kernel error messages and observe that diff reports no file differences:
      diff -ru /usr /mnt/tmp/usr/
    9. Run another scrub, this will probably correct some errors which weren’t discovered by diff:
      btrfs scrub start -B -d /mnt/tmp
  3. Offline corruption
    1. Umount the filesystem, corrupt the start, then try mounting it again which will fail because the superblocks were wiped:
      umount /mnt/tmp
      dd if=/dev/zero of=/dev/xvdd bs=1024k count=200
      mount /dev/xvdd /mnt/tmp
      mount /dev/xvde /mnt/tmp
    2. Note that the filesystem was not mountable due to a lack of a superblock. It might be possible to recover from this but that’s more advanced so we will restore the RAID.
      Mount the filesystem in a degraded RAID mode, this allows full operation.
      mount /dev/xvde /mnt/tmp -o degraded
    3. Add /dev/xvdd back to the RAID:
      btrfs device add /dev/xvdd /mnt/tmp
    4. Show the filesystem devices, observe that xvdd is listed twice, the missing device and the one that was just added:
      btrfs filesystem show /mnt/tmp
    5. Remove the missing device and observe the change:
      btrfs device delete missing /mnt/tmp
      btrfs filesystem show /mnt/tmp
    6. Balance the filesystem, not sure this is necessary but it’s good practice to do it when in doubt:
      btrfs balance start /mnt/tmp
    7. Umount and mount it, note that the degraded option is not needed:
      umount /mnt/tmp
      mount /dev/xvdd /mnt/tmp
  4. Experiment
    1. Experiment with the “btrfs subvolume create” and “btrfs subvolume delete” commands (which act like mkdir and rmdir).
    2. Experiment with “btrfs subvolume snapshot SOURCE DEST” and “btrfs subvolume snapshot -r SOURCE DEST” for creating regular and read-only snapshots of other subvolumes (including the root).
5

RAID Pain

One of my clients has a NAS device. Last week they tried to do what should have been a routine RAID operation, they added a new larger disk as a hot-spare and told the RAID array to replace one of the active disks with the hot-spare. The aim was to replace the disks one at a time to grow the array. But one of the other disks had an error during the rebuild and things fell apart.

I was called in after the NAS had been rebooted when it was refusing to recognise the RAID. The first thing that occurred to me is that maybe RAID-5 isn’t a good choice for the RAID. While it’s theoretically possible for a RAID rebuild to not fail in such a situation (the data that couldn’t be read from the disk with an error could have been regenerated from the disk that was being replaced) it seems that the RAID implementation in question couldn’t do it. As the NAS is running Linux I presume that at least older versions of Linux have the same problem. Of course if you have a RAID array that has 7 disks running RAID-6 with a hot-spare then you only get the capacity of 4 disks. But RAID-6 with no hot-spare should be at least as reliable as RAID-5 with a hot-spare.

Whenever you recover from disk problems the first thing you want to do is to make a read-only copy of the data. Then you can’t make things worse. This is a problem when you are dealing with 7 disks, fortunately they were only 3TB disks and only each had 2TB in use. So I found some space on a ZFS pool and bought a few 6TB disks which I formatted as BTRFS filesystems. For this task I only wanted filesystems that support snapshots so I could work on snapshots not on the original copy.

I expect that at some future time I will be called in when an array of 6+ disks of the largest available size fails. This will be a more difficult problem to solve as I don’t own any system that can handle so many disks.

I copied a few of the disks to a ZFS filesystem on a Dell PowerEdge T110 running kernel 3.2.68. Unfortunately that system seems to have a problem with USB, when copying from 4 disks at once each disk was reading about 10MB/s and when copying from 3 disks each disk was reading about 13MB/s. It seems that the system has an aggregate USB bandwidth of 40MB/s – slightly greater than USB 2.0 speed. This made the process take longer than expected.

One of the disks had a read error, this was presumably the cause of the original RAID failure. dd has the option conv=noerror to make it continue after a read error. This initially seemed good but the resulting file was smaller than the source partition. It seems that conv=noerror doesn’t seek the output file to maintain input and output alignment. If I had a hard drive filled with plain ASCII that MIGHT even be useful, but for a filesystem image it’s worse than useless. The only option was to repeatedly run dd with matching skip and seek options incrementing by 1K until it had passed the section with errors.

for n in /dev/loop[0-6] ; do echo $n ; mdadm –examine -v -v –scan $n|grep Events ; done

Once I had all the images I had to assemble them. The Linux Software RAID didn’t like the array because not all the devices had the same event count. The way Linux Software RAID (and probably most RAID implementations) work is that each member of the array has an event counter that is incremented when disks are added, removed, and when data is written. If there is an error then after a reboot only disks with matching event counts will be used. The above command shows the Events count for all the disks.

Fortunately different event numbers aren’t going to stop us. After assembling the array (which failed to run) I ran “mdadm -R /dev/md1” which kicked some members out. I then added them back manually and forced the array to run. Unfortunately attempts to write to the array failed (presumably due to mismatched event counts).

Now my next problem is that I can make a 10TB degraded RAID-5 array which is read-only but I can’t mount the XFS filesystem because XFS wants to replay the journal. So my next step is to buy another 2*6TB disks to make a RAID-0 array to contain an image of that XFS filesystem.

Finally backups are a really good thing…

BTRFS Status June 2015

The version of btrfs-tools in Debian/Jessie is incapable of creating a filesystem that can be mounted by the kernel in Debian/Wheezy. If you want to use a BTRFS filesystem on Jessie and Wheezy (which isn’t uncommon with removable devices) the only options are to use the Wheezy version of mkfs.btrfs or to use a Jessie kernel on Wheezy. I recently got bitten by this issue when I created a BTRFS filesystem on a removable device with a lot of important data (which is why I wanted metadata duplication and checksums) and had to read it on a server running Wheezy. Fortunately KVM in Wheezy works really well so I created a virtual machine to read the disk. Setting up a new KVM isn’t that difficult, but it’s not something I want to do while a client is anxiously waiting for their data.

BTRFS has been working well for me apart from the Jessie/Wheezy compatability issue (which was an annoyance but didn’t stop me doing what I wanted). I haven’t written a BTRFS status report for a while because everything has been OK and there has been nothing exciting to report.

I regularly get errors from the cron jobs that run a balance supposedly running out of free space. I have the cron jobs due to past problems with BTRFS running out of metadata space. In spite of the jobs often failing the systems keep working so I’m not too worried at the moment. I think this is a bug, but there are many more important bugs.

Linux kernel version 3.19 was the first version to have working support for RAID-5 recovery. This means version 3.19 was the first version to have usable RAID-5 (I think there is no point even having RAID-5 without recovery). It wouldn’t be prudent to trust your important data to a new feature in a filesystem. So at this stage if I needed a very large scratch space then BTRFS RAID-5 might be a viable option but for anything else I wouldn’t use it. BTRFS still has had little performance optimisation, while this doesn’t matter much for SSD and for single-disk filesystems for a RAID-5 of hard drives that would probably hurt a lot. Maybe BTRFS RAID-5 would be good for a scratch array of SSDs. The reports of problems with RAID-5 don’t surprise me at all.

I have a BTRFS RAID-1 filesystem on 2*4TB disks which is giving poor performance on metadata, simple operations like “ls -l” on a directory with ~200 subdirectories takes many seconds to run. I suspect that part of the problem is due to the filesystem being written by cron jobs with files accumulating over more than a year. The “btrfs filesystem” command (see btrfs-filesystem(8)) allows defragmenting files and directory trees, but unfortunately it doesn’t support recursively defragmenting directories but not files. I really wish there was a way to get BTRFS to put all metadata on SSD and all data on hard drives. Sander suggested the following command to defragment directories on the BTRFS mailing list:

find / -xdev -type d -execdir btrfs filesystem defrag -c {} +

Below is the output of “zfs list -t snapshot” on a server I run, it’s often handy to know how much space is used by snapshots, but unfortunately BTRFS has no support for this.

NAME USED AVAIL REFER MOUNTPOINT
hetz0/be0-mail@2015-03-10 2.88G 387G
hetz0/be0-mail@2015-03-11 1.12G 388G
hetz0/be0-mail@2015-03-12 1.11G 388G
hetz0/be0-mail@2015-03-13 1.19G 388G

Hugo pointed out on the BTRFS mailing list that the following command will give the amount of space used for snapshots. $SNAPSHOT is the name of a snapshot and $LASTGEN is the generation number of the previous snapshot you want to compare with.

btrfs subvolume find-new $SNAPSHOT $LASTGEN | awk '{total = total + $7}END{print total}'

One upside of the BTRFS implementation in this regard is that the above btrfs command without being piped through awk shows you the names of files that are being written and the amounts of data written to them. Through casually examining this output I discovered that the most written files in my home directory were under the “.cache” directory (which wasn’t exactly a surprise).

Now I am configuring workstations with a separate subvolume for ~/.cache for the main user. This means that ~/.cache changes don’t get stored in the hourly snapshots and less disk space is used for snapshots.

Conclusion

My observation is that things are going quite well with BTRFS. It’s more than 6 months since I had a noteworthy problem which is pretty good for a filesystem that’s still under active development. But there are still many systems I run which could benefit from the data integrity features of ZFS and BTRFS that don’t have the resources to run ZFS and need more reliability than I can expect from an unattended BTRFS system.

At this time the only servers I run with BTRFS are located within a reasonable drive from my home (not the servers in Germany and the US) and are easily accessible (not the embedded systems). ZFS is working well for some of the servers in Germany. Eventually I’ll probably run ZFS on all the hosted servers in Germany and the US, I expect that will happen before I’m comfortable running BTRFS on such systems. For the embedded systems I will just take the risk of data loss/corruption for the next few years.

BTRFS Status Dec 2014

My last problem with BTRFS was in August [1]. BTRFS has been running mostly uneventfully for me for the last 4 months, that’s a good improvement but the fact that 4 months of no problems is noteworthy for something as important as a filesystem is a cause for ongoing concern.

A RAID-1 Array

A week ago I had a minor problem with my home file server, one of the 3TB disks in the BTRFS RAID-1 started giving read errors. That’s not a big deal, I bought a new disk and did a “btrfs replace” operation which was quick and easy. The first annoyance was that the output of “btrfs device stats” reported an error count for the new device, it seems that “btrfs device replace” copies everything from the old disk including the error count. The solution is to use “btrfs device stats -z” to reset the count after replacing a device.

I replaced the 3TB disk with a 4TB disk (with current prices it doesn’t make sense to buy a new 3TB disk). As I was running low on disk space I added a 1TB disk to give it 4TB of RAID-1 capacity, one of the nice features of BTRFS is that a RAID-1 filesystem can support any combination of disks and use them to store 2 copies of every block of data. I started running a btrfs balance to get BTRFS to try and use all the space before learning from the mailing list that I should have done “btrfs filesystem resize” to make it use all the space. So my balance operation had configured the filesystem to configure itself for 2*3TB+1*1TB disks which wasn’t the right configuration when the 4TB disk was fully used. To make it even more annoying the “btrfs filesystem resize” command takes a “devid” not a device name.

I think that when BTRFS is more stable it would be good to have the btrfs utility warn the user about such potential mistakes. When a replacement device is larger than the old one it will be very common to want to use that space. The btrfs utility could easily suggest the most likely “btrfs filesystem resize” to make things easier for the user.

In a disturbing coincidence a few days after replacing the first 3TB disk the other 3TB disk started giving read errors. So I replaced the second 3TB disk with a 4TB disk and removed the 1TB disk to give a 4TB RAID-1 array. This is when it would be handy to have the metadata duplication feature and copies= option of ZFS.

Ctree Corruption

2 weeks ago a basic workstation with a 120G SSD owned by a relative stopped booting, the most significant errors it gave were “BTRFS: log replay required on RO media” and “BTRFS: open_ctree failed”. The solution to this is to run the command “btrfs-zero-log”, but that initially didn’t work. I restored the system from a backup (which was 2 months old) and took the SSD home to work on it. A day later “btrfs-zero-log” worked correctly and I recovered all the data. Note that I didn’t even try mounting the filesystem in question read-write, I mounted it read-only to copy all the data off. While in theory the filesystem should have been OK I didn’t have a need to keep using it at that time (having already wiped the original device and restored from backup) and I don’t have confidence in BTRFS working correctly in that situation.

While it was nice to get all the data back it’s a concern when commands don’t operate consistently.

Debian and BTRFS

I was concerned when the Debian kernel team chose 3.16 as the kernel for Jessie (the next Debian release). Judging by the way development has been going I wasn’t confident that 3.16 would turn out to be stable enough for BTRFS. But 3.16 is working reasonably well on a number of systems so it seems that it’s likely to work well in practice.

But I’m still deploying more ZFS servers.

The Value of Anecdotal Evidence

When evaluating software based on reports from reliable sources (IE most readers will trust me to run systems well and only report genuine bugs) bad reports have a much higher weight than good reports. The fact that I’ve seen kernel 3.16 to work reasonably well on ~6 systems is nice but that doesn’t mean it will work well on thousands of other systems – although it does indicate that it will work well on more systems than some earlier Linux kernels which had common BTRFS failures.

But the annoyances I had with the 3TB array are repeatable and will annoy many other people. The ctree coruption problem MIGHT have been initially caused by a memory error (it’s a desktop machine without ECC RAM) but the recovery process was problematic and other users might expect problems in such situations.

7

Improving Computer Reliability

In a comment on my post about Taxing Inferior Products [1] Ben pointed out that most crashes are due to software bugs. Both Ben and I work on the Debian project and have had significant experience of software causing system crashes for Debian users.

But I still think that the widespread adoption of ECC RAM is a good first step towards improving the reliability of the computing infrastructure.

Currently when software developers receive bug reports they always wonder whether the bug was caused by defective hardware. So when bugs can’t be reproduced (or can’t be reproduced in a way that matches the bug report) they often get put in a list of random crash reports and no further attention is paid to them.

When a system has ECC RAM and a filesystem that uses checksums for all data and metadata we can have greater confidence that random bugs aren’t due to hardware problems. For example if a user reports a file corruption bug they can’t repeat that occurred when using the Ext3 filesystem on a typical desktop PC I’ll wonder about the reliability of storage and RAM in their system. If however the same bug report came from someone who had ECC RAM and used the ZFS filesystem then I would be more likely to consider it a software bug.

The current situation is that every part of a typical PC is unreliable. When a bug can be attributed to one of several pieces of hardware, the OS kernel and even malware (in the case of MS Windows) it’s hard to know where to start in tracking down a bug. Most users have given up and accepted that crashing periodically is just what computers do. Even experienced Linux users sometimes give up on trying to track down bugs properly because it’s sometimes very difficult to file a good bug report. For the typical computer user (who doesn’t have the power that a skilled Linux user has) it’s much worse, filing a bug report seems about as useful as praying.

One of the features of ECC RAM is that the motherboard can inform the user (either at boot time, after a NMI reboot, or through system diagnostics) of the problem so it can be fixed. A feature of filesystems such as ZFS and BTRFS is that they can inform the user of drive corruption problems, sometimes before any data is lost.

My recommendation of BTRFS in regard to system integrity does have a significant caveat, currently the system reliability decrease due to crashes outweighs the reliability increase due to checksums at this time. This isn’t all bad because at least when BTRFS crashes you know what the problem is, and BTRFS is rapidly improving in this regard. When I discuss BTRFS in posts like this one I’m considering the theoretical issues related to the design not the practical issues of software bugs. That said I’ve twice had a BTRFS filesystem seriously corrupted by a faulty DIMM on a system without ECC RAM.

2

Using BTRFS

I’ve just installed BTRFS on some systems that matter to me. It is still regarded as experimental but Oracle supports it with their kernel so it can’t be too bad – and it’s almost guaranteed that anything other than BTRFS or ZFS will lose data if you run as many systems as I do. Also I run lots of systems that don’t have enough RAM for ZFS (4G isn’t enough for ZFS in my tests). So I have to use BTRFS.

BTRFS and Virtual Machines

I’m running BTRFS for the DomUs on a virtual server which has 4G of RAM (and thus can’t run ZFS). The way I have done this is to use ext4 on Linux software RAID-1 for the root filesystem on the Dom0 and use BTRFS for the rest. For BTRFS and virtual machines there seem to be two good options given that I want BTRFS to use it’s own RAID-1 so that it can correct errors from a corrupted disk. One is to use a single BTRFS filesystem with RAID-1 for all the storage and then have each VM use a file on that big BTRFS filesystem for all it’s storage. The other option is to have each virtual machine run BTRFS RAID-1.

I’ve created two LVM Volume Groups (VGs) named diska and diskb, each DomU has a Logical Volume (LV) from each VG and runs BTRFS. So if a disk becomes corrupt the DomU will have to figure out what the problem is and fix it.

#!/bin/bash
for n in $(xm list|cut -f1 -d\ |egrep -v ^Name\|^Domain-0) ; do
  echo $n
  ssh $n "btrfs scrub start -B -d /"
done

I use the above script in a cron job from the Dom0 to scrub the BTRFS filesystems in the DomUs. I use the -B option so that I will receive email about any errors and so that there won’t be multiple DomUs scrubbing at the same time (which would be really bad for performance).

BTRFS and Workstations

The first workstation installs of BTRFS that I did were similar to installations of Ext3/4 in that I had multiple filesystems on LVM block devices. This caused all the usual problems of filesystem sizes and also significantly hurt performance (sync seems to perform very badly on a BTRFS filesystem and it gets really bad with lots of BTRFS filesystems). BTRFS allows using subvolumes for snapshots and it’s designed to handle large filesystems so there’s no reason to have more than one filesystem IMHO.

It seems to me that the only benefit in using multiple BTRFS filesystems on a system is if you want to use different RAID options. I presume that eventually the BTRFS developers will support different RAID options on a per-subvolume basis (they seem to want to copy all ZFS features). I would like to be able to configure /home to use 3 copies of all data and metadata on a workstation that only has a single disk.

Currently I have some workstations using BTRFS with a BTRFS RAID-1 configuration for /home and a regular non-RAID configuration for everything else. But now it seems that this is a bad idea, I would be better off just using a single copy of all data on workstations (as I did for everything on workstations for the previous 15 years of running Linux desktop systems) and make backups that are frequent enough to not have a great risk.

BTRFS and Servers

One server that I run is primarily used as an NFS server and as a workstation. I have a pair of 3TB SATA disks in a BTRFS RAID-1 configuration mounted as /big and with subvolumes under /big for the various NFS exports. The system also has a 120G Intel SSD for /boot (Ext4) and the root filesystem which is BTRFS and also includes /home. The SSD gives really good read performance which is largely independent of what is done with the disks so booting and workstation use are very fast even when cron jobs are hitting the file server hard.

The system has used a RAID-1 array of 1TB SATA disks for all it’s storage ever since 1TB disks were big. So moving to a single storage device for /home is a decrease in theoretical reliability (in addition to the fact that a SSD might be less reliable than a traditional disk). The next thing that I am going to do is to install cron jobs that backup the root filesystem to something under /big. The server in question isn’t used for anything that requires high uptime, so if the SSD dies entirely and I need to replace it with another boot device then it will be really annoying but it won’t be a great problem.

Snapshot Backups

One of the most important uses of backups is to recover from basic user mistakes such as deleting the wrong file. To deal with this I wrote some scripts to create backups from a cron job. I put the snapshots of a subvolume under a subvolume named “backup“. A common use is to have everything on the root filesystem, /home as a subvolume, /home/backup as another subvolume, and then subvolumes for backups such as /home/backup/2012-12-17, /home/backup/2012-12-17:00:15, and /home/backup/2012-12-17:00:30. I make /home/backup world readable so every user can access their own backups without involving me, of course this means that if they make a mistake related to security then I would have to help them correct it – but I don’t expect my users to deal with security issues, if they accidentally grant inappropriate access to their files then I will be the one to notice and correct it.

Here is a script I name btrfs-make-snapshot which has an optional first parameter “-d” to cause it do just display the btrfs commands it would run and not actually do anything. The second parameter is either “minutes” or “days” depending on whether you want to create a snapshot on a short interval (I use 15 minutes) or a daily snapshot. All other parameters are paths for subvolumes that are to be backed up:

#!/bin/bash
set -e

# usage:
# btrfs-make-snapshot [-d] minutes|days paths
# example:
# btrfs-make-snapshot minutes /home /mail

if [ "$1" == "-d" ]; then
  BTRFS="echo btrfs"
  shift
else
  BTRFS=/sbin/btrfs
fi

if [ "$1" == "minutes" ]; then
  DATE=$(date +%Y-%m-%d:%H:%M)
else
  DATE=$(date +%Y-%m-%d)
fi
shift

for n in $* ; do
  $BTRFS subvol snapshot -r $n $n/backup/$DATE
done

Here is a script I name btrfs-remove-snapshots which removes old snapshots to free space. It has an optional first parameter “-d” to cause it do just display the btrfs commands it would run and not actually do anything. The next parameters are the number of minute based and day based snapshots to keep (I am currently experimenting with 100 100 for /home to keep 15 minute snapshots for 25 hours and daily snapshots for 100 days). After that is a list of filesystems to remove snapshots from. The removal will be from under the backup subvolume of the path in question.

#!/bin/bash
set -e

# usage:
# btrfs-remove-snapshots [-d] MINSNAPS DAYSNAPS paths
# example:
# btrfs-remove-snapshots 100 100 /home /mail

if [ "$1" == "-d" ]; then
  BTRFS="echo btrfs"
  shift
else
  BTRFS=/sbin/btrfs
fi

MINSNAPS=$1
shift
DAYSNAPS=$1
shift

for DIR in $* ; do
  BASE=$(echo $DIR | cut -c 2-200)
  for n in $(btrfs subvol list $DIR|grep $BASE/backup/.*:|head -n -$MINSNAPS|sed -e "s/^.* //"); do
    $BTRFS subvol delete /$n
  done
  for n in $(btrfs subvol list $DIR|grep $BASE/backup/|grep -v :|head -n -$MINSNAPS|sed -e "s/^.* //"); do
    $BTRFS subvol delete /$n
  done
done

A Warning

The Debian/Wheezy kernel (based on the upstream kernel 3.2.32) doesn’t seem to cope well when you run out of space by making snapshots. I have a filesystem that I am still trying to recover after doing that.

I’ve just been buying larger storage devices for systems while migrating them to BTRFS, so I should be able to avoid running out of disk space again until I can upgrade to a kernel that fixes such bugs.

13

Hard Drives for Backup

The general trend seems to be that cheap hard drives are increasing in capacity faster than much of the data that is commonly stored. Back in 1998 I had a 3G disk in my laptop and about 800M was used for my home directory. Now I have 6.2G used for my home directory (and another 2G in ~/src) out of the 100G capacity in my laptop. So my space usage for my home directory has increased by a factor of about 8 while my space available has increased by a factor of about 30. When I had 800M for my home directory I saved space by cropping pictures for my web site and deleting the originals (thus losing some data I would rather have today) but now I just keep everything and it’s still doesn’t take up much of my hard drive. Similar trends apply to most systems that I use and that I run for my clients.

Due to the availability of storage people are gratuitously using a lot of disk space. A relative recently took 10G of pictures on a holiday, her phone has 12G of internal storage so there was nothing stopping her. She might decide that half the pictures aren’t that great if she had to save space, but that space is essentially free (she couldn’t buy a cheaper phone with less storage) so there’s no reason to delete any pictures.

When considering backup methods one important factor is the ability to store all of one type of data on one backup device. Having a single backup span multiple disks, tapes, etc has a dramatic impact on the ease of recovery and the potential for data loss. Currently 3TB SATA disks are really cheap and 4TB disks are available but rather expensive. Currently only one of my clients has more than 4TB of data used for one purpose (IE a single filesystem) so apart from that client a single SATA disk can backup anything that I run.

Benefits of Hard Drive Backup

When using a hard drive there is an option to make it a bootable disk in the same format as the live disk. I haven’t done this, but if you want the option of a quick recovery from a hardware failure then having a bootable disk with all the data on it is a good option. For example a server with software RAID-1 could have a backup disk that is configured as a degraded RAID-1 array.

The biggest benefit is the ability to read a disk anywhere. I’ve read many reports of tape drives being discovered to be defective at the least convenient time. With a SATA disk you can install it in any PC or put it in a USB bay if you have USB 3.0 or the performance penalty of USB 2.0 is bearable – a USB 2.0 bay is great if you want to recover a single file, but if you want terabytes in a hurry then it won’t do.

A backup on a hard drive will typically use a common filesystem. For backing up Linux servers I generally use Ext3, at some future time I will move to BTRFS as having checksums on all data is a good feature for a backup. Using a regular filesystem means that I can access the data anywhere without needing any special software, I can run programs like diff on the backup, and I can export the backup via NFS or Samba if necessary. You never know how you will need to access your backup so it’s best to keep your options open.

Hard drive backups are the best solution for files that are accidentally deleted. You can have the first line of backups on a local server (or through a filesystem like BTRFS or ZFS that supports snapshots) and files can be recovered quickly. Even a SATA disk in a USB bay is very fast for recovering a single file.

LTO tapes have a maximum capacity of 1.5TB at the moment and tape size has been increasing more slowly than disk size. Also LTO tapes have an expected lifetime of only 200 reads/writes of the entire tape. It seems to me that tapes don’t provide a great benefit unless you are backing up enough data to need a tape robot.

Problems with a Hard Drive Backup

Hard drives tend not to survive being dropped so posting a hard drive for remote storage probably isn’t a good option. This can be solved by transferring data over the Internet if the data isn’t particularly big or doesn’t change too much (I have a 400G data set backed up via rsync to another country because most of the data doesn’t change over the course of a year). Also if the data is particularly small then solid state storage (which costs about $1 per GB) is a viable option, I run more than a few servers which could be entirely backed up to a 200G SSD. $200 for a single backup of 200G of data is a bit expensive, but the potential for saving time and money on the restore means that it can be financially viable.

Some people claim that tape storage will better survive a Carrington Event than hard drives. I’m fairly dubious about the benefits of this, if a hard drive in a Faraday Cage (such as a regular safe that is earthed) is going to be destroyed then you will probably worry about security of the food supply instead of your data. Maybe I should just add a disclaimer “this backup system won’t survive a zombie apocalypse”. ;)

It’s widely regarded that tape storage lasts longer than hard drives. I doubt that this provides a real benefit as some of my personal servers are running on 20G hard drives from back when 20G was big. The fact that drives tend to last for more than 10 years combined with the fact that newer bigger drives are always being released means that important backups can be moved to bigger drives. As a general rule you should assume that anything which isn’t regularly tested doesn’t work. So whatever your backup method you should test it regularly and have multiple copies of the data to deal with the case when one copy becomes corrupt. The process of testing a backup can involve moving it to newer media.

I’ve seen it claimed that a benefit of tape storage is that part of the data can be recovered from a damaged tape. One problem with this is that part of a database often isn’t particularly useful. Another issue is that in my experience hard drives usually don’t fail entirely unless you drop them, drives usually fail a few sectors at a time.

How to Implement Hard Drive Backup

The most common need for backups is when someone deletes the wrong file. It’s usually a small restore and you want it to be an easy process. The best solution to this is to have a filesystem with snapshots such as BTRFS or ZFS. In theory it shouldn’t be too difficult to have a cron job manage snapshots, but as I’ve only just started putting BTRFS and ZFS on servers I haven’t got around to changing my backups. Snapshots won’t cover more serious problems such as hardware, software, or user errors that wipe all the disks in a server. For example the only time I lost a significant amount of data from a hosted server was when the data center staff wiped it, so obviously good off-site backups are needed.

The easiest way to deal with problems that wipe a server is to have data copied to another system. For remote backups you can rsync to a local system and then use “cp -rl” or your favorite snapshot system to make a hard linked copy of the tree. A really neat feature is the ZFS ability to “send” a filesystem snapshot (or the diff between two snapshots) to a remote system [1]. Once you have regular backups on local storage you can then copy them to removable disks as often as you wish, I think I’ll have to install ZFS on some of my servers for the sole purpose of getting the “send” feature! There are NAS devices that provide similar functionality to the ZFS send/receive (maybe implemented with ZFS), but I’m not a fan of cheap NAS devices [2].

It seems that the best way to address the first two needs of backup (fast local restore and resilience in the face of site failure) is to use ZFS snapshots on the server and ZFS send/receive to copy the data to another site. The next issue is that the backup server probably won’t be big enough for all the archives and you want to be able to recover from a failure on the backup server. This requires some removable storage.

The simplest removable backup is to use a SATA drive bay with eSATA and USB connectors. You use a regular filesystem like Ext3 and just copy the files on. It’s easy, cheap, and requires no special skill or software. Requiring no special skill is important, you never know who will be called on to recover from backups.

When a server is backing up another server by rsync (whether it’s in the same rack or another country) you want the backup server to be reliable. However there is no requirement for a single reliable server and sometimes having multiple backup servers will be cheaper. At current purchase prices you can buy two cheap tower systems with 4*3TB disks for less money than a single server that has redundant PSUs and other high end server features. Having two cheap servers die at once seems quite unlikely so getting two backup servers would be the better choice.

For filesystems that are bigger than 4TB a disk based backup would require backup software that handles multi part archives. One would hope that any software that is designed for tape backup would work well for this (consider a hard drive as a tape with a very fast seek), but often things don’t work as desired. If anyone knows of a good Linux backup program that supports multiple 4TB SATA disks in eSATA or USB bays then please let me know.

Conclusion

BTRFS or ZFS snapshots are the best way of recovering from simple mistakes.

ZFS send/receive seems to be the best way of synchronising updates to filesystems to other systems or sites.

ZFS should be used for all servers. Even if you don’t currently need send/receive you never know what the future requirements may be. Apart from needing huge amounts of RAM (one of my servers had OOM failures when it had a mere 4G of RAM) there doesn’t seem to be any down-side to ZFS.

I’m unsure of whether to use BTRFS for removable backup disks. The immediate up-sides are checksums on all data and meta-data and the possibility of using built-in RAID-1 so that a random bad sector is unlikely to lose data. There is also the possibility of using snapshots on a removable backup disk (if the disk contains separate files instead of an archive). The down-sides are lack of support on older systems and the fact that BTRFS is fairly new.

Have I missed anything?

13

ZFS on Debian/Wheezy

As storage capacities increase the probability of data corruption increases as does the amount of time required for a fsck on a traditional filesystem. Also the capacity of disks is increasing a lot faster than the contiguous IO speed which means that the RAID rebuild time is increasing, for example my first hard disk was 70M and had a transfer rate of 500K/s which meant that the entire contents could be read in a mere 140 seconds! The last time I did a test on a more recent disk a 1TB SATA disk gave contiguous transfer rates ranging from 112MB/s to 52MB/s which meant that reading the entire contents took 3 hours and 10 minutes, and that problem is worse with newer bigger disks. The long rebuild times make greater redundancy more desirable.

BTRFS vs ZFS

Both BTRFS and ZFS checksum all data to cover the case where a disk returns corrupt data, they don’t need a fsck program, and the combination of checksums and built-in RAID means that they should have less risk of data loss due to a second failure during rebuild. ZFS supports RAID-Z which is essentially a RAID-5 with checksums on all blocks to handle the case of corrupt data as well as RAID-Z2 which is a similar equivalent to RAID-6. RAID-Z is quite important if you don’t want to have half your disk space taken up by redundancy or if you want to have your data survive the loss or more than one disk, so until BTRFS has an equivalent feature ZFS offers significant benefits. Also BTRFS is still rather new which is a concern for software that is critical to data integrity.

I am about to install a system to be a file server and Xen server which probably isn’t going to be upgraded a lot over the next few years. It will have 4 disks so ZFS with RAID-Z offers a significant benefit over BTRFS for capacity and RAID-Z2 offers a significant benefit for redundancy. As it won’t be upgraded a lot I’ll start with Debian/Wheezy even though it isn’t released yet because the system will be in use without much change well after Squeeze security updates end.

ZFS on Wheezy

Getting ZFS to basically work isn’t particularly hard, the ZFSonLinux.org site has the code and reasonable instructions for doing it [1]. The zfsonlinux code doesn’t compile out of the box on Wheezy although it works well on Squeeze. I found it easier to get a the latest Ubuntu working with ZFS and then I rebuilt the Ubuntu packages for Debian/Wheezy and they worked. This wasn’t particularly difficult but it’s a pity that the zfsonlinux site didn’t support recent kernels.

Root on ZFS

The complication with root on ZFS is that the ZFS FAQ recommends using whole disks for best performance so you can avoid alignment problems on 4K sector disks (which is an issue for any disk large enough that you want to use it with ZFS) [2]. This means you have to either use /boot on ZFS (which seems a little too experimental for me) or have a separate boot device.

Currently I have one server running with 4*3TB disks in a RAID-Z array and a single smaller disk for the root filesystem. Having a fifth disk attached by duct-tape to a system that is only designed for four disks isn’t ideal, but when you have an OS image that is backed up (and not so important) and a data store that’s business critical (but not needed every day) then a failure on the root device can be fixed the next day without serious problems. But I want to fix this and avoid creating more systems like it.

There is some good documentation on using Ubuntu with root on ZFS [3]. I considered using Ubuntu LTS for the server in question, but as I prefer Debian and I can recompile Ubuntu packages for Debian it seems that Debian is the best choice for me. I compiled those packages for Wheezy, did the install and DKMS build, and got ZFS basically working without much effort.

The problem then became getting ZFS to work for the root filesystem. The Ubuntu packages didn’t work with the Debian initramfs for some reason and modules failed to load. This wasn’t necessarily a show-stopper as I can modify such things myself, but it’s another painful thing to manage and another way that the system can potentially break on upgrade.

The next issue is the unusual way that ZFS mounts filesystems. Instead of having block devices to mount and entries in /etc/fstab the ZFS system does things for you. So if you want a ZFS volume to be mounted as root you configure the mountpoint via the “zfs set mountpoint” command. This of course means that it doesn’t get mounted if you boot with a different root filesystem and adds some needless pain to the process. When I encountered this I decided that root on ZFS isn’t a good option. So for this new server I’ll install it with an Ext4 filesystem on a RAID-1 device for root and /boot and use ZFS for everything else.

Correct Alignment

After setting up the system with a 4 disk RAID-1 (or mirror for the pedants who insist that true RAID-1 has only two disks) for root and boot I then created partitions for ZFS. According to fdisk output the partitions /dev/sda2, /dev/sdb2 etc had their first sector address as a multiple of 2048 which I presume addresses the alignment requirement for a disk that has 4K sectors.

Installing ZFS

deb http://www.coker.com.au wheezy zfs

I created the above APT repository (only AMD64) for ZFS packages based on Darik Horn’s Ubuntu packages (thanks for the good work Darik). Installing zfs-dkms, spl-dkms, and zfsutils gave a working ZFS system. I could probably have used Darik’s binary packages but I think it’s best to rebuild Ubuntu packages to use on Debian.

The server in question hasn’t gone live in production yet (it turns out that we don’t have agreement on what the server will do). But so far it seems to be working OK.

10

BTRFS and ZFS as Layering Violations

LWN has an interesting article comparing recent developments in the Linux world to the “Unix Wars” that essentially killed every proprietary Unix system [1]. The article is really interesting and I recommend reading it, it’s probably only available to subscribers at the moment but should be generally available in a week or so (I used my Debian access sponsored by HP to read it).

A comment on that article cites my previous post about the reliability of RAID [2] and then goes on to disagree with my conclusion that using the filesystem for everything is the right thing to do.

The Benefits of Layers

I don’t believe as strongly in the BTRFS/ZFS design as the commentator probably thinks. The current way my servers (and a huge number of other Linux systems) work of having RAID to form a reliable array of disks from a set of cheap disks for the purpose of reliability and often capacity or performance is a good thing. I have storage on top of the RAID array and can fix the RAID without bothering about the filesystem(s) – and have done so in the past. I can also test the RAID array without involving any filesystem specific code. Then I have LVM running on top of the RAID array in exactly the same way that it runs on top of a single hard drive or SSD in the case of a laptop or netbook. So Linux on a laptop is much the same as Linux on a server in terms of storage once we get past the issue of whether a single disk or a RAID array is used for the LVM PV, among other things this means that the same code paths are used and I’m less likely to encounter a bug when I install a new system.

LVM provides multiple LVs which can be used for filesystems, swap, or anything else that uses storage. So if a filesystem gets badly corrupted I can umount it, create an LVM snapshot, and then take appropriate measures to try and fix it – without interfering with other filesystems.

When using layered storage I can easily add or change layers when it’s appropriate. For example I have encryption on only some LVs on my laptop and netbook systems (there is no point encrypting the filesystem used for .iso files of Linux distributions) and on some servers I use RAID-0 for cached data.

When using a filesystem like BTRFS or ZFS which includes subvolumes (similar in result to LVM in some cases) and internal RAID you can’t separate the layers. So if something gets corrupted then you have to deal with all the complexity of BTRFS or ZFS instead of just fixing the one layer that has a problem.

Update: One thing I forgot to mention when I first published this is the benefits of layering for some uncommon cases such as network devices. I can run an Ext4 filesystem over a RAID-1 array which has one device on NBD on another system. That’s a bit unusual but it is apparently working well for some people. The internal RAID on ZFS and BTRFS doesn’t support such things and using software RAID underneath ZFS or BTRFS loses some features.

When using DRBD you might have two servers with local RAID arrays, DRBD on top of that, and then an Ext4 filesystem. As any form of RAID other than internal RAID loses reliability features for ZFS and BTRFS that means that no matter how you might implement those filesystems with DRBD it seems that you will lose somehow. It seems that neither BTRFS nor ZFS supports a disconnected RAID mode (like a Linux software RAID with a bitmap so it can resync only the parts that didn’t change) so it’s not possible to use BTRFS or ZFS RAID-1 with an NBD device.

The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that.

The Benefits of Integration

When RAID and the filesystem are separate things (with some added abstraction from LVM) it’s difficult to optimise the filesystem for RAID performance at the best of times and impossible in many cases. When the filesystem manages RAID it can optimise it’s operation to match the details of the RAID layout. I believe that in some situations ZFS will use mirroring instead of RAID-Z for small writes to reduce the load and that ZFS will combine writes into a single RAID-Z stripe (or set of contiguous RAID-Z stripes) to improve write performance.

It would be possible to have a RAID driver that includes checksums for all blocks, it could then read from another device when a checksum fails and give some of the reliability features that ZFS and BTRFS offer. Then to provide all the reliability benefits of ZFS you would at least need a filesystem that stores multiple copies of the data which would of course need checksums (because the filesystem could be used on a less reliable block device) and therefore you would end up with two checksums on the same data. Note that if you want to have a RAID array with checksums on all blocks then ZFS has a volume management feature (which is well described by Mark Round) [3]. Such a zvol could be used for a block device in a virtual machine and in an ideal world it would be possible to use one as swap space. But the zvol is apparently managed with all the regular ZFS mechanisms so it’s not a direct list of blocks on disk and thus can’t be extracted if there is a problem with ZFS.

Snapshots are an essential feature by today’s standards. The ability to create lots of snapshots with low overhead is a significant feature of filesystems like BTRFS and ZFS. Now it is possible to run BTRFS or ZFS on top of a volume manager like LVM which does snapshots to cover the case of the filesystem getting corrupted. But again that would end up with two sets of overhead.

The way that ZFS supports snapshots which inherit encryption keys is also interesting.

Conclusion

It’s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn’t much interest in developing such things. So while people would use it (and people are using ZFS ZVols as block devices for other filesystems as described in a comment on Mark Round’s blog) it’s probably not going to be implemented.

Therefore we have a choice of all the complexity and features of BTRFS or ZFS, or the current RAID+LVM+Ext4 option. While the complexity of BTRFS and ZFS is a concern for me (particularly as BTRFS is new and ZFS is really complex and not well supported on Linux) it seems that there is no other option for certain types of large storage at the moment.

ZFS on Linux isn’t a great option for me, but for some of my clients it seems to be the only option. ZFS on Solaris would be a better option in some ways, but that’s not possible when you have important Linux software that needs fast access to the storage.

3

The Most Important things for running a Reliable Internet Service

One of my clients is currently investigating new hosting arrangements. It’s a bit of a complex process because there are lots of architectural issues relating to things such as the storage and backup of some terabytes of data and some serious computation on the data. Among other options we are considering cheap servers in the EX range from Hetzner [1] which provide 3TB of RAID-1 storage per server along with reasonable CPU power and RAM and Amazon EC2 [2]. Hetzner and Amazon aren’t the only companies providing services that can be used to solve my client’s problems, but they both provide good value for what they provide and we have prior experience with them.

To add an extra complication my client did some web research on hosting companies and found that Hetzner wasn’t even in the list of reliable hosting companies (whichever list that was). This is in some ways not particularly surprising, Hetzner offers servers without a full management interface (you can’t see a serial console or a KVM, you merely get access to reset it) and the best value servers (the only servers to consider for many terabytes of data) have SATA disks which presumably have a lower MTBF than SAS disks.

But I don’t think that this is a real problem. Even when hardware that’s designed for the desktop is run in a server room the reliability tends to be reasonable. My experience is that a desktop PC with two hard drives in a RAID-1 array will give a level of reliability in practice that compares very well to an expensive server with ECC RAM, redundant fans, redundant PSUs, etc.

My experience is that the most critical factor for server reliability is management. A server that is designed to be reliable can give very poor uptime if poorly maintained or if there is no rapid way of discovering and fixing problems. But a system that is designed to be cheap can give quite good uptime if well maintained, if problems can be repidly discovered and fixed.

A Brief Overview of Managing Servers

There are text books about how to manage servers, so obviously I can’t cover the topic in detail in a blog post. But here are some quick points. Note that I’m not claiming that this list includes everything, please add comments about anything particularly noteworthy that you think I’ve missed.

  1. For a server to be well managed it needs to be kept up to date. It’s probably a good idea for management to have this on the list of things to do. A plan to check for necessary updates and apply them at fixed times (at least once a week) would be a good thing. My experience is that usually managers don’t have anything to do with this and sysadmins either apply patches or not at their own whim.
  2. It is really ideal for people to know how all the software works. For every piece of software that’s running it should either have come from a source that provides some degree of support (EG a Linux distribution) or be maintained by someone who knows it well. When you install custom software from people who become unavailable then it puts the reliability of the entire system at risk – if anything breaks then you won’t be able to get it fixed quickly.
  3. It should be possible to rapidly discover problems, having a client phone you to tell you that your web site is offline is a bad thing. Ideally you will have software like Nagios monitoring the network and reporting problems via a SMS gateway service such as ClickaTell.com. I am not sure that Nagios is the best network monitoring system or that ClickaTell is the best SMS gateway, but they have both worked well in my experience. If you think that there are better options for either of those then please write a comment.
  4. It should be possible to rapidly fix problems. That means that a sysadmin must be available 24*7 to respond to SMS and you must have a backup sysadmin for when the main person takes a holiday, or ideally two backup sysadmins so that if one is on holiday and another has an emergency then problems can still be fixed. Another thing to consider is that an increasing number of hotels, resorts, and cruise ships are providing net access. So you could decrease your need for backup sysadmins if you give a holiday bonus to a sysadmin who uses a hotel, resort, or cruise ship that has good net access. ;)
  5. If it seems likely that there may be some staff changes then it’s a really good idea to hire a potential replacement on a casual basis so that they can learn how things work. There have been a few occasions when I started a sysadmin contract after the old sysadmin ceased being on speaking terms with the company owner. This made it difficult for me to learn what’s going on.
  6. If your network is in any way complex (IE it’s something that needs some skill to manage) then it will probably be impossible to hire someone who has experience in all the areas of technology at a salary you are prepared to pay. So you should assume that whoever you hire will do some learning on the job. This isn’t necessarily a problem but is something that needs to be considered. If you use some unusual hardware or software and want it to run reliably then you should have a spare system for testing so that the types of mistake which are typically made in the learning process are not made on your production network.

Conclusion

If you have a business which depends on running servers on the Internet and you don’t do all the things in the above list then the reliability of a service like Hetzner probably isn’t going to be an issue at all.