BTRFS Status March 2014

I’m currently using BTRFS on most systems that I can access easily. It’s not nearly reliable enough that I want to install it on a server in another country or an embedded device that’s only accessible via 3G, but for systems where I can access the console it’s not doing too badly.

Balancing and Space Allocation

# btrfs filesystem df /
Data, single: total=103.97GiB, used=85.91GiB
System, DUP: total=32.00MiB, used=20.00KiB
Metadata, DUP: total=1.78GiB, used=1.31GiB
# df -h /

Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/ac696117-473c-4945-a71e-917e09c6503c 108G 89G 19G 84% /

Currently there are still situations where it can run out of space and deadlock on freeing space. The above shows the output of the btrfs df command and the regular df command, I have about 106G of disk space used by data and metadata in BTRFS while df shows that the entire filesystem (IE the block device) is 108G. So if I use another 2G of data or metadata then the system is at risk of deadlocking. To avoid that happening I have to run “btrfs balance start /” to start a balance which defragments the space use and frees some blocks. Currently there is a bug in BTRFS (present in all Debian/Unstable kernels) which prevents a balance operation from completing when systemd is used in a default configuration (there’s something about the way systemd accesses it’s journal files that triggers a BTRFS bug). This is really inconvenient, particularly given that there’s probably a strong correlation between people who use experimental filesystems and people who use experimental init programs.

When you get to the stage of the filesystem being deadlocked you can sometimes recover by removing snapshots and sometimes by adding a new device to the filesystem (even a USB flash drive will do). But I once had a filesystem get into a state where there wasn’t enough space to balance, add a device, or remove a snapshot – so I had to do a backup/format/restore.

Quota Groups

Last time I asked the developers (a few weeks ago) they told me that quota groups aren’t ready to use. They also said that they know about enough bugs that there’s no benefit in testing that feature. Even people who want to report bugs in BTRFS shouldn’t use quotas.

Kernel Panics with Kmail

I’ve had three systems develop filesystem corruption on files related to Kmail (the email program from KDE). I suspect that Kmail is triggering a bug in BTRFS. On all three systems the filesystem developed corruption that persisted across a reboot. One of the three systems was fixed by deleting the file for the Outbox, the others are waiting for kernel 3.14 which is supposed to fix the bug that causes kernel panics when accessing the corrupted files in question.

I don’t know whether kernel 3.14 will fix the bug that caused the corruption in the first place.

Conclusion

As I don’t use quotas BTRFS is working well for me on systems that have plenty of storage space and don’t run Kmail. There are some systems running systemd where I plan to upgrade the kernel before all the filesystem space is allocated. One of my systems is currently running SysVinit so I can balance the filesystem.

Apart from these issues BTRFS is working reasonably well for me. I haven’t yet had it’s filesystem checksums correct corrupted data from disk in any situation other than tests (I have had ZFS correct such an error, so hardware I use does benefit from this). I have restored data from BTRFS snapshots on many occasions, so that feature has been a major benefit for me. When I had a system with faulty RAM the internal checks in BTRFS alerted me to the problem and I didn’t lose any data, the filesystem became read-only and I was able to copy everything off even though it was too corrupted for writes.

Cheap Bulk Storage

The Problem

Some of my clients need systems that store reasonable amounts of data. This is enough data that we can expect some data corruption on disk such that traditional RAID doesn’t work, that old fashioned filesystems like Ext3/4 will have unreasonable fsck fimes, and that the number of disks in a small server isn’t enough.

NetApp is a really good option for bulk reliable storage, but their products are very expensive. BTRFS has a lot of potential, but the currently released versions (as supported in distributions such as Debian/Wheezy) lack significant features. One significant lack in current BTRFS releases is something equivalent to the ZFS send/receive functionality for remote backups, this was a major factor when I analysed the options for hard drive based backup [1], and you should always think about backup before deploying a new system. Currently ZFS is the best choice for bulk storage which is reliable if you can’t afford NetApp. Any storage system needs a minimum level of reliability if only to protect it’s own metadata and a basic RAID array doesn’t protect against media corruption with current data volumes. The combination of performance, lack of fsck (which is a performance feature), large storage support, backup, and significant real-world use makes ZFS a really good option.

Now I need to get some servers for more than 8.1TiB of storage (the capacity of a RAID-Z array of 4*3TB disks). One of my clients needs significantly more, probably at least 10 disks in a RAID-Z array so none of the cheaper servers will do.

Basically the issue that some of my clients are dealing with (and which I have to solve) is how to provide a relatively cheap ZFS system for storing reasonable amounts of data. For some systems I need to start with about 10 disks and be able to scale to 24 disks or more without excessive expense. Also to make things a little easier and cheaper 24*7 operation is not required, so instead of paying for hot-swap disks we can just schedule down-time outside business hours.

The Problem with Dell

Dell is really good for small systems, the PowerEdge tower servers that support 2*3.5″ or 4*3.5″ disks and which have space for an SSD or two are really affordable and easy to order. But even in the mid-size Dell tower servers (which are small by server standards) you have problems with just getting a few disks operating outside a RAID array [2]. The Dell online store is really great for small servers, any time I’m buying a server for less than $2500 I check the Dell online store first and usually their price is good enough that there is no need to get a quote from another company. Unfortunately all the servers with bigger storage involve disks that are unreasonably expensive (it seems that Dell makes their profit on the parts) and prices are not available online. I gave my email address and phone number to the Dell web site on Wednesday and they haven’t cared to get back to me yet. This is the type of service that makes me avoid IBM and HP for any server deployment where the Dell online store sells something suitable!

BackBlaze

For some time BackBlaze have been getting interest by describing how they store lots of data in a small amount of space by tightly stacking SATA disks. They don’t think that ZFS on Linux is ready for production, but their hardware ideas are useful. They have recently described their latest architecture [3]. They describe it as 135TB for $7,384. Of course the 135TB number is based on the idea of getting the full 3TB capacity out of each disk which they can do as they have redundancy over multiple storage pods. But anyone who wants a single fileserver needs some internal redundancy to cover disk failure. One option might be to have three RAID-Z2 arrays of 15 disks which gives a usable capacity of 42*3TB==126TB==113TiB. Note that while the ZFS documentation recommends between 3 and 9 disks per zpool for performance I don’t expect performance problems, when you only have a gigabit Ethernet connection there shouldn’t be a problem with three ZFS zpools making the network the bottleneck.

For this option the way to go would be to start with an array of 15 disks and then buy a second set of 15 disks when the first storage pool becomes full. It seems likely that 4TB disks will become cheap before a 35TiB array is filled so we can get more efficiency by delaying purchases. The BackBlaze pod isn’t cheap, they are sold as a complete system without storage disks for $US5,395 by Protocase [4]. That gives a markup of $US3,411 over the BackBlaze cost which isn’t too bad given that BackBlaze are quoting the insane bulk discount hardware prices that I could never get. Protocase also offer the case on it’s own for anyone who wants to build a system around it. It seems like the better option is to buy the system from Protocase, but that would end up being over $6,000 when Australian import duty is added and probably close to $7,000 when shipping etc is included.

Norco

Norco offers a case that takes 24 hot-swap SATA/SAS disks and a regular PC motherboard for $US399 [5]. It’s similar to the BackBlaze pod but smaller, cheaper, and there’s no obvious option to buy a configured and tested system. 24 disks would allow two RAID-Z2 arrays of 12 disks, the first array could provide 27TiB and the second array could provide something bigger when new disks are released.

SuperMicro

SuperMicro has a range of storage servers that support from 12 to 36 disks [6]. They seem good, but I’d have to deal with a reseller to buy them which would involve pain at best and at worst they wouldn’t bother getting me a quote because I only want one server at a time.

Conclusion

Does anyone know of any other options for affordable systems suitable for running ZFS on SATA disks? Preferably ones that don’t involve dealing with resellers.

At the moment it seems that the best option is to get a Norco case and build my own system as I don’t think that any of my clients needs the capacity of a BackBlaze pod at the moment. Supermicro seems good but I’d have to deal with a reseller. In my experience the difference between the resellers of such computer systems and used car dealers is that used car dealers are happy to sell one car at a time and that every used car dealer at least knows how to drive.

Also if you are an Australian reader of my blog and you want to build such storage servers to sell to my clients in Melbourne then I’d be interested to see an offer. But please make sure that any such offer includes a reference to your contributions to the Linux community if you think I won’t recognise your name. If you don’t contribute then I probably don’t want to do business with you.

As an aside, I was recently at a camera store helping a client test a new DSLR when one of the store employees started telling me how good ZFS is for storing RAW images. I totally agree that ZFS is the best filesystem for storing large RAW files and this is what I am working on right now. But it’s not the sort of advice I expect to receive at a camera store, not even one that caters to professional photographers.

BTRFS and ZFS as Layering Violations

LWN has an interesting article comparing recent developments in the Linux world to the “Unix Wars” that essentially killed every proprietary Unix system [1]. The article is really interesting and I recommend reading it, it’s probably only available to subscribers at the moment but should be generally available in a week or so (I used my Debian access sponsored by HP to read it).

A comment on that article cites my previous post about the reliability of RAID [2] and then goes on to disagree with my conclusion that using the filesystem for everything is the right thing to do.

The Benefits of Layers

I don’t believe as strongly in the BTRFS/ZFS design as the commentator probably thinks. The current way my servers (and a huge number of other Linux systems) work of having RAID to form a reliable array of disks from a set of cheap disks for the purpose of reliability and often capacity or performance is a good thing. I have storage on top of the RAID array and can fix the RAID without bothering about the filesystem(s) – and have done so in the past. I can also test the RAID array without involving any filesystem specific code. Then I have LVM running on top of the RAID array in exactly the same way that it runs on top of a single hard drive or SSD in the case of a laptop or netbook. So Linux on a laptop is much the same as Linux on a server in terms of storage once we get past the issue of whether a single disk or a RAID array is used for the LVM PV, among other things this means that the same code paths are used and I’m less likely to encounter a bug when I install a new system.

LVM provides multiple LVs which can be used for filesystems, swap, or anything else that uses storage. So if a filesystem gets badly corrupted I can umount it, create an LVM snapshot, and then take appropriate measures to try and fix it – without interfering with other filesystems.

When using layered storage I can easily add or change layers when it’s appropriate. For example I have encryption on only some LVs on my laptop and netbook systems (there is no point encrypting the filesystem used for .iso files of Linux distributions) and on some servers I use RAID-0 for cached data.

When using a filesystem like BTRFS or ZFS which includes subvolumes (similar in result to LVM in some cases) and internal RAID you can’t separate the layers. So if something gets corrupted then you have to deal with all the complexity of BTRFS or ZFS instead of just fixing the one layer that has a problem.

Update: One thing I forgot to mention when I first published this is the benefits of layering for some uncommon cases such as network devices. I can run an Ext4 filesystem over a RAID-1 array which has one device on NBD on another system. That’s a bit unusual but it is apparently working well for some people. The internal RAID on ZFS and BTRFS doesn’t support such things and using software RAID underneath ZFS or BTRFS loses some features.

When using DRBD you might have two servers with local RAID arrays, DRBD on top of that, and then an Ext4 filesystem. As any form of RAID other than internal RAID loses reliability features for ZFS and BTRFS that means that no matter how you might implement those filesystems with DRBD it seems that you will lose somehow. It seems that neither BTRFS nor ZFS supports a disconnected RAID mode (like a Linux software RAID with a bitmap so it can resync only the parts that didn’t change) so it’s not possible to use BTRFS or ZFS RAID-1 with an NBD device.

The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that.

The Benefits of Integration

When RAID and the filesystem are separate things (with some added abstraction from LVM) it’s difficult to optimise the filesystem for RAID performance at the best of times and impossible in many cases. When the filesystem manages RAID it can optimise it’s operation to match the details of the RAID layout. I believe that in some situations ZFS will use mirroring instead of RAID-Z for small writes to reduce the load and that ZFS will combine writes into a single RAID-Z stripe (or set of contiguous RAID-Z stripes) to improve write performance.

It would be possible to have a RAID driver that includes checksums for all blocks, it could then read from another device when a checksum fails and give some of the reliability features that ZFS and BTRFS offer. Then to provide all the reliability benefits of ZFS you would at least need a filesystem that stores multiple copies of the data which would of course need checksums (because the filesystem could be used on a less reliable block device) and therefore you would end up with two checksums on the same data. Note that if you want to have a RAID array with checksums on all blocks then ZFS has a volume management feature (which is well described by Mark Round) [3]. Such a zvol could be used for a block device in a virtual machine and in an ideal world it would be possible to use one as swap space. But the zvol is apparently managed with all the regular ZFS mechanisms so it’s not a direct list of blocks on disk and thus can’t be extracted if there is a problem with ZFS.

Snapshots are an essential feature by today’s standards. The ability to create lots of snapshots with low overhead is a significant feature of filesystems like BTRFS and ZFS. Now it is possible to run BTRFS or ZFS on top of a volume manager like LVM which does snapshots to cover the case of the filesystem getting corrupted. But again that would end up with two sets of overhead.

The way that ZFS supports snapshots which inherit encryption keys is also interesting.

Conclusion

It’s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn’t much interest in developing such things. So while people would use it (and people are using ZFS ZVols as block devices for other filesystems as described in a comment on Mark Round’s blog) it’s probably not going to be implemented.

Therefore we have a choice of all the complexity and features of BTRFS or ZFS, or the current RAID+LVM+Ext4 option. While the complexity of BTRFS and ZFS is a concern for me (particularly as BTRFS is new and ZFS is really complex and not well supported on Linux) it seems that there is no other option for certain types of large storage at the moment.

ZFS on Linux isn’t a great option for me, but for some of my clients it seems to be the only option. ZFS on Solaris would be a better option in some ways, but that’s not possible when you have important Linux software that needs fast access to the storage.

ZFS vs BTRFS on Cheap Dell Servers

I previously wrote about my first experiences with BTRFS [1]. Since then I’ve been using BTRFS on more systems and have had good results. The main problem I want to address is with the reliability of RAID [2].

Requirements for a File Server

Now one of my clients has a need for a new fileserver. They need to reliably store terabytes of data (currently 6TB and growing) which is mostly comprised of data files in the 10MB – 15MB size range. The data files will almost never be re-written and I anticiapte that the main bottleneck will be the latency of NFS and other network file sharing protocols. I would hope that saturating a GigE network when sending 10MB data files from SATA disks via NFS, AFS, or SMB wouldn’t be a technical challenge.

It seems that BTRFS is the way of the future. But it’s still rather new and the lack of RAID-5 and RAID-6 is a serious issue when you need to store 10TB with today’s technology (that would be 8*3TB disks for RAID-10 vs 5*3TB disks for RAID-5). Also the case of two disks entirely failing in a short period of time requires RAID-6 (or RAID-Z2 as the ZFS variant of RAID-6 is known). With BTRFS at it’s current stage of development it seems that to recover from two disks failing you need to have BTRFS on another RAID-6 (maybe Linux software RAID-6). But for filesystems based on concepts similar to ZFS and BTRFS you want to have the filesystem run the RAID so that if a block has a filesystem hash mismatch then the correct copy can be reconstructed from parity.

ZFS seems to be a lot more complex than BTRFS. While having more features is a good thing (BTRFS seems to be missing some sysadmin friendly features at this stage) complexity means that I need to learn more and test more before going live.

But it seems that the built in RAID-5 and RAID-6 is the killer issue. Servers start becoming a lot more expensive if you want more than 8 disks and even going past 6 disks is a significant price point. As 3TB disks are available an 8 disk RAID-6 gives something like 18TB usable space vs 12TB on a RAID-10 and a 6 disk RAID-6 gives about 12TB vs 9TB on a RAID-10. With RAID-10 (IE BTRFS) my client couldn’t use a 6 disk server such as the Dell PowerEdge T410 for $1500 as 9TB of usable storage isn’t adequate and the Dell PowerEdge T610 which can support 8 disks and costs $2100 would be barely adequate for the near future with only 12TB of usable storage. Dell does sell significantly larger servers such that any of my clients needs could be covered by RAID-10, but in addition to costing more there are issues of power use and noise. When comparing a T610 and a T410 with a full set of disks the price difference is $1000 (assuming $200 per disk) which is probably worth paying to delay any future need for upgrades.

Buying Disks

The problem with the PowerEdge T610 server is that it uses hot-swap disks and the biggest disks available are 2TB for $586.30! 2TB*8 in RAID-6 gives 12TB of usable space for $4690.40! This compares poorly to the PowerEdge T410 which supports non-hot-swap disks so I can buy 6*3TB disks for something less than $200 each and get 12TB of usable space for $1200. If I could get hot-swap trays for Dell disks at a reasonable price then the T610 would be worth considering. But as 12TB of storage should do for at least the next 18 months it seems that the T410 is clearly the better option.

Does anyone know how to get cheap disk trays for Dell servers?

Implementation

In mailing list discussions some people suggest using Solaris or FreeBSD for a ZFS server. ZFS was designed for and implemented on Solaris, and FreeBSD was the first port. However Solaris and FreeBSD aren’t commonly used systems so it’s harder to find skilled people to work with them and there is less of a guarantee that the desired software will work. Among other things it’s really convenient to be able to run software for embedded Linux i386 systems on the server.

The first port of ZFS to Linux was based on FUSE [3]. This allows a clean separation of ZFS code from the Linux kernel code to avoid license issues but does have some performance problems. I don’t think that I will have any performance issues on this server as the data files are reasonably large, are received via an ADSL link, and which require quite a bit of CPU time to process them when they are accessed. But ZFS-FUSE doesn’t seem to be particularly popular.

The ZFS On Linux project provides source for a ZFS kernel module which you can compile and load [4]. As the module isn’t distributed with or statically linked to the kernel the license conflict of the CDDL ZFS code and the GPL Linux kernel code is apparently solved. I’ve read some positive reports from people who use this so it will be my preferred option.

Starting with BTRFS

Based on my investigation of RAID reliability [1] I have determined that BTRFS [2] is the Linux storage technology that has the best potential to increase data integrity without costing a lot of money. Basically a BTRFS internal RAID-1 should offer equal or greater data protection than RAID-6.

As BTRFS is so important and so very different to any prior technology for Linux it’s not something that can be easily deployed in the same way as other filesystems. It is possible to easily switch between filesystems such as Ext4 and XFS because they work in much the same way, you have a single block device which the filesystem uses to create a single mount-point. While BTRFS supports internal RAID so it may have multiple block devices and it may offer multiple mountable filesystems and snapshots. Much of the functionality of Linux Software RAID and LVM is covered by BTRFS. So the sensible way to deploy BTRFS is to give it all your storage and not make use of any other RAID or LVM.

So I decided to do a test installation. I started with a Debian install CD that was made shortly before the release of Squeeze (it was first to hand) and installed with BTRFS for the root filesystem, I then upgraded to Debian/Unstable to get the latest kernel as BTRFS is developing rapidly. The system failed on the first boot after upgrading to Unstable because the /etc/fstab entry for the root filesystem had the FSCK pass number set to 1 – which wasn’t going to work as no FSCK program has been written. I changed that number to 0 and it then worked.

The initial install was on a desktop system that had a single IDE drive and a CD-ROM drive. For /boot I used a degraded RAID-1 and then after completing the installation I removed the CD-ROM drive and installed a second hard drive, after that it was easy to add the other device to the RAID-1. Then I tried to add a new device to the BTRFS group with the command “btrfs device add /dev/sdb2 /dev/sda2” and was informed that it can’t do that to a mounted filesystem! That will decrease the possibilities for using BTRFS on systems with hot-swap drives, I hope that the developers regard it as a bug.

Then I booted with an ext3 filesystem for root and tried the “btrfs device add /dev/sdb2 /dev/sda2” again but got the error message “btrfs: sending ioctl 5000940a to a partition!” which is not even found by Google.

The next thing that I wanted to do was to put a swap file on BTRFS, the benefits for having redundancy and checksums on swap space seem obvious – and other BTRFS features such as compression might give a benefit too. So I created a file by using dd to take take from /dev/zero, ran mkswap on it and then tried to run swapon. But I was told that the file has holes and can’t be used. Automatically making zero blocks into holes is a useful feature in many situations, but not in this case.

So far my experience with BTRFS is that all the basic things work (IE storing files, directories, etc). But the advanced functions I wanted from BTRFS (mirroring and making a reliable swap space) failed. This is a bit disappointing, but BTRFS isn’t described as being ready for production yet.