4

BTRFS Status July 2014

My last BTRFS status report was in April [1], it wasn’t the most positive report with data corruption and system hangs. Hacker News has a brief discussion of BTRFS which includes the statement “Russell Coker’s reports of his experiences with BTRFS give me the screaming heebie-jeebies, no matter how up-beat and positive he stays about it” [2] (that’s one of my favorite comments about my blog).

Since April things have worked better. Linux kernel 3.14 solves the worst problems I had with 3.13 and it’s generally doing everything I want it to do. I now have cron jobs making snapshots as often as I wish (as frequently as every 15 minutes on some systems), automatically removing snapshots (removing 500+ snapshots at once doesn’t hang the system), balancing, and scrubbing. The fact that I can now expect that a filesystem balance (which is a type of defragment operation for BTRFS that frees some “chunks”) from a cron job and expect the system not to hang means that I haven’t run out of metadata chunk space. I expect that running out of metadata space can still cause filesystem deadlocks given a lack of reports on the BTRFS mailing list of fixes in that regard, but as long as balance works well we can work around that.

My main workstation now has 35 days of uptime and my home server has 90 days of uptime. Also the server that stores my email now has 93 days uptime even though it’s running Linux kernel 3.13.10. I am rather nervous about the server running 3.13.10 because in my experience every kernel before 3.14.1 had BTRFS problems that would cause system hangs. I don’t want a server that’s an hour’s drive away to hang…

The server that runs my email is using kernel 3.13.10 because when I briefly tried a 3.14 kernel it didn’t work reliably with the Xen kernel 4.1 from Debian/Wheezy and I had a choice of using the Xen kernel 4.3 from Debian/Unstable to match the Linux kernel or use an earlier Linux kernel. I have a couple of Xen servers running Debian/Unstable for test purposes which are working well so I may upgrade my mail server to the latest Xen and Linux kernels from Unstable in the near future. But for the moment I’m just not doing many snapshots and never running a filesystem scrub on that server.

Scrubbing

In kernel 3.14 scrub is working reliably for me and I have cron jobs to scrub filesystems on every system running that kernel. So far I’ve never seen it report an error on a system that matters to me but I expect that it will happen eventually.

The paper “An Analysis of Data Corruption in the Storage Stack” from the University of Wisconsin (based on NetApp data) [3] shows that “nearline” disks (IE any disks I can afford) have an incidence of checksum errors (occasions when the disk returns bad data but claims it to be good) of about 0.42%. There are 18 disks running in systems I personally care about (as opposed to systems where I am paid to care) so with a 0.42% probability of a disk experiencing data corruption per year that would give a 7.3% probability of having such corruption on one disk in any year and a greater than 50% chance that it’s already happened over the last 10 years. Of the 18 disks in question 15 are currently running BTRFS. Of the 15 running BTRFS 10 are scrubbed regularly (the other 5 are systems that don’t run 24*7 and the system running kernel 3.13.10).

Newer Kernels

The discussion on the BTRFS mailing list about kernel 3.15 is mostly about hangs. This is correlated with some changes to improve performance so I presume that it has exposed race conditions. Based on those discussions I haven’t felt inclined to run a 3.15 kernel. As the developers already have some good bug reports I don’t think that I could provide any benefit by doing more testing at this time. I think that there would be no benefit to me personally or the Linux community in testing 3.15.

I don’t have a personal interest in RAID-5 or RAID-6. The only systems I run that have more data than will fit on a RAID-1 array of cheap SATA disks are ones that I am paid to run – and they are running ZFS. So the ongoing development of RAID-5 and RAID-6 code isn’t an incentive for me to run newer kernels. Eventually I’ll test out RAID-6 code, but at the moment I don’t think they need more bug reports in this area.

I don’t have a great personal interest in filesystem performance at this time. There are some serious BTRFS performance issues. One problem is that a filesystem balance and subtree removal seem to take excessive amounts of CPU time. Another is that there isn’t much support for balancing IO to multiple devices (in RAID-1 every process has all it’s read requests sent to one device). For large-scale use of a filesystem these are significant problems. But when you have basic requirements (such as a mail server for dozens of users or a personal workstation with a quad-core CPU and fast SSD storage) it doesn’t make much difference. Currently all of my systems which use BTRFS have storage hardware that exceeds the system performance requirements by such a large margin that nothing other than installing Debian packages can slow the system down. So while there are performance improvements in newer versions of the BTRFS kernel code that isn’t an incentive for me to upgrade.

It’s just been announced that Debian/Jessie will use Linux 3.16, so I guess I’ll have to test that a bit for the benefit of Debian users. I am concerned that 3.16 won’t be stable enough for typical users at the time that Jessie is released.

3

Why I Use BTRFS

I’ve just had to do yet another backup/format/restore operation on my workstation due to a BTRFS corruption problem, but as usual I didn’t lose any data. The BTRFS data integrity features work reasonably well even when the filesystem gets into a state where the kernel will only accept a read-only mount.

Given that the BTRFS tag on my blog is mostly about problems with BTRFS I think it’s time that I explain why I use it in spite of the problems before people start to worry about my sanity or competence.

The first thing to note is that BTRFS is fairly resiliant toward errors when mounted in read-only mode. When mounting a filesystem read-write there are a number of ways in which things can break which are often due to kernel code not being able to handle corrupt metadata – I don’t know how much of this is inherent to the design of BTRFS and how much is simply missing features in filesystem error handling. Some of the errors that I have had weren’t entirely the fault of BTRFS, I twice had to do a backup/format/restore of my workstation due to a faulty DIMM corrupting memory (that has the potential to mess up any filesystem), but I still didn’t lose any data AFAIK.

The next thing to note is that I don’t use BTRFS when doing paid sysadmin work. ZFS is a solid and reliable filesystem and it is working really well for my clients while BTRFS has too many issues at the moment. As an aside I’m not interested in any comments about the ZFS license situation from anyone who’s not officially representing Oracle.

I also don’t use BTRFS on systems that I can’t access easily. The servers I have running BTRFS are all within an hour’s drive from home, while driving for an hour on account of a kernel or filesystem error is really annoying it’s not as bad as dealing with a remote server where I have no direct access.

Reasons to Use BTRFS

The benefits of BTRFS right now are snapshots (which are good for a first-line backup) and the basic data integrity features. I’ve found these features to work well in real use.

According to the comparison of Filesystems Wikipedia page ZFS and BTRFS are the only general purpose filesystems (IE for disks not tapes, NVRAM, or a cluster) that support checksums for all data and compression. Given that ZFS license issues will never allow it to be included in the Linux kernel tree it seems clear that BTRFS is the next significant filesystem for Linux. More testing of BTRFS is a good thing, while there are a number of known problems that the developers are working on it seems that more testing is needed now to find corner cases. Also we need a lot of testing to find bugs related to interactions with other software.

I’ve recently filed bug reports against the Debian installer because it can’t install to a BTRFS RAID-1 (fortunately BTRFS supports changing to RAID-1 after installation) and because it doesn’t support formatting an existing BTRFS filesystem (the mkfs program needs a -f option in that case). I also sent in a patch for the magic database used by file(1) to provide more information on BTRFS filesystems (which is in Debian/testing but not Debian/Wheezy). These are the sorts of things you encounter when routinely using software that you don’t necessarily notice in basic testing.

As an aside the Debian installation process failed at the GRUB step when I manually balanced a filesystem to use RAID-1 while the Debian installation was in progress. I didn’t file a bug report because the best advice is to not mess with filesystems while the installer is running. I’ll do a lot more testing of this when the Debian installer supports a BTRFS RAID-1 installation.

A final thing that we need to work on is developing sysadmin best practices and scripts for managing BTRFS filesystems. I’ve done some work on scripts to create snapshots for online backups but there are issues of managing free space etc. But working out how to best manage a new filesystem is something that takes years because there are many corner cases you may only encounter after a system has been running for a long time. So I really wouldn’t want to be in the situation of using a new filesystem on an important server without having practice running it on less important systems, I did that with ZFS and now have a hacky first install that I have to support for years.

3

BTRFS vs LVM

For some years LVM (the Linux Logical Volume Manager) has been used in most Linux systems. LVM allows one or more storage devices (either disks, partitions, or RAID sets) to be assigned to a Volume Group (VG) some of which can then allocated to a Logical Volume (LVs) which are equivalent to any other block device, a VG can have many LVs.

One of the significant features of LVM is that you can create snapshots of a LV. One common use is to have multiple snapshots of a LV for online backups and another is to make a snapshot of a filesystem before making a backup to external storage, the snapshot is unchanging so there’s no problem of inconsistencies due to backing up a changing data set. When you create a snapshot it will have the same filesystem label and UUID so you should always mount a LVM device by it’s name (which will be /dev/$VGNAME/$LVNAME).

One of the problems with the ReiserFS filesystem was that there was no way to know whether a block of storage was a data block, a metadata block, or unused. A reiserfsck --rebuild-tree would find any blocks that appeared to be metadata and treat them as such, deleted files would reappear and file contents which matched metadata (such as a file containing an image of a ReiserFS filesystem) would be treated as metadata. One of the impacts of this was that a hostile user could create a file which would create a SUID root program if the sysadmin ran a --rebuild-tree operation.

BTRFS solves the problem of filesystem images by using a filesystem specific UUID in every metadata block. One impact of this is that if you want to duplicate a BTRFS filesystem image and use both copies on the same system you need to regenerate all the checksums of metadata blocks with the new UUID. The way BTRFS works is that filesystems are identified by UUID so having multiple block devices with the same UUID causes the kernel to get confused. Making an LVM snapshot really isn’t a good idea in this situation. It’s possible to change BTRFS kernel code to avoid some of the problems of duplicate block devices and it’s most likely that something will be done about it in future. But it still seems like a bad idea to use LVM with BTRFS.

The most common use of LVM is to divide the storage of a single disk or RAID array for the use of multiple filesystems. Each filesystem can be enlarged (through extending the LV and making the filesystem use the space) and snapshots can be taken. With BTRFS you can use subvolumes for the snapshots and the best use of BTRFS (IMHO) is to give it all the storage that’s available so there is no need to enlarge a filesystem in typical use. BTRFS supports quotas on subvolumes which aren’t really usable yet but in the future will remove the need to create multiple filesystems to control disk space use. An important but less common use of LVM is to migrate a live filesystem to a new disk or RAID array, but this can be done by BTRFS too by adding a new partition or disk to a filesystem and then removing the old one.

It doesn’t seem that LVM offers any benefits when you use BTRFS. When I first experimented with BTRFS I used LVM but I didn’t find any benefit in using LVM and it was only a matter of luck that I didn’t use a snapshot and break things.

Snapshots of BTRFS Filesystems

One reason for creating a snapshot of a filesystem (as opposed to a snapshot of a subvolume) is for making backups of virtual machines without support from inside the virtual machine (EG running an old RHEL5 virtual machine that doesn’t have the BTRFS utilities). Another is for running training on virtual servers where you want to create one copy of the filesystem for each student. To solve both these problems I am currently using files in a BTRFS subvolume. The BTRFS kernel code won’t touch those files unless I create a loop device so I can only create a loop device for one file at a time.

One tip for doing this, don’t use names such as /xenstore/vm1 for the files containing filesystem images, use names such as /xenstore/vm1-root. If you try to create a virtual machine named “vm1” then Xen will look for a file named “vm1” in the current directory before looking in /etc/xen and tries to use a filesystem image as a Xen configuration file. It would be nice if there was a path for Xen configuration files that either didn’t include the current directory or included it at the end of the list. Including the current directory in the path is a DOS mistake that should have gone away a long time ago.

Psychology and Block Devices

ZFS has a similar design to BTRFS in many ways and has some similar issues. But one benefit for ZFS is that it manages block devices in a “zpool”, first you create a zpool with the block devices and after that you can create ZFS filesystems or “ZVOL” block devices. I think that most sysadmins would regard a zpool as something similar to LVM (which may or may not be correct depending on how you look at it) and immediately rule out the possibility of running a zpool on LVM.

BTRFS looks like a regular Unix filesystem in many ways, you can have a single block device that you mount with the usual mount command. The fact that BTRFS can support multiple block devices in a RAID configuration isn’t so obvious and the fact that it implements equivalents to most LVM functionality probably isn’t known to most people when they start using it. The most obvious way to start using BTRFS is to use it just like an Ext3/4 filesystem on an LV, and to use LVM snapshots to backup data, this is made even more likely by the fact that there is a program to convert a ext2/3/4 filesystem to BTRFS. This seems likely to cause data loss.

2

Swap Breaking SSD

I’ve seen many comments about swap space and SSD claiming that swap will inherently destroy SSD through using too many writes. The latest was in the comments of my post about swap space and SSD performance [1]. Note that I’m not criticising the person who commented on my blog, everyone has heard lots of reports about possible problems that they avoid without analysing them in detail.

The first thing to note is that the quality of flash memory varies a lot, the chips that are used in SSDs for workstation/server use are designed to last while those in USB-flash devices aren’t. I’ve documented my unsuccessful experiments with using USB-flash for the root filesystem of a gateway server [2] (and the flash device that wasn’t used for swap died too).

The real issue when determining whether swap will break your SSD is the amount of writes. While swapping can do a lot of writing quickly that usually doesn’t happen unless something has gone wrong. The workstation that I’m currently using has writes to the root fileystem outnumbering writes to swap by a factor of 130:1 (by volume of data written). On other days I’ve seen it as low as 42:1, in either case it’s writes to the root filesystem (which is BTRFS and includes /home etc) that will break the SSD if anything. For some other workstations I run I see ratios of 201:1 (that’s with 8G of RAM), 57:1, and 23:1. In a quick search I couldn’t find a single system I run where even 10% of disk writes were attributed to swap. This really isn’t surprising given that adding RAM is a cheap way to improve the performance of most systems. If a SSD didn’t do any write leveling (as is rumored to be the case with cheap USB flash devices) then swap use might still cause a problem because of the number of writes in a small area, but if that was the case then filesystem journals and other fixed data structures would be more likely to cause a problem – and any swap based breakage would break swap not the root filesystem (although if swap was the first partition then it might also break the MBR).

Of the workstations that are convenient to inspect the one with the most writes to the root filesystem and swap space (IE everything on /dev/sda) had 128G written per day for 1.2 days of uptime, but that involved running a filesystem balance and some torrent downloads (not the typical use). Among the two systems which had been running for more than 48 hours with typical use the most writes was 24G in a day. If a SSD can sustain 10,000 writes per block (which is smaller than quoted by most flash manufacturers nowadays) and has perfect wear leveling (which is unlikely) then the system with a 120G SSD and 128G written in a day could continue like that for almost 10,000 days or 27 years – much longer than storage is expected to work or be useful. So for workstation use (where even 24G of writes per day probably counts as heavy use) a 120G SSD that can sustain 10,000 writes per block shouldn’t be at great risk of wearing out.

Conclusion

I believe that swap is much less likely to break SSDs than regular file access on every system with a reasonable amount of RAM. For systems which don’t have enough RAM you probably want the speed of SSD for swap space anyway in spite of the risks.

If wear leveling works as designed and the 10,000+ writes per block claims are accurate then SSDs will massively outlast their useful life. Long before they wear out they should be too small, too slow, and probably not compatible. 26 years ago I had a 5.25″ full height ST-506 disk in my desktop PC, it wouldn’t physically fit in most systems I own now (unless I removed the DVD drive), there is no possibility of buying a controller (I don’t own a system with an ISA bus), and that disk was too slow and small by today’s standards. A 27yo SSD isn’t going to be useful for anything, even for archive storage it’s no good as no-one has tested long term storage and data could decay before then.

6

BTRFS Status April 2014

Since my blog post about BTRFS in March [1] not much has changed for me. Until yesterday I was using 3.13 kernels on all my systems and dealing with the occasional kmail index file corruption problem.

Yesterday my main workstation ran out of disk space and went read-only. I started a BTRFS balance which didn’t seem to be doing any good because most of the space was actually in use so I deleted a bunch of snapshots. Then my X session aborted (some problem with KDE or the X server – I’ll never know as logs couldn’t be written to disk). I rebooted the system and had kernel threads go into infinite loops with repeated messages about a lack of response for 22 seconds (I should have photographed the screen). When it got into that state the ALT-Fn keys to change a virtual console sometimes worked but nothing else worked – the terminal usually didn’t respond to input.

To try and stop the kernel from entering an infinite loop on every boot that I used “rootflags=skip_balance” on the kernel command line to stop it from continuing the balance which made the system usable for a little longer, unfortunately the skip_balance mount option doesn’t permanently apply, the kernel will keep trying to balance the filesystem on every mount until a “btrfs balance cancel” operation succeeds. But my attempts to cancel the balance always failed.

When I booted my system with skip_balance it would sometimes free some space from the deleted snapshots, after two good runs I got to 17G free. But after that every time I rebooted it would report another Gig or two free (according to “btrfs filesystem df“) and then hang without committing the changes to disk.

I solved this problem by upgrading my USB rescue image to kernel 3.14 from Debian/Experimental and mounting the filesystem from the rescue image. After letting kernel 3.14 work on the filesystem for a while it was in a stage where I could use it with kernel 3.13 and then boot the system normally to upgrade it to kernel 3.14.

I had a minor extra complication due to the fact that I was running “apt-get dist-upgrade” at the time the filesystem went read-only do the dpkg records of which packages were installed were a bit messed up. But that was easy to fix by running a diff against /var/lib/dpkg/info on a recent snapshot. In retrospect I should have copied from an old snapshot of the root filesystem, but I fixed the problems faster than I could think of better ways to fix them.

When running a balance the system had a peak IO rate of about 30MB/s reads and 30MB/s writes. That compares to the maximum contiguous file IO speed of 260MB/s for reads and 320MB/s for writes. During that time it had about 50% CPU time used for my Q8400 quad-core CPU. So far the only tasks that I do regularly which have CPU speed as a significant bottleneck are BTRFS filesystem balancing and recoding MP4 files. Compiling hasn’t been an issue because recently I haven’t been compiling many programs that are particularly big.

Lessons Learned

I should photograph the screen regularly when doing things that won’t be logged, those kernel error messages might have been useful to me or someone else.

The fact that the only kernel that runs BTRFS the way I need comes from the Experimental repository in Debian stands in contrast to the recent kernel patch that stops describing BTRFS as experimental. While I have a high opinion of the people who provide support for the kernel in commercial distributions and their ability to back-port fixes from newer kernels I’m concerned about their decision to support BTRFS. I’m also dubious about whether we can offer BTRFS support in Debian/Jessie (the next version of Debian) without a significant warning. OTOH if you find yourself with a BTRFS system that isn’t working well you could always hire me to fix it. I accept payment via Paypal, bank transfer, or Bitcoin. If you want to pay me in Grange then I assure you I will never forget about it. ;)

I thought that I wouldn’t have CPU speed issues when I started using the AMD64 architecture, for most tasks that’s been the case. But for systems for which storage is important I’ll look at getting faster CPUs because of BTRFS. Using faster CPUs for storage isn’t that uncommon (I used to work for SGI and dealt with some significant CPU power used for file serving), but needing a fast quad-core CPU to drive a single SSD is a little disappointing. While recovery from file system corner cases isn’t going to be particularly common it’s something that you want completed quickly, for personal systems you want to be doing something else and for work systems you don’t want down-time.

The BTRFS problems with running out of disk space are really serious. It seems that even workstations used at home can’t survive without monitoring. For any other filesystem used at home you can just let it get full and then delete stuff.

Include “rootflags=skip_balance” in the boot loader configuration for every system with a BTRFS root filesystem and in the /etc/fstab for every non-root BTRFS filesystem. I haven’t yet encountered a single situation where continuing the balance did any good or when it didn’t do any harm.

12

Swap Space and SSD

In 2007 I wrote a blog post about swap space [1]. The main point of that article was to debunk the claim that Linux needs a swap space twice as large as main memory (in summary such advice is based on BSD Unix systems and has never applied to Linux and that most storage devices aren’t fast enough for large swap). That post was picked up by Barrapunto (Spanish Slashdot) and became one of the most popular posts I’ve written [2].

In the past 7 years things have changed. Back then 2G of RAM was still a reasonable amount and 4G was a lot for a desktop system or laptop. Now there are even phones with 3G of RAM, 4G is about the minimum for any new desktop or laptop, and desktop/laptop systems with 16G aren’t that uncommon. Another significant development is the use of SSDs which dramatically improve speed for some operations (mainly seeks).

As SATA SSDs for desktop use start at about $110 I think it’s safe to assume that everyone who wants a fast desktop system has one. As a major limiting factor in swap use is the seek performance of the storage the use of SSDs should allow greater swap use. My main desktop system has 4G of RAM (it’s an older Intel 64bit system and doesn’t support more) and has 4G of swap space on an Intel SSD. My work flow involves having dozens of Chromium tabs open at the same time, usually performance starts to drop when I get to about 3.5G of swap in use.

While SSD generally has excellent random IO performance the contiguous IO performance often isn’t much better than hard drives. My Intel SSDSC2CT12 300i 128G can do over 5000 random seeks per second but for sustained contiguous filesystem IO can only do 225M/s for writes and 274M/s for reads. The contiguous IO performance is less than twice as good as a cheap 3TB SATA disk. It also seems that the performance of SSDs aren’t as consistent as that of hard drives, when a hard drive delivers a certain level of performance then it can generally do so 24*7 but a SSD will sometimes reduce performance to move blocks around (the erase block size is usually a lot larger than the filesystem block size).

It’s obvious that SSDs allow significantly better swap performance and therefore make it viable to run a system with more swap in use but that doesn’t allow unlimited swap. Even when using programs like Chromium (which seems to allocate huge amounts of RAM that aren’t used much) it doesn’t seem viable to have swap be much bigger than 4G on a system with 4G of RAM. Now I could buy another SSD and use two swap spaces for double the overall throughput (which would still be cheaper than buying a PC that supports 8G of RAM), but that still wouldn’t solve all problems.

One issue I have been having on occasion is BTRFS failing to allocate kernel memory when managing snapshots. I’m not sure if this would be solved by adding more RAM as it could be an issue of RAM fragmentation – I won’t file a bug report about this until some of the other BTRFS bugs are fixed. Another problem I have had is when running Minecraft the driver for my ATI video card fails to allocate contiguous kernel memory, this is one that almost certainly wouldn’t be solved by just adding more swap – but might be solved if I tweaked the kernel to be more aggressive about swapping out data.

In 2007 when using hard drives for swap I found that the maximum space that could be used with reasonable performance for typical desktop operations was something less than 2G. Now with a SSD the limit for usable swap seems to be something like 4G on a system with 4G of RAM. On a system with only 2G of RAM that might allow the system to be usable with swap being twice as large as RAM, but with the amounts of RAM in modern PCs it seems that even SSD doesn’t allow using a swap space larger than RAM for typical use unless it’s being used for hibernation.

Conclusion

It seems that nothing has significantly changed in the last 7 years. We have more RAM, faster storage, and applications that are more memory hungry. The end result is that swap still isn’t very usable for anything other than hibernation if it’s larger than RAM.

It would be nice if application developers could stop increasing the use of RAM. Currently it seems that the RAM requirements for Linux desktop use are about 3 years behind the RAM requirements for Windows. This is convenient as a PC is fully depreciated according to the tax office after 3 years. This makes it easy to get 3 year old PCs cheaply (or sometimes for free as rubbish) which work really well for Linux. But it would be nice if we could be 4 or 5 years behind Windows in terms of hardware requirements to reduce the hardware requirements for Linux users even further.

Finding Corrupt Files that cause a Kernel Error

There is a BTRFS bug in kernel 3.13 which is triggered by Kmail and causes Kmail index files to become seriously corrupt. Another bug in BTRFS causes a kernel GPF when an application tries to read such a file, that results in a SEGV being sent to the application. After that the kernel ceases to operate correctly for any files on that filesystem and no command other than “reboot -nf” (hard reset without flushing write-back caches) can be relied on to work correctly. The second bug should be fixed in Linux 3.14, I’m not sure about the first one.

In the mean time I have several systems running Kmail on BTRFS which have this problem.

(strace tar cf – . |cat > /dev/null) 2>&1|tail

To discover which file is corrupt I run the above command after a reboot. Below is a sample of the typical output of that command which shows that the file named “.trash.index” is corrupt. After discovering the file name I run “reboot -nf” and then delete the file (the file can be deleted on a clean system but not after a kernel GPF). Of recent times I’ve been doing this about once every 5 days, so on average each Kmail/BTRFS system has been getting disk corruption every two weeks. Fortunately every time the corruption has been on an index file so I don’t need to restore from backups.

newfstatat(4, ".trash.index", {st_mode=S_IFREG|0600, st_size=33, …}, AT_SYMLINK_NOFOLLOW) = 0
openat(4, ".trash.index", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_NOFOLLOW|O_CLOEXEC) = 5
fstat(5, {st_mode=S_IFREG|0600, st_size=33, …}) = 0
read(5,  <unfinished …>
+++ killed by SIGSEGV +++

4

BTRFS Status March 2014

I’m currently using BTRFS on most systems that I can access easily. It’s not nearly reliable enough that I want to install it on a server in another country or an embedded device that’s only accessible via 3G, but for systems where I can access the console it’s not doing too badly.

Balancing and Space Allocation

# btrfs filesystem df /
Data, single: total=103.97GiB, used=85.91GiB
System, DUP: total=32.00MiB, used=20.00KiB
Metadata, DUP: total=1.78GiB, used=1.31GiB
# df -h /

Filesystem Size Used Avail Use% Mounted on
/dev/disk/by-uuid/ac696117-473c-4945-a71e-917e09c6503c 108G 89G 19G 84% /

Currently there are still situations where it can run out of space and deadlock on freeing space. The above shows the output of the btrfs df command and the regular df command, I have about 106G of disk space used by data and metadata in BTRFS while df shows that the entire filesystem (IE the block device) is 108G. So if I use another 2G of data or metadata then the system is at risk of deadlocking. To avoid that happening I have to run “btrfs balance start /” to start a balance which defragments the space use and frees some blocks. Currently there is a bug in BTRFS (present in all Debian/Unstable kernels) which prevents a balance operation from completing when systemd is used in a default configuration (there’s something about the way systemd accesses it’s journal files that triggers a BTRFS bug). This is really inconvenient, particularly given that there’s probably a strong correlation between people who use experimental filesystems and people who use experimental init programs.

When you get to the stage of the filesystem being deadlocked you can sometimes recover by removing snapshots and sometimes by adding a new device to the filesystem (even a USB flash drive will do). But I once had a filesystem get into a state where there wasn’t enough space to balance, add a device, or remove a snapshot – so I had to do a backup/format/restore.

Quota Groups

Last time I asked the developers (a few weeks ago) they told me that quota groups aren’t ready to use. They also said that they know about enough bugs that there’s no benefit in testing that feature. Even people who want to report bugs in BTRFS shouldn’t use quotas.

Kernel Panics with Kmail

I’ve had three systems develop filesystem corruption on files related to Kmail (the email program from KDE). I suspect that Kmail is triggering a bug in BTRFS. On all three systems the filesystem developed corruption that persisted across a reboot. One of the three systems was fixed by deleting the file for the Outbox, the others are waiting for kernel 3.14 which is supposed to fix the bug that causes kernel panics when accessing the corrupted files in question.

I don’t know whether kernel 3.14 will fix the bug that caused the corruption in the first place.

Conclusion

As I don’t use quotas BTRFS is working well for me on systems that have plenty of storage space and don’t run Kmail. There are some systems running systemd where I plan to upgrade the kernel before all the filesystem space is allocated. One of my systems is currently running SysVinit so I can balance the filesystem.

Apart from these issues BTRFS is working reasonably well for me. I haven’t yet had it’s filesystem checksums correct corrupted data from disk in any situation other than tests (I have had ZFS correct such an error, so hardware I use does benefit from this). I have restored data from BTRFS snapshots on many occasions, so that feature has been a major benefit for me. When I had a system with faulty RAM the internal checks in BTRFS alerted me to the problem and I didn’t lose any data, the filesystem became read-only and I was able to copy everything off even though it was too corrupted for writes.

11

Cheap Bulk Storage

The Problem

Some of my clients need systems that store reasonable amounts of data. This is enough data that we can expect some data corruption on disk such that traditional RAID doesn’t work, that old fashioned filesystems like Ext3/4 will have unreasonable fsck fimes, and that the number of disks in a small server isn’t enough.

NetApp is a really good option for bulk reliable storage, but their products are very expensive. BTRFS has a lot of potential, but the currently released versions (as supported in distributions such as Debian/Wheezy) lack significant features. One significant lack in current BTRFS releases is something equivalent to the ZFS send/receive functionality for remote backups, this was a major factor when I analysed the options for hard drive based backup [1], and you should always think about backup before deploying a new system. Currently ZFS is the best choice for bulk storage which is reliable if you can’t afford NetApp. Any storage system needs a minimum level of reliability if only to protect it’s own metadata and a basic RAID array doesn’t protect against media corruption with current data volumes. The combination of performance, lack of fsck (which is a performance feature), large storage support, backup, and significant real-world use makes ZFS a really good option.

Now I need to get some servers for more than 8.1TiB of storage (the capacity of a RAID-Z array of 4*3TB disks). One of my clients needs significantly more, probably at least 10 disks in a RAID-Z array so none of the cheaper servers will do.

Basically the issue that some of my clients are dealing with (and which I have to solve) is how to provide a relatively cheap ZFS system for storing reasonable amounts of data. For some systems I need to start with about 10 disks and be able to scale to 24 disks or more without excessive expense. Also to make things a little easier and cheaper 24*7 operation is not required, so instead of paying for hot-swap disks we can just schedule down-time outside business hours.

The Problem with Dell

Dell is really good for small systems, the PowerEdge tower servers that support 2*3.5″ or 4*3.5″ disks and which have space for an SSD or two are really affordable and easy to order. But even in the mid-size Dell tower servers (which are small by server standards) you have problems with just getting a few disks operating outside a RAID array [2]. The Dell online store is really great for small servers, any time I’m buying a server for less than $2500 I check the Dell online store first and usually their price is good enough that there is no need to get a quote from another company. Unfortunately all the servers with bigger storage involve disks that are unreasonably expensive (it seems that Dell makes their profit on the parts) and prices are not available online. I gave my email address and phone number to the Dell web site on Wednesday and they haven’t cared to get back to me yet. This is the type of service that makes me avoid IBM and HP for any server deployment where the Dell online store sells something suitable!

BackBlaze

For some time BackBlaze have been getting interest by describing how they store lots of data in a small amount of space by tightly stacking SATA disks. They don’t think that ZFS on Linux is ready for production, but their hardware ideas are useful. They have recently described their latest architecture [3]. They describe it as 135TB for $7,384. Of course the 135TB number is based on the idea of getting the full 3TB capacity out of each disk which they can do as they have redundancy over multiple storage pods. But anyone who wants a single fileserver needs some internal redundancy to cover disk failure. One option might be to have three RAID-Z2 arrays of 15 disks which gives a usable capacity of 42*3TB==126TB==113TiB. Note that while the ZFS documentation recommends between 3 and 9 disks per zpool for performance I don’t expect performance problems, when you only have a gigabit Ethernet connection there shouldn’t be a problem with three ZFS zpools making the network the bottleneck.

For this option the way to go would be to start with an array of 15 disks and then buy a second set of 15 disks when the first storage pool becomes full. It seems likely that 4TB disks will become cheap before a 35TiB array is filled so we can get more efficiency by delaying purchases. The BackBlaze pod isn’t cheap, they are sold as a complete system without storage disks for $US5,395 by Protocase [4]. That gives a markup of $US3,411 over the BackBlaze cost which isn’t too bad given that BackBlaze are quoting the insane bulk discount hardware prices that I could never get. Protocase also offer the case on it’s own for anyone who wants to build a system around it. It seems like the better option is to buy the system from Protocase, but that would end up being over $6,000 when Australian import duty is added and probably close to $7,000 when shipping etc is included.

Norco

Norco offers a case that takes 24 hot-swap SATA/SAS disks and a regular PC motherboard for $US399 [5]. It’s similar to the BackBlaze pod but smaller, cheaper, and there’s no obvious option to buy a configured and tested system. 24 disks would allow two RAID-Z2 arrays of 12 disks, the first array could provide 27TiB and the second array could provide something bigger when new disks are released.

SuperMicro

SuperMicro has a range of storage servers that support from 12 to 36 disks [6]. They seem good, but I’d have to deal with a reseller to buy them which would involve pain at best and at worst they wouldn’t bother getting me a quote because I only want one server at a time.

Conclusion

Does anyone know of any other options for affordable systems suitable for running ZFS on SATA disks? Preferably ones that don’t involve dealing with resellers.

At the moment it seems that the best option is to get a Norco case and build my own system as I don’t think that any of my clients needs the capacity of a BackBlaze pod at the moment. Supermicro seems good but I’d have to deal with a reseller. In my experience the difference between the resellers of such computer systems and used car dealers is that used car dealers are happy to sell one car at a time and that every used car dealer at least knows how to drive.

Also if you are an Australian reader of my blog and you want to build such storage servers to sell to my clients in Melbourne then I’d be interested to see an offer. But please make sure that any such offer includes a reference to your contributions to the Linux community if you think I won’t recognise your name. If you don’t contribute then I probably don’t want to do business with you.

As an aside, I was recently at a camera store helping a client test a new DSLR when one of the store employees started telling me how good ZFS is for storing RAW images. I totally agree that ZFS is the best filesystem for storing large RAW files and this is what I am working on right now. But it’s not the sort of advice I expect to receive at a camera store, not even one that caters to professional photographers.

10

BTRFS and ZFS as Layering Violations

LWN has an interesting article comparing recent developments in the Linux world to the “Unix Wars” that essentially killed every proprietary Unix system [1]. The article is really interesting and I recommend reading it, it’s probably only available to subscribers at the moment but should be generally available in a week or so (I used my Debian access sponsored by HP to read it).

A comment on that article cites my previous post about the reliability of RAID [2] and then goes on to disagree with my conclusion that using the filesystem for everything is the right thing to do.

The Benefits of Layers

I don’t believe as strongly in the BTRFS/ZFS design as the commentator probably thinks. The current way my servers (and a huge number of other Linux systems) work of having RAID to form a reliable array of disks from a set of cheap disks for the purpose of reliability and often capacity or performance is a good thing. I have storage on top of the RAID array and can fix the RAID without bothering about the filesystem(s) – and have done so in the past. I can also test the RAID array without involving any filesystem specific code. Then I have LVM running on top of the RAID array in exactly the same way that it runs on top of a single hard drive or SSD in the case of a laptop or netbook. So Linux on a laptop is much the same as Linux on a server in terms of storage once we get past the issue of whether a single disk or a RAID array is used for the LVM PV, among other things this means that the same code paths are used and I’m less likely to encounter a bug when I install a new system.

LVM provides multiple LVs which can be used for filesystems, swap, or anything else that uses storage. So if a filesystem gets badly corrupted I can umount it, create an LVM snapshot, and then take appropriate measures to try and fix it – without interfering with other filesystems.

When using layered storage I can easily add or change layers when it’s appropriate. For example I have encryption on only some LVs on my laptop and netbook systems (there is no point encrypting the filesystem used for .iso files of Linux distributions) and on some servers I use RAID-0 for cached data.

When using a filesystem like BTRFS or ZFS which includes subvolumes (similar in result to LVM in some cases) and internal RAID you can’t separate the layers. So if something gets corrupted then you have to deal with all the complexity of BTRFS or ZFS instead of just fixing the one layer that has a problem.

Update: One thing I forgot to mention when I first published this is the benefits of layering for some uncommon cases such as network devices. I can run an Ext4 filesystem over a RAID-1 array which has one device on NBD on another system. That’s a bit unusual but it is apparently working well for some people. The internal RAID on ZFS and BTRFS doesn’t support such things and using software RAID underneath ZFS or BTRFS loses some features.

When using DRBD you might have two servers with local RAID arrays, DRBD on top of that, and then an Ext4 filesystem. As any form of RAID other than internal RAID loses reliability features for ZFS and BTRFS that means that no matter how you might implement those filesystems with DRBD it seems that you will lose somehow. It seems that neither BTRFS nor ZFS supports a disconnected RAID mode (like a Linux software RAID with a bitmap so it can resync only the parts that didn’t change) so it’s not possible to use BTRFS or ZFS RAID-1 with an NBD device.

The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that.

The Benefits of Integration

When RAID and the filesystem are separate things (with some added abstraction from LVM) it’s difficult to optimise the filesystem for RAID performance at the best of times and impossible in many cases. When the filesystem manages RAID it can optimise it’s operation to match the details of the RAID layout. I believe that in some situations ZFS will use mirroring instead of RAID-Z for small writes to reduce the load and that ZFS will combine writes into a single RAID-Z stripe (or set of contiguous RAID-Z stripes) to improve write performance.

It would be possible to have a RAID driver that includes checksums for all blocks, it could then read from another device when a checksum fails and give some of the reliability features that ZFS and BTRFS offer. Then to provide all the reliability benefits of ZFS you would at least need a filesystem that stores multiple copies of the data which would of course need checksums (because the filesystem could be used on a less reliable block device) and therefore you would end up with two checksums on the same data. Note that if you want to have a RAID array with checksums on all blocks then ZFS has a volume management feature (which is well described by Mark Round) [3]. Such a zvol could be used for a block device in a virtual machine and in an ideal world it would be possible to use one as swap space. But the zvol is apparently managed with all the regular ZFS mechanisms so it’s not a direct list of blocks on disk and thus can’t be extracted if there is a problem with ZFS.

Snapshots are an essential feature by today’s standards. The ability to create lots of snapshots with low overhead is a significant feature of filesystems like BTRFS and ZFS. Now it is possible to run BTRFS or ZFS on top of a volume manager like LVM which does snapshots to cover the case of the filesystem getting corrupted. But again that would end up with two sets of overhead.

The way that ZFS supports snapshots which inherit encryption keys is also interesting.

Conclusion

It’s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn’t much interest in developing such things. So while people would use it (and people are using ZFS ZVols as block devices for other filesystems as described in a comment on Mark Round’s blog) it’s probably not going to be implemented.

Therefore we have a choice of all the complexity and features of BTRFS or ZFS, or the current RAID+LVM+Ext4 option. While the complexity of BTRFS and ZFS is a concern for me (particularly as BTRFS is new and ZFS is really complex and not well supported on Linux) it seems that there is no other option for certain types of large storage at the moment.

ZFS on Linux isn’t a great option for me, but for some of my clients it seems to be the only option. ZFS on Solaris would be a better option in some ways, but that’s not possible when you have important Linux software that needs fast access to the storage.