BTRFS Training

Some years ago Barwon South Water gave LUV 3 old 1RU Sun servers for any use related to free software. We gave one of those servers to the Canberra makerlab and another is used as the server for the LUV mailing lists and web site and the 3rd server was put aside for training. The servers have hot-swap 15,000rpm SAS disks – IE disks that have a replacement cost greater than the budget we have for hardware. As we were given a spare 70G disk (and a 140G disk can replace a 70G disk) the LUV server has 2*70G disks and the 140G disks (which can’t be replaced) are in the server for training.

On Saturday I ran a BTRFS and ZFS training session for the LUV Beginners’ SIG. This was inspired by the amount of discussion of those filesystems on the mailing list and the amount of interest when we have lectures on those topics.

The training went well, the meeting was better attended than most Beginners’ SIG meetings and the people who attended it seemed to enjoy it. One thing that I will do better in future is clearly documenting commands that are expected to fail and documenting how to login to the system. The users all logged in to accounts on a Xen server and then ssh’d to root at their DomU. I think that it would have saved a bit of time if I had aliased commands like “btrfs” to “echo you must login to your virtual server first” or made the shell prompt at the Dom0 include instructions to login to the DomU.

Each user or group had a virtual machine. The server has 32G of RAM and I ran 14 virtual servers that each had 2G of RAM. In retrospect I should have configured fewer servers and asked people to work in groups, that would allow more RAM for each virtual server and also more RAM for the Dom0. The Dom0 was running a BTRFS RAID-1 filesystem and each virtual machine had a snapshot of the block devices from my master image for the training. Performance was quite good initially as the OS image was shared and fit into cache. But when many users were corrupting and scrubbing filesystems performance became very poor. The disks performed well (sustaining over 100 writes per second) but that’s not much when shared between 14 active users.

The ZFS part of the tutorial was based on RAID-Z (I didn’t use RAID-5/6 in BTRFS because it’s not ready to use and didn’t use RAID-1 in ZFS because most people want RAID-Z). Each user had 5*4G virtual disks (2 for the OS and 3 for BTRFS and ZFS testing). By the end of the training session there was about 76G of storage used in the filesystem (including the space used by the OS for the Dom0), so each user had something like 5G of unique data.

We are now considering what other training we can run on that server. I’m thinking of running training on DNS and email. Suggestions for other topics would be appreciated. For training that’s not disk intensive we could run many more than 14 virtual machines, 60 or more should be possible.

Below are the notes from the BTRFS part of the training, anyone could do this on their own if they substitute 2 empty partitions for /dev/xvdd and /dev/xvde. On a Debian/Jessie system all that you need to do to get ready for this is to install the btrfs-tools package. Note that this does have some risk if you make a typo. An advantage of doing this sort of thing in a virtual machine is that there’s no possibility of breaking things that matter.

  1. Making the filesystem
    1. Make the filesystem, this makes a filesystem that spans 2 devices (note you must use the-f option if there was already a filesystem on those devices):
      mkfs.btrfs /dev/xvdd /dev/xvde
    2. Use file(1) to see basic data from the superblocks:
      file -s /dev/xvdd /dev/xvde
    3. Mount the filesystem (can mount either block device, the kernel knows they belong together):
      mount /dev/xvdd /mnt/tmp
    4. See a BTRFS df of the filesystem, shows what type of RAID is used:
      btrfs filesystem df /mnt/tmp
    5. See more information about FS device use:
      btrfs filesystem show /mnt/tmp
    6. Balance the filesystem to change it to RAID-1 and verify the change, note that some parts of the filesystem were single and RAID-0 before this change):
      btrfs balance start -dconvert=raid1 -mconvert=raid1 -sconvert=raid1 –force /mnt/tmp
      btrfs filesystem df /mnt/tmp
    7. See if there are any errors, shouldn’t be any (yet):
      btrfs device stats /mnt/tmp
    8. Copy some files to the filesystem:
      cp -r /usr /mnt/tmp
    9. Check the filesystem for basic consistency (only checks checksums):
      btrfs scrub start -B -d /mnt/tmp
  2. Online corruption
    1. Corrupt the filesystem:
      dd if=/dev/zero of=/dev/xvdd bs=1024k count=2000 seek=50
    2. Scrub again, should give a warning about errors:
      btrfs scrub start -B /mnt/tmp
    3. Check error count:
      btrfs device stats /mnt/tmp
    4. Corrupt it again:
      dd if=/dev/zero of=/dev/xvdd bs=1024k count=2000 seek=50
    5. Unmount it:
      umount /mnt/tmp
    6. In another terminal follow the kernel log:
      tail -f /var/log/kern.log
    7. Mount it again and observe it correcting errors on mount:
      mount /dev/xvdd /mnt/tmp
    8. Run a diff, observe kernel error messages and observe that diff reports no file differences:
      diff -ru /usr /mnt/tmp/usr/
    9. Run another scrub, this will probably correct some errors which weren’t discovered by diff:
      btrfs scrub start -B -d /mnt/tmp
  3. Offline corruption
    1. Umount the filesystem, corrupt the start, then try mounting it again which will fail because the superblocks were wiped:
      umount /mnt/tmp
      dd if=/dev/zero of=/dev/xvdd bs=1024k count=200
      mount /dev/xvdd /mnt/tmp
      mount /dev/xvde /mnt/tmp
    2. Note that the filesystem was not mountable due to a lack of a superblock. It might be possible to recover from this but that’s more advanced so we will restore the RAID.
      Mount the filesystem in a degraded RAID mode, this allows full operation.
      mount /dev/xvde /mnt/tmp -o degraded
    3. Add /dev/xvdd back to the RAID:
      btrfs device add /dev/xvdd /mnt/tmp
    4. Show the filesystem devices, observe that xvdd is listed twice, the missing device and the one that was just added:
      btrfs filesystem show /mnt/tmp
    5. Remove the missing device and observe the change:
      btrfs device delete missing /mnt/tmp
      btrfs filesystem show /mnt/tmp
    6. Balance the filesystem, not sure this is necessary but it’s good practice to do it when in doubt:
      btrfs balance start /mnt/tmp
    7. Umount and mount it, note that the degraded option is not needed:
      umount /mnt/tmp
      mount /dev/xvdd /mnt/tmp
  4. Experiment
    1. Experiment with the “btrfs subvolume create” and “btrfs subvolume delete” commands (which act like mkdir and rmdir).
    2. Experiment with “btrfs subvolume snapshot SOURCE DEST” and “btrfs subvolume snapshot -r SOURCE DEST” for creating regular and read-only snapshots of other subvolumes (including the root).

BTRFS Status June 2015

The version of btrfs-tools in Debian/Jessie is incapable of creating a filesystem that can be mounted by the kernel in Debian/Wheezy. If you want to use a BTRFS filesystem on Jessie and Wheezy (which isn’t uncommon with removable devices) the only options are to use the Wheezy version of mkfs.btrfs or to use a Jessie kernel on Wheezy. I recently got bitten by this issue when I created a BTRFS filesystem on a removable device with a lot of important data (which is why I wanted metadata duplication and checksums) and had to read it on a server running Wheezy. Fortunately KVM in Wheezy works really well so I created a virtual machine to read the disk. Setting up a new KVM isn’t that difficult, but it’s not something I want to do while a client is anxiously waiting for their data.

BTRFS has been working well for me apart from the Jessie/Wheezy compatability issue (which was an annoyance but didn’t stop me doing what I wanted). I haven’t written a BTRFS status report for a while because everything has been OK and there has been nothing exciting to report.

I regularly get errors from the cron jobs that run a balance supposedly running out of free space. I have the cron jobs due to past problems with BTRFS running out of metadata space. In spite of the jobs often failing the systems keep working so I’m not too worried at the moment. I think this is a bug, but there are many more important bugs.

Linux kernel version 3.19 was the first version to have working support for RAID-5 recovery. This means version 3.19 was the first version to have usable RAID-5 (I think there is no point even having RAID-5 without recovery). It wouldn’t be prudent to trust your important data to a new feature in a filesystem. So at this stage if I needed a very large scratch space then BTRFS RAID-5 might be a viable option but for anything else I wouldn’t use it. BTRFS still has had little performance optimisation, while this doesn’t matter much for SSD and for single-disk filesystems for a RAID-5 of hard drives that would probably hurt a lot. Maybe BTRFS RAID-5 would be good for a scratch array of SSDs. The reports of problems with RAID-5 don’t surprise me at all.

I have a BTRFS RAID-1 filesystem on 2*4TB disks which is giving poor performance on metadata, simple operations like “ls -l” on a directory with ~200 subdirectories takes many seconds to run. I suspect that part of the problem is due to the filesystem being written by cron jobs with files accumulating over more than a year. The “btrfs filesystem” command (see btrfs-filesystem(8)) allows defragmenting files and directory trees, but unfortunately it doesn’t support recursively defragmenting directories but not files. I really wish there was a way to get BTRFS to put all metadata on SSD and all data on hard drives. Sander suggested the following command to defragment directories on the BTRFS mailing list:

find / -xdev -type d -execdir btrfs filesystem defrag -c {} +

Below is the output of “zfs list -t snapshot” on a server I run, it’s often handy to know how much space is used by snapshots, but unfortunately BTRFS has no support for this.

NAME USED AVAIL REFER MOUNTPOINT
hetz0/be0-mail@2015-03-10 2.88G 387G
hetz0/be0-mail@2015-03-11 1.12G 388G
hetz0/be0-mail@2015-03-12 1.11G 388G
hetz0/be0-mail@2015-03-13 1.19G 388G

Hugo pointed out on the BTRFS mailing list that the following command will give the amount of space used for snapshots. $SNAPSHOT is the name of a snapshot and $LASTGEN is the generation number of the previous snapshot you want to compare with.

btrfs subvolume find-new $SNAPSHOT $LASTGEN | awk '{total = total + $7}END{print total}'

One upside of the BTRFS implementation in this regard is that the above btrfs command without being piped through awk shows you the names of files that are being written and the amounts of data written to them. Through casually examining this output I discovered that the most written files in my home directory were under the “.cache” directory (which wasn’t exactly a surprise).

Now I am configuring workstations with a separate subvolume for ~/.cache for the main user. This means that ~/.cache changes don’t get stored in the hourly snapshots and less disk space is used for snapshots.

Conclusion

My observation is that things are going quite well with BTRFS. It’s more than 6 months since I had a noteworthy problem which is pretty good for a filesystem that’s still under active development. But there are still many systems I run which could benefit from the data integrity features of ZFS and BTRFS that don’t have the resources to run ZFS and need more reliability than I can expect from an unattended BTRFS system.

At this time the only servers I run with BTRFS are located within a reasonable drive from my home (not the servers in Germany and the US) and are easily accessible (not the embedded systems). ZFS is working well for some of the servers in Germany. Eventually I’ll probably run ZFS on all the hosted servers in Germany and the US, I expect that will happen before I’m comfortable running BTRFS on such systems. For the embedded systems I will just take the risk of data loss/corruption for the next few years.

BTRFS Status Dec 2014

My last problem with BTRFS was in August [1]. BTRFS has been running mostly uneventfully for me for the last 4 months, that’s a good improvement but the fact that 4 months of no problems is noteworthy for something as important as a filesystem is a cause for ongoing concern.

A RAID-1 Array

A week ago I had a minor problem with my home file server, one of the 3TB disks in the BTRFS RAID-1 started giving read errors. That’s not a big deal, I bought a new disk and did a “btrfs replace” operation which was quick and easy. The first annoyance was that the output of “btrfs device stats” reported an error count for the new device, it seems that “btrfs device replace” copies everything from the old disk including the error count. The solution is to use “btrfs device stats -z” to reset the count after replacing a device.

I replaced the 3TB disk with a 4TB disk (with current prices it doesn’t make sense to buy a new 3TB disk). As I was running low on disk space I added a 1TB disk to give it 4TB of RAID-1 capacity, one of the nice features of BTRFS is that a RAID-1 filesystem can support any combination of disks and use them to store 2 copies of every block of data. I started running a btrfs balance to get BTRFS to try and use all the space before learning from the mailing list that I should have done “btrfs filesystem resize” to make it use all the space. So my balance operation had configured the filesystem to configure itself for 2*3TB+1*1TB disks which wasn’t the right configuration when the 4TB disk was fully used. To make it even more annoying the “btrfs filesystem resize” command takes a “devid” not a device name.

I think that when BTRFS is more stable it would be good to have the btrfs utility warn the user about such potential mistakes. When a replacement device is larger than the old one it will be very common to want to use that space. The btrfs utility could easily suggest the most likely “btrfs filesystem resize” to make things easier for the user.

In a disturbing coincidence a few days after replacing the first 3TB disk the other 3TB disk started giving read errors. So I replaced the second 3TB disk with a 4TB disk and removed the 1TB disk to give a 4TB RAID-1 array. This is when it would be handy to have the metadata duplication feature and copies= option of ZFS.

Ctree Corruption

2 weeks ago a basic workstation with a 120G SSD owned by a relative stopped booting, the most significant errors it gave were “BTRFS: log replay required on RO media” and “BTRFS: open_ctree failed”. The solution to this is to run the command “btrfs-zero-log”, but that initially didn’t work. I restored the system from a backup (which was 2 months old) and took the SSD home to work on it. A day later “btrfs-zero-log” worked correctly and I recovered all the data. Note that I didn’t even try mounting the filesystem in question read-write, I mounted it read-only to copy all the data off. While in theory the filesystem should have been OK I didn’t have a need to keep using it at that time (having already wiped the original device and restored from backup) and I don’t have confidence in BTRFS working correctly in that situation.

While it was nice to get all the data back it’s a concern when commands don’t operate consistently.

Debian and BTRFS

I was concerned when the Debian kernel team chose 3.16 as the kernel for Jessie (the next Debian release). Judging by the way development has been going I wasn’t confident that 3.16 would turn out to be stable enough for BTRFS. But 3.16 is working reasonably well on a number of systems so it seems that it’s likely to work well in practice.

But I’m still deploying more ZFS servers.

The Value of Anecdotal Evidence

When evaluating software based on reports from reliable sources (IE most readers will trust me to run systems well and only report genuine bugs) bad reports have a much higher weight than good reports. The fact that I’ve seen kernel 3.16 to work reasonably well on ~6 systems is nice but that doesn’t mean it will work well on thousands of other systems – although it does indicate that it will work well on more systems than some earlier Linux kernels which had common BTRFS failures.

But the annoyances I had with the 3TB array are repeatable and will annoy many other people. The ctree coruption problem MIGHT have been initially caused by a memory error (it’s a desktop machine without ECC RAM) but the recovery process was problematic and other users might expect problems in such situations.

More BTRFS Fun

I wrote a BTRFS status report yesterday commenting on the uneventful use of BTRFS recently [1].

Early this morning the server that stores my email (which had 93 days uptime) had a filesystem related problem. The root filesystem became read-only and then the kernel message log filled with unrelated messages so there was no record of the problem. I’m now considering setting up rsyslogd to log the kernel messages to a tmpfs filesystem to cover such problems in future. As RAM is so cheap it wouldn’t matter if a few megs of RAM were wasted by that in normal operation if it allowed me to extract useful data when something goes really wrong. It’s really annoying to have a system in a state where I can login as root but not find out what went wrong.

After that I tried 2 kernels in the 3.14 series, both of which had kernel BUG assertions related to Xen networking and failed to network correctly, I filed Debian Bug #756714. Fortunately they at least had enough uptime for me to run a filesystem scrub which reported no errors.

Then I reverted to kernel 3.13.10 but the reboot to apply that kernel change failed. Systemd was unable to umount the root filesystem (maybe because of a problem with Xen) and then hung the system instead of rebooting, I filed Debian Bug #756725. I believe that if asked to reboot a system there is no benefit in hanging the system with no user space processes accessible. Here are some useful things that systemd could have done:

  1. Just reboot without umounting (like “reboot -nf” does).
  2. Pause for some reasonable amount of time to give the sysadmin a possibility of seeing the error and then rebooting.
  3. Go back to a regular runlevel, starting daemons like sshd.
  4. Offer a login prompt to allow the sysadmin to login as root and diagnose the problem.

Options 1, 2, and 3 would have saved me a bit of driving. Option 4 would have allowed me to at least diagnose the problem (which might be worth the drive).

Having a system on the other side of the city which has no remote console access just hang after a reboot command is not useful, it would be near the top of the list of things I don’t want to happen in that situation. The best thing I can say about systemd’s operation in this regard is that it didn’t make the server catch fire.

Now all I really know is that 3.14 kernels won’t work for my server, 3.13 will cause problems that no-one can diagnose due to lack of data, and I’m now going to wait for it to fail again. As an aside the server has ECC RAM and it’s hardware is known to be good, so I’m sure that BTRFS is at fault.

BTRFS Status July 2014

My last BTRFS status report was in April [1], it wasn’t the most positive report with data corruption and system hangs. Hacker News has a brief discussion of BTRFS which includes the statement “Russell Coker’s reports of his experiences with BTRFS give me the screaming heebie-jeebies, no matter how up-beat and positive he stays about it” [2] (that’s one of my favorite comments about my blog).

Since April things have worked better. Linux kernel 3.14 solves the worst problems I had with 3.13 and it’s generally doing everything I want it to do. I now have cron jobs making snapshots as often as I wish (as frequently as every 15 minutes on some systems), automatically removing snapshots (removing 500+ snapshots at once doesn’t hang the system), balancing, and scrubbing. The fact that I can now expect that a filesystem balance (which is a type of defragment operation for BTRFS that frees some “chunks”) from a cron job and expect the system not to hang means that I haven’t run out of metadata chunk space. I expect that running out of metadata space can still cause filesystem deadlocks given a lack of reports on the BTRFS mailing list of fixes in that regard, but as long as balance works well we can work around that.

My main workstation now has 35 days of uptime and my home server has 90 days of uptime. Also the server that stores my email now has 93 days uptime even though it’s running Linux kernel 3.13.10. I am rather nervous about the server running 3.13.10 because in my experience every kernel before 3.14.1 had BTRFS problems that would cause system hangs. I don’t want a server that’s an hour’s drive away to hang…

The server that runs my email is using kernel 3.13.10 because when I briefly tried a 3.14 kernel it didn’t work reliably with the Xen kernel 4.1 from Debian/Wheezy and I had a choice of using the Xen kernel 4.3 from Debian/Unstable to match the Linux kernel or use an earlier Linux kernel. I have a couple of Xen servers running Debian/Unstable for test purposes which are working well so I may upgrade my mail server to the latest Xen and Linux kernels from Unstable in the near future. But for the moment I’m just not doing many snapshots and never running a filesystem scrub on that server.

Scrubbing

In kernel 3.14 scrub is working reliably for me and I have cron jobs to scrub filesystems on every system running that kernel. So far I’ve never seen it report an error on a system that matters to me but I expect that it will happen eventually.

The paper “An Analysis of Data Corruption in the Storage Stack” from the University of Wisconsin (based on NetApp data) [3] shows that “nearline” disks (IE any disks I can afford) have an incidence of checksum errors (occasions when the disk returns bad data but claims it to be good) of about 0.42%. There are 18 disks running in systems I personally care about (as opposed to systems where I am paid to care) so with a 0.42% probability of a disk experiencing data corruption per year that would give a 7.3% probability of having such corruption on one disk in any year and a greater than 50% chance that it’s already happened over the last 10 years. Of the 18 disks in question 15 are currently running BTRFS. Of the 15 running BTRFS 10 are scrubbed regularly (the other 5 are systems that don’t run 24*7 and the system running kernel 3.13.10).

Newer Kernels

The discussion on the BTRFS mailing list about kernel 3.15 is mostly about hangs. This is correlated with some changes to improve performance so I presume that it has exposed race conditions. Based on those discussions I haven’t felt inclined to run a 3.15 kernel. As the developers already have some good bug reports I don’t think that I could provide any benefit by doing more testing at this time. I think that there would be no benefit to me personally or the Linux community in testing 3.15.

I don’t have a personal interest in RAID-5 or RAID-6. The only systems I run that have more data than will fit on a RAID-1 array of cheap SATA disks are ones that I am paid to run – and they are running ZFS. So the ongoing development of RAID-5 and RAID-6 code isn’t an incentive for me to run newer kernels. Eventually I’ll test out RAID-6 code, but at the moment I don’t think they need more bug reports in this area.

I don’t have a great personal interest in filesystem performance at this time. There are some serious BTRFS performance issues. One problem is that a filesystem balance and subtree removal seem to take excessive amounts of CPU time. Another is that there isn’t much support for balancing IO to multiple devices (in RAID-1 every process has all it’s read requests sent to one device). For large-scale use of a filesystem these are significant problems. But when you have basic requirements (such as a mail server for dozens of users or a personal workstation with a quad-core CPU and fast SSD storage) it doesn’t make much difference. Currently all of my systems which use BTRFS have storage hardware that exceeds the system performance requirements by such a large margin that nothing other than installing Debian packages can slow the system down. So while there are performance improvements in newer versions of the BTRFS kernel code that isn’t an incentive for me to upgrade.

It’s just been announced that Debian/Jessie will use Linux 3.16, so I guess I’ll have to test that a bit for the benefit of Debian users. I am concerned that 3.16 won’t be stable enough for typical users at the time that Jessie is released.

Why I Use BTRFS

I’ve just had to do yet another backup/format/restore operation on my workstation due to a BTRFS corruption problem, but as usual I didn’t lose any data. The BTRFS data integrity features work reasonably well even when the filesystem gets into a state where the kernel will only accept a read-only mount.

Given that the BTRFS tag on my blog is mostly about problems with BTRFS I think it’s time that I explain why I use it in spite of the problems before people start to worry about my sanity or competence.

The first thing to note is that BTRFS is fairly resiliant toward errors when mounted in read-only mode. When mounting a filesystem read-write there are a number of ways in which things can break which are often due to kernel code not being able to handle corrupt metadata – I don’t know how much of this is inherent to the design of BTRFS and how much is simply missing features in filesystem error handling. Some of the errors that I have had weren’t entirely the fault of BTRFS, I twice had to do a backup/format/restore of my workstation due to a faulty DIMM corrupting memory (that has the potential to mess up any filesystem), but I still didn’t lose any data AFAIK.

The next thing to note is that I don’t use BTRFS when doing paid sysadmin work. ZFS is a solid and reliable filesystem and it is working really well for my clients while BTRFS has too many issues at the moment. As an aside I’m not interested in any comments about the ZFS license situation from anyone who’s not officially representing Oracle.

I also don’t use BTRFS on systems that I can’t access easily. The servers I have running BTRFS are all within an hour’s drive from home, while driving for an hour on account of a kernel or filesystem error is really annoying it’s not as bad as dealing with a remote server where I have no direct access.

Reasons to Use BTRFS

The benefits of BTRFS right now are snapshots (which are good for a first-line backup) and the basic data integrity features. I’ve found these features to work well in real use.

According to the comparison of Filesystems Wikipedia page ZFS and BTRFS are the only general purpose filesystems (IE for disks not tapes, NVRAM, or a cluster) that support checksums for all data and compression. Given that ZFS license issues will never allow it to be included in the Linux kernel tree it seems clear that BTRFS is the next significant filesystem for Linux. More testing of BTRFS is a good thing, while there are a number of known problems that the developers are working on it seems that more testing is needed now to find corner cases. Also we need a lot of testing to find bugs related to interactions with other software.

I’ve recently filed bug reports against the Debian installer because it can’t install to a BTRFS RAID-1 (fortunately BTRFS supports changing to RAID-1 after installation) and because it doesn’t support formatting an existing BTRFS filesystem (the mkfs program needs a -f option in that case). I also sent in a patch for the magic database used by file(1) to provide more information on BTRFS filesystems (which is in Debian/testing but not Debian/Wheezy). These are the sorts of things you encounter when routinely using software that you don’t necessarily notice in basic testing.

As an aside the Debian installation process failed at the GRUB step when I manually balanced a filesystem to use RAID-1 while the Debian installation was in progress. I didn’t file a bug report because the best advice is to not mess with filesystems while the installer is running. I’ll do a lot more testing of this when the Debian installer supports a BTRFS RAID-1 installation.

A final thing that we need to work on is developing sysadmin best practices and scripts for managing BTRFS filesystems. I’ve done some work on scripts to create snapshots for online backups but there are issues of managing free space etc. But working out how to best manage a new filesystem is something that takes years because there are many corner cases you may only encounter after a system has been running for a long time. So I really wouldn’t want to be in the situation of using a new filesystem on an important server without having practice running it on less important systems, I did that with ZFS and now have a hacky first install that I have to support for years.

BTRFS vs LVM

For some years LVM (the Linux Logical Volume Manager) has been used in most Linux systems. LVM allows one or more storage devices (either disks, partitions, or RAID sets) to be assigned to a Volume Group (VG) some of which can then allocated to a Logical Volume (LVs) which are equivalent to any other block device, a VG can have many LVs.

One of the significant features of LVM is that you can create snapshots of a LV. One common use is to have multiple snapshots of a LV for online backups and another is to make a snapshot of a filesystem before making a backup to external storage, the snapshot is unchanging so there’s no problem of inconsistencies due to backing up a changing data set. When you create a snapshot it will have the same filesystem label and UUID so you should always mount a LVM device by it’s name (which will be /dev/$VGNAME/$LVNAME).

One of the problems with the ReiserFS filesystem was that there was no way to know whether a block of storage was a data block, a metadata block, or unused. A reiserfsck --rebuild-tree would find any blocks that appeared to be metadata and treat them as such, deleted files would reappear and file contents which matched metadata (such as a file containing an image of a ReiserFS filesystem) would be treated as metadata. One of the impacts of this was that a hostile user could create a file which would create a SUID root program if the sysadmin ran a --rebuild-tree operation.

BTRFS solves the problem of filesystem images by using a filesystem specific UUID in every metadata block. One impact of this is that if you want to duplicate a BTRFS filesystem image and use both copies on the same system you need to regenerate all the checksums of metadata blocks with the new UUID. The way BTRFS works is that filesystems are identified by UUID so having multiple block devices with the same UUID causes the kernel to get confused. Making an LVM snapshot really isn’t a good idea in this situation. It’s possible to change BTRFS kernel code to avoid some of the problems of duplicate block devices and it’s most likely that something will be done about it in future. But it still seems like a bad idea to use LVM with BTRFS.

The most common use of LVM is to divide the storage of a single disk or RAID array for the use of multiple filesystems. Each filesystem can be enlarged (through extending the LV and making the filesystem use the space) and snapshots can be taken. With BTRFS you can use subvolumes for the snapshots and the best use of BTRFS (IMHO) is to give it all the storage that’s available so there is no need to enlarge a filesystem in typical use. BTRFS supports quotas on subvolumes which aren’t really usable yet but in the future will remove the need to create multiple filesystems to control disk space use. An important but less common use of LVM is to migrate a live filesystem to a new disk or RAID array, but this can be done by BTRFS too by adding a new partition or disk to a filesystem and then removing the old one.

It doesn’t seem that LVM offers any benefits when you use BTRFS. When I first experimented with BTRFS I used LVM but I didn’t find any benefit in using LVM and it was only a matter of luck that I didn’t use a snapshot and break things.

Snapshots of BTRFS Filesystems

One reason for creating a snapshot of a filesystem (as opposed to a snapshot of a subvolume) is for making backups of virtual machines without support from inside the virtual machine (EG running an old RHEL5 virtual machine that doesn’t have the BTRFS utilities). Another is for running training on virtual servers where you want to create one copy of the filesystem for each student. To solve both these problems I am currently using files in a BTRFS subvolume. The BTRFS kernel code won’t touch those files unless I create a loop device so I can only create a loop device for one file at a time.

One tip for doing this, don’t use names such as /xenstore/vm1 for the files containing filesystem images, use names such as /xenstore/vm1-root. If you try to create a virtual machine named “vm1” then Xen will look for a file named “vm1” in the current directory before looking in /etc/xen and tries to use a filesystem image as a Xen configuration file. It would be nice if there was a path for Xen configuration files that either didn’t include the current directory or included it at the end of the list. Including the current directory in the path is a DOS mistake that should have gone away a long time ago.

Psychology and Block Devices

ZFS has a similar design to BTRFS in many ways and has some similar issues. But one benefit for ZFS is that it manages block devices in a “zpool”, first you create a zpool with the block devices and after that you can create ZFS filesystems or “ZVOL” block devices. I think that most sysadmins would regard a zpool as something similar to LVM (which may or may not be correct depending on how you look at it) and immediately rule out the possibility of running a zpool on LVM.

BTRFS looks like a regular Unix filesystem in many ways, you can have a single block device that you mount with the usual mount command. The fact that BTRFS can support multiple block devices in a RAID configuration isn’t so obvious and the fact that it implements equivalents to most LVM functionality probably isn’t known to most people when they start using it. The most obvious way to start using BTRFS is to use it just like an Ext3/4 filesystem on an LV, and to use LVM snapshots to backup data, this is made even more likely by the fact that there is a program to convert a ext2/3/4 filesystem to BTRFS. This seems likely to cause data loss.

Swap Breaking SSD

I’ve seen many comments about swap space and SSD claiming that swap will inherently destroy SSD through using too many writes. The latest was in the comments of my post about swap space and SSD performance [1]. Note that I’m not criticising the person who commented on my blog, everyone has heard lots of reports about possible problems that they avoid without analysing them in detail.

The first thing to note is that the quality of flash memory varies a lot, the chips that are used in SSDs for workstation/server use are designed to last while those in USB-flash devices aren’t. I’ve documented my unsuccessful experiments with using USB-flash for the root filesystem of a gateway server [2] (and the flash device that wasn’t used for swap died too).

The real issue when determining whether swap will break your SSD is the amount of writes. While swapping can do a lot of writing quickly that usually doesn’t happen unless something has gone wrong. The workstation that I’m currently using has writes to the root fileystem outnumbering writes to swap by a factor of 130:1 (by volume of data written). On other days I’ve seen it as low as 42:1, in either case it’s writes to the root filesystem (which is BTRFS and includes /home etc) that will break the SSD if anything. For some other workstations I run I see ratios of 201:1 (that’s with 8G of RAM), 57:1, and 23:1. In a quick search I couldn’t find a single system I run where even 10% of disk writes were attributed to swap. This really isn’t surprising given that adding RAM is a cheap way to improve the performance of most systems. If a SSD didn’t do any write leveling (as is rumored to be the case with cheap USB flash devices) then swap use might still cause a problem because of the number of writes in a small area, but if that was the case then filesystem journals and other fixed data structures would be more likely to cause a problem – and any swap based breakage would break swap not the root filesystem (although if swap was the first partition then it might also break the MBR).

Of the workstations that are convenient to inspect the one with the most writes to the root filesystem and swap space (IE everything on /dev/sda) had 128G written per day for 1.2 days of uptime, but that involved running a filesystem balance and some torrent downloads (not the typical use). Among the two systems which had been running for more than 48 hours with typical use the most writes was 24G in a day. If a SSD can sustain 10,000 writes per block (which is smaller than quoted by most flash manufacturers nowadays) and has perfect wear leveling (which is unlikely) then the system with a 120G SSD and 128G written in a day could continue like that for almost 10,000 days or 27 years – much longer than storage is expected to work or be useful. So for workstation use (where even 24G of writes per day probably counts as heavy use) a 120G SSD that can sustain 10,000 writes per block shouldn’t be at great risk of wearing out.

Conclusion

I believe that swap is much less likely to break SSDs than regular file access on every system with a reasonable amount of RAM. For systems which don’t have enough RAM you probably want the speed of SSD for swap space anyway in spite of the risks.

If wear leveling works as designed and the 10,000+ writes per block claims are accurate then SSDs will massively outlast their useful life. Long before they wear out they should be too small, too slow, and probably not compatible. 26 years ago I had a 5.25″ full height ST-506 disk in my desktop PC, it wouldn’t physically fit in most systems I own now (unless I removed the DVD drive), there is no possibility of buying a controller (I don’t own a system with an ISA bus), and that disk was too slow and small by today’s standards. A 27yo SSD isn’t going to be useful for anything, even for archive storage it’s no good as no-one has tested long term storage and data could decay before then.

BTRFS Status April 2014

Since my blog post about BTRFS in March [1] not much has changed for me. Until yesterday I was using 3.13 kernels on all my systems and dealing with the occasional kmail index file corruption problem.

Yesterday my main workstation ran out of disk space and went read-only. I started a BTRFS balance which didn’t seem to be doing any good because most of the space was actually in use so I deleted a bunch of snapshots. Then my X session aborted (some problem with KDE or the X server – I’ll never know as logs couldn’t be written to disk). I rebooted the system and had kernel threads go into infinite loops with repeated messages about a lack of response for 22 seconds (I should have photographed the screen). When it got into that state the ALT-Fn keys to change a virtual console sometimes worked but nothing else worked – the terminal usually didn’t respond to input.

To try and stop the kernel from entering an infinite loop on every boot that I used “rootflags=skip_balance” on the kernel command line to stop it from continuing the balance which made the system usable for a little longer, unfortunately the skip_balance mount option doesn’t permanently apply, the kernel will keep trying to balance the filesystem on every mount until a “btrfs balance cancel” operation succeeds. But my attempts to cancel the balance always failed.

When I booted my system with skip_balance it would sometimes free some space from the deleted snapshots, after two good runs I got to 17G free. But after that every time I rebooted it would report another Gig or two free (according to “btrfs filesystem df“) and then hang without committing the changes to disk.

I solved this problem by upgrading my USB rescue image to kernel 3.14 from Debian/Experimental and mounting the filesystem from the rescue image. After letting kernel 3.14 work on the filesystem for a while it was in a stage where I could use it with kernel 3.13 and then boot the system normally to upgrade it to kernel 3.14.

I had a minor extra complication due to the fact that I was running “apt-get dist-upgrade” at the time the filesystem went read-only do the dpkg records of which packages were installed were a bit messed up. But that was easy to fix by running a diff against /var/lib/dpkg/info on a recent snapshot. In retrospect I should have copied from an old snapshot of the root filesystem, but I fixed the problems faster than I could think of better ways to fix them.

When running a balance the system had a peak IO rate of about 30MB/s reads and 30MB/s writes. That compares to the maximum contiguous file IO speed of 260MB/s for reads and 320MB/s for writes. During that time it had about 50% CPU time used for my Q8400 quad-core CPU. So far the only tasks that I do regularly which have CPU speed as a significant bottleneck are BTRFS filesystem balancing and recoding MP4 files. Compiling hasn’t been an issue because recently I haven’t been compiling many programs that are particularly big.

Lessons Learned

I should photograph the screen regularly when doing things that won’t be logged, those kernel error messages might have been useful to me or someone else.

The fact that the only kernel that runs BTRFS the way I need comes from the Experimental repository in Debian stands in contrast to the recent kernel patch that stops describing BTRFS as experimental. While I have a high opinion of the people who provide support for the kernel in commercial distributions and their ability to back-port fixes from newer kernels I’m concerned about their decision to support BTRFS. I’m also dubious about whether we can offer BTRFS support in Debian/Jessie (the next version of Debian) without a significant warning. OTOH if you find yourself with a BTRFS system that isn’t working well you could always hire me to fix it. I accept payment via Paypal, bank transfer, or Bitcoin. If you want to pay me in Grange then I assure you I will never forget about it. ;)

I thought that I wouldn’t have CPU speed issues when I started using the AMD64 architecture, for most tasks that’s been the case. But for systems for which storage is important I’ll look at getting faster CPUs because of BTRFS. Using faster CPUs for storage isn’t that uncommon (I used to work for SGI and dealt with some significant CPU power used for file serving), but needing a fast quad-core CPU to drive a single SSD is a little disappointing. While recovery from file system corner cases isn’t going to be particularly common it’s something that you want completed quickly, for personal systems you want to be doing something else and for work systems you don’t want down-time.

The BTRFS problems with running out of disk space are really serious. It seems that even workstations used at home can’t survive without monitoring. For any other filesystem used at home you can just let it get full and then delete stuff.

Include “rootflags=skip_balance” in the boot loader configuration for every system with a BTRFS root filesystem and in the /etc/fstab for every non-root BTRFS filesystem. I haven’t yet encountered a single situation where continuing the balance did any good or when it didn’t do any harm.

Swap Space and SSD

In 2007 I wrote a blog post about swap space [1]. The main point of that article was to debunk the claim that Linux needs a swap space twice as large as main memory (in summary such advice is based on BSD Unix systems and has never applied to Linux and that most storage devices aren’t fast enough for large swap). That post was picked up by Barrapunto (Spanish Slashdot) and became one of the most popular posts I’ve written [2].

In the past 7 years things have changed. Back then 2G of RAM was still a reasonable amount and 4G was a lot for a desktop system or laptop. Now there are even phones with 3G of RAM, 4G is about the minimum for any new desktop or laptop, and desktop/laptop systems with 16G aren’t that uncommon. Another significant development is the use of SSDs which dramatically improve speed for some operations (mainly seeks).

As SATA SSDs for desktop use start at about $110 I think it’s safe to assume that everyone who wants a fast desktop system has one. As a major limiting factor in swap use is the seek performance of the storage the use of SSDs should allow greater swap use. My main desktop system has 4G of RAM (it’s an older Intel 64bit system and doesn’t support more) and has 4G of swap space on an Intel SSD. My work flow involves having dozens of Chromium tabs open at the same time, usually performance starts to drop when I get to about 3.5G of swap in use.

While SSD generally has excellent random IO performance the contiguous IO performance often isn’t much better than hard drives. My Intel SSDSC2CT12 300i 128G can do over 5000 random seeks per second but for sustained contiguous filesystem IO can only do 225M/s for writes and 274M/s for reads. The contiguous IO performance is less than twice as good as a cheap 3TB SATA disk. It also seems that the performance of SSDs aren’t as consistent as that of hard drives, when a hard drive delivers a certain level of performance then it can generally do so 24*7 but a SSD will sometimes reduce performance to move blocks around (the erase block size is usually a lot larger than the filesystem block size).

It’s obvious that SSDs allow significantly better swap performance and therefore make it viable to run a system with more swap in use but that doesn’t allow unlimited swap. Even when using programs like Chromium (which seems to allocate huge amounts of RAM that aren’t used much) it doesn’t seem viable to have swap be much bigger than 4G on a system with 4G of RAM. Now I could buy another SSD and use two swap spaces for double the overall throughput (which would still be cheaper than buying a PC that supports 8G of RAM), but that still wouldn’t solve all problems.

One issue I have been having on occasion is BTRFS failing to allocate kernel memory when managing snapshots. I’m not sure if this would be solved by adding more RAM as it could be an issue of RAM fragmentation – I won’t file a bug report about this until some of the other BTRFS bugs are fixed. Another problem I have had is when running Minecraft the driver for my ATI video card fails to allocate contiguous kernel memory, this is one that almost certainly wouldn’t be solved by just adding more swap – but might be solved if I tweaked the kernel to be more aggressive about swapping out data.

In 2007 when using hard drives for swap I found that the maximum space that could be used with reasonable performance for typical desktop operations was something less than 2G. Now with a SSD the limit for usable swap seems to be something like 4G on a system with 4G of RAM. On a system with only 2G of RAM that might allow the system to be usable with swap being twice as large as RAM, but with the amounts of RAM in modern PCs it seems that even SSD doesn’t allow using a swap space larger than RAM for typical use unless it’s being used for hibernation.

Conclusion

It seems that nothing has significantly changed in the last 7 years. We have more RAM, faster storage, and applications that are more memory hungry. The end result is that swap still isn’t very usable for anything other than hibernation if it’s larger than RAM.

It would be nice if application developers could stop increasing the use of RAM. Currently it seems that the RAM requirements for Linux desktop use are about 3 years behind the RAM requirements for Windows. This is convenient as a PC is fully depreciated according to the tax office after 3 years. This makes it easy to get 3 year old PCs cheaply (or sometimes for free as rubbish) which work really well for Linux. But it would be nice if we could be 4 or 5 years behind Windows in terms of hardware requirements to reduce the hardware requirements for Linux users even further.