Finding Storage Performance Problems

Here are some basic things to do when debugging storage performance problems on Linux. It’s deliberately not an advanced guide, I might write about more advanced things in a later post.

Disk Errors

When a hard drive is failing it often has to read sectors several times to get the right data, this can dramatically reduce performance. As most hard drives aren’t monitored properly (email or SMS alerts on errors) it’s quite common for the first notification about an impending failure to be user complaints about performance.

View your kernel message log with the dmesg command and look in /var/log/kern.log (or wherever your system is configured to store kernel logs) for messages about disk read errors, bus resetting, and anything else unusual related to the drives.

If you use an advanced filesystem like BTRFS or ZFS there are system commands to get filesystem information about errors. For BTRFS you can run “btrfs device stats MOUNTPOINT” and for ZFS you can run “zpool status“.

Most performance problems aren’t caused by failing drives, but it’s a good idea to eliminate that possibility before you continue your investigation.

One other thing to look out for is a RAID array where one disk is noticeably slower than the others. For example if you have a RAID-5 or RAID-6 array every drive should have almost the same number of reads and writes, if one disk in the array is at 99% performance capacity and the other disks are at 5% then it’s an indication of a failing disk. This can happen even if SMART etc don’t report errors.

Monitoring IO

The iostat program in the Debian sysstat package tells you how much IO is going to each disk. If you have physical hard drives sda, sdb, and sdc you could run the command “iostat -x 10 sda sdb sdc” to tell you how much IO is going to each disk over 10 second periods. You can choose various durations but I find that 10 seconds is long enough to give results that are useful.

By default iostat will give stats on all block devices including LVM volumes, but that usually gives too much data to analyse easily.

The most useful things that iostat tells you are the %util (the percentage utilisation – anything over 90% is a serious problem), the reads per second “r/s“, and the writes per second “w/s“.

The parameters to iostat for block devices can be hard drives, partitions, LVM volumes, encrypted devices, or any other type of block device. After you have discovered which block devices are nearing their maximum load you can discover which of the partitions, RAID arrays, or swap devices on that disk are causing the load in question.

The iotop program in Debian (package iotop) gives a display that’s similar to that of top but for disk io. It generally isn’t essential (you can run “ps ax|grep D” to get most of that information), but it is handy. It will tell you which programs are causing IO on a busy filesystem. This can be good when you have a busy system and don’t know why. It isn’t very useful if you have a system that is used for one task, EG a database server that is known to be busy doing database stuff.

It’s generally a good idea to have sysstat and iotop installed on all systems. If a system is experiencing severe performance problems you might not want to wait for new packages to be installed.

In Debian the sysstat package includes the sar utility which can give historical information on system load. One benefit of using sar for diagnosing performance problems is that it shows you the time of day that has the most load which is the easiest time to diagnose performance problems.

Swap Use

Swap use sometimes confuses people. In many cases swap use decreases overall disk use, this is the design of the Linux paging algorithms. So if you have a server that accesses a lot of data it might swap out some unused programs to make more space for cache.

When you have multiple virtual machines on one system sharing the same disks it can be difficult to determine the best allocation for RAM. If one VM has some applications allocating a lot of RAM but not using it much then it might be best to give it less RAM and force those applications into swap so that another VM can cache all the data it accesses a lot.

The important thing is not the amount of swap that is allocated but the amount of IO that goes to the swap partition. Any significant amount of disk IO going to a swap device is a serious problem that can be solved by adding more RAM.

Reads vs Writes

The ratio of reads to writes depends on the applications and the amount of RAM. Some applications can have most of their reads satisfied from cache. For example an ideal configuration of a mail server will have writes significantly outnumber reads (I’ve seen ratios of 5:1 for writes to reads on real mail servers). Ideally a mail server will cache all new mail for at least an hour and as the most prolific users check their mail more frequently than that most mail will be downloaded before it leaves the cache. If you have a mail server with reads outnumbering writes then it needs more RAM. RAM is cheap nowadays so if you don’t want to compete with Gmail it should be cheap to buy enough RAM to cache all recent mail.

The ratio of reads to writes is important because it’s one way of quickly determining if you have enough RAM and adding RAM is often the cheapest way of improving performance.

Unbalanced IO

One common performance problem on systems with multiple disks is having more load going to some disks than to others. This might not be a problem (EG having cron jobs run on disks that are under heavy load while the web server accesses data from lightly loaded disks). But you need to consider whether it’s desirable to have some disks under more load than others.

The simplest solution to this problem is to just have a single RAID array for all data storage. This is also the solution that gives you the maximum available disk space if you use RAID-5 or RAID-6.

A more complex option is to use some SSDs for things that require performance and disks for things that don’t. This can be done with the ZIL and L2ARC features of ZFS or by just creating a filesystem on SSD for the data that is most frequently accessed.

What Did I Miss?

I’m sure that I missed something, please let me know of any other basic things to do – or suggestions for a post on more advanced things.

Booting GPT

I’m installing new 4TB disks on an older Dell server, it’s a PowerEdge T110 with a G6950 CPU so it’s not really old, but it’s a couple of generations behind the latest Dell servers.

I tried to enable UEFI booting, but when I turned that option on the system locked up during the BIOS process (wouldn’t boot from the CD or take keyboard input). So I had to make it boot with a BIOS compatible MBR and a GPT partition table.

Number  Start (sector)    End (sector)  Size      Code  Name
  1            2048            4095  1024.0 KiB  EF02  BIOS boot partition
  2            4096        25169919  12.0 GiB    FD00  Linux RAID
  3        25169920      7814037134  3.6 TiB    8300  Linux filesystem

After spending way to much time reading various web pages I discovered that the above partition table works. The 1MB partition is for GRUB code and needs to be enabled by a parted command such as the following:

parted /dev/sda set 1 bios_grub on

/dev/sda2 is a RAID-1 array used for the root filesystem. If I was installing a non-RAID system I’d use the same partition table but with a type of 8300 instead of FD00. I have a RAID-1 array over sda2 and sdb2 for the root filesystem and sda3, sdb3, sdc3, sdd3, and sde3 are used for a RAID-Z array. I’m reserving space for the root filesystem on all 5 disks because it seems like a good idea to use the same partition table and the 12G per disk that is unused on sdc, sdd, and sde isn’t worth worrying about when dealing with 4TB disks.

More BTRFS Fun

I wrote a BTRFS status report yesterday commenting on the uneventful use of BTRFS recently [1].

Early this morning the server that stores my email (which had 93 days uptime) had a filesystem related problem. The root filesystem became read-only and then the kernel message log filled with unrelated messages so there was no record of the problem. I’m now considering setting up rsyslogd to log the kernel messages to a tmpfs filesystem to cover such problems in future. As RAM is so cheap it wouldn’t matter if a few megs of RAM were wasted by that in normal operation if it allowed me to extract useful data when something goes really wrong. It’s really annoying to have a system in a state where I can login as root but not find out what went wrong.

After that I tried 2 kernels in the 3.14 series, both of which had kernel BUG assertions related to Xen networking and failed to network correctly, I filed Debian Bug #756714. Fortunately they at least had enough uptime for me to run a filesystem scrub which reported no errors.

Then I reverted to kernel 3.13.10 but the reboot to apply that kernel change failed. Systemd was unable to umount the root filesystem (maybe because of a problem with Xen) and then hung the system instead of rebooting, I filed Debian Bug #756725. I believe that if asked to reboot a system there is no benefit in hanging the system with no user space processes accessible. Here are some useful things that systemd could have done:

  1. Just reboot without umounting (like “reboot -nf” does).
  2. Pause for some reasonable amount of time to give the sysadmin a possibility of seeing the error and then rebooting.
  3. Go back to a regular runlevel, starting daemons like sshd.
  4. Offer a login prompt to allow the sysadmin to login as root and diagnose the problem.

Options 1, 2, and 3 would have saved me a bit of driving. Option 4 would have allowed me to at least diagnose the problem (which might be worth the drive).

Having a system on the other side of the city which has no remote console access just hang after a reboot command is not useful, it would be near the top of the list of things I don’t want to happen in that situation. The best thing I can say about systemd’s operation in this regard is that it didn’t make the server catch fire.

Now all I really know is that 3.14 kernels won’t work for my server, 3.13 will cause problems that no-one can diagnose due to lack of data, and I’m now going to wait for it to fail again. As an aside the server has ECC RAM and it’s hardware is known to be good, so I’m sure that BTRFS is at fault.

BTRFS Status July 2014

My last BTRFS status report was in April [1], it wasn’t the most positive report with data corruption and system hangs. Hacker News has a brief discussion of BTRFS which includes the statement “Russell Coker’s reports of his experiences with BTRFS give me the screaming heebie-jeebies, no matter how up-beat and positive he stays about it” [2] (that’s one of my favorite comments about my blog).

Since April things have worked better. Linux kernel 3.14 solves the worst problems I had with 3.13 and it’s generally doing everything I want it to do. I now have cron jobs making snapshots as often as I wish (as frequently as every 15 minutes on some systems), automatically removing snapshots (removing 500+ snapshots at once doesn’t hang the system), balancing, and scrubbing. The fact that I can now expect that a filesystem balance (which is a type of defragment operation for BTRFS that frees some “chunks”) from a cron job and expect the system not to hang means that I haven’t run out of metadata chunk space. I expect that running out of metadata space can still cause filesystem deadlocks given a lack of reports on the BTRFS mailing list of fixes in that regard, but as long as balance works well we can work around that.

My main workstation now has 35 days of uptime and my home server has 90 days of uptime. Also the server that stores my email now has 93 days uptime even though it’s running Linux kernel 3.13.10. I am rather nervous about the server running 3.13.10 because in my experience every kernel before 3.14.1 had BTRFS problems that would cause system hangs. I don’t want a server that’s an hour’s drive away to hang…

The server that runs my email is using kernel 3.13.10 because when I briefly tried a 3.14 kernel it didn’t work reliably with the Xen kernel 4.1 from Debian/Wheezy and I had a choice of using the Xen kernel 4.3 from Debian/Unstable to match the Linux kernel or use an earlier Linux kernel. I have a couple of Xen servers running Debian/Unstable for test purposes which are working well so I may upgrade my mail server to the latest Xen and Linux kernels from Unstable in the near future. But for the moment I’m just not doing many snapshots and never running a filesystem scrub on that server.

Scrubbing

In kernel 3.14 scrub is working reliably for me and I have cron jobs to scrub filesystems on every system running that kernel. So far I’ve never seen it report an error on a system that matters to me but I expect that it will happen eventually.

The paper “An Analysis of Data Corruption in the Storage Stack” from the University of Wisconsin (based on NetApp data) [3] shows that “nearline” disks (IE any disks I can afford) have an incidence of checksum errors (occasions when the disk returns bad data but claims it to be good) of about 0.42%. There are 18 disks running in systems I personally care about (as opposed to systems where I am paid to care) so with a 0.42% probability of a disk experiencing data corruption per year that would give a 7.3% probability of having such corruption on one disk in any year and a greater than 50% chance that it’s already happened over the last 10 years. Of the 18 disks in question 15 are currently running BTRFS. Of the 15 running BTRFS 10 are scrubbed regularly (the other 5 are systems that don’t run 24*7 and the system running kernel 3.13.10).

Newer Kernels

The discussion on the BTRFS mailing list about kernel 3.15 is mostly about hangs. This is correlated with some changes to improve performance so I presume that it has exposed race conditions. Based on those discussions I haven’t felt inclined to run a 3.15 kernel. As the developers already have some good bug reports I don’t think that I could provide any benefit by doing more testing at this time. I think that there would be no benefit to me personally or the Linux community in testing 3.15.

I don’t have a personal interest in RAID-5 or RAID-6. The only systems I run that have more data than will fit on a RAID-1 array of cheap SATA disks are ones that I am paid to run – and they are running ZFS. So the ongoing development of RAID-5 and RAID-6 code isn’t an incentive for me to run newer kernels. Eventually I’ll test out RAID-6 code, but at the moment I don’t think they need more bug reports in this area.

I don’t have a great personal interest in filesystem performance at this time. There are some serious BTRFS performance issues. One problem is that a filesystem balance and subtree removal seem to take excessive amounts of CPU time. Another is that there isn’t much support for balancing IO to multiple devices (in RAID-1 every process has all it’s read requests sent to one device). For large-scale use of a filesystem these are significant problems. But when you have basic requirements (such as a mail server for dozens of users or a personal workstation with a quad-core CPU and fast SSD storage) it doesn’t make much difference. Currently all of my systems which use BTRFS have storage hardware that exceeds the system performance requirements by such a large margin that nothing other than installing Debian packages can slow the system down. So while there are performance improvements in newer versions of the BTRFS kernel code that isn’t an incentive for me to upgrade.

It’s just been announced that Debian/Jessie will use Linux 3.16, so I guess I’ll have to test that a bit for the benefit of Debian users. I am concerned that 3.16 won’t be stable enough for typical users at the time that Jessie is released.

Why I Use BTRFS

I’ve just had to do yet another backup/format/restore operation on my workstation due to a BTRFS corruption problem, but as usual I didn’t lose any data. The BTRFS data integrity features work reasonably well even when the filesystem gets into a state where the kernel will only accept a read-only mount.

Given that the BTRFS tag on my blog is mostly about problems with BTRFS I think it’s time that I explain why I use it in spite of the problems before people start to worry about my sanity or competence.

The first thing to note is that BTRFS is fairly resiliant toward errors when mounted in read-only mode. When mounting a filesystem read-write there are a number of ways in which things can break which are often due to kernel code not being able to handle corrupt metadata – I don’t know how much of this is inherent to the design of BTRFS and how much is simply missing features in filesystem error handling. Some of the errors that I have had weren’t entirely the fault of BTRFS, I twice had to do a backup/format/restore of my workstation due to a faulty DIMM corrupting memory (that has the potential to mess up any filesystem), but I still didn’t lose any data AFAIK.

The next thing to note is that I don’t use BTRFS when doing paid sysadmin work. ZFS is a solid and reliable filesystem and it is working really well for my clients while BTRFS has too many issues at the moment. As an aside I’m not interested in any comments about the ZFS license situation from anyone who’s not officially representing Oracle.

I also don’t use BTRFS on systems that I can’t access easily. The servers I have running BTRFS are all within an hour’s drive from home, while driving for an hour on account of a kernel or filesystem error is really annoying it’s not as bad as dealing with a remote server where I have no direct access.

Reasons to Use BTRFS

The benefits of BTRFS right now are snapshots (which are good for a first-line backup) and the basic data integrity features. I’ve found these features to work well in real use.

According to the comparison of Filesystems Wikipedia page ZFS and BTRFS are the only general purpose filesystems (IE for disks not tapes, NVRAM, or a cluster) that support checksums for all data and compression. Given that ZFS license issues will never allow it to be included in the Linux kernel tree it seems clear that BTRFS is the next significant filesystem for Linux. More testing of BTRFS is a good thing, while there are a number of known problems that the developers are working on it seems that more testing is needed now to find corner cases. Also we need a lot of testing to find bugs related to interactions with other software.

I’ve recently filed bug reports against the Debian installer because it can’t install to a BTRFS RAID-1 (fortunately BTRFS supports changing to RAID-1 after installation) and because it doesn’t support formatting an existing BTRFS filesystem (the mkfs program needs a -f option in that case). I also sent in a patch for the magic database used by file(1) to provide more information on BTRFS filesystems (which is in Debian/testing but not Debian/Wheezy). These are the sorts of things you encounter when routinely using software that you don’t necessarily notice in basic testing.

As an aside the Debian installation process failed at the GRUB step when I manually balanced a filesystem to use RAID-1 while the Debian installation was in progress. I didn’t file a bug report because the best advice is to not mess with filesystems while the installer is running. I’ll do a lot more testing of this when the Debian installer supports a BTRFS RAID-1 installation.

A final thing that we need to work on is developing sysadmin best practices and scripts for managing BTRFS filesystems. I’ve done some work on scripts to create snapshots for online backups but there are issues of managing free space etc. But working out how to best manage a new filesystem is something that takes years because there are many corner cases you may only encounter after a system has been running for a long time. So I really wouldn’t want to be in the situation of using a new filesystem on an important server without having practice running it on less important systems, I did that with ZFS and now have a hacky first install that I have to support for years.

BTRFS vs LVM

For some years LVM (the Linux Logical Volume Manager) has been used in most Linux systems. LVM allows one or more storage devices (either disks, partitions, or RAID sets) to be assigned to a Volume Group (VG) some of which can then allocated to a Logical Volume (LVs) which are equivalent to any other block device, a VG can have many LVs.

One of the significant features of LVM is that you can create snapshots of a LV. One common use is to have multiple snapshots of a LV for online backups and another is to make a snapshot of a filesystem before making a backup to external storage, the snapshot is unchanging so there’s no problem of inconsistencies due to backing up a changing data set. When you create a snapshot it will have the same filesystem label and UUID so you should always mount a LVM device by it’s name (which will be /dev/$VGNAME/$LVNAME).

One of the problems with the ReiserFS filesystem was that there was no way to know whether a block of storage was a data block, a metadata block, or unused. A reiserfsck --rebuild-tree would find any blocks that appeared to be metadata and treat them as such, deleted files would reappear and file contents which matched metadata (such as a file containing an image of a ReiserFS filesystem) would be treated as metadata. One of the impacts of this was that a hostile user could create a file which would create a SUID root program if the sysadmin ran a --rebuild-tree operation.

BTRFS solves the problem of filesystem images by using a filesystem specific UUID in every metadata block. One impact of this is that if you want to duplicate a BTRFS filesystem image and use both copies on the same system you need to regenerate all the checksums of metadata blocks with the new UUID. The way BTRFS works is that filesystems are identified by UUID so having multiple block devices with the same UUID causes the kernel to get confused. Making an LVM snapshot really isn’t a good idea in this situation. It’s possible to change BTRFS kernel code to avoid some of the problems of duplicate block devices and it’s most likely that something will be done about it in future. But it still seems like a bad idea to use LVM with BTRFS.

The most common use of LVM is to divide the storage of a single disk or RAID array for the use of multiple filesystems. Each filesystem can be enlarged (through extending the LV and making the filesystem use the space) and snapshots can be taken. With BTRFS you can use subvolumes for the snapshots and the best use of BTRFS (IMHO) is to give it all the storage that’s available so there is no need to enlarge a filesystem in typical use. BTRFS supports quotas on subvolumes which aren’t really usable yet but in the future will remove the need to create multiple filesystems to control disk space use. An important but less common use of LVM is to migrate a live filesystem to a new disk or RAID array, but this can be done by BTRFS too by adding a new partition or disk to a filesystem and then removing the old one.

It doesn’t seem that LVM offers any benefits when you use BTRFS. When I first experimented with BTRFS I used LVM but I didn’t find any benefit in using LVM and it was only a matter of luck that I didn’t use a snapshot and break things.

Snapshots of BTRFS Filesystems

One reason for creating a snapshot of a filesystem (as opposed to a snapshot of a subvolume) is for making backups of virtual machines without support from inside the virtual machine (EG running an old RHEL5 virtual machine that doesn’t have the BTRFS utilities). Another is for running training on virtual servers where you want to create one copy of the filesystem for each student. To solve both these problems I am currently using files in a BTRFS subvolume. The BTRFS kernel code won’t touch those files unless I create a loop device so I can only create a loop device for one file at a time.

One tip for doing this, don’t use names such as /xenstore/vm1 for the files containing filesystem images, use names such as /xenstore/vm1-root. If you try to create a virtual machine named “vm1” then Xen will look for a file named “vm1” in the current directory before looking in /etc/xen and tries to use a filesystem image as a Xen configuration file. It would be nice if there was a path for Xen configuration files that either didn’t include the current directory or included it at the end of the list. Including the current directory in the path is a DOS mistake that should have gone away a long time ago.

Psychology and Block Devices

ZFS has a similar design to BTRFS in many ways and has some similar issues. But one benefit for ZFS is that it manages block devices in a “zpool”, first you create a zpool with the block devices and after that you can create ZFS filesystems or “ZVOL” block devices. I think that most sysadmins would regard a zpool as something similar to LVM (which may or may not be correct depending on how you look at it) and immediately rule out the possibility of running a zpool on LVM.

BTRFS looks like a regular Unix filesystem in many ways, you can have a single block device that you mount with the usual mount command. The fact that BTRFS can support multiple block devices in a RAID configuration isn’t so obvious and the fact that it implements equivalents to most LVM functionality probably isn’t known to most people when they start using it. The most obvious way to start using BTRFS is to use it just like an Ext3/4 filesystem on an LV, and to use LVM snapshots to backup data, this is made even more likely by the fact that there is a program to convert a ext2/3/4 filesystem to BTRFS. This seems likely to cause data loss.

Swap Breaking SSD

I’ve seen many comments about swap space and SSD claiming that swap will inherently destroy SSD through using too many writes. The latest was in the comments of my post about swap space and SSD performance [1]. Note that I’m not criticising the person who commented on my blog, everyone has heard lots of reports about possible problems that they avoid without analysing them in detail.

The first thing to note is that the quality of flash memory varies a lot, the chips that are used in SSDs for workstation/server use are designed to last while those in USB-flash devices aren’t. I’ve documented my unsuccessful experiments with using USB-flash for the root filesystem of a gateway server [2] (and the flash device that wasn’t used for swap died too).

The real issue when determining whether swap will break your SSD is the amount of writes. While swapping can do a lot of writing quickly that usually doesn’t happen unless something has gone wrong. The workstation that I’m currently using has writes to the root fileystem outnumbering writes to swap by a factor of 130:1 (by volume of data written). On other days I’ve seen it as low as 42:1, in either case it’s writes to the root filesystem (which is BTRFS and includes /home etc) that will break the SSD if anything. For some other workstations I run I see ratios of 201:1 (that’s with 8G of RAM), 57:1, and 23:1. In a quick search I couldn’t find a single system I run where even 10% of disk writes were attributed to swap. This really isn’t surprising given that adding RAM is a cheap way to improve the performance of most systems. If a SSD didn’t do any write leveling (as is rumored to be the case with cheap USB flash devices) then swap use might still cause a problem because of the number of writes in a small area, but if that was the case then filesystem journals and other fixed data structures would be more likely to cause a problem – and any swap based breakage would break swap not the root filesystem (although if swap was the first partition then it might also break the MBR).

Of the workstations that are convenient to inspect the one with the most writes to the root filesystem and swap space (IE everything on /dev/sda) had 128G written per day for 1.2 days of uptime, but that involved running a filesystem balance and some torrent downloads (not the typical use). Among the two systems which had been running for more than 48 hours with typical use the most writes was 24G in a day. If a SSD can sustain 10,000 writes per block (which is smaller than quoted by most flash manufacturers nowadays) and has perfect wear leveling (which is unlikely) then the system with a 120G SSD and 128G written in a day could continue like that for almost 10,000 days or 27 years – much longer than storage is expected to work or be useful. So for workstation use (where even 24G of writes per day probably counts as heavy use) a 120G SSD that can sustain 10,000 writes per block shouldn’t be at great risk of wearing out.

Conclusion

I believe that swap is much less likely to break SSDs than regular file access on every system with a reasonable amount of RAM. For systems which don’t have enough RAM you probably want the speed of SSD for swap space anyway in spite of the risks.

If wear leveling works as designed and the 10,000+ writes per block claims are accurate then SSDs will massively outlast their useful life. Long before they wear out they should be too small, too slow, and probably not compatible. 26 years ago I had a 5.25″ full height ST-506 disk in my desktop PC, it wouldn’t physically fit in most systems I own now (unless I removed the DVD drive), there is no possibility of buying a controller (I don’t own a system with an ISA bus), and that disk was too slow and small by today’s standards. A 27yo SSD isn’t going to be useful for anything, even for archive storage it’s no good as no-one has tested long term storage and data could decay before then.

BTRFS Status April 2014

Since my blog post about BTRFS in March [1] not much has changed for me. Until yesterday I was using 3.13 kernels on all my systems and dealing with the occasional kmail index file corruption problem.

Yesterday my main workstation ran out of disk space and went read-only. I started a BTRFS balance which didn’t seem to be doing any good because most of the space was actually in use so I deleted a bunch of snapshots. Then my X session aborted (some problem with KDE or the X server – I’ll never know as logs couldn’t be written to disk). I rebooted the system and had kernel threads go into infinite loops with repeated messages about a lack of response for 22 seconds (I should have photographed the screen). When it got into that state the ALT-Fn keys to change a virtual console sometimes worked but nothing else worked – the terminal usually didn’t respond to input.

To try and stop the kernel from entering an infinite loop on every boot that I used “rootflags=skip_balance” on the kernel command line to stop it from continuing the balance which made the system usable for a little longer, unfortunately the skip_balance mount option doesn’t permanently apply, the kernel will keep trying to balance the filesystem on every mount until a “btrfs balance cancel” operation succeeds. But my attempts to cancel the balance always failed.

When I booted my system with skip_balance it would sometimes free some space from the deleted snapshots, after two good runs I got to 17G free. But after that every time I rebooted it would report another Gig or two free (according to “btrfs filesystem df“) and then hang without committing the changes to disk.

I solved this problem by upgrading my USB rescue image to kernel 3.14 from Debian/Experimental and mounting the filesystem from the rescue image. After letting kernel 3.14 work on the filesystem for a while it was in a stage where I could use it with kernel 3.13 and then boot the system normally to upgrade it to kernel 3.14.

I had a minor extra complication due to the fact that I was running “apt-get dist-upgrade” at the time the filesystem went read-only do the dpkg records of which packages were installed were a bit messed up. But that was easy to fix by running a diff against /var/lib/dpkg/info on a recent snapshot. In retrospect I should have copied from an old snapshot of the root filesystem, but I fixed the problems faster than I could think of better ways to fix them.

When running a balance the system had a peak IO rate of about 30MB/s reads and 30MB/s writes. That compares to the maximum contiguous file IO speed of 260MB/s for reads and 320MB/s for writes. During that time it had about 50% CPU time used for my Q8400 quad-core CPU. So far the only tasks that I do regularly which have CPU speed as a significant bottleneck are BTRFS filesystem balancing and recoding MP4 files. Compiling hasn’t been an issue because recently I haven’t been compiling many programs that are particularly big.

Lessons Learned

I should photograph the screen regularly when doing things that won’t be logged, those kernel error messages might have been useful to me or someone else.

The fact that the only kernel that runs BTRFS the way I need comes from the Experimental repository in Debian stands in contrast to the recent kernel patch that stops describing BTRFS as experimental. While I have a high opinion of the people who provide support for the kernel in commercial distributions and their ability to back-port fixes from newer kernels I’m concerned about their decision to support BTRFS. I’m also dubious about whether we can offer BTRFS support in Debian/Jessie (the next version of Debian) without a significant warning. OTOH if you find yourself with a BTRFS system that isn’t working well you could always hire me to fix it. I accept payment via Paypal, bank transfer, or Bitcoin. If you want to pay me in Grange then I assure you I will never forget about it. ;)

I thought that I wouldn’t have CPU speed issues when I started using the AMD64 architecture, for most tasks that’s been the case. But for systems for which storage is important I’ll look at getting faster CPUs because of BTRFS. Using faster CPUs for storage isn’t that uncommon (I used to work for SGI and dealt with some significant CPU power used for file serving), but needing a fast quad-core CPU to drive a single SSD is a little disappointing. While recovery from file system corner cases isn’t going to be particularly common it’s something that you want completed quickly, for personal systems you want to be doing something else and for work systems you don’t want down-time.

The BTRFS problems with running out of disk space are really serious. It seems that even workstations used at home can’t survive without monitoring. For any other filesystem used at home you can just let it get full and then delete stuff.

Include “rootflags=skip_balance” in the boot loader configuration for every system with a BTRFS root filesystem and in the /etc/fstab for every non-root BTRFS filesystem. I haven’t yet encountered a single situation where continuing the balance did any good or when it didn’t do any harm.

Swap Space and SSD

In 2007 I wrote a blog post about swap space [1]. The main point of that article was to debunk the claim that Linux needs a swap space twice as large as main memory (in summary such advice is based on BSD Unix systems and has never applied to Linux and that most storage devices aren’t fast enough for large swap). That post was picked up by Barrapunto (Spanish Slashdot) and became one of the most popular posts I’ve written [2].

In the past 7 years things have changed. Back then 2G of RAM was still a reasonable amount and 4G was a lot for a desktop system or laptop. Now there are even phones with 3G of RAM, 4G is about the minimum for any new desktop or laptop, and desktop/laptop systems with 16G aren’t that uncommon. Another significant development is the use of SSDs which dramatically improve speed for some operations (mainly seeks).

As SATA SSDs for desktop use start at about $110 I think it’s safe to assume that everyone who wants a fast desktop system has one. As a major limiting factor in swap use is the seek performance of the storage the use of SSDs should allow greater swap use. My main desktop system has 4G of RAM (it’s an older Intel 64bit system and doesn’t support more) and has 4G of swap space on an Intel SSD. My work flow involves having dozens of Chromium tabs open at the same time, usually performance starts to drop when I get to about 3.5G of swap in use.

While SSD generally has excellent random IO performance the contiguous IO performance often isn’t much better than hard drives. My Intel SSDSC2CT12 300i 128G can do over 5000 random seeks per second but for sustained contiguous filesystem IO can only do 225M/s for writes and 274M/s for reads. The contiguous IO performance is less than twice as good as a cheap 3TB SATA disk. It also seems that the performance of SSDs aren’t as consistent as that of hard drives, when a hard drive delivers a certain level of performance then it can generally do so 24*7 but a SSD will sometimes reduce performance to move blocks around (the erase block size is usually a lot larger than the filesystem block size).

It’s obvious that SSDs allow significantly better swap performance and therefore make it viable to run a system with more swap in use but that doesn’t allow unlimited swap. Even when using programs like Chromium (which seems to allocate huge amounts of RAM that aren’t used much) it doesn’t seem viable to have swap be much bigger than 4G on a system with 4G of RAM. Now I could buy another SSD and use two swap spaces for double the overall throughput (which would still be cheaper than buying a PC that supports 8G of RAM), but that still wouldn’t solve all problems.

One issue I have been having on occasion is BTRFS failing to allocate kernel memory when managing snapshots. I’m not sure if this would be solved by adding more RAM as it could be an issue of RAM fragmentation – I won’t file a bug report about this until some of the other BTRFS bugs are fixed. Another problem I have had is when running Minecraft the driver for my ATI video card fails to allocate contiguous kernel memory, this is one that almost certainly wouldn’t be solved by just adding more swap – but might be solved if I tweaked the kernel to be more aggressive about swapping out data.

In 2007 when using hard drives for swap I found that the maximum space that could be used with reasonable performance for typical desktop operations was something less than 2G. Now with a SSD the limit for usable swap seems to be something like 4G on a system with 4G of RAM. On a system with only 2G of RAM that might allow the system to be usable with swap being twice as large as RAM, but with the amounts of RAM in modern PCs it seems that even SSD doesn’t allow using a swap space larger than RAM for typical use unless it’s being used for hibernation.

Conclusion

It seems that nothing has significantly changed in the last 7 years. We have more RAM, faster storage, and applications that are more memory hungry. The end result is that swap still isn’t very usable for anything other than hibernation if it’s larger than RAM.

It would be nice if application developers could stop increasing the use of RAM. Currently it seems that the RAM requirements for Linux desktop use are about 3 years behind the RAM requirements for Windows. This is convenient as a PC is fully depreciated according to the tax office after 3 years. This makes it easy to get 3 year old PCs cheaply (or sometimes for free as rubbish) which work really well for Linux. But it would be nice if we could be 4 or 5 years behind Windows in terms of hardware requirements to reduce the hardware requirements for Linux users even further.

Finding Corrupt Files that cause a Kernel Error

There is a BTRFS bug in kernel 3.13 which is triggered by Kmail and causes Kmail index files to become seriously corrupt. Another bug in BTRFS causes a kernel GPF when an application tries to read such a file, that results in a SEGV being sent to the application. After that the kernel ceases to operate correctly for any files on that filesystem and no command other than “reboot -nf” (hard reset without flushing write-back caches) can be relied on to work correctly. The second bug should be fixed in Linux 3.14, I’m not sure about the first one.

In the mean time I have several systems running Kmail on BTRFS which have this problem.

(strace tar cf – . |cat > /dev/null) 2>&1|tail

To discover which file is corrupt I run the above command after a reboot. Below is a sample of the typical output of that command which shows that the file named “.trash.index” is corrupt. After discovering the file name I run “reboot -nf” and then delete the file (the file can be deleted on a clean system but not after a kernel GPF). Of recent times I’ve been doing this about once every 5 days, so on average each Kmail/BTRFS system has been getting disk corruption every two weeks. Fortunately every time the corruption has been on an index file so I don’t need to restore from backups.

newfstatat(4, ".trash.index", {st_mode=S_IFREG|0600, st_size=33, …}, AT_SYMLINK_NOFOLLOW) = 0
openat(4, ".trash.index", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_NOFOLLOW|O_CLOEXEC) = 5
fstat(5, {st_mode=S_IFREG|0600, st_size=33, …}) = 0
read(5,  <unfinished …>
+++ killed by SIGSEGV +++