Based on my investigation of RAID reliability [1] I have determined that BTRFS [2] is the Linux storage technology that has the best potential to increase data integrity without costing a lot of money. Basically a BTRFS internal RAID-1 should offer equal or greater data protection than RAID-6.
As BTRFS is so important and so very different to any prior technology for Linux it’s not something that can be easily deployed in the same way as other filesystems. It is possible to easily switch between filesystems such as Ext4 and XFS because they work in much the same way, you have a single block device which the filesystem uses to create a single mount-point. While BTRFS supports internal RAID so it may have multiple block devices and it may offer multiple mountable filesystems and snapshots. Much of the functionality of Linux Software RAID and LVM is covered by BTRFS. So the sensible way to deploy BTRFS is to give it all your storage and not make use of any other RAID or LVM.
So I decided to do a test installation. I started with a Debian install CD that was made shortly before the release of Squeeze (it was first to hand) and installed with BTRFS for the root filesystem, I then upgraded to Debian/Unstable to get the latest kernel as BTRFS is developing rapidly. The system failed on the first boot after upgrading to Unstable because the /etc/fstab entry for the root filesystem had the FSCK pass number set to 1 – which wasn’t going to work as no FSCK program has been written. I changed that number to 0 and it then worked.
The initial install was on a desktop system that had a single IDE drive and a CD-ROM drive. For /boot I used a degraded RAID-1 and then after completing the installation I removed the CD-ROM drive and installed a second hard drive, after that it was easy to add the other device to the RAID-1. Then I tried to add a new device to the BTRFS group with the command “btrfs device add /dev/sdb2 /dev/sda2” and was informed that it can’t do that to a mounted filesystem! That will decrease the possibilities for using BTRFS on systems with hot-swap drives, I hope that the developers regard it as a bug.
Then I booted with an ext3 filesystem for root and tried the “btrfs device add /dev/sdb2 /dev/sda2” again but got the error message “btrfs: sending ioctl 5000940a to a partition!” which is not even found by Google.
The next thing that I wanted to do was to put a swap file on BTRFS, the benefits for having redundancy and checksums on swap space seem obvious – and other BTRFS features such as compression might give a benefit too. So I created a file by using dd to take take from /dev/zero, ran mkswap on it and then tried to run swapon. But I was told that the file has holes and can’t be used. Automatically making zero blocks into holes is a useful feature in many situations, but not in this case.
So far my experience with BTRFS is that all the basic things work (IE storing files, directories, etc). But the advanced functions I wanted from BTRFS (mirroring and making a reliable swap space) failed. This is a bit disappointing, but BTRFS isn’t described as being ready for production yet.
For the ioctl warnings, see BTS#656899 and https://lkml.org/lkml/2012/1/24/136 . From what I’ve understood, the warnings are only temporary.
Storing a swap file on a CoW file system such as btrfs is simply not possible. It’s not really an optimization turning 0s into holes, it is AFAIK just an error code for some syscall hacked in this way to forbid swapon from working on btrfs, as btrfs can by design not support swap files. You can setup a loop device from the swap file and use the loop device as swap though, AFAIK. With btrfs, you should really use a swap partition instead of a swap file, or no swap at all (which has the added benefit of making the OOM kill processes faster which is useful if you have a large memory leak in an application you develop).
You can add a device to a mounted filesystem. You should specify the mounted path instead of the device.
btrfs device add /dev/sdb1 /
I’ve just recently started to look at btrfs and haven’t played with multi-device filesystems yet, but as far as I understand the help output of btrfs, the correct command should be:
btrfs device add /dev/sdb2 /mountpoint
Where /mountpoint is the moint point of your btrfs system. Have you tried that?
You’re adding devices the wrong way
It should be
btrfs device add /dev/sdb2 /dev/sda2 /mountpoint
And adding devices can only be done online. The filesystem HAS to be mounted in order to add devices.
Arno: Thanks for that information, I guess I’ll just have to wait for an updated kernel.
Julian: There is no reason why they can’t do it. I believe that ZFS allows marking some files as non-COW and BTRFS could do the same.
Michael and Jeroen: Thanks for that information. Anyway something I did apparently destroyed the filesystem so I had to reinstall the machine. Now I’ve created a small ext3 root filesystem for testing. I’ll test it again once the kernel bug is fixed.
btrfs does have a NOCOW file flag, but I’m not sure it is ready yet:
http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg13395.html
Just making a file NOCOW does not help, swap file support relies on one function that btrfs intentionally does not implement due to potential corruptions (the swap implementation relies on some assumptions which may not hold in btrfs, like block numbers in the swap file while btrfs has a different block number mapping in case of multiple devices). The benefits like redundancy and compression are exactly the ones which either prevent swap (mapping to multiple block numbers for raid1) or are not applicable (swap is using direct IO which turns compression off for the file). If you want compressed swap, use compcache/zram.
There is a patchset (swap-over-nfs) which enhances the swapfile API and btrfs could use it, but the patchset is not merged and dunno when will be, if ever.
References:
* http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=35054394c4b3cecd52577c2662c84da1f3e73525
* http://www.spinics.net/lists/linux-btrfs/msg05042.html
http://richardhartmann.de/blog/posts/2012/02/RAID-sucks/
Richard Hartmann wrote an insightful post which was apparently inspired by this post.
kdave: Thanks for those references, does this mean that swap doesn’t work reliably with software RAID?