13

ZFS vs BTRFS on Cheap Dell Servers

I previously wrote about my first experiences with BTRFS [1]. Since then I’ve been using BTRFS on more systems and have had good results. The main problem I want to address is with the reliability of RAID [2].

Requirements for a File Server

Now one of my clients has a need for a new fileserver. They need to reliably store terabytes of data (currently 6TB and growing) which is mostly comprised of data files in the 10MB – 15MB size range. The data files will almost never be re-written and I anticiapte that the main bottleneck will be the latency of NFS and other network file sharing protocols. I would hope that saturating a GigE network when sending 10MB data files from SATA disks via NFS, AFS, or SMB wouldn’t be a technical challenge.

It seems that BTRFS is the way of the future. But it’s still rather new and the lack of RAID-5 and RAID-6 is a serious issue when you need to store 10TB with today’s technology (that would be 8*3TB disks for RAID-10 vs 5*3TB disks for RAID-5). Also the case of two disks entirely failing in a short period of time requires RAID-6 (or RAID-Z2 as the ZFS variant of RAID-6 is known). With BTRFS at it’s current stage of development it seems that to recover from two disks failing you need to have BTRFS on another RAID-6 (maybe Linux software RAID-6). But for filesystems based on concepts similar to ZFS and BTRFS you want to have the filesystem run the RAID so that if a block has a filesystem hash mismatch then the correct copy can be reconstructed from parity.

ZFS seems to be a lot more complex than BTRFS. While having more features is a good thing (BTRFS seems to be missing some sysadmin friendly features at this stage) complexity means that I need to learn more and test more before going live.

But it seems that the built in RAID-5 and RAID-6 is the killer issue. Servers start becoming a lot more expensive if you want more than 8 disks and even going past 6 disks is a significant price point. As 3TB disks are available an 8 disk RAID-6 gives something like 18TB usable space vs 12TB on a RAID-10 and a 6 disk RAID-6 gives about 12TB vs 9TB on a RAID-10. With RAID-10 (IE BTRFS) my client couldn’t use a 6 disk server such as the Dell PowerEdge T410 for $1500 as 9TB of usable storage isn’t adequate and the Dell PowerEdge T610 which can support 8 disks and costs $2100 would be barely adequate for the near future with only 12TB of usable storage. Dell does sell significantly larger servers such that any of my clients needs could be covered by RAID-10, but in addition to costing more there are issues of power use and noise. When comparing a T610 and a T410 with a full set of disks the price difference is $1000 (assuming $200 per disk) which is probably worth paying to delay any future need for upgrades.

Buying Disks

The problem with the PowerEdge T610 server is that it uses hot-swap disks and the biggest disks available are 2TB for $586.30! 2TB*8 in RAID-6 gives 12TB of usable space for $4690.40! This compares poorly to the PowerEdge T410 which supports non-hot-swap disks so I can buy 6*3TB disks for something less than $200 each and get 12TB of usable space for $1200. If I could get hot-swap trays for Dell disks at a reasonable price then the T610 would be worth considering. But as 12TB of storage should do for at least the next 18 months it seems that the T410 is clearly the better option.

Does anyone know how to get cheap disk trays for Dell servers?

Implementation

In mailing list discussions some people suggest using Solaris or FreeBSD for a ZFS server. ZFS was designed for and implemented on Solaris, and FreeBSD was the first port. However Solaris and FreeBSD aren’t commonly used systems so it’s harder to find skilled people to work with them and there is less of a guarantee that the desired software will work. Among other things it’s really convenient to be able to run software for embedded Linux i386 systems on the server.

The first port of ZFS to Linux was based on FUSE [3]. This allows a clean separation of ZFS code from the Linux kernel code to avoid license issues but does have some performance problems. I don’t think that I will have any performance issues on this server as the data files are reasonably large, are received via an ADSL link, and which require quite a bit of CPU time to process them when they are accessed. But ZFS-FUSE doesn’t seem to be particularly popular.

The ZFS On Linux project provides source for a ZFS kernel module which you can compile and load [4]. As the module isn’t distributed with or statically linked to the kernel the license conflict of the CDDL ZFS code and the GPL Linux kernel code is apparently solved. I’ve read some positive reports from people who use this so it will be my preferred option.

9

Starting with BTRFS

Based on my investigation of RAID reliability [1] I have determined that BTRFS [2] is the Linux storage technology that has the best potential to increase data integrity without costing a lot of money. Basically a BTRFS internal RAID-1 should offer equal or greater data protection than RAID-6.

As BTRFS is so important and so very different to any prior technology for Linux it’s not something that can be easily deployed in the same way as other filesystems. It is possible to easily switch between filesystems such as Ext4 and XFS because they work in much the same way, you have a single block device which the filesystem uses to create a single mount-point. While BTRFS supports internal RAID so it may have multiple block devices and it may offer multiple mountable filesystems and snapshots. Much of the functionality of Linux Software RAID and LVM is covered by BTRFS. So the sensible way to deploy BTRFS is to give it all your storage and not make use of any other RAID or LVM.

So I decided to do a test installation. I started with a Debian install CD that was made shortly before the release of Squeeze (it was first to hand) and installed with BTRFS for the root filesystem, I then upgraded to Debian/Unstable to get the latest kernel as BTRFS is developing rapidly. The system failed on the first boot after upgrading to Unstable because the /etc/fstab entry for the root filesystem had the FSCK pass number set to 1 – which wasn’t going to work as no FSCK program has been written. I changed that number to 0 and it then worked.

The initial install was on a desktop system that had a single IDE drive and a CD-ROM drive. For /boot I used a degraded RAID-1 and then after completing the installation I removed the CD-ROM drive and installed a second hard drive, after that it was easy to add the other device to the RAID-1. Then I tried to add a new device to the BTRFS group with the command “btrfs device add /dev/sdb2 /dev/sda2” and was informed that it can’t do that to a mounted filesystem! That will decrease the possibilities for using BTRFS on systems with hot-swap drives, I hope that the developers regard it as a bug.

Then I booted with an ext3 filesystem for root and tried the “btrfs device add /dev/sdb2 /dev/sda2” again but got the error message “btrfs: sending ioctl 5000940a to a partition!” which is not even found by Google.

The next thing that I wanted to do was to put a swap file on BTRFS, the benefits for having redundancy and checksums on swap space seem obvious – and other BTRFS features such as compression might give a benefit too. So I created a file by using dd to take take from /dev/zero, ran mkswap on it and then tried to run swapon. But I was told that the file has holes and can’t be used. Automatically making zero blocks into holes is a useful feature in many situations, but not in this case.

So far my experience with BTRFS is that all the basic things work (IE storing files, directories, etc). But the advanced functions I wanted from BTRFS (mirroring and making a reliable swap space) failed. This is a bit disappointing, but BTRFS isn’t described as being ready for production yet.