I previously wrote about my first experiences with BTRFS [1]. Since then I’ve been using BTRFS on more systems and have had good results. The main problem I want to address is with the reliability of RAID [2].
Requirements for a File Server
Now one of my clients has a need for a new fileserver. They need to reliably store terabytes of data (currently 6TB and growing) which is mostly comprised of data files in the 10MB – 15MB size range. The data files will almost never be re-written and I anticiapte that the main bottleneck will be the latency of NFS and other network file sharing protocols. I would hope that saturating a GigE network when sending 10MB data files from SATA disks via NFS, AFS, or SMB wouldn’t be a technical challenge.
It seems that BTRFS is the way of the future. But it’s still rather new and the lack of RAID-5 and RAID-6 is a serious issue when you need to store 10TB with today’s technology (that would be 8*3TB disks for RAID-10 vs 5*3TB disks for RAID-5). Also the case of two disks entirely failing in a short period of time requires RAID-6 (or RAID-Z2 as the ZFS variant of RAID-6 is known). With BTRFS at it’s current stage of development it seems that to recover from two disks failing you need to have BTRFS on another RAID-6 (maybe Linux software RAID-6). But for filesystems based on concepts similar to ZFS and BTRFS you want to have the filesystem run the RAID so that if a block has a filesystem hash mismatch then the correct copy can be reconstructed from parity.
ZFS seems to be a lot more complex than BTRFS. While having more features is a good thing (BTRFS seems to be missing some sysadmin friendly features at this stage) complexity means that I need to learn more and test more before going live.
But it seems that the built in RAID-5 and RAID-6 is the killer issue. Servers start becoming a lot more expensive if you want more than 8 disks and even going past 6 disks is a significant price point. As 3TB disks are available an 8 disk RAID-6 gives something like 18TB usable space vs 12TB on a RAID-10 and a 6 disk RAID-6 gives about 12TB vs 9TB on a RAID-10. With RAID-10 (IE BTRFS) my client couldn’t use a 6 disk server such as the Dell PowerEdge T410 for $1500 as 9TB of usable storage isn’t adequate and the Dell PowerEdge T610 which can support 8 disks and costs $2100 would be barely adequate for the near future with only 12TB of usable storage. Dell does sell significantly larger servers such that any of my clients needs could be covered by RAID-10, but in addition to costing more there are issues of power use and noise. When comparing a T610 and a T410 with a full set of disks the price difference is $1000 (assuming $200 per disk) which is probably worth paying to delay any future need for upgrades.
Buying Disks
The problem with the PowerEdge T610 server is that it uses hot-swap disks and the biggest disks available are 2TB for $586.30! 2TB*8 in RAID-6 gives 12TB of usable space for $4690.40! This compares poorly to the PowerEdge T410 which supports non-hot-swap disks so I can buy 6*3TB disks for something less than $200 each and get 12TB of usable space for $1200. If I could get hot-swap trays for Dell disks at a reasonable price then the T610 would be worth considering. But as 12TB of storage should do for at least the next 18 months it seems that the T410 is clearly the better option.
Does anyone know how to get cheap disk trays for Dell servers?
Implementation
In mailing list discussions some people suggest using Solaris or FreeBSD for a ZFS server. ZFS was designed for and implemented on Solaris, and FreeBSD was the first port. However Solaris and FreeBSD aren’t commonly used systems so it’s harder to find skilled people to work with them and there is less of a guarantee that the desired software will work. Among other things it’s really convenient to be able to run software for embedded Linux i386 systems on the server.
The first port of ZFS to Linux was based on FUSE [3]. This allows a clean separation of ZFS code from the Linux kernel code to avoid license issues but does have some performance problems. I don’t think that I will have any performance issues on this server as the data files are reasonably large, are received via an ADSL link, and which require quite a bit of CPU time to process them when they are accessed. But ZFS-FUSE doesn’t seem to be particularly popular.
The ZFS On Linux project provides source for a ZFS kernel module which you can compile and load [4]. As the module isn’t distributed with or statically linked to the kernel the license conflict of the CDDL ZFS code and the GPL Linux kernel code is apparently solved. I’ve read some positive reports from people who use this so it will be my preferred option.
Just as a note, we are using zfs-fuse as Bacula backup volumes with enabled compression since more than a year and it works really reliable.
12 TB space and with an old xeon about 80MB/s throughput.
Not sure if these guys deliver “down under”, but at least it might help to get an idea of the going rate for empty trays: http://www.servertrays.com/category/823/Dell
I’ve bought a number of Dell’s R510s with 12 drive bays. That’s a great density option if you don’t need a DVD drive. Additionally you can get Dell hotswap trays in a variety of places for a lot cheaper than Dell drives. Just search “Dell r510 drive tray” and you can pick them up for $25 apiece.
gebi: I’m hoping for a bit more than 80MB/s, I’d like to saturate at least one GigE port and maybe saturate two ports.
YC: Thanks a lot for that! E30 per tray is somewhat expensive, but when compared to the disk price difference ($586.30 for 2TB from Dell vs $180 for 3TB from everyone else) it’s nothing.
Would using FreeNAS be an option?
It’s FreeBSD based and thus has a good ZFS implementation. It’s webinterface makes it easy to setup, confgure and use and should be easy enough to use for even you customer to make some changes on their own.
Keep in mind that you’ll need ~1GB RAM for each TB of ZFS storage (2GB if you enable deduplication) …
I had the same problems years ago when I needed to build something very similar.
I weighed dozen of options and at the end I went for Opensolaris with raidZ.
We opted at the end for a new external DAS storage, but I remember Supermicro was giving plenty of options to build a storage server.
Keep in mind that you may find it difficult to find 3TB SATA drives with RAID firmware on the open market.
http://etbe.coker.com.au/2012/03/23/cheap-nas-devices-suck/
bash_vi: The server I’m planning is to replace the one which inspired the above blog post. While a FreeNAS box won’t be that bad it will still involve some extra management effort and be an additional platform.
One of the advantages of a file server over a NAS is that the file server can run your data intensive applications. This is more difficult when the file server is of yet another different platform. The client in question currently uses only Linux and Mac OS/X (and utterly depends on both of them), if possible I’d like to keep it at those two platforms.
Adam: From the Dell specs it seems that the default controller for the T410 and T610 doesn’t offer RAID functionality. So I’m hoping that it won’t refuse to operate with non-sanctioned disks. Does anyone have any experience with this?
what about kfreebsd? is this a real alternative to zfs-fuse or is it still too young?
kfreebsd is still a different platform. While it does have a lot of Debian packages running it won’t always be the same. kfreebsd should be in some ways better for ZFS because it has the same ZFS kernel code as the main BSD port and presumably the use-space code for managing ZFS isn’t too easy to mess up. But it’s going to be more difficult for everything else.
http://richardhartmann.de/blog/posts/2012/04/18-potpourri-i/
Richard Hartmann refers to my “continuing quest for modern, reliable, software-based storage on Linux”. I’m like a knight searching for the holy grail of reliable storage that’s affordable! ;)
http://www.supermicro.com/products/nfo/superworkstation.cfm
Richard suggests Supermicro. While I’m not after rack-mounted servers (I need something to work in an office) they have some workstations that seem good.
The issue with having RAID / enterprise firmware disks isn’t about ensuring that the disks will run in the Dell; rather it is about ensuring that you can control retry timeouts such that the drive isn’t booted from the RAID set prematurely, and ensuring it doesn’t timeout for so long that it takes out the disk controller with it (both of which I’ve seen happen). Finding such 3TB drives on the open market is probable but hardly assured. Some consumer-grade disks do support adjusting the timeouts, but many are locked-down now. You need to do your homework before
belated for sure, but for the record…
15 hotswap drive bays – no trays needed
yes a rack mount but…
http://www.plinkusa.net/web4090XH10T.htm
see model IPC-4090XH15T, currently $320 in USA
handles server size eATX boards and dual power if desired
or for 145TB in the same space:
http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/