Donate

Categories

Advert

XHTML

Valid XHTML 1.0 Transitional

Cheap Bulk Storage

The Problem

Some of my clients need systems that store reasonable amounts of data. This is enough data that we can expect some data corruption on disk such that traditional RAID doesn’t work, that old fashioned filesystems like Ext3/4 will have unreasonable fsck fimes, and that the number of disks in a small server isn’t enough.

NetApp is a really good option for bulk reliable storage, but their products are very expensive. BTRFS has a lot of potential, but the currently released versions (as supported in distributions such as Debian/Wheezy) lack significant features. One significant lack in current BTRFS releases is something equivalent to the ZFS send/receive functionality for remote backups, this was a major factor when I analysed the options for hard drive based backup [1], and you should always think about backup before deploying a new system. Currently ZFS is the best choice for bulk storage which is reliable if you can’t afford NetApp. Any storage system needs a minimum level of reliability if only to protect it’s own metadata and a basic RAID array doesn’t protect against media corruption with current data volumes. The combination of performance, lack of fsck (which is a performance feature), large storage support, backup, and significant real-world use makes ZFS a really good option.

Now I need to get some servers for more than 8.1TiB of storage (the capacity of a RAID-Z array of 4*3TB disks). One of my clients needs significantly more, probably at least 10 disks in a RAID-Z array so none of the cheaper servers will do.

Basically the issue that some of my clients are dealing with (and which I have to solve) is how to provide a relatively cheap ZFS system for storing reasonable amounts of data. For some systems I need to start with about 10 disks and be able to scale to 24 disks or more without excessive expense. Also to make things a little easier and cheaper 24*7 operation is not required, so instead of paying for hot-swap disks we can just schedule down-time outside business hours.

The Problem with Dell

Dell is really good for small systems, the PowerEdge tower servers that support 2*3.5″ or 4*3.5″ disks and which have space for an SSD or two are really affordable and easy to order. But even in the mid-size Dell tower servers (which are small by server standards) you have problems with just getting a few disks operating outside a RAID array [2]. The Dell online store is really great for small servers, any time I’m buying a server for less than $2500 I check the Dell online store first and usually their price is good enough that there is no need to get a quote from another company. Unfortunately all the servers with bigger storage involve disks that are unreasonably expensive (it seems that Dell makes their profit on the parts) and prices are not available online. I gave my email address and phone number to the Dell web site on Wednesday and they haven’t cared to get back to me yet. This is the type of service that makes me avoid IBM and HP for any server deployment where the Dell online store sells something suitable!

BackBlaze

For some time BackBlaze have been getting interest by describing how they store lots of data in a small amount of space by tightly stacking SATA disks. They don’t think that ZFS on Linux is ready for production, but their hardware ideas are useful. They have recently described their latest architecture [3]. They describe it as 135TB for $7,384. Of course the 135TB number is based on the idea of getting the full 3TB capacity out of each disk which they can do as they have redundancy over multiple storage pods. But anyone who wants a single fileserver needs some internal redundancy to cover disk failure. One option might be to have three RAID-Z2 arrays of 15 disks which gives a usable capacity of 42*3TB==126TB==113TiB. Note that while the ZFS documentation recommends between 3 and 9 disks per zpool for performance I don’t expect performance problems, when you only have a gigabit Ethernet connection there shouldn’t be a problem with three ZFS zpools making the network the bottleneck.

For this option the way to go would be to start with an array of 15 disks and then buy a second set of 15 disks when the first storage pool becomes full. It seems likely that 4TB disks will become cheap before a 35TiB array is filled so we can get more efficiency by delaying purchases. The BackBlaze pod isn’t cheap, they are sold as a complete system without storage disks for $US5,395 by Protocase [4]. That gives a markup of $US3,411 over the BackBlaze cost which isn’t too bad given that BackBlaze are quoting the insane bulk discount hardware prices that I could never get. Protocase also offer the case on it’s own for anyone who wants to build a system around it. It seems like the better option is to buy the system from Protocase, but that would end up being over $6,000 when Australian import duty is added and probably close to $7,000 when shipping etc is included.

Norco

Norco offers a case that takes 24 hot-swap SATA/SAS disks and a regular PC motherboard for $US399 [5]. It’s similar to the BackBlaze pod but smaller, cheaper, and there’s no obvious option to buy a configured and tested system. 24 disks would allow two RAID-Z2 arrays of 12 disks, the first array could provide 27TiB and the second array could provide something bigger when new disks are released.

SuperMicro

SuperMicro has a range of storage servers that support from 12 to 36 disks [6]. They seem good, but I’d have to deal with a reseller to buy them which would involve pain at best and at worst they wouldn’t bother getting me a quote because I only want one server at a time.

Conclusion

Does anyone know of any other options for affordable systems suitable for running ZFS on SATA disks? Preferably ones that don’t involve dealing with resellers.

At the moment it seems that the best option is to get a Norco case and build my own system as I don’t think that any of my clients needs the capacity of a BackBlaze pod at the moment. Supermicro seems good but I’d have to deal with a reseller. In my experience the difference between the resellers of such computer systems and used car dealers is that used car dealers are happy to sell one car at a time and that every used car dealer at least knows how to drive.

Also if you are an Australian reader of my blog and you want to build such storage servers to sell to my clients in Melbourne then I’d be interested to see an offer. But please make sure that any such offer includes a reference to your contributions to the Linux community if you think I won’t recognise your name. If you don’t contribute then I probably don’t want to do business with you.

As an aside, I was recently at a camera store helping a client test a new DSLR when one of the store employees started telling me how good ZFS is for storing RAW images. I totally agree that ZFS is the best filesystem for storing large RAW files and this is what I am working on right now. But it’s not the sort of advice I expect to receive at a camera store, not even one that caters to professional photographers.

11 comments to Cheap Bulk Storage

  • RoboTux

    What about the recently merged btrfs send/receive in Linux? It’s probably too experimental to be used by company though.

  • Steven C

    What about ZFS’ natural habitat, an old Sun Fire ‘Thumper’ X4500? 48 SATA bays which ought to be plenty. Or buy two if they are cheap, and enjoy having plenty of spare fans/PSUs lying around.

    There is also the technique of swapping out component drives of an existing mirrored vdev with new ones of higher capacity; then the vdev expands. That way you can take advantage of new drives with larger capacities if/when they become available, instead of being forced to add extra drives for extra vdevs. Having the pool split across more, smaller vdevs makes this easiest. I guess this doesn’t work with RAID-Z though.

  • Jason Riedy

    Um, at peak *theoretical* capacity, a 1GbE link will require 1000GB/TB * 1000MB/GB / 100MB/s = 10000s (2.77 hours) to transfer 1TB. A more realistic 80% usage gives around 3.5 hours, and even that will require quite a bit of tuning effort.

    Are you sure your clients ever *use* the data? 100TB with a 1GbE link… 350 hours… Perhaps the best solution is determining what summary of the data they’ll use and only storing that?

  • roedie

    I’d say Debian with ZFS on Linux is the way to go. I’ve build a lot of storages using chenbro cases. They are not to bad and come in a lot of sizes. But if you want to go bigger I recommend to start using separate disk chassis instead of buying a big server with a lot of drive bays.

    For your camera store note: We use ZFS to store a lot of RAW files from camera’s (~300TB+)

  • etbe

    RoboTux: Yes, it’s too experimental. Also BTRFS has no RAID-Z equivalent yet and using RAID-1 loses a lot of space.

    Steven C: How would I get one of those? Ebay Australia has hardly anything from Sun and what it does have is small, ancient (SCSI), or both.

    Jason Riedy: A large part of the processing will be done on the server, this is one reason a NAS isn’t a good option. The processing that is done is going to be CPU intensive so 100MB/s isn’t going to be a bottleneck. Also I am investigating 10GigE.

    roedie: Do you know of a good cheap disk chassis?

    Thanks for the comments.

  • James

    Digicor are a SuperMicro reseller who I’ve dealt with in the past, and had no problems buying one case at a time from. However, some friends bought a SuperMicro case from the US (ProVantage) because it was a lot cheaper. You can view some pictures and their ZFS pool details.

  • Pete

    “What about the recently merged btrfs send/receive in Linux? It’s probably too experimental to be used by company though.” – RoboTux

    It should be considered too experiemental for anyone’s data beyond those intereseted in testing. BTRFS still lacks complex RAID too.

    “There is also the technique of swapping out component drives of an existing mirrored vdev with new ones of higher capacity; then the vdev expands. That way you can take advantage of new drives with larger capacities if/when they become available, instead of being forced to add extra drives for extra vdevs. Having the pool split across more, smaller vdevs makes this easiest. I guess this doesn’t work with RAID-Z though.” – Steven C

    The drive swapping strategy is equally valid on mirror and RAIDZ pool, More spindles gives you better preformance for a number of metrics too. Finding a second hand thumper is an option, possible not the best one since it will be without warranty, and I assume the controllers won’t support high capacity disks, so you’ll waste a lot of capacity/money filling it.

    “Um, at peak *theoretical* capacity, a 1GbE link will require 1000GB/TB * 1000MB/GB / 100MB/s = 10000s (2.77 hours) to transfer 1TB. A more realistic 80% usage gives around 3.5 hours, and even that will require quite a bit of tuning effort. Are you sure your clients ever *use* the data? 100TB with a 1GbE link… 350 hours… Perhaps the best solution is determining what summary of the data they’ll use and only storing that” – Jason Riedy

    My ZFS box does >100MB/sec for sequential over GigE with jumbo frames. I doubt the entire contents are going to be read frequently, on the flip side it’s probably equitable to store all the data online when compared to restoring from some slow media when required. Also, if processing will occur server-side, my (16-bay populated – 2×8-disk RAIDZ2, Norco 4224) ZFS box does ~1GB/sec sequential.

  • roedie

    etbe: I wanted to say LSI. But I cannot seem to find them anymore on their site. Maybe on ebay. There are other vendors like Xyratex and Dothill. I’ve tried contacting them but they never got back to me. I even thought about using old NetApp shelfs, but those SATA shelves (you need the brackets with interposers) are one 2Gb/s which is not fast… at least not fast enough for me.

  • Craig

    FYI, techbuy.com.au in Sydney have both Norco cases and Supermicro cases, motherboards and systems in stock. The 4224 is $459 AU plus delivery (est. $22 to melbourne). That’s a good price considering how much it would cost to ship same from the US to Australia (from what i’ve been told, techbuy has them shipped in bulk direct from taiwan or wherever it is they’re actually made, not from the US).

    http://www.techbuy.com.au/p/149123/CASINGS_SERVER_-_4U/Norco/RPC-4224.asp

    I ordered one of the Norco 4220 cases from them last year for a zfsonlinux server at work. There were significant hassles and delays because they were out of stock at the time and the next batch kept getting delayed. Apparently they now have enough stock so this is not a problem.

    I would suggest LSI SAS2008 cards flashed with the IT firmware (e.g. the IBM M1015 can be bought new off ebay for about $100-$150 AUD for an 8i – if you have enough PCIe 8x slots, buying 3 is cheaper than bothering with SAS expanders).

    http://www.servethehome.com/ibm-serveraid-m1015-part-4/

    I use these cards in a few machines and they’ve been trouble-free (at least, since i figured out that if i wanted to use cheap consumer SATA drives without ZFS getting irked by long retries – timeouts – then the IT firmware is essential)

    Techbuy have similar cards (like the LSI Logic 9211-8i for $364, or LSI 9201-16i for $576), which may be an acceptable markup if you need local warranty/support.

    http://www.servethehome.com/current-lsi-hba-controller-features-compared/

    BTW if you want to plug any of the hotswap bays into motherboard SATA slots then you’ll need a **reverse** 4xSATA to SFF-8087 cable. you can get these cheap on ebay from any number of chinese cable + other misc. junk shops, about $10-ish, or you can pay about 5-10x as much locally.

    For cables from LSI to the hotswap bays you’ll just need SFF-8087 to SFF-8087. nice and simple. again, cables are much cheaper from ebay.

    BTW, for best performance, your raid-z devices should be made up of power-of-two data disks plus the parity drives. e.g. 4 or 8 data disks plus 1, 2, or 3 drives for raid-z1, -z2, or -z3. in a 4224 case this would give you enough drive bays for two raid-z2 arrays with 8 data disks and two parity disks each, and four spare bays for replacing drives as needed. Or three raidz-2 devs with 4d & 2p each, and no spare bays.

    With 4TB drives, that would be 64TB (8 drives x 4TB x 2 vdevs) or 48TB (4 x 4TB x 3). With 3TB drives, either 48TB or 36TB.

    The main advantage with having three 4-disk raidz2 devices rather than two 8-disk raidz2′s is that you can add or upgrade drives in 6-disk batches rather than 10-disk batches. e.g you could start with 6 x 3TB disks (for 12TB storage), then add another 6 later, and then add-or-replace another 6x4TB as they drives become cheaper.

    The main disadvantage is that you lose more space and money on parity drives.

  • Steven C

    Just had an afterthought: isn’t there some way to put the drives in any number of separate chasses? The ‘master’ of the ZFS pool could attach through something like iSCSI over GbE, and would only need some SSDs for boot/ZIL/L2ARC attached locally. That allows to easily add more capacity, but would it perform well?

  • Craig Sanders

    @steven C: it would be a lot simpler (and give much better performance) to have a SAS card with 4 or 8 (or 16 or more) externel ports, and either connect them directly to a hot-swap backplane in a drive-only chassis or to a SAS expander in same.

    latest SAS, like latest SATA, offers 6Gbps per port (i.e. per drive). GbE doesn’t even come close to the bandwidth of a single SAS/SATA port, let alone 4 or more of them.

    (Of course, 6Gbps is really only relevant for SSDs at the moment. Magnetic hard disks max out at about 1-1.5Gbps each, which is why SAS expanders are viable without losing much – a 4-port SAS connection could easily support 12 or even 16 hard disks).