LWN has an interesting article comparing recent developments in the Linux world to the “Unix Wars” that essentially killed every proprietary Unix system [1]. The article is really interesting and I recommend reading it, it’s probably only available to subscribers at the moment but should be generally available in a week or so (I used my Debian access sponsored by HP to read it).
A comment on that article cites my previous post about the reliability of RAID [2] and then goes on to disagree with my conclusion that using the filesystem for everything is the right thing to do.
The Benefits of Layers
I don’t believe as strongly in the BTRFS/ZFS design as the commentator probably thinks. The current way my servers (and a huge number of other Linux systems) work of having RAID to form a reliable array of disks from a set of cheap disks for the purpose of reliability and often capacity or performance is a good thing. I have storage on top of the RAID array and can fix the RAID without bothering about the filesystem(s) – and have done so in the past. I can also test the RAID array without involving any filesystem specific code. Then I have LVM running on top of the RAID array in exactly the same way that it runs on top of a single hard drive or SSD in the case of a laptop or netbook. So Linux on a laptop is much the same as Linux on a server in terms of storage once we get past the issue of whether a single disk or a RAID array is used for the LVM PV, among other things this means that the same code paths are used and I’m less likely to encounter a bug when I install a new system.
LVM provides multiple LVs which can be used for filesystems, swap, or anything else that uses storage. So if a filesystem gets badly corrupted I can umount it, create an LVM snapshot, and then take appropriate measures to try and fix it – without interfering with other filesystems.
When using layered storage I can easily add or change layers when it’s appropriate. For example I have encryption on only some LVs on my laptop and netbook systems (there is no point encrypting the filesystem used for .iso files of Linux distributions) and on some servers I use RAID-0 for cached data.
When using a filesystem like BTRFS or ZFS which includes subvolumes (similar in result to LVM in some cases) and internal RAID you can’t separate the layers. So if something gets corrupted then you have to deal with all the complexity of BTRFS or ZFS instead of just fixing the one layer that has a problem.
Update: One thing I forgot to mention when I first published this is the benefits of layering for some uncommon cases such as network devices. I can run an Ext4 filesystem over a RAID-1 array which has one device on NBD on another system. That’s a bit unusual but it is apparently working well for some people. The internal RAID on ZFS and BTRFS doesn’t support such things and using software RAID underneath ZFS or BTRFS loses some features.
When using DRBD you might have two servers with local RAID arrays, DRBD on top of that, and then an Ext4 filesystem. As any form of RAID other than internal RAID loses reliability features for ZFS and BTRFS that means that no matter how you might implement those filesystems with DRBD it seems that you will lose somehow. It seems that neither BTRFS nor ZFS supports a disconnected RAID mode (like a Linux software RAID with a bitmap so it can resync only the parts that didn’t change) so it’s not possible to use BTRFS or ZFS RAID-1 with an NBD device.
The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that.
The Benefits of Integration
When RAID and the filesystem are separate things (with some added abstraction from LVM) it’s difficult to optimise the filesystem for RAID performance at the best of times and impossible in many cases. When the filesystem manages RAID it can optimise it’s operation to match the details of the RAID layout. I believe that in some situations ZFS will use mirroring instead of RAID-Z for small writes to reduce the load and that ZFS will combine writes into a single RAID-Z stripe (or set of contiguous RAID-Z stripes) to improve write performance.
It would be possible to have a RAID driver that includes checksums for all blocks, it could then read from another device when a checksum fails and give some of the reliability features that ZFS and BTRFS offer. Then to provide all the reliability benefits of ZFS you would at least need a filesystem that stores multiple copies of the data which would of course need checksums (because the filesystem could be used on a less reliable block device) and therefore you would end up with two checksums on the same data. Note that if you want to have a RAID array with checksums on all blocks then ZFS has a volume management feature (which is well described by Mark Round) [3]. Such a zvol could be used for a block device in a virtual machine and in an ideal world it would be possible to use one as swap space. But the zvol is apparently managed with all the regular ZFS mechanisms so it’s not a direct list of blocks on disk and thus can’t be extracted if there is a problem with ZFS.
Snapshots are an essential feature by today’s standards. The ability to create lots of snapshots with low overhead is a significant feature of filesystems like BTRFS and ZFS. Now it is possible to run BTRFS or ZFS on top of a volume manager like LVM which does snapshots to cover the case of the filesystem getting corrupted. But again that would end up with two sets of overhead.
The way that ZFS supports snapshots which inherit encryption keys is also interesting.
Conclusion
It’s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn’t much interest in developing such things. So while people would use it (and people are using ZFS ZVols as block devices for other filesystems as described in a comment on Mark Round’s blog) it’s probably not going to be implemented.
Therefore we have a choice of all the complexity and features of BTRFS or ZFS, or the current RAID+LVM+Ext4 option. While the complexity of BTRFS and ZFS is a concern for me (particularly as BTRFS is new and ZFS is really complex and not well supported on Linux) it seems that there is no other option for certain types of large storage at the moment.
ZFS on Linux isn’t a great option for me, but for some of my clients it seems to be the only option. ZFS on Solaris would be a better option in some ways, but that’s not possible when you have important Linux software that needs fast access to the storage.
Heh, you finally made me cash in the DD LWN subscription option, just so I can read the article you link to.
I wonder how well ZFS works on Debian GNU/kFreeBSD? IIRC you wrote you have no experience with kfbsd (or fbsd in general), but maybe someone else could comment on this?
If you don’t want to deal with Btrfs’s raid, why don’t you use BTRFS (without raid) on top of sofware raid?
IIRC, ZFS is less code than Solaris LVM+UFS and it’s easier to use, so it’s not clear that the “complexity” exists.
Also, you say that several things like NBD don’t work with ZFS/BTRFS, but why wouldn’t they work?
Until LVM has some way of doing snapshots without preallocated space being put aside^W^Wwasted for it… it’s not so great.
If it had some sort of [RAM-style] overcommit facility, where it’d know what blocks the underlying FS hadn’t used and reused that… that’d be awesome. Not sure if it’s feasible though.
mirabilos: There are lots of other articles and discussions that are more informative. I think it’s good that you used your access as LWN is a great resource and now that you’ve used it once (and presumably instructed it to leave you logged in) you will use it more often.
I believe that FreeBSD was the second platform for ZFS and that as there are no license issues there it’s a first class kernel feature (as opposed to an optional extra that’s not properly supported on Linux). Debian kFreeBSD has the same kernel license situation as regular FreeBSD so it should be good. Getting the ZFS utilities to work on kFreeBSD should be easier than getting them to work on Linux.
Paul: If you use BTRFS without RAID on top of Linux software RAID then a single sector corruption can be copied to the other disk in the RAID set by a mdadm sync and then BTRFS has no way of recovering. You could use BTRFS RAID-1 on top of software RAID and that will work well, but you lose 75% of the disk space as opposed to 50% for a regular RAID-1. I have some systems with enough spare disk space and IO capacity that writing every block 4* is viable, but I also have some systems where it’s not.
Wes: Using software RAID, LVM, and a filesystem as separate layers means that each one can be debugged separately. If the RAID and LVM are working correctly but a filesystem is corrupted then you can take an image of the filesystem to check it (or use LVM for a snapshot). If they are all integrated then you can’t address one issue independently of the others.
You can use Linux software RAID across a local disk (or RAID array) and an NBD device, when you boot the system it would probably have the NBD disconnected but it can be added back to the live array and with bitmaps there is little data to copy for the synchronisation. It seems that you can’t use BTRFS or ZFS RAID-1 with one component being a NBD devicel, they just aren’t designed for transient devices in a RAID set.
Robert: Yes, the ZFS and BTRFS snapshots solve that problem nicely.
Robert:Thin provisioning is coming to LVM (it’s an experimental feature now, I don’t know if it’s ready for wheezy release).
https://www.redhat.com/archives/linux-lvm/2012-January/msg00018.html
TK: Thanks for that. Also sorry about your other comment with multiple URLs, I noticed it in the spam list just after clicking on “empty spam”. Due to the small number of non-spam comments that end up in the spam folder I sometimes click on the empty spam button before reading them well.
“””The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that.”””
No that doesn’t do it. One of the big points of having the file system doing checksums is that you can detect and correct problems that not only happen due to ‘bitrot’ or other disk problems, but also you can detect issues that happen on transit to storage. So by running Ext4 on top of ZFS you are ruining the best features of ZFS.
If you wanted to retain BTRFS or ZFS features on a DRBD-type setup then you could get rid of local RAID features completely and export each drive individually and put them together in pool. Or use RAID on each device, export each as a single iSCSI LUN and then use BTRFS to RAID1 them together. Depending on what you are trying to accomplish either approach can yeild superior results.
“””It’s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn’t much interest in developing such things. “””
Because that would be _worse_ then ZFS, not better.
High-end RAID and SAN devices use features like checksums and nightly scrubs to check, detect, and correct corruption. However they are useless to detect corruption that happens in transit to the storage array. ZFS and BTRFS provide the features only previously available on high-end systems and has the potential to implement it better then even the expensive systems could deal with.
Besides all that stuff….
While I don’t know the design of ZFS, the BTRFS is all data-only. It doesn’t deal with blocks or pools of blocks. Everything it does it completely logical and handling data only. When it performs RAID1, for example, it doesn’t stripe blocks. It just makes sure that every hunk of data is available on at least 2 devices.
With BTRFS you have snapshots that are cheap and easy to use. With LVM they are expensive and troublesome for the administrator. BTRFS makes those features trivial to take advantage on. Snapshots, subvolumes are easy and quick to use. For LVM it is anything but. With BTRFS you can easily expand and shrink file systems on the fly. It’s quick, easy, and only requires a couple commands. Shrinking file systems using software raid + LVM + ext4 extremely troublesome and slow. So you can do things like remove older, slower drives and add in much larger ones without having to shut down the system or umount the file systems. This adds a huge amount of flexibility you can’t have with other solutions.
With BTRFS you can use odd number of drives to implement RAID1 or RAID0 features. You can choose how data is to be striped. You can have some data as RAID10, other data as RAID0. You can use RAID1 for metadata or RAID0 for data. Whatever raid level you want to use for whatever purpose you can do it. You can re-balance and migrate data to and from different raid levels on the fly.
With BTRFS you can use ‘seed’ devices. You can combine read-only file systems with read-write, them migrate fully over the read-write version whenever you need to.
You can get all hand wavy and claim that you can get similar things done with comibattions of RAID levels, multiple file system mounts on different volume groups. Maybe throw some AUFS/UnionFS action at it and crud like that. I’ve done all that stuff myself in the past. Btrfs/ZFS not only does it faster, but it’s _vastly_ easier to use, much more manageable, flexible and less dangerous.
In Linux, at least, there is two layers we are dealing with here:
1) File system level
2) Block storage level
_thats_it_. That’s your layers.
NDB, DRBD, LVM, MD RAID, and even things like iSCSI or SSD drives, and the such all are methods of block level manipulation. There is no logical reason why you would want LVM on top of RAID. There is no point to that division except that we happen to have software raid to duplicate hardware raid features and that LVM is a very poor copy of what HP-UX had at some point. There is all sorts of overlap between LVM and RAID, for example. There is really no point of doing RAID0 or RAID1 with LVM, for example, since LVM can handle striping and mirroring on it’s own. It’s the sort of thing that is necessary for LVM to do it’s job. Should LVM be barred from having RAID1 features because it’s a layer violation for RAID devices?
And what is the technical reason why file systems shouldn’t be able to manage duplicating and tracking data?
From what I’ve seen ZFS and BTRFS do a massively better job then what was ever possible do on Linux, FreeBSD, or Solaris without the use of proprietary file systems. They implemented these sort of features in a faster way, in a better way, in a more manageable way then any sort of previously existing software raid + volume management + file system solution ever did in the past… especiall for Linux and especially if you don’t have hundreds of thousands of dollars to spend on high-end hardware and proprietary file systems.
To me this speaks volumes about the correctness and practical nature of modern file system design.
nate: Firstly you should probably write some blog posts about this yourself. At 910 words including quoted text I think you’ve set a new record for the longest comment on my blog! Please write a blog post explaining how to best use DRBD with ZFS while not losing the ability to recover from the situation where one disk returns bad data and says it’s good.
Now while it’s theoretically possible for Ext4 filesystem data to be corrupted on the way to storage in practice for Ext4 to have such a corruption but ZFS not have it you would need the corruption to occur after the filesystem layer (IE the write() system call) and before it gets to the block device. This could happen with a virtual machine although it seems unlikely. Any corruption between the application and the filesystem won’t be covered no matter what you do.
One possibility I considered was to have two physical disks in a server that are managed by LVM and then have each VM get two LVs for it’s storage to run BTRFS or ZFS. This would require a little extra work on systems running older distros such as RHEL4 but wouldn’t be impossible. Managing two LVM VGs and two LVs per DomU would be a bit of extra effort but it’s not impossible.
While BTRFS does support RAID-0, RAID-1, etc it doesn’t do it on the same block devices. I can’t just have a single BTRFS volume spanning /dev/sda2 and /dev/sdb2 and then have both RAID-0 and RAID-1 subvolumes under that.
http://www.reddit.com/r/linux/comments/swpu8/btrfs_and_zfs_as_layering_violations/
Reddit has some commentary on this post. I like the way that they get back to the point of my previous posts that ZFS and BTRFS offer significant benefits and the point of this post that the benefits outweigh the down-sides for many important use cases.