The general trend seems to be that cheap hard drives are increasing in capacity faster than much of the data that is commonly stored. Back in 1998 I had a 3G disk in my laptop and about 800M was used for my home directory. Now I have 6.2G used for my home directory (and another 2G in ~/src) out of the 100G capacity in my laptop. So my space usage for my home directory has increased by a factor of about 8 while my space available has increased by a factor of about 30. When I had 800M for my home directory I saved space by cropping pictures for my web site and deleting the originals (thus losing some data I would rather have today) but now I just keep everything and it’s still doesn’t take up much of my hard drive. Similar trends apply to most systems that I use and that I run for my clients.
Due to the availability of storage people are gratuitously using a lot of disk space. A relative recently took 10G of pictures on a holiday, her phone has 12G of internal storage so there was nothing stopping her. She might decide that half the pictures aren’t that great if she had to save space, but that space is essentially free (she couldn’t buy a cheaper phone with less storage) so there’s no reason to delete any pictures.
When considering backup methods one important factor is the ability to store all of one type of data on one backup device. Having a single backup span multiple disks, tapes, etc has a dramatic impact on the ease of recovery and the potential for data loss. Currently 3TB SATA disks are really cheap and 4TB disks are available but rather expensive. Currently only one of my clients has more than 4TB of data used for one purpose (IE a single filesystem) so apart from that client a single SATA disk can backup anything that I run.
Table of Contents
Benefits of Hard Drive Backup
When using a hard drive there is an option to make it a bootable disk in the same format as the live disk. I haven’t done this, but if you want the option of a quick recovery from a hardware failure then having a bootable disk with all the data on it is a good option. For example a server with software RAID-1 could have a backup disk that is configured as a degraded RAID-1 array.
The biggest benefit is the ability to read a disk anywhere. I’ve read many reports of tape drives being discovered to be defective at the least convenient time. With a SATA disk you can install it in any PC or put it in a USB bay if you have USB 3.0 or the performance penalty of USB 2.0 is bearable – a USB 2.0 bay is great if you want to recover a single file, but if you want terabytes in a hurry then it won’t do.
A backup on a hard drive will typically use a common filesystem. For backing up Linux servers I generally use Ext3, at some future time I will move to BTRFS as having checksums on all data is a good feature for a backup. Using a regular filesystem means that I can access the data anywhere without needing any special software, I can run programs like diff on the backup, and I can export the backup via NFS or Samba if necessary. You never know how you will need to access your backup so it’s best to keep your options open.
Hard drive backups are the best solution for files that are accidentally deleted. You can have the first line of backups on a local server (or through a filesystem like BTRFS or ZFS that supports snapshots) and files can be recovered quickly. Even a SATA disk in a USB bay is very fast for recovering a single file.
LTO tapes have a maximum capacity of 1.5TB at the moment and tape size has been increasing more slowly than disk size. Also LTO tapes have an expected lifetime of only 200 reads/writes of the entire tape. It seems to me that tapes don’t provide a great benefit unless you are backing up enough data to need a tape robot.
Problems with a Hard Drive Backup
Hard drives tend not to survive being dropped so posting a hard drive for remote storage probably isn’t a good option. This can be solved by transferring data over the Internet if the data isn’t particularly big or doesn’t change too much (I have a 400G data set backed up via rsync to another country because most of the data doesn’t change over the course of a year). Also if the data is particularly small then solid state storage (which costs about $1 per GB) is a viable option, I run more than a few servers which could be entirely backed up to a 200G SSD. $200 for a single backup of 200G of data is a bit expensive, but the potential for saving time and money on the restore means that it can be financially viable.
Some people claim that tape storage will better survive a Carrington Event than hard drives. I’m fairly dubious about the benefits of this, if a hard drive in a Faraday Cage (such as a regular safe that is earthed) is going to be destroyed then you will probably worry about security of the food supply instead of your data. Maybe I should just add a disclaimer “this backup system won’t survive a zombie apocalypse”. ;)
It’s widely regarded that tape storage lasts longer than hard drives. I doubt that this provides a real benefit as some of my personal servers are running on 20G hard drives from back when 20G was big. The fact that drives tend to last for more than 10 years combined with the fact that newer bigger drives are always being released means that important backups can be moved to bigger drives. As a general rule you should assume that anything which isn’t regularly tested doesn’t work. So whatever your backup method you should test it regularly and have multiple copies of the data to deal with the case when one copy becomes corrupt. The process of testing a backup can involve moving it to newer media.
I’ve seen it claimed that a benefit of tape storage is that part of the data can be recovered from a damaged tape. One problem with this is that part of a database often isn’t particularly useful. Another issue is that in my experience hard drives usually don’t fail entirely unless you drop them, drives usually fail a few sectors at a time.
How to Implement Hard Drive Backup
The most common need for backups is when someone deletes the wrong file. It’s usually a small restore and you want it to be an easy process. The best solution to this is to have a filesystem with snapshots such as BTRFS or ZFS. In theory it shouldn’t be too difficult to have a cron job manage snapshots, but as I’ve only just started putting BTRFS and ZFS on servers I haven’t got around to changing my backups. Snapshots won’t cover more serious problems such as hardware, software, or user errors that wipe all the disks in a server. For example the only time I lost a significant amount of data from a hosted server was when the data center staff wiped it, so obviously good off-site backups are needed.
The easiest way to deal with problems that wipe a server is to have data copied to another system. For remote backups you can rsync to a local system and then use “cp -rl” or your favorite snapshot system to make a hard linked copy of the tree. A really neat feature is the ZFS ability to “send” a filesystem snapshot (or the diff between two snapshots) to a remote system [1]. Once you have regular backups on local storage you can then copy them to removable disks as often as you wish, I think I’ll have to install ZFS on some of my servers for the sole purpose of getting the “send” feature! There are NAS devices that provide similar functionality to the ZFS send/receive (maybe implemented with ZFS), but I’m not a fan of cheap NAS devices [2].
It seems that the best way to address the first two needs of backup (fast local restore and resilience in the face of site failure) is to use ZFS snapshots on the server and ZFS send/receive to copy the data to another site. The next issue is that the backup server probably won’t be big enough for all the archives and you want to be able to recover from a failure on the backup server. This requires some removable storage.
The simplest removable backup is to use a SATA drive bay with eSATA and USB connectors. You use a regular filesystem like Ext3 and just copy the files on. It’s easy, cheap, and requires no special skill or software. Requiring no special skill is important, you never know who will be called on to recover from backups.
When a server is backing up another server by rsync (whether it’s in the same rack or another country) you want the backup server to be reliable. However there is no requirement for a single reliable server and sometimes having multiple backup servers will be cheaper. At current purchase prices you can buy two cheap tower systems with 4*3TB disks for less money than a single server that has redundant PSUs and other high end server features. Having two cheap servers die at once seems quite unlikely so getting two backup servers would be the better choice.
For filesystems that are bigger than 4TB a disk based backup would require backup software that handles multi part archives. One would hope that any software that is designed for tape backup would work well for this (consider a hard drive as a tape with a very fast seek), but often things don’t work as desired. If anyone knows of a good Linux backup program that supports multiple 4TB SATA disks in eSATA or USB bays then please let me know.
Conclusion
BTRFS or ZFS snapshots are the best way of recovering from simple mistakes.
ZFS send/receive seems to be the best way of synchronising updates to filesystems to other systems or sites.
ZFS should be used for all servers. Even if you don’t currently need send/receive you never know what the future requirements may be. Apart from needing huge amounts of RAM (one of my servers had OOM failures when it had a mere 4G of RAM) there doesn’t seem to be any down-side to ZFS.
I’m unsure of whether to use BTRFS for removable backup disks. The immediate up-sides are checksums on all data and meta-data and the possibility of using built-in RAID-1 so that a random bad sector is unlikely to lose data. There is also the possibility of using snapshots on a removable backup disk (if the disk contains separate files instead of an archive). The down-sides are lack of support on older systems and the fact that BTRFS is fairly new.
Have I missed anything?
Btrfs send/receive has been merged in 3.6, so you don’t have to use ZFS to just get the send/receive feature.
IIRC, the big worry in the case of a Carrington Event was lack of timely replacements for the huge power transformers that would likely be destroyed.
Jeroen: That’s great! But it won’t be in Debian until Wheezy+1 and I’ve got some servers that need upgrading much sooner than that. Maybe in Wheezy+1 time I’ll convert some servers to BTRFS, but then SSD is becoming more popular (prices for home use are decreasing and availability in DCs is increasing) so unless BTRFS gets a feature like ZIL there will still be a good incentive for using ZFS.
Paul: If we had a government that wasn’t stupid and cowardly then they would reduce the anti-terrorism budget and assign some serious money towards preparing for such things. Having a stockpile of replacement transformers, engine control systems for government vehicles and essential services, etc would cost nothing compared to the post-911 expenses. But Carrington Events are likely to kill many more people than al Quaeda.
How are you running ZFS?
I’m currently toying with fuse-zfs on an external USB drive- I rsync my hdd to it and take a snapshot. Dedupe + compression seems to work fairly well to save space, but my main fs is ext3.
http://etbe.coker.com.au/2012/07/31/zfs-debian-wheezy/
alex: I’m using zfsonlinux code, and for Debian/wheezy I’m using the Ubuntu packages, see the above URL.
I wouldn’t be inclined to run it on removable media as it doesn’t seem to be designed for such things.
Between file system snapshot and rsync I put rdiff-backup, you have a mountable filesystem replica (local or over the net), plus differences snapshots:
“The target directory ends up a copy of the source directory, but extra reverse diffs are stored in a special subdirectory of that target directory, so you can still recover files lost some time ago. The idea is to combine the best features of a mirror and an incremental backup.” (from rdiff-backup site).
The bad thing is that it is not actively maintained.
http://lwn.net/Articles/506244/
Jeroen: The above LWN article describes the BTRFS send/receive functionality as experimental as of last month, it seems that the file format isn’t even stabilised.
http://www.nongnu.org/rdiff-backup/
Above is the web site for rdiff-backup.
http://en.wikipedia.org/wiki/Rsync
rdiff is based on the rsync algorithm and described in the above Wikipedia page.
Paolo: Thanks for the suggestion of that, I’ll investigate it for backing up databases. But for things like Maildir storage rsync and cp -rl is best.
Well, we use rdiff-backup storage for maildir (~ 400Gb), because all other options (Bacula, rsync, etc..) have not proved successful due to the time to complete, the high high volume of data (in the case of monthly full backup) and / or lack of historical data.
In addition we are able to mount the “backup” and restart the service in the event of total loss of live data, or to restore a mailbox at a certain date with a single command.
The other viable alternative – some sort of lvm or filesystem snapshot – is not an option for us in this moment.
What are your concerns about using rdiff-backup to archive maildir storage?
Paolo: Why do you need rdiff for Maildir? When backing up Maildir I exclude index files (because they would be regenerated by the server anyway if I restored them) and the server doesn’t modify any email files once they are written. So there should be no benefit in storing file diffs.
Why would rdiff provide any benefit over “cp -rl” of an rsync tree?
As for snapshots, LVM is somewhat painful. BTRFS and ZFS offer the promise of doing this properly, but I haven’t yet tested this in production. Comments about ZFS send/receive make it sound easy, but it seems that a fair amount of scripting would be needed to cover the cases of network outages between sites causing failed backups. You can’t just send/receive the diff between the last two snapshots, you have to cover the case where the previous backup run aborted.
You are right, any email file is never modified, so no need of computing diffs on it.
But rdiff stores not only file diffs, but also the tree diff, so you have a full copy of your data, plus historical differences with the previous backups.
In this manner we can restore a mailbox deleted x days before, while you can’t do that only with “cp -rl” (without multiple full copies of the data).
Feel free to tell me if is not appropriate to continue this discussion on the blog.
Paolo: The point of “cp -rl” is that you have hard-links not copies of the data. So you have multiple copies of the directories (which does take some space) but no multiple copies of data. Using that method I can just pipe tar over ssh to restore the data as it’s all there in a regular file tree.
This discussion is fine here.
Russel: I had missed the use of hard links, my fault.
At this point the only advantage of rdiff is to use a single command, instead of using rsync for remote copy and cp-rl to create the historical data, avoiding a bit of scripting for automating tasks.
I must say that this use of hard links (which I had not well considered before) intrigues me, I think I’ll make some experiments on it. Thank you.
Russell,
I used the rsync + cp -al trick for many years. I learnt that filesystems and tools tend to work badly when there are huge numbers of hardlinks. For example, I had a bad time moving my backups to another computer: rsync required too much memory, for example. Removing large trees of hardlinks was surprisingly slow. And so on.
That’s one good reason to favor a backup tool rather than hardlink trees.
That doesn’t mean hardlink trees are a bad idea, they’re just not perfect.