The general trend seems to be that cheap hard drives are increasing in capacity faster than much of the data that is commonly stored. Back in 1998 I had a 3G disk in my laptop and about 800M was used for my home directory. Now I have 6.2G used for my home directory (and another 2G in ~/src) out of the 100G capacity in my laptop. So my space usage for my home directory has increased by a factor of about 8 while my space available has increased by a factor of about 30. When I had 800M for my home directory I saved space by cropping pictures for my web site and deleting the originals (thus losing some data I would rather have today) but now I just keep everything and it’s still doesn’t take up much of my hard drive. Similar trends apply to most systems that I use and that I run for my clients.
Due to the availability of storage people are gratuitously using a lot of disk space. A relative recently took 10G of pictures on a holiday, her phone has 12G of internal storage so there was nothing stopping her. She might decide that half the pictures aren’t that great if she had to save space, but that space is essentially free (she couldn’t buy a cheaper phone with less storage) so there’s no reason to delete any pictures.
When considering backup methods one important factor is the ability to store all of one type of data on one backup device. Having a single backup span multiple disks, tapes, etc has a dramatic impact on the ease of recovery and the potential for data loss. Currently 3TB SATA disks are really cheap and 4TB disks are available but rather expensive. Currently only one of my clients has more than 4TB of data used for one purpose (IE a single filesystem) so apart from that client a single SATA disk can backup anything that I run.
Benefits of Hard Drive Backup
When using a hard drive there is an option to make it a bootable disk in the same format as the live disk. I haven’t done this, but if you want the option of a quick recovery from a hardware failure then having a bootable disk with all the data on it is a good option. For example a server with software RAID-1 could have a backup disk that is configured as a degraded RAID-1 array.
The biggest benefit is the ability to read a disk anywhere. I’ve read many reports of tape drives being discovered to be defective at the least convenient time. With a SATA disk you can install it in any PC or put it in a USB bay if you have USB 3.0 or the performance penalty of USB 2.0 is bearable – a USB 2.0 bay is great if you want to recover a single file, but if you want terabytes in a hurry then it won’t do.
A backup on a hard drive will typically use a common filesystem. For backing up Linux servers I generally use Ext3, at some future time I will move to BTRFS as having checksums on all data is a good feature for a backup. Using a regular filesystem means that I can access the data anywhere without needing any special software, I can run programs like diff on the backup, and I can export the backup via NFS or Samba if necessary. You never know how you will need to access your backup so it’s best to keep your options open.
Hard drive backups are the best solution for files that are accidentally deleted. You can have the first line of backups on a local server (or through a filesystem like BTRFS or ZFS that supports snapshots) and files can be recovered quickly. Even a SATA disk in a USB bay is very fast for recovering a single file.
LTO tapes have a maximum capacity of 1.5TB at the moment and tape size has been increasing more slowly than disk size. Also LTO tapes have an expected lifetime of only 200 reads/writes of the entire tape. It seems to me that tapes don’t provide a great benefit unless you are backing up enough data to need a tape robot.
Problems with a Hard Drive Backup
Hard drives tend not to survive being dropped so posting a hard drive for remote storage probably isn’t a good option. This can be solved by transferring data over the Internet if the data isn’t particularly big or doesn’t change too much (I have a 400G data set backed up via rsync to another country because most of the data doesn’t change over the course of a year). Also if the data is particularly small then solid state storage (which costs about $1 per GB) is a viable option, I run more than a few servers which could be entirely backed up to a 200G SSD. $200 for a single backup of 200G of data is a bit expensive, but the potential for saving time and money on the restore means that it can be financially viable.
Some people claim that tape storage will better survive a Carrington Event than hard drives. I’m fairly dubious about the benefits of this, if a hard drive in a Faraday Cage (such as a regular safe that is earthed) is going to be destroyed then you will probably worry about security of the food supply instead of your data. Maybe I should just add a disclaimer “this backup system won’t survive a zombie apocalypse”. ;)
It’s widely regarded that tape storage lasts longer than hard drives. I doubt that this provides a real benefit as some of my personal servers are running on 20G hard drives from back when 20G was big. The fact that drives tend to last for more than 10 years combined with the fact that newer bigger drives are always being released means that important backups can be moved to bigger drives. As a general rule you should assume that anything which isn’t regularly tested doesn’t work. So whatever your backup method you should test it regularly and have multiple copies of the data to deal with the case when one copy becomes corrupt. The process of testing a backup can involve moving it to newer media.
I’ve seen it claimed that a benefit of tape storage is that part of the data can be recovered from a damaged tape. One problem with this is that part of a database often isn’t particularly useful. Another issue is that in my experience hard drives usually don’t fail entirely unless you drop them, drives usually fail a few sectors at a time.
How to Implement Hard Drive Backup
The most common need for backups is when someone deletes the wrong file. It’s usually a small restore and you want it to be an easy process. The best solution to this is to have a filesystem with snapshots such as BTRFS or ZFS. In theory it shouldn’t be too difficult to have a cron job manage snapshots, but as I’ve only just started putting BTRFS and ZFS on servers I haven’t got around to changing my backups. Snapshots won’t cover more serious problems such as hardware, software, or user errors that wipe all the disks in a server. For example the only time I lost a significant amount of data from a hosted server was when the data center staff wiped it, so obviously good off-site backups are needed.
The easiest way to deal with problems that wipe a server is to have data copied to another system. For remote backups you can rsync to a local system and then use “cp -rl” or your favorite snapshot system to make a hard linked copy of the tree. A really neat feature is the ZFS ability to “send” a filesystem snapshot (or the diff between two snapshots) to a remote system [1]. Once you have regular backups on local storage you can then copy them to removable disks as often as you wish, I think I’ll have to install ZFS on some of my servers for the sole purpose of getting the “send” feature! There are NAS devices that provide similar functionality to the ZFS send/receive (maybe implemented with ZFS), but I’m not a fan of cheap NAS devices [2].
It seems that the best way to address the first two needs of backup (fast local restore and resilience in the face of site failure) is to use ZFS snapshots on the server and ZFS send/receive to copy the data to another site. The next issue is that the backup server probably won’t be big enough for all the archives and you want to be able to recover from a failure on the backup server. This requires some removable storage.
The simplest removable backup is to use a SATA drive bay with eSATA and USB connectors. You use a regular filesystem like Ext3 and just copy the files on. It’s easy, cheap, and requires no special skill or software. Requiring no special skill is important, you never know who will be called on to recover from backups.
When a server is backing up another server by rsync (whether it’s in the same rack or another country) you want the backup server to be reliable. However there is no requirement for a single reliable server and sometimes having multiple backup servers will be cheaper. At current purchase prices you can buy two cheap tower systems with 4*3TB disks for less money than a single server that has redundant PSUs and other high end server features. Having two cheap servers die at once seems quite unlikely so getting two backup servers would be the better choice.
For filesystems that are bigger than 4TB a disk based backup would require backup software that handles multi part archives. One would hope that any software that is designed for tape backup would work well for this (consider a hard drive as a tape with a very fast seek), but often things don’t work as desired. If anyone knows of a good Linux backup program that supports multiple 4TB SATA disks in eSATA or USB bays then please let me know.
Conclusion
BTRFS or ZFS snapshots are the best way of recovering from simple mistakes.
ZFS send/receive seems to be the best way of synchronising updates to filesystems to other systems or sites.
ZFS should be used for all servers. Even if you don’t currently need send/receive you never know what the future requirements may be. Apart from needing huge amounts of RAM (one of my servers had OOM failures when it had a mere 4G of RAM) there doesn’t seem to be any down-side to ZFS.
I’m unsure of whether to use BTRFS for removable backup disks. The immediate up-sides are checksums on all data and meta-data and the possibility of using built-in RAID-1 so that a random bad sector is unlikely to lose data. There is also the possibility of using snapshots on a removable backup disk (if the disk contains separate files instead of an archive). The down-sides are lack of support on older systems and the fact that BTRFS is fairly new.
Have I missed anything?