BTRFS Rebuild Time

In February I replaced a Dell T320 server with a HP Z640 workstation for a home server/workstation [1]. The T320 has 8*3.5″ drive bays which I had used to put 3*4TB disks in a BTRFS RAID-10 array for 6TB of usable capacity. The Z640 has only 2*3.5″ bays and 4*2.5″ bays, so one option I could have taken was to buy a 4TB 2.5″ SSD and keep the same 3*4TB array as before. Instead I chose to use an 8TB disk I had spare in an array with one of the original 4TB disks and some extra on NVMe devices (the system has 2*1TB NVMe devices which are used as a 380G RAID-1 for the root filesystem and the rest for the storage array). It’s nice how BTRFS allows putting any storage you have in a RAID-10 configuration.

Unfortunately it seems that I chose the wrong 4TB disk to use for this as it failed three days ago. It gave thousands of read and write errors and Linux decided that the drive no longer existed. I tried rebooting the system to get it in the BTRFS array again but it failed again and failed so quickly that it wasn’t even possible to use the data on it as part of a RAID rebuild. So I removed that disk and put in one of the other 4TB disks.

As the array is comprised of an 8TB disk and 3 other devices that don’t add up to 8TB the layout is the 8TB disk having one copy of everything and the other devices having parts of it. So the rebuild process comprised of copying data from the 8TB disk to the 4TB disk. For a RAID array run in the manner of Linux software RAID the rebuild of a RAID-1 involves a linear copy of data which is the optimal case for hard disks, copying 4TB of data in that manner would have an average speed of a bit over 100MB/s and take about 11 hours. With BTRFS the source disk has to be updated for each block that is recreated so the process was bottlenecked on writing to the 8TB disk. It took 2 days 23 hours to complete. The process involved reading 3,478,031MB and writing 4,405,545MB. The system was live for the process and some cron jobs etc were writing to the array, but in the 12 hours since the rebuild completed the array has had 7,038MB written. So presumably during the rebuild time about 42G of actual data were written to the array and the other 4.3TB written to the 8TB disk were from the process of copying 3.5TB from it to another device. Iostat reported that there were 645.36 TPS for the duration of the rebuild which seems like a decent number for a hard drive, during the process iostat reported that the drive had 99%+ of IO capacity used for the duration.

While waiting for this to complete I wrote a blog post about storage trends [2]. One thing I didn’t mention in that post is that if you are the type of person who checks the rebuild process fifty times a day then that should be counted as part of the cost of using slow storage. If instead of an 8TB disk plus some SSD storage I had used 2*4TB disks and 1*4TB SSD as I had considered doing then instead of having 3.8TB on one device I would have had about 2.5TB and the reconstruct would have probably taken 2/3 of the time. If I had moved the array to 3*4TB SSDs then it would have taken a small fraction of the time.

One thing to note is that I made a mistake in this operation by removing the failed device instead of doing a “btrfs replace” operation which can be significantly faster. If I had correctly done this then I would have written a blog post about the rebuild taking 2 days or something, the issues of hard drives being slow and me compulsively checking the progress would still apply.

Comments are closed.