One of my clients has a NAS device. Last week they tried to do what should have been a routine RAID operation, they added a new larger disk as a hot-spare and told the RAID array to replace one of the active disks with the hot-spare. The aim was to replace the disks one at a time to grow the array. But one of the other disks had an error during the rebuild and things fell apart.
I was called in after the NAS had been rebooted when it was refusing to recognise the RAID. The first thing that occurred to me is that maybe RAID-5 isn’t a good choice for the RAID. While it’s theoretically possible for a RAID rebuild to not fail in such a situation (the data that couldn’t be read from the disk with an error could have been regenerated from the disk that was being replaced) it seems that the RAID implementation in question couldn’t do it. As the NAS is running Linux I presume that at least older versions of Linux have the same problem. Of course if you have a RAID array that has 7 disks running RAID-6 with a hot-spare then you only get the capacity of 4 disks. But RAID-6 with no hot-spare should be at least as reliable as RAID-5 with a hot-spare.
Whenever you recover from disk problems the first thing you want to do is to make a read-only copy of the data. Then you can’t make things worse. This is a problem when you are dealing with 7 disks, fortunately they were only 3TB disks and only each had 2TB in use. So I found some space on a ZFS pool and bought a few 6TB disks which I formatted as BTRFS filesystems. For this task I only wanted filesystems that support snapshots so I could work on snapshots not on the original copy.
I expect that at some future time I will be called in when an array of 6+ disks of the largest available size fails. This will be a more difficult problem to solve as I don’t own any system that can handle so many disks.
I copied a few of the disks to a ZFS filesystem on a Dell PowerEdge T110 running kernel 3.2.68. Unfortunately that system seems to have a problem with USB, when copying from 4 disks at once each disk was reading about 10MB/s and when copying from 3 disks each disk was reading about 13MB/s. It seems that the system has an aggregate USB bandwidth of 40MB/s – slightly greater than USB 2.0 speed. This made the process take longer than expected.
One of the disks had a read error, this was presumably the cause of the original RAID failure. dd has the option conv=noerror to make it continue after a read error. This initially seemed good but the resulting file was smaller than the source partition. It seems that conv=noerror doesn’t seek the output file to maintain input and output alignment. If I had a hard drive filled with plain ASCII that MIGHT even be useful, but for a filesystem image it’s worse than useless. The only option was to repeatedly run dd with matching skip and seek options incrementing by 1K until it had passed the section with errors.
for n in /dev/loop[0-6] ; do echo $n ; mdadm –examine -v -v –scan $n|grep Events ; done
Once I had all the images I had to assemble them. The Linux Software RAID didn’t like the array because not all the devices had the same event count. The way Linux Software RAID (and probably most RAID implementations) work is that each member of the array has an event counter that is incremented when disks are added, removed, and when data is written. If there is an error then after a reboot only disks with matching event counts will be used. The above command shows the Events count for all the disks.
Fortunately different event numbers aren’t going to stop us. After assembling the array (which failed to run) I ran “mdadm -R /dev/md1” which kicked some members out. I then added them back manually and forced the array to run. Unfortunately attempts to write to the array failed (presumably due to mismatched event counts).
Now my next problem is that I can make a 10TB degraded RAID-5 array which is read-only but I can’t mount the XFS filesystem because XFS wants to replay the journal. So my next step is to buy another 2*6TB disks to make a RAID-0 array to contain an image of that XFS filesystem.
Finally backups are a really good thing…
Check out GNU ddrescue to avoid that dd problem.
Christian sent me the following comment via email. This is the first report I’ve seen of problems when JavaScript and cookies are enabled. I’ll investigate this.
Thanks for the suggestion Christian.
I wanted to post a reply to your blog post at
http://etbe.coker.com.au/2015/06/28/raid-pain/
and couldn’t because it always kept complaining that Javascript and
cookies needed to be active, which I made sure of. (I even tried a
different browser.)
Anyway, in case you’re interested, here’s the comment I wanted to post:
———————————————————————–
With regards to dd: do you know GNU ddrescue? You can just do
ddrescue /dev/broken /path/to/image.img /path/to/image.log
That will automatically skip errors and get as big chunks as possible
from the drive, ignoring errors. Then it will revisit the blocks with
errors and try very hard to extract stuff from them (trying to read
them from all directions and from the middle, etc.). Since it generates
a log file, you can even interrupt it (Ctrl+C + wait until the current
attempted read operation is done) and have it restart at the same place
at a later point in time.
In fact, I only use plain dd to copy an image TO a disk (like making a
bootable USB stick for installation etc.). Because I’ve had a bad
sector even in drives I thought were good a couple of times, ddrescue
is now the standard tool I use to image disks / partitions if need be,
regardless of whether I think there’s a hardware defect or not. Also,
it’s much more intelligent than dd and automatically adjusts block
size, so you don’t have to play around with bs= etc.
Note that there’s also dd_rescue (with an underscore), which has the
same goal, but personally I much prefer the GNU tool, especially from
a usability standpoint.
———————————————————————–
(If you fix the commenting function, I’ll repost it on your blog again
because it may also be useful for other people.)
I think you do not need physical disk. You can just setup the images to loopback devices (losetup) and assemble the raid from there.
mdadm v3.3 supports –replace; with older versions of mdadm (as in Wheezy) it is still possible to hot replace a drive if you’re using kernel 3.2+:
1. mdadm /dev/mdX –add /dev/
2. echo want_replacement > /sys/block/md0/md//state
for details see http://unix.stackexchange.com/questions/74924/how-to-safely-replace-a-not-yet-failed-disk-in-a-linux-raid5-array
Mount XFS without recovery:
mount -o ro,norecovery
(as documented in the mount(8) man page)
-Dave.