Planning Servers for Failure

Sometimes computers fail. If you run enough computers then you encounter failures regularly. If the computers are important then you need to plan for the failure.

An ideal situation is to have redundant servers. Misconfigured clusters can cause more downtime than they can prevent and it requires more expensive hardware to properly implement a cluster (you need at least two servers and hardware to allow a good node to kill a bad node) as well as more time (from people who may charge higher rates).

Most companies don’t have redundant servers. So if you have some non-redundant servers there seem to be two reasonable options. The first one is to use more expensive hardware on a support contract. If a server is really important to you then get a 24*7 support contract – it only takes an extra mouse click (and a couple of thousand dollars) when ordering a Dell server. I am not going to debate the relative merits of Dell vs IBM vs HP at this time, but I think that most people will agree that Dell offers significant advantages over a white-box server, both in terms of quality (the low incidence of failure) and support (including 24*7 part replacement).

The second option is to have a cheap server that can be easily replaced. This IS appropriate for some tasks. For example I have installed many cheap desktop systems with two IDE disks in a RAID-1 array that run as Internet gateway systems for small businesses. The requirements were that they be quiet, use little power (due to poorly ventilated server rooms / cupboards), be relatively reliable, and be reasonably cheap. If one of those systems suddenly fails and no replacement hardware is available then someone’s desktop PC can be taken as a replacement, having one person unable to work due to a lack of a PC is better than having everyone’s work impeded by lack of Internet access!

This ability to swap hardware is dependent on the new hardware being reasonably similar. Finding a desktop PC in an office today which can support two IDE disks and which has an Ethernet port on the motherboard and a spare PCI slot is not too difficult. I expect that in the near future such machines will start to disappear which will be an incentive for using systems with SATA disks and USB keyboards as routers.

This evening I had to advise someone who was dealing with a broken server. The system in question is mission critical, was based on white-box hardware, and had four SATA disks in a LVM volume group for the root filesystem. This gave a 600G filesystem with less than 10G in use. If the person who installed it had chosen to use only a single disk (or even better two disks in a RAID-1 array) then there would have been a wide range of systems that could take the disks and be used to keep the company running for business tomorrow. But finding a computer that can handle four SATA disks is a little more tricky.

Of course running a mission critical server that doesn’t use RAID is obviously very wrong. But using four disks in an LVM volume group both increases the probability of a data destroying disk failure and makes it more difficult to replace the computer itself. Some server installations are fractally wrong.

3 comments to Planning Servers for Failure

  • RAID is no panacea. Compare with ZFS.

  • etbe

    Toby: I agree that RAID doesn’t solve all problems. But it does solve certain classes of reasonably common problems very well.

    Please explain the relevance of your comment about ZFS.

  • I am not sure how familiar you are with ZFS, so – I’ll summarise – the crucial advantage is that ZFS uses checksums for data and metadata and performs self healing, where RAID cannot even detect corruption.

    A half written mirror caused by a power failure, e.g. leaves RAID with a likely undetected, unfixable integrity problem. ZFS will detect and fix it, ensuring that bad data will not reach your application or operating system.

    By this criterion alone, it obsoletes software RAID and practically all forms of hardware RAID.