Why Clusters Usually Don’t Work

It’s widely regarded that to solve reliability problems you can just install a cluster. It’s quite obvious that if instead of having one system of a particular type you have multiple systems of that type and a cluster configured such that broken systems aren’t used then reliability will increase. Also in the case of routine maintenance a cluster configuration can allow one system to be maintained in a serious way (EG being rebooted for a kernel or BIOS upgrade) without interrupting service (apart from a very brief interruption that may be needed for resource failover). But there are some significant obstacles in the path of getting a good cluster going.

Buying Suitable Hardware

If you only have a single server that is doing something important and you have some budget for doing things properly then you really must do everything possible to keep it going. You need RAID storage with hot-swap disks, hot-swap redundant PSUs, and redundant ethernet cables bonded together. But if you have redundant servers then the requirement for making one server reliable is slightly reduced.

Hardware is getting cheaper all the time, a Dell R300 1RU server configured with redundant hot-plug PSUs, two 250G hot-plug SATA disks in a RAID-1 array, 2G of RAM, and a dual-core Xeon Pro E3113 3.0GHz CPU apparently costs just under $2,800AU (when using Google Chrome I couldn’t add some necessary jumper cables to the list so I couldn’t determine the exact price). So a cluster of two of them would cost about $5,600 just for the servers. But a Dell R200 1RU server with no redundant PSUs, a single 250G SATA disk, 2G of RAM, and a Core 2 Duo E7400 2.8GHz CPU costs only $1,048.99AU. So if a low end server is required then you could buy two R200 servers that have no redundancy built in which cost less than a single server that has hardware RAID and redundant PSUs. Those two servers have different sets of CPU options and probably other differences in the technical specs, but for many applications they will probably both provide more than adequate performance.

Using a server that doesn’t even have RAID is a bad idea, a minimal RAID configuration is a software RAID-1 array which only requires an extra disk per server. That takes the price of a Dell R200 to $1,203. So it seems that two low-end 1RU servers from Dell that have minimal redundancy features will be cheaper than a single 1RU server that has the full set of features. If you want to serve static content then that’s all you need, and a cluster can save you money on hardware! Of course we can debate whether any cluster node should be missing redundant hot-plug PSUs and disks. But that’s not an issue I want to address in this post.

Also serving static content is the simplest form of cluster, if you have a cluster for running a database server then you will need a dual-attached RAID array which will make things start to get expensive (or software for replicating the data over the network which is difficult to configure and may be expensive), so while a trivial cluster may not cost any extra money a real-world cluster deployment is likely to add significant expense.

My observation is that most people who implement clusters tend to have problems getting budget for decent hardware. When you have redundancy via the cluster you can tolerate slightly less expected uptime from the individual servers. While we can debate about whether a cluster member should have redundant PSUs and other expensive features it does seem that using a cheap desktop system as a cluster node is a bad idea. Unfortunately some managers think that a cluster solves the reliability problem and therefore you can just use recycled desktop systems as cluster nodes, this doesn’t give a good result.

Even if it is agreed that server class hardware is used for all servers so features such as ECC RAM are used you will still have problems if someone decides to use different hardware specs for each of the cluster nodes.

Testing a Cluster

Testing a non-clustered server or some servers that use a load-balancing device at the front-end isn’t that difficult in concept. Sure you have lots of use cases and exception conditions to test, but they are all mostly straight-through tests. With a cluster you need to test node failover at unexpected times. When a node is regarded as having an inconsistent state (which can mean that one service it runs could not be cleanly shutdown when it was due to be migrated) it will need to be rebooted which is sometimes known as a STONITH. A STONITH event usually involves something like IPMI to cut the power or a command such as “reboot -nf“, this loses cached data and can cause serious problems for any application which doesn’t call fsync() as often as it should. It seems likely that the vast majority of sysadmins run programs which don’t call fsync() often enough, but the probability of losing data is low and the probability of losing data in a way that you will notice (IE it doesn’t get automatically regenerated) is even lower. The low probability of data loss due to race conditions combined with the fact that a server with a UPS and redundant PSUs doesn’t unexpectedly halt that often means that problems don’t get found easily. But when clusters have problems and start calling STONITH the probability starts increasing.

Getting cluster software to work in a correct manner isn’t easy. I filed Debian bug #430958 about dpkg (the Debian package manager) not calling fsync() and thus having the potential to leave systems in an inconsistent or unusable state if a STONITH happened at the wrong time. I was inspired to find this problem after finding the same problem with RPM on a SUSE system. The result of applying a patch to call fsync() on every file was bug report #578635 about the performance of doing so, the eventual solution was to call sync() after each package is installed. Next time I do any cluster work on Debian I will have to test whether the sync() code seems to work as desired.

Getting software to work in a cluster requires that not only bugs in system software such as dpkg be fixed, but also bugs in 3rd party applications and in-house code. Please someone write a comment claiming that their favorite OS has no such bugs and the commercial and in-house software they use is also bug-free – I could do with a cheap laugh.

For the most expensive cluster I have ever installed (worth about 4,000,000 UK pounds – back when the pound was worth something) I was not allowed to power-cycle the servers. Apparently the servers were too valuable to be rebooted in that way, so if they did happen to have any defective hardware or buggy software that would do something undesirable after a power problem it would become apparent in production rather than being a basic warranty or patching issue before the system went live.

I have heard many people argue that if you install a reasonably common OS on a server from a reputable company and run reasonably common server software then the combination would have been tested before and therefore almost no testing is required. I think that some testing is always required (and I always seem to find some bugs when I do such tests), but I seem to be in a minority on this issue as less testing saves money – unless of course something breaks. It seems that the need for testing systems before going live is much greater for clusters, but most managers don’t allocate budget and other resources for this.

Finally there is the issue of testing issues related to custom code and the user experience. What is the correct thing to do with an interactive application when one of the cluster nodes goes down and how would you implement it at the back-end?

Running a Cluster

Systems don’t just sit there without changing, you have new versions of the OS and applications and requirements for configuration changes. This means that the people who run the cluster will ideally have some specialised cluster skills. If you hire sysadmins without regard to cluster skills then you will probably end up not hiring anyone who has any prior experience with the cluster configuration that you use. Learning to run a cluster is not like learning to run yet another typical Unix daemon, it requires some differences in the way things are done. All changes have to be strictly made to all nodes in the cluster, having a cluster fail-over to a node that wasn’t upgraded and can’t understand the new data is not fun at all!

My observation is that the typical experience of having a team of sysadmins who have no prior cluster experience being hired to run a cluster usually involves “learning experiences” for everyone. It’s probably best to assume that every member of the team will break the cluster and cause down-time on at least one occasion! This can be alleviated by only having one or two people ever work on the cluster and having everyone else delegate cluster work to them. Of course if something goes wrong when the cluster experts aren’t available then the result is even more downtime than might otherwise be expected.

Hiring sysadmins who have prior experience running a cluster with the software that you use is going to be very difficult. It seems that any organisation that is planning a cluster deployment should plan a training program for sysadmins. Have a set of test machines suitable for running a cluster and have every new hire install the cluster software and get it all working correctly. It’s expensive to buy extra systems for such testing, but it’s much more expensive to have people who lack necessary skills try and run your most important servers!

The trend in recent years has been towards sysadmins not being system programmers. This may be a good thing in other areas but it seems that in the case of clustering it is very useful to have a degree of low level knowledge of the system that you can only gain by having some experience doing system coding in C.

It’s also a good idea to have a test network which has machines in an almost identical configuration to the production servers. Being able to deploy patches to test machines before applying them in production is a really good thing.


Running a cluster is something that you should either do properly or not at all. If you do it badly then the result can easily be less uptime than a single well-run system.

I am not suggesting that people avoid running clusters. You can take this post as a list of suggestions for what to avoid doing if you want a successful cluster deployment.

11 comments to Why Clusters Usually Don’t Work

  • You should probably state which *type* of clustering you’re talking about. Many of your assumptions are wrong for HPC clustering.

  • Carsten

    @Jo: Well, depends of course. We are running a largish HPC cluster (top100 position in 2008) and of course want to have a minimal downtime for any server. Of course, worker nodes are not important than say a nut server or a file server with the users’ home directory on it.

    @Russell: Many of the points you touch base on are right. Some of them are not so right – as Jo pointed out – for the HPC/Beowulf world. But essentially yes, one should alwys try to get reliable hardware, which is not at the frontier, i.e. don’t buy the newest hardware as you will encounter many bugs initially, plus you will get more for the bucks.

    One very important point you only mention between the line: You don’t need only good people, but you need to automate as much as possible and document every tiny bit as much as possible, regardless of how much you hate it.

    We started off and everyone knew basically anything, but we grew and needed to partition the work load a bit, and now we are bitten back, because much work has not been properly documented in the past, thus as soon as one admin is not available, the other has to work around this and understand how to tackle problem…

    But overall a nice article!

    Was the 4M pounds cluster a comercial or scientific one?

  • Iain


    Can you expand on which aspects you’re refering to?

    HPC clusters have a lot in common with load-balancing clusters. Issues like failover-testing are obviously less relevant, but the approach to hardware-speccing and cluster management are very similar. I totally agree with the observation on teaching our sysadmins to manage the *cluster*, not the individual machines.

  • To take these in reverse order…

    3) “Running a cluster is something that you should either do properly or not at all. If you do it badly then the result can easily be less uptime than a single well-run system.”
    Absolutely. If the application really needs to be clustered, then you have to be totally paranoid about the running of it; adding very complex software around a basic application, and expecting that everything will all be fine, is totally backwards. The additional complexity adds risks and/or costs (sysadmin experience, as you mention, being the highest – given the fundamentals that you mention – mirrored storage (ideally hotplug), hotplug PSUs, bonded NICs and so on – more outages are caused by human error than by system failure)

    2) “For the most expensive cluster I have ever installed (worth about 4,000,000 UK pounds – back when the pound was worth something) I was not allowed to power-cycle the servers.”
    I would refuse to install cluster under such circumstances. If the application is important enough to the customer for them to cluster it, then they must take the outage to confirm that it works. Otherwise, I can not sign off that the deliverable has been provided.

    1) “Please someone write a comment claiming that their favorite OS has no such bugs and the commercial and in-house software they use is also bug-free – I could do with a cheap laugh.”
    I have lost count of how many installs of Solaris/SunCluster/HA-Oracle I have done, and I have never heard of data loss due to failure-fencing (STONITH as you call it) with that combination. If you can provide an example, I have no axe to grind, just my own experience.

  • etbe

    Jo: Yes, HPC clustering is quite different in all it’s forms – I wasn’t thinking of that at all when writing this article. I know some people who run 1000+ systems, and when some of them fall over they do things such as re-submit jobs. If your aim is to just do a lot of small and medium size computation runs from a large data set then lots of the redundancy issues become a lot easier. You can design a system such that any failure loses a few hours of computation from one node and not be concerned about it. However if you are running a system that manages user account data and you lose some password changes then it’s a very different issue.

    Carsten: Good point about documentation. That sort of thing comes after training people and buying suitable hardware. The 4M cluster was commercial, it ran something that you can think of as a database.

    Steve: Whether you can refuse or not depends on your position. If you happen to be the most junior person in the team and you can be replaced without concern if you decide to resign then you might as well stay IMHO.

    As for not losing data, Oracle is well regarded for being quite reliable in the face of many adverse situations (unlike for example MySQL which didn’t even checksum it’s data as of the last report I heard). But this doesn’t mean that you should blindly trust an Oracle installation! Take your Oracle installation, test it on some of the unlikely situations (including a power failure and second power failure during restart), and then you can have some confidence that the installation in question (not other installations) will work.

    Also if you are looking for corner cases then try removing the battery from the write-back cache on the Sun RAID system, last I heard the RAID controllers went into a different mode of operation when the battery was dead which is something that’s worth testing.

    It’s quite reasonable to say “I’ve done this a dozen times with success”, but it’s not reasonable to say “this is from Sun and it’s good so I won’t test it” – which is the sort of thing I’ve heard many times.

  • Regarding refusal – if I will be held to task if it doesn’t work, I would rather leave, even if it was a junior position. The closest we have to a hippocratic oath, is “let me show (to my satisfaction and to the customer’s) that it works before signing it off”

    Yes, test it to pieces. Absolutely. Apart from anything else, I don’t want a call 3 months later because a config file in /etc on Node2 is wrong, but we were never allowed to test failover.

    Recently I have been doing stuff with Veritas VCS; there are some compromises that VCS makes which shock me. Split-brain is allowed, if the customer chooses to configure it that way. Ie, STONITH should happen, but choose to allow both nodes to continue working on the shared data, both blissfully thinking that they are the only surviving node. That should never be allowed.

  • James

    For larger clusters I’d consider netbooting the nodes instead of having the system on local disk. I’m pretty sure LiveJournal did this with their Debian-base webserving infrastructure. Depending on your data storage requirements you could use a large SAN and get away with having no local disk at all.

  • @etbe I reckon threaded commenting would make this much easier to read ;)
    Anyway, for those of us who do HPC professionally, other types of clustering aren’t a first thought when the word is used in isolation. Maybe parallel filesystem clustering…
    The “running a cluster” stuff is all pretty applicable to HPC, but obviously the things you test in the testing bit are rather different (you’re mainly looking for benchmarking reproducibility from what was proposed by the vendor in their tender document, plus some burn-in without failure)
    Hardware requirements, however, are very different.
    For parallel codes, a single failure of any kind (even a networking glitch) can kill an MPI job entirely – this is an unavoidable reality of how MPI is written. As a result, you tend to focus on minimizing the risks of certain classes of failure, without wasting money on redundancy that you won’t take advantage of. ECC RAM is in, redundant power supplies are out. PXEBOOT is in, RAID is out. Infiniband is in, Ethernet is out. And, overall, buying the latest generation of hardware you can afford is vital – if you’re running your hardware for 3 years, nobody wants to be running applications on hardware which was already old before it hit the machine room.
    You’re running hundreds of machines flat-out 24/7. Stuff will fail. But adding redundancy costs you X nodes out of your budget – and you could re-do the lost work at a lower “cost”.

  • Carsten


    An IF you don’t have MPI jobs, but more like a work farm, don’t waste money on Infiniband as it’s just too costly and cost you easily 10-20% of your nodes – also switching is more expensive than a 10GE backbone structure.

    If you have the room and the man-power you can easily try to run your hardware for 5-6 years. 3 yrs under warranty and then in cannibalism mode, i.e. dead nodes are replacement parts

  • etbe

    Tom Fifield gave an interesting LUV talk about using EC2 as a cluster for high energy physics. Apparently for his use it was cheaper than owning servers. Part of that was due to the fact that he would otherwise be running servers in Japan (where electricity is expensive) and part of it was due to the servers in question being seriously used for a few months a year – the economics might have been different if there was a need for 365 days worth of computing.

    Infiniband is horribly expensive. It’s really a shame that some vendors don’t give out samples of Infiniband gear to LUGs and Hackerspaces. Last I checked they were regularly releasing new and faster versions of Infiniband which obsoleted a lot of older kit. So giving some obsolete stuff that no-one wanted much to hackers who can develop code that will run on the latest version would be a good investment.

    As for 6yo hardware, after Adobe dropped support for 64bit Flash I tried running Flash on a 7yo 32bit system to watch the season finale of Desperate Housewives. It wasn’t until that viewing failed utterly that I realised how significant the changes in CPUs have been over that time period.

    But I guess it depends on what you are really trying to do. If you are running a job that needs to have huge numbers of disks connected then as disks haven’t changed much in the last decade old machines won’t do too badly.

    But if we assume that CPU power has been doubling every 18 months then 6yo systems would have 1/16 the power of modern systems which would make a huge difference to what you can do in terms of getting work done and avoiding sysadmin pain. Not to mention the fact that RAM sizes have been steadily increasing and RAM for old systems is often unreasonably expensive.

    Back to the topic of what sysadmin skills are needed, based on the comments here and my past discussions with scientific computing people I believe that the training requirements for sysadmin work of a scientific cluster are probably greater than that for a typical HA cluster for an ISP or corporate server. It’s not just clustering which needs to be learned, it’s technologies such as Infiniband, issues such as vibration affecting disk performance, advanced networking (most organisations don’t even push 1Gb/s networking to it’s limits let along 10Gb/s), and probably lots of other things that I can’t think of right now. Then there’s OS issues like NUMA systems such as the SGI Altix which are internally quite unlike any regular server, filesystem drivers for clustered filesystems, high performance NFS, various message passing systems (where you need to support several because you support applications written to use them), compilers for unusual languages such as parallel Fortran (which I keep hearing people talk about so I’m sure that some of you have to support it), unusually large amounts of storage which require unusual backup schemes, and lots more.

  • @etbe
    if there’s one thing you don’t need to stress about administering, it’s an Altix. Think of it as a badly behaved SLES desktop.