In a comment on my blog post “a Heartbeat developer comments on my blog post” Alan Robertson writes:
I got in a hurry on my math because of the emergency. So, there are even more assumptions (errors?) than I documented.
In particular, the probability model I gave was for a particular node to fail. So the probability of either of two failing would be double that, and either of three failing would be triple that.
Note that the probability of multiple simultaneous failures goes up as a power, but the probability of either of only goes up linearly.
I really need to sit down and do the math carefully – but the idea of the simultaneous failures going up as a power is true. And the “any of” probability goes up linearly. That’s also true. This is why people can actually use larger HA clusters ;-).
The 5 years figure is the industry standard quoted figure for an average Intel-based server to fail.
The four hours to repair is a common high-quality of service response time from a hardware vendor. I admit that’s not the same as actual repair time, but if some “repairs” are just reboots, then it’s not a horrible number to start with – if your vendor has cached some spares nearby. I suppose I should sit down and do the math right, and make a spreadsheet of it. (I wonder if I remember that much math?)
I assume disk failures are taken care of by hot swap disks, RAID, etc. and so in effect they “never fail” (at least not totally) so that these failures don’t have to be accounted for by the overall availability model.
Here’s an intuitive way of thinking about it “from your gut”…
If I took your whole data center and made a cluster out of it, what’s the chance that at least half of your servers would fail at once?
Pretty darn small, is the short answer ;-). If it’s not pretty darn small, you need to buy better servers, and IBM has just the servers for you ;-). Or maybe they need to hire a better SysAdmin ;-)
If you ask yourself “when is the last time at least half my machines in my data center couldn’t communicate with the other half”, then hopefully that’s also a “pretty darn small” chance too. If not, there are well-known methods for making networks highly reliable too.
[I'm still ignoring "catastrophes" that you haven't accounted for in your HA architecture].
I’m not saying this is free, and it can be pricey. One of my other favorite sayings is “Paranoia is an expensive hobby”. How much do you want to spend?
You tell me how much you want to spend, and you can figure out how to spend it.
I’ll make a separate comment on quorum models later. It’s getting late here.
My only comment in response to that is to say that I still believe the calculations of probability to be correct in my original post and I am interested to see someone prove otherwise.
Another comment by Alan:
Data corruption, no doubt, is almost always much worse than loss of availability. And some kinds of data corruption are worse than others. For example, mounting a non-clustered shared disk filesystem twice simultaneously is usually much worse, than updating two replicas of the data simultaneously. In the first case, you have to restore to your previous backups and lose all data since then. In the second case, you only lose updates that were made to one of the sides, and you instantly have a working copy of the data which is nearly always much newer than your last backup (with the possibility of recovering them by significant effort). Typically you would only lose a few minutes of updates at worst – and depending on the kind of networking failure, you might not lose anything.
Heartbeats certainly aren’t enough. You need to monitor the health of your servers and the health of your applications. Heartbeat monitors applications and can easily be informed of and act on the health of your servers (with release 2 style Linux-HA Heartbeat configurations).
Since data corruption is so serious, this is why cluster designers worry so much about split-brain, which is managed using the ideas of quorum and it’s sibling fencing.
This is all about keeping bad things from happening.
This post is really about quorum, since Russ had expressed interest in it.
Quorum is the idea that you can uniquely choose a subcluster to represent the whole cluster in those cases where communication failure has caused the cluster to split into separate sub-clusters which cannot properly communicate with each other. In this way, only one of the subclusters continues on, and the others will sit on their hands and do nothing waiting for a person to fix things.
Some of the kinds of quorum mentioned below are better than others. But, most importantly, they can be used in combination as described later.
The most common kind of quorum is that Russ mentioned in his earlier post – the majority quorum. In this method, for a cluster of n nodes, you grant quorum to a sub-cluster which has more than INT(n/2) members. This means that if you have a 3-node cluster, you have to have two nodes to continue. If you have 4 nodes, you have to have 3 nodes to continue. For 5 nodes, you have to have 3 nodes, and so on.
Other basic methods include disk reserve, so that you have reserve a disk to have quorum. In this case, if only one node survives and it can reserve the disk, it continues to run. However, the disk becomes a single point of failure. This may not be a problem if this single disk is required to run any of the cluster services, since they would fail without it anyway. [Heartbeat does not support this method].
An analagous method is to implement a software resource which grants quorum to one subcluster in a fashion analagous to the disk reserve method. This has the advantage of not requiring disk reserves, or a shared disk, but it has the same SPOF disadvantage as the disk reserve method. Heartbeat does support this method using the quorum daemon. It’s incredibly useful for those cases (like split-site clusters) where you cannot use fencing.
Another method is to grant quorum to any subcluster which can ping a certain set of nodes, and not grant it to any which can’t access those nodes. This isn’t a wonderful method, and has obvious disadvantages with respect to uniqueness, and single points of failure. (Heartbeat doesn’t yet implement this one).
Another method is to grant quorum to any node which is a member of a 2-node cluster. This is better than losing quorum and stopping when one node stops, but obviously completely ignores the uniqueness requirement of quorum.
Another method is to ask a human being if you have quorum. This is hardly an ideal circumstance, but useful in some contexts as described below. (Heartbeat doesn’t yet implement this one).
Perhaps you say, really the only one of these that’s really good is the first one – the majority vote method.
And, I would generally agree with you. But, Heartbeat has the ability to use these in combination which makes some of those methods that seem flaky to be much more reasonable.
Heartbeat has the ability to have multiple quorum modules declared, and they’re used in this way: Any module can return HAVEQUORUM, NOQUORUM, or TIE. If they return HAVEQUORUM or NOQUORUM, then no further quorum modules are consulted. However, if they return TIE, then the next quorum module is consulted for its opinion. If the last quorum module returns TIE, it is treated the same as NOQUORUM.
This enables you to use one quorum module to break the tie declared by a previous quorum module.
You could then use the quorumd to break the tie created by a voting module. Or you could use the quorumd instead of the “two-node” module. Or you could use the “pingable” module instead of the “two-node” module. Or you could at the end always tack on a “human” module, in case all else returns TIE.
This is kind of cool, actually. My favorites for next implementation are the pingable and consult human modules.
And, of course, if your cluster loses quorum due to real server failures failures, there are always ways to work around it, with a little human intervention. One method is to tell Heartbeat to ignore quorum. Another is to tell Heartbeat to remove certain nodes from the cluster, after you verify that they’re really dead. And, I’m sure that in a pinch, some new methods will be invented. And some of them might actually work ;-).
Regarding the quorumd, it seems that this is an extra server that will generally run on another machine separate from the rest of the cluster. So if we had a two-node cluster with a quorumd then it would effectively be a three-node cluster where one node is not configured to run any resources. It seems that the simpler approach in many cases would be to merely have a three-node cluster with resources not configured to run on one of the nodes.
For example if I was running a mail server cluster for an ISP I might configure a three node cluster of the two mail server back-end machines and one other machine that is lightly loaded (EG a DNS server) and have it configured not to run MTA resoures on the DNS machine.