1

Heartbeat version 2.0 CIB STONITH example configuration

Below is a sample script to configure the ssh STONITH agent for the Heartbeat system. STONITH will reboot nodes when things go wrong to restore the integrity of the cluster.

The STONITH test program supports the -n option to list parameters and the -l option to list nodes. The following is an example of using it with the ssh agent:
# stonith -t ssh -n
hostlist
# stonith -t ssh hostlist="node-0 node-1" -l
node-0
node-1

The hostlist tuple is the only configuration option for ssh. It is assumed that you have passwordless ssh logins allowed between root on all the nodes in the cluster so the host name list is all that’s needed.

The important thing to note about the constraint is that you are constraining the parent of the clones (which in this example has an ID of “DoFencing“) not a clone instance.

ssh is the simplest and in many ways least useful method of STONITH, but it’s good for an example as everyone knows it. Once you get ssh going it’ll be trivial to get the other methods working.

See below for the script to insert the XML in the CIB.
Continue reading

priorities for heartbeat services

Currently I am considering the priority scheme to use for some highly available services running on Linux with Heartbeat.

The Heartbeat system has a number of factors that can be used to determine the weight for running a particular service on a given node. One is the connectivity to other systems determined by ping (every system that is pingable can add a value to the score), one is the number of failures (every failure deducts a value from the total score), one is the weight for staying on the same node (IE if the situation changes and the current node is not the ideal node you might not want to immediately move the service to a different node as that gives some seconds of no service), and one is the preference for each node that may run the service.

For a given node to run a particular service then the score has to be greater than all other nodes and also greater than zero. If all nodes have a score that is zero or less then the service will not run.

Now in the case of a service that repeatedly fails (EG a filesystem mount that relies on a hardware RAID which is not connected) then what should we do? One option is to have the score for running on a particular node be for example 100 times the value that is subtracted on failure. In this case after 100 failures on that node (and an appropriate number of failures on other nodes which are permitted to run the service) it will be disabled. Then the service has to be explicitly re-enabled (or a node rebooted) before it will run again.

The other option would be to have the value that is subtracted on failure be less than a billionth of the score for running on a particular node, so that the service will keep trying to start for the next few hundred years. The up-side of this is that there is less fiddling required, the down-side is that some CPU and disk resources will be kept active in repeatedly starting the service.

Now I have to decide which option to take in this regard, any comments would be appreciated.

more about Heartbeat

In a comment on my blog post “a Heartbeat developer comments on my blog post” Alan Robertson writes:
I got in a hurry on my math because of the emergency. So, there are even more assumptions (errors?) than I documented.
In particular, the probability model I gave was for a particular node to fail. So the probability of either of two failing would be double that, and either of three failing would be triple that.
Note that the probability of multiple simultaneous failures goes up as a power, but the probability of either of only goes up linearly.
I really need to sit down and do the math carefully – but the idea of the simultaneous failures going up as a power is true. And the “any of” probability goes up linearly. That’s also true. This is why people can actually use larger HA clusters ;-).
The 5 years figure is the industry standard quoted figure for an average Intel-based server to fail.
The four hours to repair is a common high-quality of service response time from a hardware vendor. I admit that’s not the same as actual repair time, but if some “repairs” are just reboots, then it’s not a horrible number to start with – if your vendor has cached some spares nearby. I suppose I should sit down and do the math right, and make a spreadsheet of it. (I wonder if I remember that much math?)
I assume disk failures are taken care of by hot swap disks, RAID, etc. and so in effect they “never fail” (at least not totally) so that these failures don’t have to be accounted for by the overall availability model.
Here’s an intuitive way of thinking about it “from your gut”…
If I took your whole data center and made a cluster out of it, what’s the chance that at least half of your servers would fail at once?
Pretty darn small, is the short answer ;-). If it’s not pretty darn small, you need to buy better servers, and IBM has just the servers for you ;-). Or maybe they need to hire a better SysAdmin ;-)
If you ask yourself “when is the last time at least half my machines in my data center couldn’t communicate with the other half”, then hopefully that’s also a “pretty darn small” chance too. If not, there are well-known methods for making networks highly reliable too.
[I’m still ignoring “catastrophes” that you haven’t accounted for in your HA architecture].
I’m not saying this is free, and it can be pricey. One of my other favorite sayings is “Paranoia is an expensive hobby”. How much do you want to spend?
You tell me how much you want to spend, and you can figure out how to spend it.
I’ll make a separate comment on quorum models later. It’s getting late here.

My only comment in response to that is to say that I still believe the calculations of probability to be correct in my original post and I am interested to see someone prove otherwise.

Another comment by Alan:
Data corruption, no doubt, is almost always much worse than loss of availability. And some kinds of data corruption are worse than others. For example, mounting a non-clustered shared disk filesystem twice simultaneously is usually much worse, than updating two replicas of the data simultaneously. In the first case, you have to restore to your previous backups and lose all data since then. In the second case, you only lose updates that were made to one of the sides, and you instantly have a working copy of the data which is nearly always much newer than your last backup (with the possibility of recovering them by significant effort). Typically you would only lose a few minutes of updates at worst – and depending on the kind of networking failure, you might not lose anything.
Heartbeats certainly aren’t enough. You need to monitor the health of your servers and the health of your applications. Heartbeat monitors applications and can easily be informed of and act on the health of your servers (with release 2 style Linux-HA Heartbeat configurations).

Followed by:
Since data corruption is so serious, this is why cluster designers worry so much about split-brain, which is managed using the ideas of quorum and it’s sibling fencing.
This is all about keeping bad things from happening.
This post is really about quorum, since Russ had expressed interest in it.
Quorum is the idea that you can uniquely choose a subcluster to represent the whole cluster in those cases where communication failure has caused the cluster to split into separate sub-clusters which cannot properly communicate with each other. In this way, only one of the subclusters continues on, and the others will sit on their hands and do nothing waiting for a person to fix things.
Some of the kinds of quorum mentioned below are better than others. But, most importantly, they can be used in combination as described later.
The most common kind of quorum is that Russ mentioned in his earlier post – the majority quorum. In this method, for a cluster of n nodes, you grant quorum to a sub-cluster which has more than INT(n/2) members. This means that if you have a 3-node cluster, you have to have two nodes to continue. If you have 4 nodes, you have to have 3 nodes to continue. For 5 nodes, you have to have 3 nodes, and so on.
Other basic methods include disk reserve, so that you have reserve a disk to have quorum. In this case, if only one node survives and it can reserve the disk, it continues to run. However, the disk becomes a single point of failure. This may not be a problem if this single disk is required to run any of the cluster services, since they would fail without it anyway. [Heartbeat does not support this method].
An analagous method is to implement a software resource which grants quorum to one subcluster in a fashion analagous to the disk reserve method. This has the advantage of not requiring disk reserves, or a shared disk, but it has the same SPOF disadvantage as the disk reserve method. Heartbeat does support this method using the quorum daemon. It’s incredibly useful for those cases (like split-site clusters) where you cannot use fencing.
Another method is to grant quorum to any subcluster which can ping a certain set of nodes, and not grant it to any which can’t access those nodes. This isn’t a wonderful method, and has obvious disadvantages with respect to uniqueness, and single points of failure. (Heartbeat doesn’t yet implement this one).
Another method is to grant quorum to any node which is a member of a 2-node cluster. This is better than losing quorum and stopping when one node stops, but obviously completely ignores the uniqueness requirement of quorum.
Another method is to ask a human being if you have quorum. This is hardly an ideal circumstance, but useful in some contexts as described below. (Heartbeat doesn’t yet implement this one).
Perhaps you say, really the only one of these that’s really good is the first one – the majority vote method.
And, I would generally agree with you. But, Heartbeat has the ability to use these in combination which makes some of those methods that seem flaky to be much more reasonable.
Heartbeat has the ability to have multiple quorum modules declared, and they’re used in this way: Any module can return HAVEQUORUM, NOQUORUM, or TIE. If they return HAVEQUORUM or NOQUORUM, then no further quorum modules are consulted. However, if they return TIE, then the next quorum module is consulted for its opinion. If the last quorum module returns TIE, it is treated the same as NOQUORUM.
This enables you to use one quorum module to break the tie declared by a previous quorum module.
You could then use the quorumd to break the tie created by a voting module. Or you could use the quorumd instead of the “two-node” module. Or you could use the “pingable” module instead of the “two-node” module. Or you could at the end always tack on a “human” module, in case all else returns TIE.
This is kind of cool, actually. My favorites for next implementation are the pingable and consult human modules.
And, of course, if your cluster loses quorum due to real server failures failures, there are always ways to work around it, with a little human intervention. One method is to tell Heartbeat to ignore quorum. Another is to tell Heartbeat to remove certain nodes from the cluster, after you verify that they’re really dead. And, I’m sure that in a pinch, some new methods will be invented. And some of them might actually work ;-).

Regarding the quorumd, it seems that this is an extra server that will generally run on another machine separate from the rest of the cluster. So if we had a two-node cluster with a quorumd then it would effectively be a three-node cluster where one node is not configured to run any resources. It seems that the simpler approach in many cases would be to merely have a three-node cluster with resources not configured to run on one of the nodes.

For example if I was running a mail server cluster for an ISP I might configure a three node cluster of the two mail server back-end machines and one other machine that is lightly loaded (EG a DNS server) and have it configured not to run MTA resoures on the DNS machine.

a Heartbeat developer comments on my blog post

Alan Robertson (a major contributor to the Heartbeat project) commented on my post failure probability and clusters. His comment deserves wider readership than a comment generally gets so I’m making a post out of it. Here it is:

One of my favorite phrases is “complexity is the enemy of reliability” . This is absolutely true, but not a complete picture, because you don’t actually care much about reliability, you care about availability.
Complexity (which reduces MTBF) is only worth it if you can use it to drastically cut MTTR – which in turn raises availability significantly. If your MTTR was 0, then you wouldn’t care if you ever had a failure. Of course, it’s never zero
But, with normal clustering software, you can significantly improve your availability, AND your maintainability.
Your post makes some assumptions which are more than a little simplistic. To be fair, the real mathematics of this are pretty darn complicated.
First I agree that there are FAR more 2-node clusters than larger clusters. But, I think for a different reason. People understand 2-node clusters. I’m not saying this isn’t important, it is important. But, it’s not related to reliability.
Second, you assume a particular model of quorum, and there are many. It is true that your model is the most common, but it’s hardly the only one – not even for heartbeat (and there are others we want to implement).
Third, if you have redundant networking, and multiple power sources, as it should, then system failures become much less correlated. The normal model which is used is completely uncorrelated failures.
This is obviously an oversimplification as well, but if you have redundant power supplies supplied from redundant power feeds, and redundant networking etc. it’s not a bad approximation.
So, if you have an MTTR of 4 hours to repair broken hardware, what you care about is the probability of having additional failures during those four hours.
If your HA software can recover from an error in 60 seconds, then that’s your effective MTTR as seen by (a subset) of users. Some won’t see it at all. And, of course, that should also go into your computation. This depends on knowing a lot about what kind of protocol is involved, and what the probability of various lengths of failures is to be visible to various kinds of users. And, of course, no one really knows that either in practice.
If you have a hardware failure every 5 years approximately, and a hardware repair MTTR of 4 hours, then the probability of a second failure during that time is about .009%. The probability of two failures occuring during that time is about 8^10-7% – which is a pretty small number.
Probabilities for higher order failures are proportionately smaller.
But, of course, like any calculation, the probabilities of this are calculated using a number of simplifying assumptions.
It assumes, for example, that the probabilities of correlated failures are small. For example, the probability of a flood taking out all the servers, or some other disaster is ignored.
You can add complexity to solve those problems too ;-), but at some point the managerial difficulties (complexity) overwhelms you and you say (regardless of the numbers) that you don’t want to go there.
Mangerial complexity is minimized by uniformity in the configuration. So, if all your nodes can run any service, that’s good. If they’re asymmetric, and very wildly so, that’s bad.
I have to go now, I had a family emergency come up while I was writing this. Later…

End quote.

It’s interesting to note that there are other models of quorum, I’ll have to investigate that. Most places I have worked have had a MTTR that is significantly greater than four hours. But if you have hot-swap hard drives (so drive failure isn’t a serious problem) then having machines have an average of one failure per five years should be possible.

2 node vs 3+ node clusters

A comment on my post about the failure probability of clusters suggested that a six node cluster that has one node fail should become a five node cluster.

The problem with this is what to do when nodes recover from a failure. For example if a six node cluster had a node fail and became a five node cluster, then became a three node cluster after another two nodes had failed, then you would have half the cluster that was disconnected. If the three nodes that appeared to have failed became active again but unable to see the other three nodes then you would have a split-brain situation.

As noted in the comment the special case of a two node cluster does have different failure situations. If the connection between nodes goes down and the router can still be pinged then you can have a split brain situation. To avoid this you will generally have a direct connection between the two nodes (either a null-modem cable or a crossover Ethernet cable), such cables are more reliable than networking which involves a switch or hub. Also the network interface which involves the router in question will ideally also be used as a method of maintaining cluster status – it seems unlikely that two nodes will both be able to ping the router but be unable to send data to each other.

For best reliability you need to use multiple network interfaces between cluster nodes. One way of doing this is to have a pair of Ethernet ports bonded for providing the service (connected to two switches and pinging a router to determine which switch is best to use). The Heartbeat software supports encrypted data so it should be safe to run it on the same interface as used for providing the service (of course if you provide a service to the public Internet then you want a firewall to prevent machines on the net from trying to attack it).

Heartbeat also supports using multiple interfaces for maintaining the cluster data, so you can have one network dedicated to cluster operations and the network that is used for providing the service can be a backup network for cluster data. The pingd service allows Heartbeat to place services on nodes that have good connectivity to the net. So you could have multiple nodes that each have one Ethernet port for providing the service and one port as a backup for Heartbeat operations, if pingd indicates that the service port was not functioning correctly then the services would be moved to other nodes.

If you want to avoid having private Heartbeat data going over the service interface then in the two-node case you need a minimum of two Ethernet ports for Heartbeat and one port for providing the service if you use pingd. If you don’t use pingd then you need two bonded ports for providing the service and two ports (either bonded or independently configured in Hertbeat) for Heartbeat giving a total of four ports.

When there are more than two nodes in the cluster the criteria for cluster membership is that a majority of nodes are connected. This makes split-brain impossible and reduces the need to have reliable Ethernet interfaces. A cluster with three or more nodes could have a single service port and a single private port for Heartbeat, or if you trust the service interface you could do it all on one Ethernet port.

In summary, three nodes is better than two, but requires more hardware. Five nodes is better than three, but as I wrote in my previous post four nodes is not much good. I recommend against any even number of nodes other than two for the same reason.

failure probability and clusters

When running a high-availability cluster of two nodes it will generally be configured such that if one node fails then the other runs. Some common operation (such as accessing a shared storage device or pinging a router) will be used by the surviving node to determine that the other node is dead and that it’s not merely a networking problem. Therefore if you lose one node then the system keeps operating until you lose another.

When you run a three-node cluster the general configuration is that a majority of nodes is required. So if the cluster is partitioned then one node on it’s own will shut down all services while two nodes that can talk to each other will continue operating as normal. This means that to lose the cluster you need to lose all inter-node communication or have two nodes fail.

If the probability of a node surviving for the time interval required to repair a node that’s already died is N (where N is a number between 0 and 1 – 1 means 100% chance of success and 0 means it is certain to fail) then for a two node cluster the probability of the second node surviving long enough for a dead node to be fixed is N. For a three node cluster the probability that both the surviving two nodes will survive is N^2. This is significantly less, therefore a three node cluster is more likely to experience a critical second failure than a two node cluster.

For a four node cluster you need three active nodes to have quorum. Therefore the probability that a second node won’t fail is N^3 – even worse again!

For a five node cluster you can lose two nodes without losing the cluster. If you have already lost a node the probability that you won’t lose another two is N^4+(1-N)*N^3*4. As long as N is greater than 0.8 the probability of keeping three nodes out of four is greater than the probability of a single node not failing.

To see the probabilities of four and five node clusters experiencing a catastrophic failure after one node has died run the following shell script for different values of N (0.9 and 0.99 are reasonable values to try). You might hope that the probability of a second node remaining online while the first node is being repaired is significantly higher than 0.9, however when you consider that the first node’s failure might have been partially caused by the ambient temperature, power supply problems, vibration, or other factors that affect multiple nodes I don’t think it’s impossible for the probability to be as low as 0.9.

echo $N^4+\(1-$N\)*$N^3*4|bc -l ; echo $N^3 | bc -l

So it seems that if reliability is your aim in having a cluster then your options are two nodes (if you can be certain of avoiding split-brain) or five nodes. Six nodes is not a good option as the probability of losing three nodes out of six is greater than the probability of losing three nodes out of five. Seven and nine node clusters would also be reasonable options.

But it’s not surprising that a google search for “five node” cluster high-availability gives about 1/10 the number of results as a search for “four node” cluster high-availability. Most people in the computer industry like powers of two more than they like maths.

heartbeat – what defines a cluster?

In Debian bug 418210 there is discussion of what constitutes a cluster.

I believe that the node configuration lines in the config file /etc/ha.d/ha.cf should authoritatively define what is in the cluster and any broadcast packets from other nodes should be ignored.

Currently if you have two clusters sharing the same VLAN and they both use the same auth code then they will get confused about which node belongs to each cluster.

I set up a couple of clusters for testing (one Debian/Etch and the other Debian/unstable) under Xen using the same bridge device – naturally I could set up separate bridges – but why should I have to?

I gave each of them the same auth code (one was created by copying the block devices from the other – they have the same root password so there shouldn’t be a need for changing any other passwords). Then things all fell apart. They would correctly determine that they should each have two nodes in the cluster (mapping to the two node lines), but cluster 1 would get nodes ha1 and ha2-unstable even though it had node lines for ha1 and ha2.

I have been told that this is the way it’s supposed to be and I should just use different ports or different physical media.

I wonder how many companies have multiple Heartbeat installations on different VLANs such that a single mis-connected cable will make all hell break loose on their network…