Archives

Categories

a Heartbeat developer comments on my blog post

Alan Robertson (a major contributor to the Heartbeat project) commented on my post failure probability and clusters. His comment deserves wider readership than a comment generally gets so I’m making a post out of it. Here it is:

One of my favorite phrases is “complexity is the enemy of reliability” . This is absolutely true, but not a complete picture, because you don’t actually care much about reliability, you care about availability.
Complexity (which reduces MTBF) is only worth it if you can use it to drastically cut MTTR – which in turn raises availability significantly. If your MTTR was 0, then you wouldn’t care if you ever had a failure. Of course, it’s never zero
But, with normal clustering software, you can significantly improve your availability, AND your maintainability.
Your post makes some assumptions which are more than a little simplistic. To be fair, the real mathematics of this are pretty darn complicated.
First I agree that there are FAR more 2-node clusters than larger clusters. But, I think for a different reason. People understand 2-node clusters. I’m not saying this isn’t important, it is important. But, it’s not related to reliability.
Second, you assume a particular model of quorum, and there are many. It is true that your model is the most common, but it’s hardly the only one – not even for heartbeat (and there are others we want to implement).
Third, if you have redundant networking, and multiple power sources, as it should, then system failures become much less correlated. The normal model which is used is completely uncorrelated failures.
This is obviously an oversimplification as well, but if you have redundant power supplies supplied from redundant power feeds, and redundant networking etc. it’s not a bad approximation.
So, if you have an MTTR of 4 hours to repair broken hardware, what you care about is the probability of having additional failures during those four hours.
If your HA software can recover from an error in 60 seconds, then that’s your effective MTTR as seen by (a subset) of users. Some won’t see it at all. And, of course, that should also go into your computation. This depends on knowing a lot about what kind of protocol is involved, and what the probability of various lengths of failures is to be visible to various kinds of users. And, of course, no one really knows that either in practice.
If you have a hardware failure every 5 years approximately, and a hardware repair MTTR of 4 hours, then the probability of a second failure during that time is about .009%. The probability of two failures occuring during that time is about 8^10-7% – which is a pretty small number.
Probabilities for higher order failures are proportionately smaller.
But, of course, like any calculation, the probabilities of this are calculated using a number of simplifying assumptions.
It assumes, for example, that the probabilities of correlated failures are small. For example, the probability of a flood taking out all the servers, or some other disaster is ignored.
You can add complexity to solve those problems too ;-), but at some point the managerial difficulties (complexity) overwhelms you and you say (regardless of the numbers) that you don’t want to go there.
Mangerial complexity is minimized by uniformity in the configuration. So, if all your nodes can run any service, that’s good. If they’re asymmetric, and very wildly so, that’s bad.
I have to go now, I had a family emergency come up while I was writing this. Later…

End quote.

It’s interesting to note that there are other models of quorum, I’ll have to investigate that. Most places I have worked have had a MTTR that is significantly greater than four hours. But if you have hot-swap hard drives (so drive failure isn’t a serious problem) then having machines have an average of one failure per five years should be possible.

Comments are closed.