Currently I am considering the priority scheme to use for some highly available services running on Linux with Heartbeat.
The Heartbeat system has a number of factors that can be used to determine the weight for running a particular service on a given node. One is the connectivity to other systems determined by ping (every system that is pingable can add a value to the score), one is the number of failures (every failure deducts a value from the total score), one is the weight for staying on the same node (IE if the situation changes and the current node is not the ideal node you might not want to immediately move the service to a different node as that gives some seconds of no service), and one is the preference for each node that may run the service.
For a given node to run a particular service then the score has to be greater than all other nodes and also greater than zero. If all nodes have a score that is zero or less then the service will not run.
Now in the case of a service that repeatedly fails (EG a filesystem mount that relies on a hardware RAID which is not connected) then what should we do? One option is to have the score for running on a particular node be for example 100 times the value that is subtracted on failure. In this case after 100 failures on that node (and an appropriate number of failures on other nodes which are permitted to run the service) it will be disabled. Then the service has to be explicitly re-enabled (or a node rebooted) before it will run again.
The other option would be to have the value that is subtracted on failure be less than a billionth of the score for running on a particular node, so that the service will keep trying to start for the next few hundred years. The up-side of this is that there is less fiddling required, the down-side is that some CPU and disk resources will be kept active in repeatedly starting the service.
Now I have to decide which option to take in this regard, any comments would be appreciated.