Linux, politics, and other interesting things
One problem that I have had in configuring Heartbeat clusters is in performing a STONITH that originates outside the Heartbeat system.
STONITH was designed for the Heartbeat system to know when a node is not operating correctly (this can either be determined by the node itself or by other nodes in the network) and then force a hardware reset so that the non-functional node will not interfere with another node that is designated to take over the service.
However sometimes code that is called by Heartbeat will have more information about the state of the system than Heartbeat can access. For example if I have a service that accesses a filesystem on an external RAID then it’s common for the RAID to track who is accessing it. In some situations the RAID hardware has the ability to “fence” the access (so that when machine B mounts the filesystem machine A can no longer access it). In other situations the RAID may only be capable of informing the system that another machine is registered as the owner of the device. To solve this problem a machine that is to mount such a device must either prohibit the previous owner from accessing the device (which may be impossible or unreasonably difficult) or reset the previous owner.
Until recently I had been doing this by writing some code to extract the STONITH configuration from the CIB and call the stonith utility. The problem with this is that there is no requirement that every node be capable of performing a STONITH on every other node, and that even if every node is are designed to be capable of rebooting every other node a partial failure condition may restrict the set of nodes that are capable of performing a STONITH on the target.
Currently the recommended way of doing this is via the test program. Below is an example of the command used to reset the node node-1 with a timeout of 20000ms and the result of it being successfully completed. I have suggested that the Heartbeat developers make an official interface for doing this (rather than a test of the API) and I believe that this is being considered. In the mean time the following is the only way of doing it:
# /usr/lib/heartbeat/stonithdtest/apitest 1 node-1 20000 0
optype=1, node_name=node-1, result=0, node_list=node-0