Node Failure & Recovery in Galera Cluster (2024)

Individual nodes fail to operate when they lose touch with the cluster. This can occur due to various reasons. For instance, in the event of hardware failure or software crash, the loss of network connectivity or the failure of a state transfer. Anything that prevents the node from communicating with the cluster is generalized behind the concept of node failure. Understanding how nodes fail will help in planning for their recovery.

Detecting Single Node Failures

When a node fails the only sign is the loss of connection to the node processes as seen by other nodes. Thus nodes are considered failed when they lose membership with the cluster’s Primary Component. That is, from the perspective of the cluster when the nodes that form the Primary Component can no longer see the node, that node is failed. From the perspective of the failed node itself, assuming that it has not crashed, it has lost its connection with the Primary Component.

Although there are third-party tools for monitoring nodes—such as ping, Heartbeat, and Pacemaker—they can be grossly off in their estimates on node failures. These utilities do not participate in the Galera Cluster group communications and remain unaware of the Primary Component.

If you want to monitor the Galera Cluster node status poll the wsrep_local_state status variable or through the Notification Command.

For more information on monitoring the state of cluster nodes, see the chapter on Monitoring the Cluster.

The cluster determines node connectivity from the last time it received a network packet from the node. You can configure how often the cluster checks this using the evs.inactive_check_period parameter. During the check, if the cluster finds that the time since the last time it received a network packet from the node is greater than the value of the evs.keepalive_period parameter, it begins to emit heartbeat beacons. If the cluster continues to receive no network packets from the node for the period of the evs.suspect_timeout parameter, the node is declared suspect. Once all members of the Primary Component see the node as suspect, it is declared inactive—that is, failed.

If no messages were received from the node for a period greater than the evs.inactive_timeout period, the node is declared failed regardless of the consensus. The failed node remains non-operational until all members agree on its membership. If the members cannot reach consensus on the liveness of a node, the network is too unstable for cluster operations.

The relationship between these option values is:

evs.keepalive_period	<=	evs.inactive_check_period
evs.inactive_check_period	<=	evs.suspect_timeout
evs.suspect_timeout	<=	evs.inactive_timeout
evs.inactive_timeout	<=	evs.consensus_timeout

Note

Unresponsive nodes that fail to send messages or heartbeat beacons on time—for instance, in the event of heavy swapping—may also be pronounced failed. This prevents them from locking up the operations of the rest of the cluster. If you find this behavior undesirable, increase the timeout parameters.