Fault Detection and Network Partitioning | MGR in simple terms

2022.09.19

This article introduces the fault detection mechanism of MGR and how to deal with it after a network partition occurs.

1. Fault detection

When individual nodes in the MGR communicate abnormally with other nodes, the fault detection mechanism will be triggered, and the majority of nodes will vote and then decide whether to expel them from the MGR.

When a failure occurs, the fault detection mechanism can work normally only when the majority node survives, enabling MGR to restore availability; when the majority node itself is abnormal, MGR cannot recover by itself and requires human intervention.

In MGR, messages are regularly exchanged between nodes. When more than 5 seconds (fixed 5 seconds in MySQL, the new option group_replication_communication_flp_timeout in GreatSQL can be configured) has not received any message from a node, it will Nodes are marked as suspicious. Each normal surviving node of MGR will detect suspicious nodes every 15 seconds (in GreatSQL, it is adjusted to detect every 2 seconds, which is more efficient, and will be introduced later). Evict the node from the MGR.

It should be noted that the option group_replication_member_expel_timeoutstarts from MySQL 8.0.21 and has a default value of 5. Prior to MySQL 8.0.21, the default was 0. In versions <= MySQL 8.0.20, the default value of group_replication_member_expel_timeout is 0, that is, when a node is judged to be in a suspicious state, it will be expelled immediately. In MySQL 5.7, without this option, the behavior is the same.

In MySQL, MGR failure detection is done by a separate thread that checks every 15 seconds (MySQL has SUSPICION_PROCESSING_THREAD_PERIOD = 15 hardcoded in the source code). Therefore, when a node fails, in extreme cases, it may take 5 (5 seconds without sending a message, it is judged as a suspicious node) + 15 (SUSPICION_PROCESSING_THREAD_PERIOD) + 5 (group_replication_member_expel_timeout) = 25 seconds to evict the node. In the best case, the node can be evicted after as fast as 5 + 5 = 10 seconds.

This has been optimized in GreatSQL, and the new option group_replication_communication_flp_timeout (default value 5, minimum 3, maximum 60) is used to define how many seconds a node does not send a message to be judged as suspicious. In addition, the hardcoded SUSPICION_PROCESSING_THREAD_PERIOD = 2 has been modified, which means that the failure detection thread will check every 2 seconds instead of 15 seconds. So in GreatSQL, the fastest 5 (group_replication_communication_flp_timeout) + 5 (group_replication_member_expel_timeout) = 10 secondsto complete the eviction, the slowest 5 + 5 + 2 (SUSPICION_PROCESSING_THREAD_PERIOD) = 12 seconds to complete the eviction.

In the case of poor network conditions, it is recommended to appropriately increase the value of group_replication_member_expel_timeout to avoid frequent eviction of nodes due to network fluctuations. But also pay attention to another risk, as described in this article: Technology sharing | Why MGR consistency mode is not recommended AFTER

The surviving node will delete the expelled node from the member list, but the expelled node may not be "aware" of itself (maybe just because of a temporary short-term network abnormality). After the state is restored, the node will first receive to a new view message that the node has been expelled from MGR, and then rejoins MGR. An evicted node will try group_replication_autorejoin_tries times to rejoin the MGR.

The option group_replication_exit_state_action defines the behavior mode after the node is expelled. The default is to set super_read_only = ON to enter read-only mode.

2. When a minority member loses contact

When the minority members in the cluster are unreachable (Unreachable), the MGR cluster will not be automatically exited by default. At this time, you can set group_replication_unreachable_majority_timeout. When the disconnection between the minority node and the majority node exceeds this threshold, the minority node will automatically exit the MGR cluster. If set to 0, it will exit immediately without waiting. When a node exits the cluster, the corresponding transaction will be rolled back, then the node state will become ERROR, and the subsequent behavior mode defined by the option group_replication_exit_state_action will be executed. If group_replication_autorejoin_tries is set, it will also automatically try to rejoin the MGR cluster.

3. When a majority member loses contact

When the majority node is also disconnected (Unreachable), for example, in a 3-node MGR cluster, 2 nodes are disconnected, and the remaining 1 node cannot become the majority, so it cannot make new transaction requests. decision, in which case a network partition (split brain) occurs. That is, an MGR cluster is split into two or more regions, and therefore lacks a majority. In this case, the MGR cluster cannot provide write services.

At this time, manual intervention is required to force a new member list by setting group_replication_force_members. For example, an MGR cluster consists of three nodes, two of which are accidentally disconnected, and only one node survives. In this case, you need to manually set group_replication_force_members to force a member list, that is, only the last surviving node.

Two important reminders:

Using this method is basically a last resort and requires great care. If used incorrectly, it can create an artificial split-brain scenario, or cause the entire system to be completely blocked. It is also possible to select the wrong new node list. After forcibly setting a new node list and unblocking MGR, remember to clear this option value, otherwise START GROUP_REPLICATION cannot be executed again.

4. Xcom cache

When a node is in a suspicious state, transactions are cached in the Xcom cache of other nodes until it is determined to be kicked out of the MGR cluster. This cache corresponds to the option group_replication_message_cache_size. When the suspicious node recovers within a short period of time, it will first read records from the Xcom cache to recover, and then perform distributed recovery. Therefore, in the scenario where the network is not stable or the concurrent transactions are large and the physical memory is sufficient, the Xcom cache size can be appropriately increased; on the contrary, in the scenario where the physical memory is small or the network is relatively stable, the Xcom cache size should not be set too high. large, reducing the risk of OOM.

In MySQL 5.7, the maximum Xcom cache size is 1G and cannot be dynamically adjusted. As of MySQL 8.0, it can be dynamically adjusted. In versions <= MySQL 8.0.20, the minimum value is 1G. In versions >= MySQL 8.0.21, the minimum value is 128M.

You can execute the following SQL to view the current Xcom cache consumption:

[root@GreatSQL]> SELECT * FROM performance_schema.memory_summary_global_by_event_name
  WHERE EVENT_NAME LIKE ‘memory/group_rpl/GCS_XCom::xcom_cache';1.
2.

In MySQL, the Xcom cache is dynamically allocated on demand. If there is too much free space, it will be released; if it is not enough, more memory will be dynamically allocated, and about 250,000 cache items will be allocated at a time, which can easily cause a response delay of about 150ms. That is, there may be frequent response delays as the number of transactions changes.

In GreatSQL, a static allocation mechanism is adopted for the Xcom cache, that is, about 1GB of memory is pre-allocated for the xcom cache at the beginning, which can avoid the aforementioned risk of response delay jitter, but the "side effect" is occupied by the mysqld process. There will be more memory than the original, and it is not suitable for servers with particularly tight memory.

5. Network Partitioning

In MGR, transactions need to reach consensus (either commit or roll back) through a majority of nodes. Similarly, the aforementioned inter-node communication messages also require consensus among the majority nodes. When the majority node in the MGR loses contact, it cannot form a consensus on this, nor can it meet the majority voting/arbitration requirements, and the MGR will reject the write transaction request. In this situation, also known as network partition, an MGR cluster is split into two or more partitions, which cannot communicate with each other, and the nodes in any one partition cannot reach a majority.

It is possible that the Primary node will be kicked out of the MGR cluster due to network partition. When it is rejoined, there may be more local transactions due to local transactions that have not been synchronized to other nodes before, and will report something like the following mistake:

This member has more executed transactions than those present in the group. Local transactions: xx:1-300917674 > Group transactions: xx:1-3009176691.

At this time, manual intervention is required to select which node is the latest Primary node.

6. Summary

This article introduces the fault detection mechanism of MGR, Xcom cache, what is a network partition, what is the impact when a fault occurs, and how to recover from the fault.

新聞