Document toolboxDocument toolbox

Resolving Split-Brain Issues

Generally speaking, split-brain is a term used to describe the undesirable state in which both members of a failover pair act as primary at the same time. This is a rare situation that can occur when both the systems are up and running, but the systems completely disconnect from one another on both the MGMT and HA ports at the same time due to a network outage or a cable mishap. Split-brain can also occur due to an error in the failover software. In this case, the secondary system assumes that the primary system has failed and takes on the primary role. The primary system, which does not have any contact with the secondary system continues to perform as the primary system. Having two primary systems introduces issues such as VIP contention and duplication of data.

To detect a split-brain issue, complete the following:

  1. When the "Lost connectivity via peer replication link" alert occurs, run failover status on both members.
  2. Check the failover status: If the failover is enabled and both nodes are in the primary mode, you have the split-brain situation.

For example:

NM85 (primary)> failover status

 Failover enabled: Yes

 Connection state: WFConnection

Replication Role [Local|Remote]: Primary | Unknown

  Disk state [Local|Remote]: UpToDate | DUnknown

   I/O status: Running

Network data [Sent|Received]: 0 KB | 0 KB

Local disk data [Read|Write]: 141621 KB | 476576 KB

Currently out of sync: 74964 KB


NM84 (secondary)> failover status

 Failover enabled: Yes

 Connection state: WFConnection

Replication Role [Local|Remote]: Primary | Unknown

   Disk state [Local|Remote]: UpToDate | DUnknown

I/O status: Running

Network data [Sent|Received]: 0 KB | 0 KB

Local disk data [Read|Write]: 524661 KB | 743072 KB

Currently out of sync: 70760 KB

You can resolve a split-brain issue by choosing one of the systems to retain data (the survivor) and the other system to discard data (the victim), and then force the victim into the secondary role. While choosing the survivor and the victim, you should look at each system and select the system which has the most complete data as the survivor. If you are unsure, select the original primary as the survivor, and the secondary as the victim. Typically, the data in both the systems are similar, since both systems have access to the same network and collect data from the same pool of devices, and both perform the same tasks. The data prior to the split-brain state are identical in each system because the data is replicated when the systems were still connected. Only the data collected during the split-brain state differs. The longer the systems are in a split-brain state, the more the systems will diverge.

To resolve a split-brain issue using the NetMRI UI, complete the following:

  1. Connect to the management IP address of the victim system and log in to the system using your username and password.
  2. Go to the Settings icon > Setup > Failover Configuration tab.
  3. On the Failover Configuration page, click Become Secondary.

To resolve a split-brain issue using the NetMRI administrative shell, complete the following:

  1. Use a terminal program to connect to the management IP address of the victim system.
  2. Log in to the administrative shell using your username and password.
  3. At the administrative shell prompt, enter the failover role secondary command, and then click Enter.