You can create a NetMRI failover pair using two NetMRI appliances, in which one acts as the primary appliance and the other as the secondary appliance. A failover pair provides a backup or redundant operational mode between the primary and secondary appliances so you can greatly reduce service downtime when one of them is out of service. You can configure two Operation Center (OC) appliances, collector appliances, or standalone appliances to form a failover pair.

In a failover pair, the primary appliance actively discovers and manages network devices and serves the Web UI and the CLI over the shared VIP address while the secondary appliance constantly keeps its database synchronized with the primary. Although you can access a failover pair using either the VIP address of the failover pair or the management IP address of the primary appliance, using the management IP is not recommended because, during a failover, the roles of the primary and secondary appliances reverse and the management IP becomes unreachable. Accessing the failover pair using the VIP address ensures that you are contacting the active primary appliance. Note that during a failover, all active connections between the NetMRI appliances and the network devices are disrupted and all ongoing processes fail. Also, all active Web UI and CLI sessions are disrupted during a failover and all users with active sessions must reconnect and log in again after the secondary appliance assumes the role of the primary appliance.

Note the following about the automatic failover feature:

Failover pair is supported only on NetMRI NT-1400, NT-2200, and NT-4000 (G8 only) appliances. It is not supported on NetMRI virtual appliances.
Failover is supported in NetMRI 7.1.1 and later releases.
Collector failover is supported in NetMRI 7.1.2 and later releases.
Both the primary and secondary must be of the same appliance model and same software version number.
The management IP address of both the primary and secondary must be on the same subnet.
The VIP address, shared by the primary and secondary, must be on the same subnet as the management IP address.

Deploying Automatic Failover for New Appliances

The following are the pre-requisites for deploying automatic failover for new appliances:

Configure two supported NetMRI appliances with licenses installed.
Both the appliances must be of the same appliance model and same software version number.
Provision three IP addresses on the same subnet: A VIP address and two management IP addresses for the appliances.
If you are using the direct replication method to connect both appliances, you need an Ethernet cable to connect the systems directly through their HA Ports.
If you are using the network replication method to connect the appliances, you must connect the systems over a local network and two replication IP addresses must be acquired on the same subnet. You must also select a TCP port for the replication traffic.

Note

Infoblox recommends that you use the direct replication method for the best reliability and performance. The network replication method will have higher latency and a greater chance of connection breakage, and thus lower reliability and performance.

You can deploy two new Operation Centers (OC), collector appliances, or standalone appliances to form a failover pair, as follows:

Set up and configure two new NetMRI appliances as separate systems. Ensure that the appliances are running NetMRI 7.1.1 or later. For collector failover configuration, make sure that the appliances are running NetMRI 7.1.2 or later.
Connect both the systems using one of the following methods:
- Direct replication: Connect the systems directly through their HA ports.
- Network replication: Connect the HA port of both systems to a network using an Ethernet cable.
  Infoblox recommends that you connect the systems using the direct replication method.
Run the Setup Wizard on both appliances, set the admin password, and then install the license. The admin password must be the same on both systems. For more information, see Running the Setup Wizard.
At this point of time, it is not necessary to complete the entire configuration wizard on both systems. You can complete the configuration only on the primary system.
Configure the failover settings on the Operation Center and collectors, as described in Specifying Automatic Failover Settings.

Note

After specifying the failover configuration settings and completing the enable operation, the systems start synchronizing data. This process might take up to one hour, depending on the appliance model.
For an Operation Center and collector failover, complete the following:

- Log in to the Admin Shell on the Operation Center and run the configure tunserver command. Enter the VIP address of the Operation Center when prompted for the IP address of the Operation Center server.
- To register collector on the Operation Center set up, log in to the Admin Shell on each Collector and run the register tunclient command. Enter the VIP address of the Operation Center when prompted for the IP address of the Operation Center.

Specifying Automatic Failover Settings

To specify automatic failover configuration settings:

Go to the Settings > Setup > Failover Configuration tab.
The Failover Configuration page appears, listing all device interfaces that are used by the system.
In the Failover Configuration page, complete the following:
- - Virtual IP address: Enter the VIP address.
  - Connection Mode: Select the connection mode from the drop-down list. You can select Direct if the systems are connected directly through the HA port, or select Network if the HA port of both the systems are connected to a network. Infoblox recommends that you use the Direct connection mode.
  - Virtual Hostname: Enter the hostname for the system.
  - Port: Enter the TCP port for replication traffic, if you are using Network connection mode. You must enter a port number great than 1024.
In the Replication Nodes section, enter the following for both Primary and Secondary.
1. - Role: Displays the role of the appliance, either PRIMARY or SECONDARY.
  - Management IP: Enter the management IP address of the system.
  - Hostname: Enter the hostname of the system.
  - Replication IP: Enter the IP address used for replication traffic, if you are using Network connection mode.
  - Subnet: Enter the subnet mask of the replication IP, if you are using Network connection mode. Note that the subnet mask must be the same for both primary and secondary appliances.
Click Update to update the settings and replicate data on both the primary and secondary appliances.
Click Enable to start connecting the systems.
The secondary system synchronizes data with the primary system. This process might take about one hour, depending on the appliance model.

Migrating Existing Systems as Failover Pairs

You can migrate two existing Operation Centers (OC) or standalone appliances to form a failover pair. Ensure that both appliances are running versions NetMRI 7.1.1 or later. To form a collector failover, migrate the existing collector to NetMRI 7.1.2 or later releases.

The following are the pre-requisites for migrating existing systems as a failover pair:

Two supported NetMRI appliances with licenses installed. You can choose an existing appliance and a second appliance of the same model.
For an HA pair, provision three IP addresses: One for the primary appliance, another one for the secondary appliance, and a virtual IP address shared between the failover pair.
If you are using the network replication method to connect the appliances, you must connect the systems over a local network and two replication IPs must be acquired on the same subnet. You must also select a TCP port for the replication traffic.

The example further in this section describes the migration of an Operation Center with two collectors to an HA Operation Center with two HA collectors. It uses the following conventions:

A — Nodes of the existing devices.
B — Nodes to be added to existing devices as paired.

The steps required to migrate existing systems as failover pairs depend on whether your appliances use the old or new partition scheme. If your appliances use the old partition scheme, you need to additionally prepare them. To determine what partition scheme an appliance has, run the show diskusage command from the Admin Shell and search for the “/drbd0” substring. If the substring is present, the appliance runs with the new scheme.

To migrate existing systems to form a failover pair, perform the steps described in the following sections:

Note

For the new partition scheme, configure three nodes—one with an OC license and two others with Stand Alone licenses—with the same version and licenses, and reset the admin password in the GUI to match the other system. Then proceed with the last three steps from the list above.

Preparing Secondary Appliances (Old Partition Scheme)

To prepare a secondary (B) device, complete the following:

Install NetMRI with the same version as on node A. If you are using a device that you used earlier, update it to the node A version.
If the device was already used, run the reset command on node B.
Run the repartition command on node B.
Run the config server command on node B.
Note

The management port should be in the same network, time zone, etc. as in node A. If you are using scan ports, connect the scan ports of the second node to the network in the same way as in the existing device. For more information, see Failover and Scan Interfaces.
Install the license on node B. The license must have the same type, device limit, and expiration date as node A.
Note

In the case of an Operation Center, run config server again after the license installation without modifying any parameters.
Reset the admin password in the GUI to match the other system or make UI setup through the Setup Wizard.

You now have the new node with the new partition scheme prepared to participate in the HA pair.

Preparing the Existing Operation Center Node (Old Partition Scheme)

To prepare the existing OC node, complete the following:

Follow the steps to prepare node B for HA OC as described in the section above, Preparing Secondary Appliances (Old Partition Scheme).
On node A, disable SNMP collection. Go to Settings -> Setup -> Collection and Groups -> Global -> Network Polling and deselect the SNMP Collection check box.
Generate a database archive of node A and restore it on node B. For more information, see NetMRI Database Management.
- If data is restored successfully, proceed to the next step.
- If the restore failed due to disk space exhaustion, try reducing data retention settings on your existing NetMRI system to reduce the archive size. It might take up to 24 hours for reduced data retention settings to take effect. For more information, see Data Retention or contact Infoblox Support for further assistance.
Run the configure server command on node B.
Run the config tunserver command with the new server IP (IP of node B).
Re-enable SNMP collection after restoring the archive on node B. Go to Settings -> Setup -> Collection and Groups -> Global -> Network Polling and select the SNMP Collection check box.
Log in to the Admin Shell on node A, enter the reset system command, and then enter the repartition command.
After repartitioning is complete, run the configure server command on node A.
Install the license.
Reset the admin password in the GUI to match the other system.

The two nodes are now ready for failover configuration where primary will be node B with all OC data, but without connected collectors at this point.

Preparing the Existing Collectors Nodes (Old Partition Scheme)

To prepare the existing collectors nodes, complete the following:

Follow the steps to prepare node B for HA Collector as described in the section above, Preparing Secondary Appliances (Old Partition Scheme).
Log in to the Admin Shell on the existing node, enter the reset system command, and then enter the repartition command.
After repartitioning is complete, run the configure server command.
Install the license that should be identical.
Reset the admin password in the GUI to match the other system.
Log in to the Admin Shell on node A, enter the config tunclient command and connect it to the node B OC.
You now have two nodes ready for failover configuration where any of them can be primary.
If you want to make node B primary, complete the following:
- Log in to the UI of node B (primary) OC and go to Settings -> Setup -> Tunnels and Collectors.
- Choose the existing collector (A), select Collector Replacement, and insert the Serial Number of node B.
- Log in to the Admin Shell on node B, enter the config tunclient command and connect it to node B OC (primary).

As the result, you have a new partitioned system (OC (B) and two 2 collectors (A and B)) with prepared secondary devices. Now you can configure a failover pair.

Configuring a Failover Pair

To configure a failover pair from the prepared appliances, complete the following:

Log in to the Operation Center UI as admin.
Go to Settings > Setup > Failover Configuration. Here you can see the HA status of your devices scheme (OC and 2 collectors).
Choose OC > Edit and configure the Operation Center HA pair. Wait until it is finished and the status is OK.
Choose the first collector > Edit and configure the first collector HA pair. Wait until it is finished and the status is OK.
Choose the second collector > Edit and configure the second collector HA pair. Wait until it is finished and the status is OK.

Note

If the failover status is not OK (e.g. Standalone/Replication Down/Not Synced), wait about an hour and try to make a resynchronization from the secondary device.

Now you obtained a NetMRI scheme with HA appliances.

Reconfiguring the Operation Center HA Pair

To reconfigure the Operation Center HA pair, complete the following:

Log in to the Operation Center CLI as admin.
Run the reset tunserver command:
oc (primary)> reset tunserver
Notice: This operation will clear all Tunnel CA, server, and client
configuration and shut down the Tunnel service.
Continue? (y/n) [n]: y
+++ Stopping OpenVPN Server ... OK
+++ Configuring OpenVPN Service ... OK
+++ Clearing Server Config ...OK
+++ Clearing CA Config ... OK
Launching "failover tunserver reset" on "172.19.2.59"...
The server needs to be restarted for these changes to take effect.
Do you wish to restart the server now? (y/n) [y]: y
+++ Restarting Server ... OK
Run the configure tunserver command and configure it with OC VIP address (Server Public Virtual Name or VIP address):
oc (primary)> config tunserver
+++ Configuring CA Settings
CA key expiry in days [5475]:
CA key size in bits [2048]:
+++ Configuring Server Settings
Server key expiry in days [5475]:
Server key size in bits [2048]:
Server Public Virtual Name or VIP address [172.19.2.66]: <- By default it will be already oc VIP
Select tunnnel IP protocol. 'udp' recommended for better performance.
Protocol (udp, udp6, tcp) [udp]:
Tunnel network /24 base [169.254.50.0]:
Block cipher:
0. None (RSA auth)
1. Blowfish-CBC
2. AES-128-CBC
3. Triple DES
4. AES-256-CBC
Enter Choice [2]:
Use compression [y]:
Use these settings? (y/n) [n]: y
+++ Initializing CA (may take a minute) ...
+++ Creating Server Params and Keypair ...
Generating DH parameters, 2048 bit long safe prime, generator 2
This is going to take a long time

As the result, the Operation Center obtained the new server IP address (failover VIP).

Reconfiguring the Collector HA Pair

To reconfigure the collector HA pair, complete the following:

Log in to the collector CLI as admin.
Run the reset tunclient command:
col2 (primary)> reset tunclient
Launching "show version" on "172.19.2.62"...
Notice: This operation will clear all Tunnel client
configuration and shut down the local Tunnel service.
Continue? (y/n) [n]: y
+++ Stopping OpenVPN Service ... OK
+++ Configuring OpenVPN Service ... OK
+++ Clearing Client Config ... OK
+++ Adjusting ACLs ... OK
Launching "failover tunclient reset" on "172.19.2.62"...
Run the configure tunclient command and configure it with OC VIP address (Server Public Virtual Name or VIP address):
col2(primary)> config tunclient
NOTICE: The inactivity timeout is being disabled temporarily while this command is run.
Launching "show version" on "172.19.2.62"...
NOTICE: The time zone on this system is US/Pacific.
This MUST match the Operation Center time zone for registration to be successful.
If the time zones are not equal, you must first use "configure server" to set
the collector time zone to match.
PLEASE NOTE: Changing the time zone WILL REQUIRE A SYSTEM REBOOT.
Do you want to continue registration? (y/n) [n]: y
+++ Configuring Tunnel Registration Settings
Registration Server/IP [e.g., example.com]: 172.19.2.66
Registration protocol (http|https) [https]:
Registration username: admin
Registration password:
Register this system? (y/n) [y]:y
.......
This is going to take a long time (really long, about 40 + minutes)
Repeat the above steps for each collector.

The NetMRI system (OC and two collectors) is now migrated as a failover pair.

Note

If devices were discovered from both collectors, e.g. when a device has a few interfaces, these devices are displayed in grey without a sim link to the device viewer on the collector which did not discover first. After the system is migrated, the following issue is observed for one of the collectors (second): In Network Explorer > Discovery, the devices listed on the left from the initial collection and discovered by both collectors cannot be discovered or deleted using the Discover Now or Delete button. However, you can discover them from Settings > Discovery Settings > Seed Routers/CIDR.

Manually Initiating Failover

If you want to swap roles between the members of a failover pair, you can manually initiate a failover. Within about five minutes after initiating a manual failover, the secondary system assumes the primary role and takes ownership of the VIP address. Note that a manually initiated failover causes a temporary service disruption.

To initiate a manual failover using the GUI, complete the following:

Log in to the primary system using your username and password.
Go to the Settings > Setup > Failover Configuration tab.
In the Failover Configuration page, click Become Secondary.

To initiate a manual failover using the NetMRI Admin Shell, perform one of the following:

Log in to the Admin Shell on the primary system and enter the failover role secondary command, and then click Enter.
Log in to the Admin Shell on the secondary system and enter the failover role primary command, and then click Enter.

Monitoring Automatic Failover

To monitor the current status of the failover pair, complete the following:

Go to the Settings > Setup > Failover Configuration tab.
The Failover Configuration page appears, listing all device interfaces that are used by the system.
In the Failover Configuration page, the Status field displays the current status of the failover pair. The current status can be one of the following:

- OK (green): Indicates that the failover pair is connected and synchronized. If the primary fails, the secondary automatically takes over the primary role.
- Syncing (yellow): Indicates that the failover pair is connected and the primary and secondary are synchronizing data. If the primary fails during synchronization, the secondary system cannot automatically take over as the primary system.
- Replication Down (red): Indicates that the failover pair is disconnected on the HA port but reachable on the MGMT port. This may be due to a cable mishap or when the secondary goes offline.
- Peer Down (red): Indicates that the secondary has lost connection with the primary on both HA and MGMT ports.

You can click the status link and view details about the failover status.

Viewing Failover Settings

To view configuration details of the Operation Center (OC) and Collector pair:

Go to the Settings > Setup > Failover Configuration tab.

The Failover Configuration page appears, listing all device interfaces that are used by the system.

Note

For an OC collector set up, the first row of the Failover Configuration page displays the OC pair information and other rows display the collector pair information.

Actions: You can click Edit or Status using the Action icon.
Virtual IP: Displays the virtual IP address.
Virtual Host Name: Displays the virtual hostname.
Connection: Displays the connection mode.
First MGMT IP: Displays the management IP address of the primary.
Second MGMT IP: Displays the management IP address of the secondary.
First MGMT Hostname: Displays the management hostname of the primary.
Second MGMT Hostname: Displays the management hostname of the secondary.
First Replication IP: Displays the IP address of the replication traffic of the primary.
Second Replication IP: Displays the IP address of the replication traffic of the secondary.
Port: Displays the port number for replication traffic.
Status: Displays the connection status. For more information, see Monitoring Automatic Failover.

Failover and Scan Interfaces

In a failover pair, although the scan interfaces are enabled only on the primary system, the scan interface configurations are replicated on both the systems. When the primary fails, the secondary activates its scan interfaces (physical and virtual) using the same IP configurations. Both the primary and the secondary can access the network using the same scan interface configurations. After a failover, the NetMRI appliance continues to interact with the devices using the same scan interfaces.

If no scan interfaces are configured on both failover systems, then the NetMRI appliances interact with network devices using the management port. The physical management port configuration is not replicated between the systems. After a failover, the NetMRI appliance interacts with the devices using the management port of the local system. Therefore, you have to configure the management IPs and infrastructure ACLs on both systems.

Software Upgrades

To upgrade a failover pair, you need to perform a software upgrade only on the primary system. The primary system upgrades locally, and then automatically upgrades the secondary. Note that during the upgrade of both systems, the failover capability is suspended. After upgrading the secondary system, both systems automatically connect and synchronize data.

Resolving Split Brain Issues

Generally speaking, "Split Brain" is a term used to describe the undesirable state in which both members of a failover pair act as primary at the same time. This is a rare situation which can occur when both the systems are up and running, but the systems completely disconnect from one another on both the MGMT and HA ports at the same time due to a network outage or a cable mishap. Split brain can also occur due to an error in the failover software. In this case, the secondary system assumes that the primary system has failed and takes on the primary role. The primary system, which does not have any contact with the secondary system continues to perform as the primary system. Having two primary systems introduces issues such as VIP contention and duplication of data.

To detect a split brain issue, complete the following:

When the "Lost connectivity via peer replication link" alert occurs, run failover status on both members.
Check the failover status: If the failover is enabled and both nodes are in the primary mode, you have the split brain situation.

For example:

NM85 (primary)> failover status

Failover enabled: Yes

Connection state: WFConnection

Replication Role [Local|Remote]: Primary | Unknown

Disk state [Local|Remote]: UpToDate | DUnknown

I/O status: Running

Network data [Sent|Received]: 0 KB | 0 KB

Local disk data [Read|Write]: 141621 KB | 476576 KB

Currently out of sync: 74964 KB

NM84 (secondary)> failover status

Failover enabled: Yes

Connection state: WFConnection

Replication Role [Local|Remote]: Primary | Unknown

Disk state [Local|Remote]: UpToDate | DUnknown

I/O status: Running

Network data [Sent|Received]: 0 KB | 0 KB

Local disk data [Read|Write]: 524661 KB | 743072 KB

Currently out of sync: 70760 KB

You can resolve a split brain issue by choosing one of the systems to retain data (the survivor) and the other system to discard data (the victim), and then force the victim into the secondary role. While choosing the survivor and the victim, you should look at each system and select the system which has the most complete data as the survivor. If you are unsure, select the original primary as the survivor, and the secondary as the victim. Typically, the data in both the systems are similar, since both systems have access to the same network, and they collect data from the same pool of devices, and both perform the same tasks. The data prior to the split brain state are identical in each system because the data is replicated when the systems were still connected. Only the data collected during the split brain state differs. The longer the systems are in a split brain state, the more the systems will diverge.

To resolve a split brain issue using the GUI, complete the following:

Connect to the management IP address of the victim system and log in to the system using your username and password.
Go to the Settings > Setup > Failover Configuration tab.
In the Failover Configuration page, click Become Secondary.

To resolve a Split Brain issue using the NetMRI Admin Shell, complete the following:

Use a terminal program to connect to the management IP address of the victim system.
Log in to the Admin Shell using your username and password.
At the Admin Shell prompt, enter the failover role secondary command, and then click Enter.

Infoblox NetMRI 7.5.1 Administrator Guide

About Automatic Failover

Deploying Automatic Failover for New Appliances

Specifying Automatic Failover Settings

Migrating Existing Systems as Failover Pairs

Preparing Secondary Appliances (Old Partition Scheme)

Preparing the Existing Operation Center Node (Old Partition Scheme)

Preparing the Existing Collectors Nodes (Old Partition Scheme)

Configuring a Failover Pair

Reconfiguring the Operation Center HA Pair

Reconfiguring the Collector HA Pair

Manually Initiating Failover

Monitoring Automatic Failover

Viewing Failover Settings

Failover and Scan Interfaces

Software Upgrades

Resolving Split Brain Issues