Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Anchor
About Automatic Failover
About Automatic Failover
Anchor
bookmark82
bookmark82
You can create a NetMRI failover pair using two NetMRI appliances, in which one acts as the primary appliance and the other as the secondary appliance. A failover pair provides a backup or redundant operational mode between the primary and secondary appliances so you can greatly reduce service downtime when one of them is out of service. You can configure two Operation Center (OC) appliances, collector appliances, or standalone appliances to form a failover pair.

...

  • Failover pair is supported only on NetMRI NT-1400, NT-2200, and NT-4000 (G8 only) appliances. It is not supported on NetMRI virtual appliances.
  • Failover is supported in NetMRI 7.1.1 and later releases.
  • Collector failover is supported in NetMRI 7.1.2 and later releases.
  • Both the primary and secondary must be of the same appliance model and same software version number.
  • The management IP address of both the primary and secondary must be on the same subnet.
  • The VIP address, shared by the primary and secondary, must be on the same subnet as the management IP address.

...

The following are the pre-requisites for deploying automatic failover for new appliances:

  • Configure two supported NetMRI appliances with licenses installed.
  • Both the appliances must be of the same appliance model and same software version number.
  • Provision three IP addresses on the same subnet: A VIP address and two management IP addresses for the appliances.
  • If you are using the direct replication method to connect both appliances, you need an Ethernet cable to connect the systems directly through their HA Ports.
  • If you are using the network replication method to connect the appliances, you must connect the systems over a local network and two replication IP addresses must be acquired on the same subnet. You must also select a TCP port for the replication traffic.
Note
titleNote

Infoblox recommends that you use the direct replication method for the best reliability and performance. The network replication method will have higher latency and a greater chance of connection breakage, and thus lower reliability and performance.

You can deploy two new Operation Centers (OC), collector appliances, or standalone appliances to form a failover pair, as follows:

  1. Set up and configure two new NetMRI appliances as separate systems. Ensure that the appliances are running NetMRI 7.1.1 or later. For collector failover configuration, make sure that the appliances are running NetMRI 7.1.2 or later.
  2. Connect both the systems using one of the following methods:
    • Direct replication: Connect the systems directly through their HA ports.
    • Network replication: Connect the HA port of both systems to a network using an Ethernet cable.

Infoblox recommends that you connect the systems using the direct replication method.

3. Run the Setup Wizard on both appliances, set the admin password, and then install the license. The admin password must be the same on both systems. For more information, see Running the Setup Wizard.

At this point of time, it is not necessary to complete the entire configuration wizard on both systems. You can complete the configuration only on the primary system.

4. Configure the failover settings on the Operation Center and collectors, as described in Specifying Automatic Failover Settings.

Note
titleNote

After specifying the failover configuration settings and completing the enable operation, the systems start synchronizing data. This process might take up to one hour, depending on the appliance model.

5. For an Operation Center and collector failover, complete the following:

    • Log in to the Admin Shell on the Operation Center and run the configure tunserver command. Enter the VIP address of the Operation Center when prompted for the IP address of the Operation Center server.
    • To register collector on the Operation Center set up, log in to the Admin Shell on each Collector and run the register tunclient command. Enter the VIP address of the Operation Center when prompted for the IP address of the Operation Center.

...

To specify automatic failover configuration settings:

...

  • by

...

  • Virtual IP address: Enter the VIP address.
  • Connection Mode: Select the connection mode from the drop-down list. You can select Direct if the systems are connected directly through the HA port, or select Network if
  • the
  • HA port of both the systems are connected to a network. Infoblox recommends that you use the Direct connection mode.
  • Virtual Hostname: Enter the hostname for the system.
  • Port: Enter the TCP port for replication traffic, if you are using Network connection mode. You must enter a port number great than 1024.

In the Replication Nodes section, enter the following for both Primary and Secondary.

    • Role: Displays the role of the appliance, either PRIMARY or SECONDARY.
    • Management IP: Enter the management IP address of the system.
    • Hostname: Enter the hostname of the system.
    • Replication IP: Enter the IP address used for replication traffic, if you are using Network connection mode.
    • Subnet: Enter the subnet mask of the replication IP, if you are using Network connection mode. Note that the subnet mask must be the same for both primary and secondary appliances.

...

  • primary and secondary

...

4. Click Enable to start connecting the systems.

...

  • ,

...

You can migrate two existing Operation Centers (OC) or standalone appliances to form a failover pair. Ensure that both appliances are running versions NetMRI 7.1.1 or later. To form a collector failover, migrate the existing collector to NetMRI 7.1.2 or later releases.

The following are the pre-requisites for migrating existing systems as a failover pair:

  • Two supported NetMRI appliances with licenses installed. You can choose an existing appliance and a second appliance of the same model.
  • For an HA pair, provision three IP addresses: One for the primary appliance, another one for the secondary appliance, and a virtual IP address shared between the failover pair.
  • If you are using the network replication method to connect the appliances, you must connect the systems over a local network and two replication IPs must be acquired on the same subnet . You must also select a TCP port for the replication traffic.

The example further in this section describes the migration of an Operation Center with two collectors to an HA Operation Center with two HA collectors. It uses the following conventions:

  • A — Nodes of the existing devices.
  • B — Nodes to be added to existing devices as paired.

The steps required to migrate existing systems as failover pairs depend on whether your appliances use the old or new partition scheme. If your appliances use the old partition scheme, you need to additionally prepare them. To determine what partition scheme an appliance has, run the show diskusage command from the Admin Shell and search for the “/drbd0” substring. If the substring is present, the appliance runs with the new scheme.

To migrate existing systems to form a failover pair, perform the steps described in the following sections:

Note
titleNote
For the new partition scheme, configure three nodes—one with an OC license and two others with Stand Alone licenses—with the same version and licenses, and reset the admin password in the GUI to match the other system. Then proceed with the last three steps from the list above.

Preparing Secondary Appliances (Old Partition Scheme)

To prepare a secondary (B) device, complete the following:

...

Run the config server command on node B. 

Note
titleNote

The management port should be in the same network, time zone, etc. as in node A. If you are using scan ports, connect the scan ports of the second node to the network in the same way as in the existing device. For more information, see Failover and Scan Interfaces.

Install the license on node B. The license must have the same type, device limit, and expiration date as node A.

Note
titleNote

In the case of an Operation Center, run config server again after the license installation without modifying any parameters.

...

You now have the new node with the new partition scheme prepared to participate in the HA pair.

Preparing the Existing Operation Center Node (Old Partition Scheme)

To prepare the existing OC node, complete the following:

  1. Follow the steps to prepare node B for HA OC as described in the section above, Preparing Secondary Appliances (Old Partition Scheme).
  2. On node A, disable SNMP collection. Go to Settings -> Setup -> Collection and Groups -> Global -> Network Polling and deselect the SNMP Collection checkbox.
  3. Generate a database archive of node A and restore it on node B. For more information, see NetMRI Database Management.
    • If data is restored successfully, proceed to the next step.
    • If the restore failed due to disk space exhaustion, try reducing data retention settings on your existing NetMRI system to reduce the archive size. It might take up to 24 hours for reduced data retention settings to take effect. For more information, see Data Retention or contact Infoblox Support for further assistance.
  4. Run the configure server command on node B.
  5. Run the config tunserver command with the new server IP (IP of node B).
  6. Re-enable SNMP collection after restoring the archive on node B. Go to Settings -> Setup -> Collection and Groups -> Global -> Network Polling and select the SNMP Collection checkbox.
  7. Log in to the Admin Shell on node A, enter the reset system command, and then enter the repartition command.
  8. After repartitioning is complete, run the configure server command on node A.
  9. Install the license.
  10. Reset the admin password in the GUI to match the other system.

The two nodes are now ready for failover configuration where primary will be node B with all OC data, but without connected collectors at this point.

Preparing the Existing Collectors Nodes (Old Partition Scheme)

To prepare the existing collectors nodes, complete the following:

  1. Follow the steps to prepare node B for HA Collector as described in the section above, Preparing Secondary Appliances (Old Partition Scheme).
  2. Log in to the Admin Shell on the existing node, enter the reset system command, and then enter the repartition command.
  3. After repartitioning is complete, run the configure server command.
  4. Install the license that should be identical.
  5. Reset the admin password in the GUI to match the other system.
  6. Log in to the Admin Shell on node A, enter the config tunclient command and connect it to the node B OC.
  7. You now have two nodes ready for failover configuration where any of them can be primary.
  8. If you want to make node B primary, complete the following:
    • Log in to the UI of node B (primary) OC and go to Settings -> Setup -> Tunnels and Collectors.
    • Choose the existing collector (A), select Collector Replacement, and insert the Serial Number of node B.
    • Log in to the Admin Shell on node B, enter the config tunclient command and connect it to node B OC (primary).

As the result, you have a new partitioned system (OC (B) and two 2 collectors (A and B)) with prepared secondary devices. Now you can configure a failover pair.

Configuring a Failover Pair

To configure a failover pair from the prepared appliances, complete the following:

  1. Log in to the Operation Center UI as admin.
  2. Go to Settings -> Setup -> Failover Configuration. Here you can see the HA status of your devices scheme (OC and 2 collectors).
  3. Choose OC -> Edit and configure the Operation Center HA pair. Wait until it is finished and the status is OK.
  4. Choose the first collector -> Edit and configure the first collector HA pair. Wait until it is finished and the status is OK.
  5. Choose the second collector -> Edit and configure the second collector HA pair. Wait until it is finished and the status is OK.
Note
titleNote

If the failover status is not OK (e.g. Standalone/Replication Down/Not Synced), wait about an hour and try to make a resynchronization from the secondary device.

Now you obtained a NetMRI scheme with HA appliances.

Reconfiguring the Operation Center HA Pair

To reconfigure the Operation Center HA pair, complete the following:

...

oc (primary)> reset tunserver

Notice: This operation will clear all Tunnel CA, server, and client

...

+++ Stopping OpenVPN Server ... OK

+++ Configuring OpenVPN Service ... OK

+++ Clearing Server Config ...OK

...

The server needs to be restarted for these changes to take effect.

...

oc (primary)> config tunserver

+++ Configuring CA Settings

CA key expiry in days [5475]:

CA key size in bits [2048]:

+++ Configuring Server Settings

Server key expiry in days [5475]:

Server key size in bits [2048]:

Server Public Virtual Name or VIP address [172.19.2.66]: <- By default it will be already oc VIP

Select tunnnel IP protocol. 'udp' recommended for better performance.

Protocol (udp, udp6, tcp) [udp]:

Tunnel network /24 base [169.254.50.0]:

Block cipher:

0. None (RSA auth)

1. Blowfish-CBC

2. AES-128-CBC

3. Triple DES

4. AES-256-CBC

Enter Choice [2]:

...

+++ Initializing CA (may take a minute) ...

+++ Creating Server Params and Keypair ...

Generating DH parameters, 2048 bit long safe prime, generator 2

This is going to take a long time

As the result, the Operation Center obtained the new server IP address (failover VIP).

Reconfiguring the Collector HA Pair

To reconfigure the collector HA pair, complete the following:

...

Launching "show version" on "172.19.2.62"...

Notice: This operation will clear all Tunnel client

...

Continue? (y/n) [n]: y

+++ Stopping OpenVPN Service ... OK

+++ Configuring OpenVPN Service ... OK

+++ Clearing Client Config ... OK

...

Launching "show version" on "172.19.2.62"...

NOTICE: The time zone on this system is US/Pacific.

This MUST match the Operation Center time zone for registration to be successful.

If the time zones are not equal, you must first use "configure server" to set

...

PLEASE NOTE: Changing the time zone WILL REQUIRE A SYSTEM REBOOT.

Do you want to continue registration? (y/n) [n]: y

+++ Configuring Tunnel Registration Settings

Registration Server/IP [e.g., example.com]: 172.19.2.66

Registration protocol (http|https) [https]:

Registration username: admin

...

Register this system? (y/n) [y]:y

.......

...

  • as

...

titleNote

...

  • the

...

If you want to swap roles between the members of a failover pair, you can manually initiate a failover. Within about five minutes after initiating a manual failover, the secondary system assumes the primary role and takes ownership of the VIP address. Note that a manually initiated failover causes a temporary service disruption.

To initiate a manual failover using the GUI, complete the following:

  1. Log in to the primary system using your username and password.
  2. Go to the Settings -> Setup -> Failover Configuration tab.
  3. In the Failover Configuration page, click Become Secondary.

To initiate a manual failover using the NetMRI Admin Shell, perform one of the following:

  • Log in to the Admin Shell on the primary system and enter the failover role secondary command, and then click Enter.
  • Log in to the Admin Shell on the secondary system and enter the failover role primary command, and then click Enter.

...

To monitor the current status of the failover pair, complete the following:

  1. Go to the Settings -> Setup -> Failover Configuration tab.
    The Failover Configuration page appears, listing all device interfaces that are used by the system.
  2. In the Failover Configuration page, the Status field displays the current status of the failover pair. The current status can be one of the following:
    • OK (green): Indicates that the failover pair is connected and synchronized. If the primary fails, the secondary automatically takes over the primary role.
    • Syncing (yellow): Indicates that the failover pair is connected and the primary and secondary are synchronizing data. If the primary fails during synchronization, the secondary system cannot automatically take over as the primary system.
    • Replication Down (red): Indicates that the failover pair is disconnected on the HA port but reachable on the MGMT port. This may be due to a cable mishap or when the secondary goes offline.
    • Peer Down (red): Indicates that the secondary has lost connection with the primary on both HA and MGMT ports.

You can click the status link and view details about the failover status.

...

To view configuration details of the Operation Center (OC) and Collector pair:

  • Go to the Settings -> Setup -> Failover Configuration tab.

The Failover Configuration page appears, listing all device interfaces that are used by the system.

Note
titleNote

For an OC collector set up, the first row of the Failover Configuration page displays the OC pair information and other rows display the collector pair information.

  • Actions: You can click Edit or Status using the Action icon.
  • Virtual IP: Displays the virtual IP address.
  • Virtual Host Name: Displays the virtual hostname.
  • Connection: Displays the connection mode.
  • First MGMT IP: Displays the management IP address of the primary.
  • Second MGMT IP: Displays the management IP address of the secondary.
  • First MGMT Hostname: Displays the management hostname of the primary.
  • Second MGMT Hostname: Displays the management hostname of the secondary.
  • First Replication IP: Displays the IP address of the replication traffic of the primary.
  • Second Replication IP: Displays the IP address of the replication traffic of the secondary.
  • Port: Displays the port number for replication traffic.
  • Status: Displays the connection status. For more information, see Monitoring Automatic Failover.

...

In a failover pair, although the scan interfaces are enabled only on the primary system, the scan interface configurations are replicated on both the systems. When the primary fails, the secondary activates its scan interfaces (physical and virtual) using the same IP configurations. Both the primary and the secondary can access the network using the same scan interface configurations. After a failover, the NetMRI appliance continues to interact with the devices using the same scan interfaces.

If no scan interfaces are configured on both failover systems, then the NetMRI appliances interact with network devices using the management port. The physical management port configuration is not replicated between the systems. After a failover, the NetMRI appliance interacts with the devices using the management port of the local system. Therefore, you have to configure the management IPs and infrastructure ACLs on both systems.

...

To upgrade a failover pair, you need to perform a software upgrade only on the primary system. The primary system upgrades locally, and then automatically upgrades the secondary. Note that during the upgrade of both systems, the failover capability is suspended. After upgrading the secondary system, both systems automatically connect and synchronize data.

...

Generally speaking, "Split Brain" is a term used to describe the undesirable state in which both members of a failover pair act as primary at the same time. This is a rare situation which can occur when both the systems are up and running, but the systems completely disconnect from one another on both the MGMT and HA ports at the same time due to a network outage or a cable mishap. Split brain can also occur due to an error in the failover software. In this case, the secondary system assumes that the primary system has failed and takes on the primary role. The primary system, which does not have any contact with the secondary system continues to perform as the primary system. Having two primary systems introduces issues such as VIP contention and duplication of data.

To detect a split brain issue, complete the following:

  1. When the "Lost connectivity via peer replication link" alert occurs, run failover status on both members.
  2. Check the failover status: If the failover is enabled and both nodes are in the primary mode, you have the split brain situation.

For example:

NM85 (primary)> failover status

 Failover enabled: Yes

 Connection state: WFConnection

Replication Role [Local|Remote]: Primary | Unknown

  Disk state [Local|Remote]: UpToDate | DUnknown

   I/O status: Running

Network data [Sent|Received]: 0 KB | 0 KB

Local disk data [Read|Write]: 141621 KB | 476576 KB

Currently out of sync: 74964 KB

NM84 (secondary)> failover status

 Failover enabled: Yes

 Connection state: WFConnection

Replication Role [Local|Remote]: Primary | Unknown

   Disk state [Local|Remote]: UpToDate | DUnknown

I/O status: Running

Network data [Sent|Received]: 0 KB | 0 KB

Local disk data [Read|Write]: 524661 KB | 743072 KB

Currently out of sync: 70760 KB

You can resolve a split brain issue by choosing one of the systems to retain data (the survivor) and the other system to discard data (the victim), and then force the victim into the secondary role. While choosing the survivor and the victim, you should look at each system and select the system which has the most complete data as the survivor. If you are unsure, select the original primary as the survivor, and the secondary as the victim. Typically, the data in both the systems are similar, since both systems have access to the same network, and they collect data from the same pool of devices, and both perform the same tasks. The data prior to the split brain state are identical in each system because the data is replicated when the systems were still connected. Only the data collected during the split brain state differs. The longer the systems are in a split brain state, the more the systems will diverge.

To resolve a split brain issue using the GUI, complete the following:

  1. Connect to the management IP address of the victim system and log in to the system using your username and password.
  2. Go to the Settings -> Setup -> Failover Configuration tab.
  3. In the Failover Configuration page, click Become Secondary.

To resolve a Split Brain issue using the NetMRI Admin Shell, complete the following:

...

  • .