Document toolboxDocument toolbox

Managing Failover Associations

After you establish a failover association, you can monitor its status periodically to ensure that it is functioning properly. You can also delete a failover association when it is not assigned to any DHCP range.
See the following sections on how to manage failover associations:

Under special circumstances, you can manually adjust the configuration of a failover association. For example, when you know in advance that a peer will be out of service for an extended period of time, you can manually set the functional peer in a PARTNER-DOWN mode. This allows the functional partner to assume all leases and be able to allocate addresses to client requests in full capacity. In addition, when you suspect the databases in a failover association are not synchronized, you can consider doing a force recovery (after you consult with Infoblox Technical Support or your Infoblox representative) so the secondary server can completely rebuild its lease table with updates from the primary server.
See the following sections on how to set a peer to the partner-down mode and perform a force recovery:

  • Setting a Peer in the Partner-Down State

  • Performing a Force Recovery

Modifying Failover Associations

To modify a failover association:

  1. From the Data Management tab, select the DHCP tab -> Members tab -> Failover Associations -> failover_association checkbox, and then click the Edit icon.

  2. The DHCP Failover Association editor contains the following tabs from which you can modify data:

    • General: In the Basic tab, modify the fields as described in Adding Failover Associations.
      In the Advanced tab, complete the following to modify the port number you use for the failover association:

      • Failover Port: Click Override to enter a port number for the failover association. You can use any available port from 1 to 63999. The default is 647 for a new installation and 519 for an upgrade.

    • Triggers: Before editing the triggers and timers, ensure that you understand the ramification of the changes. Improper configuration of the triggers can cause the failover association to fail. For information about the fields in the Basic tab, see Adding Failover Associations. The following are the triggers in the Advanced tab:

      • Max Response Delay Before Failover(s): Specifies the maximum duration of time (in seconds) before a failover enters the Communications-Interrupted state after failing to hear from its peers. The duration must be long enough to prevent frequent connections and disconnections from the DHCP failover peers, yet short enough so that the transient network failure will not keep the peers out of contact for an extended duration. The recommended default is 60 seconds.

      • Max Number of Unacked Updates: Specifies the number of "unacked" packets the server can send before a failover occurs. The default is 10 messages.

      • Max Client Lead Time (s): Specifies the length of time that a failover peer can renew a lease without contacting its peer. The larger the number, the longer it takes for the peer to recover IP addresses after moving to the PARTNER-DOWN state. The smaller the number, the more load your servers experience when they are not communicating. The default is 3600 seconds.

      • Max Load Balancing Delay (s): Specifies the cutoff after load balancing is disabled. The cutoff is based on the number of seconds since a client sent its first DHCPDISCOVER message. For instance, if one of the failover peers gets into a state where it is busy responding to failover messages but is not responding to other client requests, the other peer responds to the client requests when the clients retry. This does not cause a failover. The default is three seconds.

    • Failover Settings: This is valid for Microsoft Management only. Modify failover association settings. For information, see Configuring Failover Associations.
      If you modify failover settings from secondary Microsoft server settings, the appliance does not update failover settings on NIOS for the following reasons:

      • When DHCP synchronization is disabled for primary Microsoft server, you must enable DHCP synchronization for primary Microsoft server to reflect the settings on NIOS.

      • The primary synchronization interval must be completed. For example, consider that you are modifying failover settings from secondary Microsoft server settings where the synchronization interval for primary server is five minutes, and the time interval for the secondary server is one minute. In this case, failover settings are updated on NIOS only after the primary server synchronization interval, which is five minutes.

    • Extensible Attributes: Add and delete extensible attributes that are associated with a failover association. You can also modify the values of extensible attributes. For information, see Using Extensible Attributes.

Monitoring Failover Associations

After you configure a failover association, the peers establish a TCP connection for communication. In a normal operational state, they send keepalive messages and database updates every time they grant a lease. However, there are times when the failover association experiences problems and goes into a state other than NORMAL. You can monitor the overall state of a failover association and the individual status of the peers to verify that the servers are operating and communicating properly.
Both peers in a failover association maintain the same DHCP fingerprinting state (enabled or disabled) even when one of the peers fails or becomes operational again. Note that both peers must be in the same Grid for the fingerprinting state to stay the same. For information about DHCP fingerprinting, see About DHCP Fingerprints.
In this panel, you can also modify some of the data in the table. Double click a row of data, and either edit the data in the field or select an item from a drop-down list. Note that some fields are read-only. For more information about this feature, see Modifying Data in Tables.
To monitor the failover association status:

  1. From the Data Management tab, select the DHCP tab -> Members tab -> IPv4 Failover Associations section. Grid Manager displays the list of failover associations and their overall status.

  2. To view detailed information about a failover association, select the failover_association checkbox, and then click the Show Status icon.

  3. In the Failover Association Status dialog box, Grid Manager displays the overall status of the failover association and the status of both the primary and secondary servers.
    The failover association can be in one of the following states:

    • OK (green): The failover association is functioning properly.

    • DEGRADED (yellow): The failover association is degraded when one of the peers is giving out limited addresses.

    • FAILURE (red): The failover association is not functioning, may be because it is not completely configured. The peers are not assigning IP addresses.
      For each peer, Grid Manager displays the hostname or IP address, the status, and event date. The peer can be in one of the following states:

    • STARTUP: The server is starting up.

    • NORMAL: The server is in a normal operational state in which it responds to its load balancing subset of DHCP clients.

    • PAUSED: This state allows a peer to inform the other peer that it is going out of service for a short period of time so the other peer can immediately transition to the COMMUNICATIONS-INTERRUPTED state and start providing DHCP service to DHCP clients.

    • COMMUNICATIONS-INTERRUPTED: The servers are not communicating with each other. Both servers provide DHCP service to DHCP clients from which they receive DHCP requests.

    • PARTNER-DOWN: The server assumes control of the DHCP service because its peer is out of service.

    • RECOVER: The server is starting up and trying to get a complete update from its peer and discovers that its peer is in the PARTNER-DOWN state.

    • RECOVER-WAIT: The server has got a complete update from its peer and is waiting for MCLT period to pass before transitioning to the RECOVER-DONE state.

    • RECOVER-DONE: The server completed an update from its peer.

    • POTENTIAL-CONFLICT: The peers are not synchronized due to an administrative error or an incorrect state transition. Check the failover configuration and correct the error.

    • CONFLICT-DONE: This is a temporary state that the primary server enters after it received updates from the secondary server when it was in the POTENTIAL-CONFLICT state.

    • RESOLUTION-INTERRUPTED: The server responds to DHCP clients in a limited way when it is in this state.

    • UNKNOWN: The DHCP server is in an unknown state. The failover association is not functioning properly, may be because it is configured improperly. For example, failover association is not assigned to any DHCP range.

    • SHUTDOWN: This state allows a peer to inform the other peer that it is going out of service for a long period of time so the other peer can immediately transition to the PARTNER-DOWN state and completely assume control of the DHCP service.

Note

NIOS does not support PARTNER-DOWN and Force Recovery for a Microsoft DHCP failover association.

Deleting Failover Associations

You cannot delete a failover association if it is currently assigned to a DHCP range. If you want to delete a failover association, ensure that it is not assigned to any DHCP range.
To delete a failover association:

  1. From the Data Management tab, select the DHCP tab -> Members tab -> Failover Associations -> failover_association checkbox, and then click the Delete icon.

  2. In the Delete Confirmation dialog box, click Yes.
    The appliance puts the failover association in the recycle bin, if enabled.

Setting a Peer in the Partner-Down State

If one of the peers in a failover association is out of service for an extended period of time, you should consider putting the functional peer in the PARTNER-DOWN state. When you place the functional peer in the PARTNER-DOWN state, it assumes full DHCP services for the networks. Since the functional server may not receive all the updates from its peer, it extends all the leases on the MCLT. Once the following conditions are met, the functional peer provides DHCP services autonomously:

  • It has reclaimed all the leases that belonged to its peer.

  • The MCLT has passed.

When the peer that is offline comes back online, it synchronizes with the functional peer and reestablishes the communication before it provides DHCP services to the clients.

Warning

Before you put a peer in the partner-down state, ensure that the other peer is indeed out of service. If both the primary and secondary servers are operational when you place one of them in the partner-down mode, both servers may stop issuing leases for a minimum of time defined in the MCLT.

To set a peer in the PARTNER-DOWN state:

  1. From the Data Management tab, select the DHCP tab ->Members tab- > Failover Associations -> failover_association checkbox.

  2. Expand the Toolbar and click Set Partner Down.

  3. In the Set Failover Association Partner Down dialog box, select one of the following:

    • Primary: Select this if the secondary server is out of service.

    • Secondary: Select this if the primary server is out of service.

  4. Click OK.

Note

You cannot place the functional peer in the PARTNER-DOWN state for a Microsoft DHCP failover in NIOS.

Performing a Force Recovery

When the primary and secondary peers are not synchronized, you can perform a force recovery to set the primary server in the PARTNER-DOWN state while putting the secondary server in the RECOVER state. During a force recovery, all leases in the databases are resynchronized. When you perform a force recovery, the secondary server does not serve any DHCP leases for a minimum of the MCLT while it resynchronizes with the primary server. Before you perform a force recovery, consult with Infoblox Technical Support or your Infoblox representative to ensure that the force recovery is appropriate for the situation.
To perform a force recovery:

  1. From the Data Management tab, select the DHCP tab-> Members tab-> Failover Associations -> failover_association checkbox.

  2. Expand the Toolbar and click Force Recovery State.

  3. In the Force Secondary Peer Recovery State dialog box, click OK.

The appliance synchronizes the databases on the primary and secondary servers.

Recovering DHCP Failover Associations

During a conflict resolution, when the primary peer of the DHCP failover association is in the CONFLICT-DONE state and the secondary peer in the POTENTIAL-CONFLICT state, the secondary peer might experience problems (such as restarting, network outage, etc.) and goes into an invalid state. This results in a deadlock state for the failover association, causing a DHCP service outage. When the failover association is in a deadlock state, you can perform a recovery for the failover association. You can run the recovery for one failover association at a time and when the primary member is in the CONFLICT-DONE state. This feature is supported for Infoblox appliances only and not for any other external DHCP servers.

To recover a DHCP failover association:

  1. From the Data Management tab, select the DHCP tab-> Members tab -> Failover Associations -> failover_association checkbox.

  2. Expand the Toolbar and click Recovery from Deadlock State.

  3. In the Failover Recovery Progress dialog box, click Start to start the recovery of the failover association from the deadlock state.

  4. In the confirmation dialog box, click Yes.

Grid Manager starts the failover recovery and you can view the following information in the Failover Recovery Progress
dialog box:

  • Failover association: The name of the failover association.

  • Primary: The hostname or IP address of the primary server.

  • Secondary: The hostname or IP address of the secondary server.

  • Number of leases to be processed: The total number of leases to be processed.

  • Number of leases processed: The number of leases that have been processed.

  • Current Status: Displays the current status of the failover recovery process. The current status can be one of the following:

    • Pending: The failover recovery is initiated for a failover association and the recovery process will start soon.

    • Calculating: The appliance calculates the total amount of leases to be processed.

    • Applying: The appliance looks for conflicts and tries to resolve the conflicts.

    • Completed: The failover recovery is completed successfully.

    • Failed: The failover recovery fails.

Grid Manager also displays the reason for the failure if that happens.

After successful completion of the failover recovery, you must restart both the primary and secondary peers to bring them back to the CONFLICT-DONE state.
You can stop the failover recovery operation by clicking Stop in the Failover Recovery Progress dialog box before the recovery process is complete.

Â