Troubleshooting High Availability

Manual Failover, Fallback, and Recovery

Use Cisco Unified Communications Manager Administration to initiate a manual failover, fallback, and recovery for IM and Presence Service nodes in a presence redundancy group. You can also initiate these actions from Cisco Unified Communications Manager or IM and Presence Service using the CLI. See the Command Line Interface Guide for Cisco Unified Communications Solutions for details.

  • Manual failover: When you initiate a manual failover, the Cisco Server Recovery Manager stops the critical services on the failed node. All users from the failed node are disconnected and must re-login to the backup node.


    Note

    After a manual failover occurs, critical services will not be started unless we invoke manual fallback.
  • Manual fallback: When you initiate a manual fallback, the Cisco Server Recovery Manager restarts critical services on the primary node and disconnects all users that had been failed over. Those users must then re-login to their assigned node.

  • Manual recovery: When both nodes in the presence redundancy group are in a failed state and you initiate a manual recovery, the IM and Presence Service restarts the Cisco Server Recovery Manager service on both nodes in the presence redundancy group.

Initiate Manual Failover

You can manually initiate a failover of IM and Presence Service nodes in a presence redundancy group using Cisco Unified Communications Manager Administration.

Procedure


Step 1

Select System > Presence Redundancy Groups.

The Find and List Presence Redundancy Groups window displays.

Step 2

Select the presence redundancy group search parameters, and then click Find.

Matching records appear.

Step 3

Select the presence redundancy group that is listed in the Find and List Presence Redundancy Group window.

The Presence Redundancy Group Configuration window appears.

Step 4

Click Failover in the ServerAction field.

Note 

This button appears only when the server and presence redundancy group are in the correct states.


Initiate Manual Fallback

Use Cisco Unified Communications Manager Administration to manually initiate the fallback of an IM and Presence Service node in a presence redundancy group that has failed over. For more information about presence redundancy group node status, see topics related to node state, state change causes, and recommended actions.

Procedure


Step 1

Select System > Presence Redundancy Groups.

The Find and List Presence Redundancy Groups window displays.

Step 2

Select the presence redundancy group search parameters, and then click Find.

Matching records appear.

Step 3

Select the presence redundancy group that is listed in the Find and List Presence Redundancy Group window.

The Presence Redundancy Group Configuration window appears.

Step 4

Click Fallback in the ServerAction field.

Note 

This button appears only when the server and presence redundancy group are in the correct states.


Initiate Manual Recovery

A manual recovery is necessary when both nodes in the presence redundancy group are in the failed state. Use Cisco Unified Communications Manager Administration to manually initiate the recovery of IM and Presence Service nodes in a presence redundancy group that is in the failed state.

For more information about presence redundancy group node status, see topics related to node state, state change causes, and recommended actions.

Before you begin

A manual recovery is necessary when both nodes in the presence redundancy group are in the failed state. Use Cisco Unified Communications Manager Administration to manually initiate the recovery of IM and Presence Service nodes in a presence redundancy group that is in the failed state.

Procedure


Step 1

Select System > Presence Redundancy Groups.

The Find and List Presence Redundancy Groups window displays.

Step 2

Select the presence redundancy group search parameters, and then click Find.

Matching records appear.

Step 3

Select the presence redundancy group that is listed in the Find and List Presence Redundancy Group window.

The Presence Redundancy Group Configuration window appears.

Step 4

Click Recover.

Note 

This button appears only when the server and presence redundancy group are in the correct states.


View Presence Redundancy Group Node Status

Use the Cisco Unified CM Administration user interface to view the status of IM and Presence Service nodes that are members of a presence redundancy group.

Procedure


Step 1

Choose System > Presence Redundancy Groups.

The Find and List Presence Redundancy Groups window displays.

Step 2

Choose the presence redundancy group search parameters, and then click Find.

Matching records appear.

Step 3

Choose a presence redundancy group that is listed in the search results.

The Presence Redundancy Group Configuration window appears. If two nodes are configured in that group and high availability is enabled, then the status of the nodes within that group are displayed in the High Availability area.


Node State Definitions

Table 1. Presence Redundancy Group Node State Definitions

State

Description

Initializing

This is the initial (transition) state when the Cisco Server Recovery Manager service starts; it is a temporary state.

Idle

IM and Presence Service is in Idle state when failover occurs and services are stopped. In Idle state, the IM and Presence Service node does not provide any availability or Instant Messaging services. In Idle state, you can manually initiate a fallback to this node using the Cisco Unified CM Administration user interface.

Normal

This is a stable state. The IM and Presence Service node is operating normally. In this state, you can manually initiate a failover to this node using the Cisco Unified CM Administration user interface.

Running in Backup Mode

This is a stable state. The IM and Presence Service node is acting as the backup for its peer node. Users have moved to this (backup) node.

Taking Over

This is a transition state. The IM and Presence Service node is taking over for its peer node.

Failing Over

This is a transition state. The IM and Presence Service node is being taken over by its peer node.

Failed Over

This is a steady state. The IM and Presence Service node has failed over, but no critical services are down. In this state, you can manually initiate a fallback to this node using the Cisco Unified CM Administration user interface.

Failed Over with Critical Services Not Running

This is a steady state. Some of the critical services on the IM and Presence Service node have either stopped or failed.

Falling Back

This is a transition state. The system is falling back to this IM and Presence Service node from the node that is running in backup mode.

Taking Back

This is a transition state. The failed IM and Presence Service node is taking back over from its peer.

Running in Failed Mode

An error occurs during the transition states or Running in Backup Mode state.

Unknown

Node state is unknown.

A possible cause is that high availability was not enabled properly on the IM and Presence Service node. Restart the Server Recovery Manager service on both nodes in the presence redundancy group.

Node States, Causes, and Recommended Actions

You can view the status of nodes in a presence redundancy group on the Presence Redundancy Group Configuration window when you choose a group using the Cisco Unified CM Administration user interface.

Table 2. Presence Redundancy Group Node High-Availability States, Causes, and Recommended Actions

Node 1

Node 2

State

Reason

State

Reason

Cause/Recommended Actions

Normal

Normal

Normal

Normal

Normal

Failing Over

On Admin Request

Taking Over

On Admin Request

The administrator initiated a manual failover from node 1 to node 2. The manual failover is in progress.

Idle

On Admin Request

Running in Backup Mode

On Admin Request

The manual failover from node 1 to node 2 that the administrator initiated is complete.

Taking Back

On Admin Request

Falling Back

On Admin Request

The administrator initiated a manual fallback from node 2 to node 1. The manual fallback is in progress.

Idle

Initialization

Running in Backup Mode

On Admin Request

The administrator restarts the SRM service on node 1 while node 1 is in "Idle" state.

Idle

Initialization

Running in Backup Mode

Initialization

The administrator either restarts both nodes in the presence redundancy group, or restarts the SRM service on both nodes while the presence redundancy group was in manual failover mode.

Idle

On Admin Request

Running in Backup Mode

Initialization

The administrator restarts the SRM service on node 2 while node 2 is running in backup mode, but before the heartbeat on node 1 times out.

Failing Over

On Admin Request

Taking Over

Initialization

The administrator restarts the SRM service on node 2 while node 2 is taking over, but before the heartbeat on node1 times out.

Taking Back

Initialization

Falling Back

On Admin Request

The administrator restarts the SRM service on node 1 while taking back, but before the heartbeat on node 2 times out. After the taking back process is complete, both nodes are in Normal state.

Taking Back

Automatic Fallback

Falling Back

Automatic Fallback

Automatic Fallback has been initiated from node 2 to node 1 and is currently in progress.

Failed Over

Initialization or Critical Services Down

Running in Backup Mode

Critical Service Down

Node 1 transitions to Failed Over state when either of the following conditions occur:

  • Critical services come back up due to a reboot of node 1.

  • The administrator starts critical services on node 1 while node 1 is in Failed Over with Critical Services Not Running state.

    When node 1 transitions to Failed Over state the node is ready for the administrator to perform a manual fallback to restore the nodes in the presence redundancy group to Normal state.

Failed Over with Critical Services not Running

Critical Service Down

Running in Backup Mode

Critical Service Down

A critical service is down on node 1. IM and Presence Service performs an automatic failover to node 2.

Recommended Actions:

  1. Check node 1 for any critical services that are down and try to manually start those services.

  2. If the critical services on node 1 do not start, then reboot node 1.

  3. When all the critical services are up and running after the reboot, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.

Failed Over with Critical Services not Running

Database Failure

Running in Backup Mode

Database Failure

A database service is down on node 1. IM and Presence Service performs an automatic failover to node 2.

Recommended Actions:

  1. Reboot node 1.

  2. When all the critical services are up and running after the reboot, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.

Running in Failed Mode

Start of Critical Services Failed

Running in Failed Mode

Start of Critical Services Failed

Critical services fail to start while a node in the presence redundancy group is taking back from the other node.

Recommended Actions. On the node that is taking back, perform the following actions:

  1. Check the node for critical services that are down. To manually start these services, click Recovery in the Presence Redundancy Group Configuration window.

  2. If the critical services do not start, reboot the node.

  3. When all the critical services are up and running after the reboot, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.

Running in Failed Mode

Critical Service Down

Running in Failed Mode

Critical Service Down

Critical services go down on the backup node. Both nodes enter the failed state.

Recommended Actions:

  1. Check the backup node for critical services that are down. To start these services manually, click Recovery in the Presence Redundancy Group Configuration window.

  2. If the critical services do not start, reboot the node.

Node 1 is down due to loss of network connectivity or the SRM service is not running.

Running in Backup Mode

Peer Down

Node 2 has lost the heartbeat from node 1. IM and Presence Service performs an automatic failover to node 2.

Recommended Action. If node 1 is up, perform the following actions:

  1. Check and repair the network connectivity between nodes in the presence redundancy group. When you reestablish the network connection between the nodes, the node may go into a failed state. Click Recovery in the Presence Redundancy Group Configuration window to restore the nodes to the Normal state.

  2. Start the SRM service and perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.

  3. (If the node is down) Repair and power up node 1.

  4. When the node is up and all critical services are running, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.

Node 1 is down (due to possible power down, hardware failure, shutdown, reboot)

Running in Backup Mode

Peer Reboot

IM and Presence Service performs an automatic failover to node 2 due to the following possible conditions on node 1:
  • hardware failure

  • power down

  • restart

  • shutdown

Recommended Actions:

  1. Repair and power up node 1.

  2. When the node is up and all critical services are running, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.

Failed Over with Critical Services not Running OR Failed Over

Initialization

Backup Mode

Peer Down During Initialization

Node 2 does not see node 1 during startup.

Recommended Action:

When node1 is up and all critical services are running, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.

Running in Failed Mode

Cisco Server Recovery Manager Take Over Users Failed

Running in Failed Mode

Cisco Server Recovery Manager Take Over Users Failed

User move fails during the taking over process.

Recommended Action:

Possible database error. Click Recovery in the Presence Redundancy Group Configuration window. If the problem persists, then reboot the nodes.

Running in Failed Mode

Cisco Server Recovery Manager Take Back Users Failed

Running in Failed Mode

Cisco Server Recovery Manager Take Back Users Failed

User move fails during falling back process.

Recommended Action:

Possible database error. Click Recovery in the Presence Redundancy Group Configuration window. If the problem persists, then reboot the nodes.

Running in Failed Mode

Unknown

Running in Failed Mode

Unknown

The SRM on a node restarts while the SRM on the other node is in a failed state, or an internal system error occurs.

Recommended Action:

Click Recovery in the Presence Redundancy Group Configuration window. If the problem persists, then reboot the nodes.

Backup Activated

Auto Recover Database Failure

Failover Affected Services

Auto Recovery Database Failure.

The database goes down on the backup node. The peer node is in failover mode and can take over for all users in the presence redundancy group. Auto-recovery operation automatically occurs and all users are moved over to the primary node.

Backup Activated

Auto Recover Database Failure

Failover Affected Services

Auto Recover Critical Service Down

A critical service goes down on the backup node. The peer node is in failover mode and can take over for all users in the presence redundancy group. Auto-recovery operation automatically occurs and all users are moved over to the peer node.

Unknown

Unknown

Node state is unknown.

A possible cause is that high availability was not enabled properly on the IM and Presence Service node.

Recommended Action:

Restart the Server Recovery Manager service on both nodes in the presence redundancy group.

Restarting Services with High Availability

If you make any system configuration changes, or system upgrades, that require you to disable High Availability and then restart either the Cisco XCP router, Cisco Presence Engine, or the server itself, you must allow sufficient time for Cisco Jabber sessions to be recreated before you enable High Availability. Otherwise, Presence won't work for Jabber clients whose sessions aren't created.

Make sure to follow this process:

Procedure


Step 1

Before you make any changes, check the Presence Topology window in Cisco Unified CM IM and Presence Administration window (System > Presence Topology). Take a record of the number of assigned users to each node in each Presence Redundancy Group.

Step 2

Disable High Availability in each Presence Redundancy Group and wait at least two minutes for the new HA settings to synchronize.

Step 3

Do whichever of the following is required for your update:

  • Restart the Cisco XCP Router
  • Restart the Cisco Presence Engine
  • Restart the server
Step 4

After the restart, monitor the number of active sessions on all nodes.

Step 5

For each node, run the show perf query counter "Cisco Presence Engine" ActiveJsmSessions CLI command on each node to confirm the number of active sessions on each node. The number of active sessions should match the number that you recorded in step 1 for assigned users. It should take no more than 15 minutes for all sessions to resume.

Step 6

Once all of your sessions are created, you can enable High Availability within the Presence Redundancy Group.

Note 

If 30 minutes passes and the active sessions haven't yet been created, restart the Cisco Presence Engine. If that doesn't work, there is a larger system issue for you to fix.

Note 

It is not recommended to do back-to-back restarts of the Cisco XCP Router and/or Cisco Presence Engine. However, if you do need to do a restart: restart the first service, wait for all of the JSM sessions to be recreated. After all of the JSM sessions are created, then do the second restart.