Troubleshooting High Availability

Manual Failover, Fallback, and Recovery

Use Cisco Unified Communications Manager Administration to initiate a manual failover, fallback, and recovery for IM and Presence Service nodes in a presence redundancy group. You can also initiate these actions from Cisco Unified Communications Manager or IM and Presence Service using the CLI. See the Command Line Interface Guide for Cisco Unified Communications Solutions for details.

Manual failover: When you initiate a manual failover, the Cisco Server Recovery Manager stops the critical services on the failed node. All users from the failed node are disconnected and must re-login to the backup node.

Note

After a manual failover occurs, critical services will not be started unless we invoke manual fallback.

Manual fallback: When you initiate a manual fallback, the Cisco Server Recovery Manager restarts critical services on the primary node and disconnects all users that had been failed over. Those users must then re-login to their assigned node.
Manual recovery: When both nodes in the presence redundancy group are in a failed state and you initiate a manual recovery, the IM and Presence Service restarts the Cisco Server Recovery Manager service on both nodes in the presence redundancy group.

Initiate Manual Failover

You can manually initiate a failover of IM and Presence Service nodes in a presence redundancy group using Cisco Unified Communications Manager Administration.

Procedure

Step 1

Select System > Presence Redundancy Groups.

The Find and List Presence Redundancy Groups window displays.

Step 2

Select the presence redundancy group search parameters, and then click Find.

Matching records appear.

Step 3

Select the presence redundancy group that is listed in the Find and List Presence Redundancy Group window.

The Presence Redundancy Group Configuration window appears.

Step 4

Click Failover in the ServerAction field.

Note

This button appears only when the server and presence redundancy group are in the correct states.

Initiate Manual Fallback

Use Cisco Unified Communications Manager Administration to manually initiate the fallback of an IM and Presence Service node in a presence redundancy group that has failed over. For more information about presence redundancy group node status, see topics related to node state, state change causes, and recommended actions.

Procedure

Step 1

Select System > Presence Redundancy Groups.

The Find and List Presence Redundancy Groups window displays.

Step 2

Select the presence redundancy group search parameters, and then click Find.

Matching records appear.

Step 3

Select the presence redundancy group that is listed in the Find and List Presence Redundancy Group window.

The Presence Redundancy Group Configuration window appears.

Step 4

Click Fallback in the ServerAction field.

Note

This button appears only when the server and presence redundancy group are in the correct states.

Initiate Manual Recovery

A manual recovery is necessary when both nodes in the presence redundancy group are in the failed state. Use Cisco Unified Communications Manager Administration to manually initiate the recovery of IM and Presence Service nodes in a presence redundancy group that is in the failed state.

For more information about presence redundancy group node status, see topics related to node state, state change causes, and recommended actions.

Before you begin

Procedure

Step 1

Select System > Presence Redundancy Groups.

The Find and List Presence Redundancy Groups window displays.

Step 2

Select the presence redundancy group search parameters, and then click Find.

Matching records appear.

Step 3

Select the presence redundancy group that is listed in the Find and List Presence Redundancy Group window.

The Presence Redundancy Group Configuration window appears.

Step 4

Click Recover.

Note

This button appears only when the server and presence redundancy group are in the correct states.

View Presence Redundancy Group Node Status

Use the Cisco Unified CM Administration user interface to view the status of IM and Presence Service nodes that are members of a presence redundancy group.

Procedure

Step 1

Choose System > Presence Redundancy Groups.

The Find and List Presence Redundancy Groups window displays.

Step 2

Choose the presence redundancy group search parameters, and then click Find.

Matching records appear.

Step 3

Choose a presence redundancy group that is listed in the search results.

The Presence Redundancy Group Configuration window appears. If two nodes are configured in that group and high availability is enabled, then the status of the nodes within that group are displayed in the High Availability area.

Node State Definitions

Table 1. Presence Redundancy Group Node State Definitions
State	Description
Initializing	This is the initial (transition) state when the Cisco Server Recovery Manager service starts; it is a temporary state.
Idle	IM and Presence Service is in Idle state when failover occurs and services are stopped. In Idle state, the IM and Presence Service node does not provide any availability or Instant Messaging services. In Idle state, you can manually initiate a fallback to this node using the Cisco Unified CM Administration user interface.
Normal	This is a stable state. The IM and Presence Service node is operating normally. In this state, you can manually initiate a failover to this node using the Cisco Unified CM Administration user interface.
Running in Backup Mode	This is a stable state. The IM and Presence Service node is acting as the backup for its peer node. Users have moved to this (backup) node.
Taking Over	This is a transition state. The IM and Presence Service node is taking over for its peer node.
Failing Over	This is a transition state. The IM and Presence Service node is being taken over by its peer node.
Failed Over	This is a steady state. The IM and Presence Service node has failed over, but no critical services are down. In this state, you can manually initiate a fallback to this node using the Cisco Unified CM Administration user interface.
Failed Over with Critical Services Not Running	This is a steady state. Some of the critical services on the IM and Presence Service node have either stopped or failed.
Falling Back	This is a transition state. The system is falling back to this IM and Presence Service node from the node that is running in backup mode.
Taking Back	This is a transition state. The failed IM and Presence Service node is taking back over from its peer.
Running in Failed Mode	An error occurs during the transition states or Running in Backup Mode state.
Unknown	Node state is unknown. A possible cause is that high availability was not enabled properly on the IM and Presence Service node. Restart the Server Recovery Manager service on both nodes in the presence redundancy group.

Node States, Causes, and Recommended Actions

You can view the status of nodes in a presence redundancy group on the Presence Redundancy Group Configuration window when you choose a group using the Cisco Unified CM Administration user interface.

Table 2. Presence Redundancy Group Node High-Availability States, Causes, and Recommended Actions
Node 1		Node 2
State	Reason	State	Reason	Cause/Recommended Actions
Normal	Normal	Normal	Normal	Normal
Failing Over	On Admin Request	Taking Over	On Admin Request	The administrator initiated a manual failover from node 1 to node 2. The manual failover is in progress.
Idle	On Admin Request	Running in Backup Mode	On Admin Request	The manual failover from node 1 to node 2 that the administrator initiated is complete.
Taking Back	On Admin Request	Falling Back	On Admin Request	The administrator initiated a manual fallback from node 2 to node 1. The manual fallback is in progress.
Idle	Initialization	Running in Backup Mode	On Admin Request	The administrator restarts the SRM service on node 1 while node 1 is in "Idle" state.
Idle	Initialization	Running in Backup Mode	Initialization	The administrator either restarts both nodes in the presence redundancy group, or restarts the SRM service on both nodes while the presence redundancy group was in manual failover mode.
Idle	On Admin Request	Running in Backup Mode	Initialization	The administrator restarts the SRM service on node 2 while node 2 is running in backup mode, but before the heartbeat on node 1 times out.
Failing Over	On Admin Request	Taking Over	Initialization	The administrator restarts the SRM service on node 2 while node 2 is taking over, but before the heartbeat on node1 times out.
Taking Back	Initialization	Falling Back	On Admin Request	The administrator restarts the SRM service on node 1 while taking back, but before the heartbeat on node 2 times out. After the taking back process is complete, both nodes are in Normal state.
Taking Back	Automatic Fallback	Falling Back	Automatic Fallback	Automatic Fallback has been initiated from node 2 to node 1 and is currently in progress.
Failed Over	Initialization or Critical Services Down	Running in Backup Mode	Critical Service Down	Node 1 transitions to Failed Over state when either of the following conditions occur: Critical services come back up due to a reboot of node 1. The administrator starts critical services on node 1 while node 1 is in Failed Over with Critical Services Not Running state. When node 1 transitions to Failed Over state the node is ready for the administrator to perform a manual fallback to restore the nodes in the presence redundancy group to Normal state.
Failed Over with Critical Services not Running	Critical Service Down	Running in Backup Mode	Critical Service Down	A critical service is down on node 1. IM and Presence Service performs an automatic failover to node 2. Recommended Actions: Check node 1 for any critical services that are down and try to manually start those services. If the critical services on node 1 do not start, then reboot node 1. When all the critical services are up and running after the reboot, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.
Failed Over with Critical Services not Running	Database Failure	Running in Backup Mode	Database Failure	A database service is down on node 1. IM and Presence Service performs an automatic failover to node 2. Recommended Actions: Reboot node 1. When all the critical services are up and running after the reboot, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.
Running in Failed Mode	Start of Critical Services Failed	Running in Failed Mode	Start of Critical Services Failed	Critical services fail to start while a node in the presence redundancy group is taking back from the other node. Recommended Actions. On the node that is taking back, perform the following actions: Check the node for critical services that are down. To manually start these services, click Recovery in the Presence Redundancy Group Configuration window. If the critical services do not start, reboot the node. When all the critical services are up and running after the reboot, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.
Running in Failed Mode	Critical Service Down	Running in Failed Mode	Critical Service Down	Critical services go down on the backup node. Both nodes enter the failed state. Recommended Actions: Check the backup node for critical services that are down. To start these services manually, click Recovery in the Presence Redundancy Group Configuration window. If the critical services do not start, reboot the node.
Node 1 is down due to loss of network connectivity or the SRM service is not running.		Running in Backup Mode	Peer Down	Node 2 has lost the heartbeat from node 1. IM and Presence Service performs an automatic failover to node 2. Recommended Action. If node 1 is up, perform the following actions: Check and repair the network connectivity between nodes in the presence redundancy group. When you reestablish the network connection between the nodes, the node may go into a failed state. Click Recovery in the Presence Redundancy Group Configuration window to restore the nodes to the Normal state. Start the SRM service and perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state. (If the node is down) Repair and power up node 1. When the node is up and all critical services are running, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.
Node 1 is down (due to possible power down, hardware failure, shutdown, reboot)		Running in Backup Mode	Peer Reboot	IM and Presence Service performs an automatic failover to node 2 due to the following possible conditions on node 1: hardware failure power down restart shutdown Recommended Actions: Repair and power up node 1. When the node is up and all critical services are running, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.
Failed Over with Critical Services not Running OR Failed Over	Initialization	Backup Mode	Peer Down During Initialization	Node 2 does not see node 1 during startup. Recommended Action: When node1 is up and all critical services are running, perform a manual fallback to restore the nodes in the presence redundancy group to the Normal state.
Running in Failed Mode	Cisco Server Recovery Manager Take Over Users Failed	Running in Failed Mode	Cisco Server Recovery Manager Take Over Users Failed	User move fails during the taking over process. Recommended Action: Possible database error. Click Recovery in the Presence Redundancy Group Configuration window. If the problem persists, then reboot the nodes.
Running in Failed Mode	Cisco Server Recovery Manager Take Back Users Failed	Running in Failed Mode	Cisco Server Recovery Manager Take Back Users Failed	User move fails during falling back process. Recommended Action: Possible database error. Click Recovery in the Presence Redundancy Group Configuration window. If the problem persists, then reboot the nodes.
Running in Failed Mode	Unknown	Running in Failed Mode	Unknown	The SRM on a node restarts while the SRM on the other node is in a failed state, or an internal system error occurs. Recommended Action: Click Recovery in the Presence Redundancy Group Configuration window. If the problem persists, then reboot the nodes.
Backup Activated	Auto Recover Database Failure	Failover Affected Services	Auto Recovery Database Failure.	The database goes down on the backup node. The peer node is in failover mode and can take over for all users in the presence redundancy group. Auto-recovery operation automatically occurs and all users are moved over to the primary node.
Backup Activated	Auto Recover Database Failure	Failover Affected Services	Auto Recover Critical Service Down	A critical service goes down on the backup node. The peer node is in failover mode and can take over for all users in the presence redundancy group. Auto-recovery operation automatically occurs and all users are moved over to the peer node.
Unknown		Unknown		Node state is unknown. A possible cause is that high availability was not enabled properly on the IM and Presence Service node. Recommended Action: Restart the Server Recovery Manager service on both nodes in the presence redundancy group.

Restarting Services with High Availability

If you make any system configuration changes, or system upgrades, that require you to disable High Availability and then restart either the Cisco XCP router, Cisco Presence Engine, or the server itself, you must allow sufficient time for Cisco Jabber sessions to be recreated before you enable High Availability. Otherwise, Presence won't work for Jabber clients whose sessions aren't created.

Make sure to follow this process:

Procedure

Step 1

Before you make any changes, check the Presence Topology window in Cisco Unified CM IM and Presence Administration window (System > Presence Topology). Take a record of the number of assigned users to each node in each Presence Redundancy Group.

Step 2

Disable High Availability in each Presence Redundancy Group and wait at least two minutes for the new HA settings to synchronize.

Step 3

Do whichever of the following is required for your update:

Restart the Cisco XCP Router
Restart the Cisco Presence Engine
Restart the server

Step 4

After the restart, monitor the number of active sessions on all nodes.

Step 5

For each node, run the show perf query counter "Cisco Presence Engine" ActiveJsmSessions CLI command on each node to confirm the number of active sessions on each node. The number of active sessions should match the number that you recorded in step 1 for assigned users. It should take no more than 15 minutes for all sessions to resume.

Step 6

Once all of your sessions are created, you can enable High Availability within the Presence Redundancy Group.

Note

If 30 minutes passes and the active sessions haven't yet been created, restart the Cisco Presence Engine. If that doesn't work, there is a larger system issue for you to fix.

Note

It is not recommended to do back-to-back restarts of the Cisco XCP Router and/or Cisco Presence Engine. However, if you do need to do a restart: restart the first service, wait for all of the JSM sessions to be recreated. After all of the JSM sessions are created, then do the second restart.

Configuration and Administration of the IM and Presence Service on Cisco Unified Communications Manager, Release 11.5(1)

Bias-Free Language

Book Title

Configuration and Administration of the IM and Presence Service on Cisco Unified Communications Manager, Release 11.5(1)

Chapter Title