How High Availability Works
The Cisco EPN Manager high availability (HA) framework ensures continued system operation in case of failure. HA uses a pair of linked, synchronized Cisco EPN Manager servers to minimize or eliminate the impact of application or hardware failures that may take place on either server. Servers can fail due to issues in one or more of the following areas:
-
Application processes—Server, TFTP, FTP, and other process failures. You can view the status of these processes using the CLI ncs status command.
-
Database server—Database-related process failures (the database server runs as a service on Cisco EPN Manager).
-
Network—Problems with network access or reachability.
-
System—Problems with the server's physical hardware or operating system.
-
Virtual machine (if HA is running in a VM environment)—Problems with the VM environment on which the primary and secondary servers are installed.
The following figure shows the main components and process flows for an HA setup.
An HA deployment consists of a primary and a secondary server with Health Monitor (HM) instances (running as an application process) on both servers. When the primary server fails (either automatically or because it is manually stopped), the secondary server takes over and manages the network while you restore access to the primary server. If the deployment is configured for automatic failover, the secondary server takes over the active role within two to three minutes after the failover. This HA is based on the active/passive or cold standby model of operation. Because it is not a clustered system, when the primary server fails, the sessions are not preserved in the secondary server.
When issues on the primary server are resolved and the server is in a running state, it remains in standby mode during which it begins syncing its data with the active secondary server. When the primary is available again, you can initiate a failback operation. When a failback is triggered, the primary server again takes over the active role. This role switching between the primary and secondary servers happens within two to three minutes.
Whenever the HA configuration determines that the primary server has changed, it synchronizes this change with the secondary server. These changes are of two types:
- File changes, which are
synchronized using the HTTPS protocol. This includes items such as report
configurations, configuration templates, TFTP-root directory, administration
settings, licensing files, and the key store. File synchronization is done:
- In batches, for files that are not updated frequently (such as license files). These files are synchronized once every 500 seconds.
- Near real-time, for files that are updated frequently. These files are synchronized once every 11 seconds.
- Database changes, such as updates related to configuration, performance and monitoring data. Oracle Recovery Manager (RMAN) creates the initial standby database and Oracle Active Data Guard synchronizes the databases when there is any change.
The primary and secondary HA servers exchange the following messages to maintain synchronization between the two servers:
- Database Sync—Includes all the information necessary to ensure that the databases on the primary and secondary servers are running and synchronized.
- File Sync—Includes frequently updated configuration files. These are synchronized every 11 seconds, while other infrequently updated configuration files are synchronized every 500 seconds.
- Process Sync—Ensures that application- and database-related processes are running. These messages fall under the Heartbeat category.
- Health Monitor Sync—These messages check for the network, system, and health monitor failure conditions.
HA States and Transitions
The following table lists the HA states, including those that require no response from you. You can view these states from the HA Status page (Customize Server Internal SNMP Traps and Forward the Traps.
) or from the Health Monitor. For a list of HA events and instructions for enabling, disabling, and adjusting them, see
State |
Server |
Description |
Stand Alone |
Both |
HA is not configured on this server. |
Primary Alone |
Primary |
Primary server has restarted after it lost the secondary server (only Health Monitor is running in this state). |
HA Initializing |
Both |
HA registration process between the primary and secondary server has started. |
Primary Active |
Primary |
Primary server is now active and is synchronizing with the secondary server. |
Primary Database Copy Failed |
Primary |
Restarted primary server detected a data gap, triggered a data copy from the active secondary server, and the database copy failed. When a primary server is restarted, it always checks to see if a data gap has occurred due to the primary server being down for 24 hours or more. This copy rarely fails but if it occurs, all attempts to failback to the primary are blocked until the database copy completes successfully. As soon as it does, the primary state is set to Primary Syncing. |
Primary Failover |
Primary |
Primary server detected a failure. |
Primary Failback |
Primary |
User-triggered failback is currently in progress. |
Primary Lost Secondary |
Primary |
Primary server is unable to communicate with the secondary server. |
Primary Preparing for Failback |
Primary |
Primary server has started up in standby mode after a failover (because the secondary server is still active). When the primary server is ready for failback, its state will be set to Primary Syncing. |
Primary Syncing |
Primary |
Primary server is synchronizing the database and configuration files from the active secondary server. This occurs after a failover, when primary processes are brought up (and the secondary server is playing the active role). |
Primary Uncertain |
Primary |
Primary server's application processes are not able to connect to its database. |
Secondary Alone |
Secondary |
Primary server is not reachable from secondary server after a primary server restart. |
Secondary Syncing |
Secondary |
Secondary server is synchronizing the database and configuration files from the primary server. |
Secondary Active |
Secondary |
Failover from the primary server to the secondary server has completed successfully. |
Secondary Lost Primary |
Secondary |
Secondary server is not able to connect to the primary server (occurs when the primary fails or network connectivity is lost). For automatic failover, the secondary server will automatically move to the Secondary Active state. For Manual failover, you must trigger the failover to make the secondary server active (see Trigger Failover). |
Secondary Failover |
Secondary |
Failover triggered and is in progress. |
Secondary Failback |
Secondary |
Failback triggered and database and file replication is in progress. |
Secondary Post Failback |
Secondary |
Failback triggered; associated process stops and restarts are in progress. Database and configuration files have been replicated from the secondary server to the primary server. The primary server status will change to Primary Active, and the secondary server HA status will change to Secondary Syncing. |
Secondary Uncertain |
Secondary |
Secondary server's application processes cannot connect to the server's database. |