Overview

Cisco NX-OS is a resilient operating system that is specifically designed for high availability at the network, system, and process level.

This chapter describes high availability (HA) concepts and features for Cisco NX-OS devices and includes the following sections:

Licensing Requirements

For a complete explanation of Cisco NX-OS licensing recommendations and how to obtain and apply licenses, see the Cisco NX-OS Licensing Guide.

Information About High Availability

To prevent or minimize traffic disruption during hardware or software failures, Cisco NX-OS has these features:

  • Redundancy—Cisco NX-OS HA provides physical and software redundancy at every component level, spanning across the physical, environmental, power, and system software aspects of its architecture.

  • Isolation of planes and processes—Cisco NX-OS HA provides isolation between control and data forwarding planes within the device and between software components, so that a failure within one plane or process does not disrupt others.

  • Restartability—Most system functions and services are isolated so that they can be restarted independently after a failure while other services continue to run. In addition, most system services can perform stateful restarts, which allow the service to resume operations transparently to other services.

  • Supervisor stateful switchover—The Nexus 7000 series supports an active and standby dual supervisor configuration. State and configuration remain constantly synchronized between the two supervisor modules to provide seamless and stateful switchover in the event of a supervisor module failure.

  • Nondisruptive upgrades—Cisco NX-OS supports the in-service software upgrade (ISSU) feature, which allows you to upgrade the device software while the switch continues to forward traffic. ISSU reduces or eliminates the downtime typically caused by software upgrades.

Service-Level High Availability

Cisco NX-OS has a modularized architecture that compartmentalizes components for fault isolation, redundancy, and resource efficiency.

For additional details about service-level HA, see chapter Service-Level High Availability.

Isolation of Processes

In the Cisco NX-OS software, independent processes, known asservices, perform a function or set of functions for a subsystem or feature set. Each service and service instance runs as an independent, protected process. This approach provides a highly fault-tolerant software infrastructure and fault isolation between services. A failure in a service instance (such as 802.1Q) will not affect any other services running at that time (such as the Link Aggregation Control Protocol [LACP]). In addition, each instance of a service can run as an independent process, which means that two instances of a routing protocol (for example, two instances of the Open Shortest Path First [OSPF] protocol) can run as separate processes.

Process Restartability

Cisco NX-OS processes run in a protected memory space independently from each other and the kernel. This process isolation provides fault containment and enables rapid restarts. Process restartability ensures that process-level failures do not cause system-level failures. In addition, most services can perform stateful restarts, which allows a service that experiences a failure to be restarted and to resume operations transparently to other services within the platform and to neighboring devices within the network.

System-Level High Availability

The Nexus 7000 series is protected from system failure by redundant hardware components and a high-availability software framework.

For additional information about system-level HA features, see chapter, System-Level High Availability.

Physical Redundancy

The Nexus 7000 series has the following physical redundancies:

  • Power Supply Redundancy—The Cisco Nexus 7000 series chassis supports three power supply modules on a Cisco Nexus 7010 switch and up to four power supplies on a Cisco Nexus 7018 switch, each of which is composed of two internalized isolated power units, giving it two power paths per modular power supply, and six paths in total, per chassis, when fully populated.

  • Fan Tray Redundancy—The Cisco Nexus 7010 chassis contains two redundant system fan trays for I/O module cooling and two redundant fan trays for switch fabric module cooling. One of each pair of fan trays is sufficient to provide system cooling. There is no time limit for replacing a failed Cisco Nexus 7010 fan tray, but to ensure the proper airflow, you must leave the failed fan tray in place.

    The Cisco Nexus 7018 chassis contains two fan trays, each of which is required to cool the modules in the chassis. The upper fan tray cools slots 1 to 9 and the fabric modules. The lower fan tray cools slots 10 to 18. Each of these fan trays is hot swappable, but you must replace a fan tray within 3 minutes of removal or the switch will shut down.

  • Fabric Redundancy—Cisco NX-OS provides switching fabric availability through redundant switch fabric modules. You can configure a single Cisco Nexus 7000 series chassis with one to five switch fabric cards for capacity and redundancy. Each I/O module installed in the system automatically connects to and uses all functionally installed switch fabric modules. A failure of a switch fabric module triggers an automatic reallocation and balancing of traffic across the remaining active switch fabric modules. Replacing the failed fabric module reverses this process. When you insert the fabric module and bring it online, traffic is again redistributed across all installed fabric modules and redundancy is restored.

  • Supervisor Module Redundancy—The Cisco Nexus 7000 series chassis supports dual supervisor modules to provide redundancy for the control and management plane. A dual supervisor configuration operates in an active/standby capacity in which only one of the supervisor modules is active at any given time, while the other acts as a standby backup. The state and configuration remain constantly synchronized between the two supervisor modules to provide a statefu1 switchover if the active supervisor module fails.

ISSU

Cisco NX-OS allows you to perform an in-service software upgrade (ISSU), which is also known as a nondisruptive upgrade. The modular software architecture of Cisco NX-OS supports plug-in-based services and features, which allow you to perform complete image upgrades of supervisors and switching modules with little to no impact on other modules. Because of this design, you can upgrade Cisco NX-OS nondisruptively with no impact to the data forwarding plane and allow for nonstop forwarding during a software upgrade, even between full image versions.

For additional details about ISSU, see ISSU and High Availability.

VDCs

Cisco NX-OS implements a logical virtualization at the device level, which allows multiple instances of a device to operate on the same physical switch simultaneously. These logical operating environments are known asvirtual device contexts, or VDCs. VDCs provide logically separate device environments that you can independently configure and manage. This degree of isolation provides fault isolation in addition to security and administrative benefits. Human error or failure conditions occur when the configuration is isolated within a given virtual device. While virtual device contexts are not primarily a high-availability feature, the operationally independent fault domains contribute to availability and prevent service disruptions that are associated with device configuration.

For more information on VDCs, see Cisco Nexus 7000 Series NX-OS Virtual Device Context Configuration Guide.

Network-Level High Availability

Network convergence is optimized by providing tools and functions to make both failover and fallback transparent and fast.

For additional information about network-level HA features, see Network-Level High Availability..

Internal CRC Error Detection and Isolation

Starting with the Cisco NX-OS Release 8.4(1), the Internal Cyclic Redundancy Check (CRC) error detection and isolation feature is supported on the Cisco Nexus 7000 Series switches.

This feature enables the switches to detect CRC errors that occur internally within a switch and isolate the source of these errors when the configured threshold value is reached. Use the hardware fabric crc threshold command to enable this feature. This command also enables powering down of any switching or supervisor module that exceeds the internal CRC error threshold limit. By default, the Internal CRC Error Detection and Isolation feature is disabled.

Detection of CRC errors along with powering down of any switching or supervisor module that exceeds the internal CRC error threshold limit is only supported on devices in which SM15 or SAC ASICs are present. No action is triggered on devices on which any other ASICs are present.

Internal CRC errors are usually caused by a fault in the system. Such faults may be transient, such as an ungracefully removed module, or permanent, such as a badly seated module, or, in rare cases, a failing or failed hardware component. Module refers either a switching module or a supervisor module. The rate of errors depends on many factors and may range from very high to very low. The error-rate threshold is configurable as a system-wide value, but separate error counts are maintained for each module to identify an error source. Errors on each module are handled individually when the error count exceeds the threshold.

Polling of internal CRC errors is based on an interrupt timer. In case an interrupt flag is set, it indicates that a CRC error has occured and the CRC counter is correspondingly increased. When the CRC counter value crosses the threshold value, the CRC error handler is triggered. The duration of the interrupt timer depends on factors such as the supervisor module being used and the number of ASICs in the device.


Note


The counters are reset at 24 hours from the time the Internal Cyclic Redundancy Check (CRC) detection and isolation feature was first configured.


Configuring Internal CRC Error Detection and Isolation

Procedure


Step 1

Enter global configuration mode:

switch# configure terminal

Step 2

Enable internal CRC error detection and isolation:

switch(config)# hardware fabric crc threshold <threshold-count>

Note

 

The error rate is measured over sequential 24-hour window. The range of threshold is 1 to 100. If the threshold is not specified, the default threshold value is 3.

Step 3

(Optional) Disable internal CRC error detection and isolation:

switch(config)# no hardware fabric crc

Step 4

(Optional) Save the running configuration to the startup configuration:

switch(config)# copy running-config startup-config


Running Configuration

This example shows how to enable internal CRC error detection and isolation.


configure terminal
 hardware fabric crc threshold 5
 

Additional Management Tools for Availability

Cisco NX-OS incorporates several Cisco system management tools for monitoring and notification of system availability events.

GOLD

Cisco Generic On-Line Diagnostics (GOLD) subsystem and additional monitoring processes on the supervisor facilitate the triggering of a stateful failover to the redundant supervisor upon the detection of unrecoverable critical failures, service restartability errors, kernel errors, or hardware failures.

For information about configuring GOLD, see the Cisco Nexus 7000 Series NX-OS System Management Configuration Guide.

EEM

Cisco Embedded Event Manager (EEM) consists of Event Detectors, the Event Manager, and an Event Manager Policy Engine. Using EEM, you can define policies to take specific actions when the system software recognizes certain events through the Event Detectors. The result is a flexible set of tools to automate many network management tasks and to direct the operation of Cisco NX-OS to increase availability, collect information, and notify external systems or personnel about critical events.

For information about configuring EEM, see the Cisco Nexus 7000 Series NX-OS System Management Configuration Guide.

Smart Call Home

Combining Cisco GOLD and Cisco EEM capabilities, Smart Call Home provides an e-mail-based notification of critical system events. Smart Call Home has message formats that are compatible with pager services, standard e-mail, or XML-based automated parsing applications. You can use this feature to page a network support engineer, e-mail a network operations center, or use Cisco Smart Call Home services to automatically generate a case with Cisco’s Technical Assistance Center (TAC).

For information about configuring Smart Call Home, see the Cisco Nexus 7000 Series NX-OS System Management Configuration Guide.