Before Cisco NX-OS
Release 6.2(8), runtime tests did not take corrective recovery actions when
they detected a hardware failure. The default action through EEM included
generating alerts (callhome, syslog) and logging (OBFL, exception logs). These
actions are informative, but they did not remove faulty devices from the
network, which can lead to network disruption, traffic black holing, and so
forth. Before Cisco NX-OS Release 6.2(8), you must manually shut the devices to
recover the network.
In Cisco NX-OS Release
6.2(8) and later releases, you can configure the system to take disruptive
action if the system detects failure on one of the following runtime, or
health-monitoring, tests:
The recovery actions
feature is disabled by default. With this feature you can configure the system
to take disruptive action as a result of repeated failures on the
health-monitoring, or runtime, tests. This feature enables or disables the
corrective, conservative action on all four tests, simultaneously; the
corrective action taken differs for each test. After crossing the maximum
consecutive failure count for that test, the system takes corrective action.
With the recovery
actions feature enabled, he corrective action for each test is as follows:
-
SnakeLoopback
test—After the test detects 10 consecutive failures with any port on the
module, the system will move the faulty port to an error-disabled state.
-
StandbyFabricLoopback test—The system attempts to reload the standby supervisor if it receives error on this test and continues
to reload if the system keeps seeing the failure even after the reload. It cannot power off the standby supervisor.
Finally, the system
maintains a history of the recovery actions that includes details of each
action, the testing type, and the severity. You can display these counters.