The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
This document describes troubleshooting techniques for the Nexus 7000 (N7K) hardware.
This command displays the fan module status on the switch.
SITE1-AGG1# show environment fan Fan: ------------------------------------------------------ Fan Model Hw Status ------------------------------------------------------ Fan1(sys_fan1) N7K-C7010-FAN-S 1.1 Ok Fan2(sys_fan2) N7K-C7010-FAN-S 1.1 Ok Fan3(fab_fan1) N7K-C7010-FAN-F 1.1 Ok Fan4(fab_fan2) N7K-C7010-FAN-F 1.1 Ok Fan_in_PS1 -- -- Ok Fan_in_PS2 -- -- Ok Fan_in_PS3 -- -- Shutdown Fan Zone Speed: Zone 1: 0x78 Zone 2: 0x58 Fan Air Filter : Present
Fan status can be one of ok, failure or absent.
“Fan module removed. Fan module has been absent for 120 seconds"
This command displays the power supplies installed, power usage summary and status of power supplies on the switch.
The command as well as a sample output is provided.
SITE1-AGG1# show environment power Power Supply: Voltage: 50 Volts Power Actual Total Supply Model Output Capacity Status (Watts ) (Watts ) ------- ------------------- ----------- ----------- -------------- 1 N7K-AC-6.0KW 1179 W 6000 W Ok 2 N7K-AC-6.0KW 1117 W 6000 W Ok 3 N7K-AC-6.0KW 0 W 0 W Shutdown Actual Power Module Model Draw Allocated Status (Watts ) (Watts ) ------- ------------------- ----------- ----------- -------------- 1 N7K-M148GT-11 N/A 400 W Powered-Up 3 N7K-M132XP-12 N/A 750 W Powered-Up 4 N7K-F132XP-15 318 W 385 W Powered-Up 5 N7K-SUP1 N/A 210 W Powered-Up 6 N7K-SUP1 N/A 210 W Powered-Up 10 N7K-M132XP-12L 535 W 750 W Powered-Up Xb1 N7K-C7010-FAB-1 N/A 80 W Powered-Up Xb2 N7K-C7010-FAB-1 N/A 80 W Powered-Up Xb3 N7K-C7010-FAB-1 N/A 80 W Powered-Up Xb4 xbar N/A 80 W Absent Xb5 xbar N/A 80 W Absent fan1 N7K-C7010-FAN-S 133 W 720 W Powered-Up fan2 N7K-C7010-FAN-S 133 W 720 W Powered-Up fan3 N7K-C7010-FAN-F 12 W 120 W Powered-Up fan4 N7K-C7010-FAN-F 12 W 120 W Powered-Up N/A - Per module power not available Power Usage Summary: -------------------- Power Supply redundancy mode (configured) PS-Redundant Power Supply redundancy mode (operational) Non-Redundant Total Power Capacity (based on configured mode) 12000 W Total Power of all Inputs (cumulative) 12000 W Total Power Output (actual draw) 2296 W Total Power Allocated (budget) 4785 W Total Power Available for additional modules 7215 W
Power supply status can be one of these:
Power supply failures:
Each power supply has a LED that indicates power output status. This LED is directly controlled by the power supply and a red color indicates a power supply failure. When you scan the syslog, you might show alternating messages about power supply failure and recovery, further indicating power supply related problems.
Each card in the chassis has atleast two temperature sensors. Each temperature sensor is configured with a minor and a major threshold. This command with sample output shows how temperature information can be retrieved from the switch:
SITE1-AGG1# show environment temperature Temperature: -------------------------------------------------------------------- Module Sensor MajorThresh MinorThres CurTemp Status (Celsius) (Celsius) (Celsius) -------------------------------------------------------------------- 1 Crossbar(s5) 105 95 46 Ok 1 CTSdev4 (s9) 115 105 56 Ok 1 CTSdev5 (s10) 115 105 57 Ok 1 CTSdev7 (s12) 115 105 56 Ok 1 CTSdev9 (s14) 115 105 53 Ok 1 CTSdev10(s15) 115 105 53 Ok 1 CTSdev11(s16) 115 105 52 Ok 1 CTSdev12(s17) 115 105 51 Ok 1 QEng1Sn1(s18) 115 105 51 Ok 1 QEng1Sn2(s19) 115 105 50 Ok 1 QEng1Sn3(s20) 115 105 48 Ok 1 QEng1Sn4(s21) 115 105 48 Ok 1 L2Lookup(s22) 120 110 47 Ok 1 L3Lookup(s23) 120 110 54 Ok 3 Crossbar(s5) 105 95 50 Ok 3 QEng1Sn1(s12) 115 110 69 Ok 3 QEng1Sn2(s13) 115 110 67 Ok 3 QEng1Sn3(s14) 115 110 66 Ok 3 QEng1Sn4(s15) 115 110 67 Ok 3 QEng2Sn1(s16) 115 110 70 Ok 3 QEng2Sn2(s17) 115 110 67 Ok 3 QEng2Sn3(s18) 115 110 66 Ok 3 QEng2Sn4(s19) 115 110 67 Ok 3 L2Lookup(s27) 115 105 51 Ok 3 L3Lookup(s28) 120 110 64 Ok 4 Crossbar1(s1) 105 95 69 Ok 4 Crossbar2(s2) 105 95 52 Ok 4 L2dev1(s3) 105 95 37 Ok 4 L2dev2(s4) 105 95 43 Ok 4 L2dev3(s5) 105 95 45 Ok 4 L2dev4(s6) 105 95 45 Ok 4 L2dev5(s7) 105 95 40 Ok 4 L2dev6(s8) 105 95 41 Ok 4 L2dev7(s9) 105 95 42 Ok 4 L2dev8(s10) 105 95 40 Ok 4 L2dev9(s11) 105 95 38 Ok 4 L2dev10(s12) 105 95 38 Ok 4 L2dev11(s13) 105 95 38 Ok 4 L2dev12(s14) 105 95 37 Ok 4 L2dev13(s15) 105 95 34 Ok 4 L2dev14(s16) 105 95 33 Ok 4 L2dev15(s17) 105 95 33 Ok 4 L2dev16(s18) 105 95 32 Ok 5 Intake (s3) 60 42 24 Ok 5 EOBC_MAC(s4) 105 95 42 Ok 5 CPU (s5) 105 95 42 Ok 5 Crossbar(s6) 105 95 47 Ok 5 Arbiter (s7) 110 100 55 Ok 5 CTSdev1 (s8) 115 105 44 Ok 5 InbFPGA (s9) 105 95 43 Ok 5 QEng1Sn1(s10) 115 105 48 Ok 5 QEng1Sn2(s11) 115 105 46 Ok 5 QEng1Sn3(s12) 115 105 44 Ok 5 QEng1Sn4(s13) 115 105 44 Ok 6 Intake (s3) 60 42 24 Ok 6 EOBC_MAC(s4) 105 95 40 Ok 6 CPU (s5) 105 95 36 Ok 6 Crossbar(s6) 105 95 45 Ok 6 Arbiter (s7) 110 100 52 Ok 6 CTSdev1 (s8) 115 105 43 Ok 6 InbFPGA (s9) 105 95 43 Ok 6 QEng1Sn1(s10) 115 105 53 Ok 6 QEng1Sn2(s11) 115 105 51 Ok 6 QEng1Sn3(s12) 115 105 48 Ok 6 QEng1Sn4(s13) 115 105 48 Ok 10 Crossbar(s5) 105 95 46 Ok 10 QEng1Sn1(s12) 115 110 65 Ok 10 QEng1Sn2(s13) 115 110 62 Ok 10 QEng1Sn3(s14) 115 110 64 Ok 10 QEng1Sn4(s15) 115 110 65 Ok 10 QEng2Sn1(s16) 115 110 65 Ok 10 QEng2Sn2(s17) 115 110 63 Ok 10 QEng2Sn3(s18) 115 110 64 Ok 10 QEng2Sn4(s19) 115 110 65 Ok 10 L2Lookup(s27) 115 105 51 Ok 10 L3Lookup(s28) 120 110 71 Ok xbar-1 Intake (s2) 60 42 27 Ok xbar-1 Crossbar(s3) 105 95 55 Ok xbar-2 Intake (s2) 60 42 25 Ok xbar-2 Crossbar(s3) 105 95 49 Ok xbar-3 Intake (s2) 60 42 26 Ok xbar-3 Crossbar(s3) 105 95 47 Ok
The Intake sensor is placed at the airflow intake and is the most critical indicator of card temperature. All software actions are taken based on a major temperature violation of the Intake sensor.
These result in a syslog message, callhome event and a Simple Network Management Protocol (SNMP) trap. This priority 1 or 2 messages are printed in the syslog – Module 1 reported Major temperature alarm (sensor-index 1 temperature 76).
The linecard is instantly shutdown with this priority 0 syslog message - Module 1 powered down due to major temperature alarm.
The redundant Supervisor is instantly shutdown. This will result in either a switchover or the standby shutting down, depending on the particular Supervisor that violated the threshold. This priority 0 syslog message is displayed - Module 1 powered down due to major temperature alarm.
Sometimes, the temperature sensors fail and become inaccessible. No explicit software action is taken for this condition. This priority 4 syslog message is printed – Module 1 temperature sensor failed.
Debugging a switch/supervisor level reset/reload typically involves looking into debug/log information stored on the Non-Volatile Random Access Memory (NVRAM) on the Supervisors. There are 3 kinds of debug/log information present in the NVRAM that might hold some important information.
1.1 Reset reason
Reset reasons are stored on the Supervisor NVRAM on each Supervisor. Each Supervisor stores its own reset reason. After the switch comes back up, the reset reasons can be dumped using this CLI command. A sample output is provided.
SITE1-AGG1# show system reset-reason ----- reset reason for Supervisor-module 5 (from Supervisor in slot 5) --- 1) No time Reason: Unknown Service: Version: 6.1(2) 2) No time Reason: Unknown Service: Version: 6.1(1) 3) At 246445 usecs after Wed Nov 7 21:26:59 2012 Reason: Reset triggered due to Switchover Request by User Service: SAP(93): Swover due to install Version: 6.1(2) 4) At 36164 usecs after Tue Nov 6 01:18:15 2012 Reason: Reset Requested by CLI command reload Service: Version: 5.2(1) ----- reset reason for Supervisor-module 5 (from Supervisor in slot 6) --- 1) At 939785 usecs after Wed Nov 7 22:28:36 2012 Reason: Reset due to upgrade Service: Version: 6.1(1) 2) At 687128 usecs after Thu Mar 29 18:06:34 2012 Reason: Reset of standby by active sup due to sysmgr timeout Service: Version: 6.0(2) 3) At 10012 usecs after Thu Mar 29 17:56:13 2012 Reason: Reset of standby by active sup due to sysmgr timeout Service: Version: 6.0(2) 4) At 210045 usecs after Thu Mar 29 17:45:51 2012 Reason: Reset of standby by active sup due to sysmgr timeout Service: Version: 6.0(2) ----- reset reason for Supervisor-module 6 (from Supervisor in slot 5) --- 1) At 50770 usecs after Wed Nov 7 21:12:19 2012 Reason: Reset due to upgrade Service: Version: 6.1(2) 2) At 434294 usecs after Mon Nov 5 22:10:16 2012 Reason: Reset due to upgrade Service: Version: 5.2(1) 3) At 518 usecs after Mon Nov 5 21:21:51 2012 Reason: Reset Requested by CLI command reload Service: Version: 5.2(7) 4) At 556934 usecs after Mon Nov 5 21:12:15 2012 Reason: Reset due to upgrade Service: Version: 5.2(1) ----- reset reason for Supervisor-module 6 (from Supervisor in slot 6) --- 1) No time Reason: Unknown Service: Version: 6.1(2) 2) At 462775 usecs after Wed Nov 7 22:38:44 2012 Reason: Reset triggered due to Switchover Request by User Service: SAP(93): Swover due to install Version: 6.1(1) 3) No time Reason: Unknown Service: Version: 6.1(2) 4) No time Reason: Unknown Service: Version: 5.2(1)
Upto the last 4 reset reasons are saved and displayed. A reset reason contains:
Sometimes a reset reason of Unknown is displayed. Reset reasons that are unknown to software or beyond software control are categorized as Unknown. These typically include:
1.2 NVRAM syslog
Syslog messages that are priority 0, 1 and 2 are also logged into the NVRAM of the Supervisor. After the switch comes back up online, syslog messages in the NVRAM can be displayed using this command. The command and a sample output is displayed:
SITE1-AGG1# show log nvram 2012 Nov 17 05:59:51 SITE1-AGG1 %$ VDC-1 %$ %SYSMGR-STANDBY-2-LAST_CORE_BASIC_TRACE: : PID 15681 with message 'Core detected due to hwclock crash'. 2012 Nov 17 12:07:11 SITE1-AGG1 %$ VDC-1 %$ %CMPPROXY-2-LOG_CMP_UP: Connectivity Management processor(on module 5) is now UP 2012 Nov 17 12:07:56 SITE1-AGG1 %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 1 has come online 2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_OK: Power supply 1 ok (Serial number DTM131000A4) 2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_FANOK: Fan in Power supply 1 ok 2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_OK: Power supply 2 ok (Serial number DTM140700HS) 2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_FANOK: Fan in Power supply 2 ok 2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_DETECT: Power supply 3 detected but shutdown (Serial number DTM1413004P) 2012 Nov 17 12:07:59 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-XBAR_DETECT: Xbar 1 detected (Serial number JAF1308ABCS) 2012 Nov 17 12:08:01 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-XBAR_DETECT: Xbar 2 detected (Serial number JAB120600NX) 2012 Nov 17 12:08:02 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-XBAR_DETECT: Xbar 3 detected (Serial number JAF1508AJHN) 2012 Nov 17 12:08:04 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_DETECT: Module 1 detected (Serial number JAB121602HP) Module-Type 10/100/1000 Mbps Ethernet Module Model N7K-M148GT-11 2012 Nov 17 12:08:04 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_PWRUP: Module 1 powered up (Serial number JAB121602HP) 2012 Nov 17 12:08:11 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_DETECT: Module 3 detected (Serial number JAF1441BSED) Module-Type 10 Gbps Ethernet Module Model N7K-M132XP-12 2012 Nov 17 12:08:11 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_DETECT: Module 4 detected (Serial number JAF1542ABML) Module-Type 1/10 Gbps Ethernet Module Model N7K-F132XP-15 2012 Nov 17 12:08:12 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_PWRUP: Module 3 powered up (Serial number JAF1441BSED) 2012 Nov 17 12:08:12 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_PWRUP: Module 4 powered up (Serial number JAF1542ABML) 2012 Nov 17 12:08:15 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_DETECT: Module 10 detected (Serial number JAF1521BNMK) Module-Type 10 Gbps Ethernet XL Module Model N7K-M132XP-12L 2012 Nov 17 12:08:15 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_PWRUP: Module 10 powered up (Serial number JAF1521BNMK) 2012 Nov 17 12:08:30 SITE1-AGG1 %$ VDC-1 %$ %CMPPROXY-STANDBY-2-LOG_CMP_UP: Connectivity Management processor(on module 6) is now UP 2012 Nov 17 12:08:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-FANMOD_FAN_OK: Fan module 1 (Fan1(sys_fan1) fan) ok 2012 Nov 17 12:08:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-FANMOD_FAN_OK: Fan module 2 (Fan2(sys_fan2) fan) ok 2012 Nov 17 12:08:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-FANMOD_FAN_OK: Fan module 3 (Fan3(fab_fan1) fan) ok 2012 Nov 17 12:08:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-FANMOD_FAN_OK: Fan module 4 (Fan4(fab_fan2) fan) ok 2012 Nov 17 12:11:40 SITE1-AGG1 %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 2 has come online 2012 Nov 17 12:12:31 SITE1-AGG1 %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 3 has come online 2012 Nov 17 12:13:21 SITE1-AGG1 %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 4 has come online 2012 Nov 17 13:10:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_TEMPMINALRM: Xbar-1 reported minor temperature alarm. Sensor=2 Temperature=43 MinThreshold=42 2012 Nov 17 19:56:35 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_TEMPOK: Xbar-1 recovered from minor temperature alarm. Sensor=2 Temperature=41 MinThreshold=42
Scanning the NVRAM syslog might provide some more information on the particular failure that caused the switch/Supervisor reload/reset.
1.3 Module exceptionlog
Module exceptionlog is a wraparound log of all errors and exceptional conditions on each module. Some exceptions are catastrophic, some partially affect certain ports in a module, others are for warning purposes. Each log entry has the particular device that logged the exception, the exception level, error code, ports affected, timestamp. The exception log is stored in the NVRAM on the Supervisor and it can be displayed using this CLI command. A sample output is provided.
SITE1-AGG1# show module internal exceptionlog ********* Exception info for module 1 ******** exception information --- exception instance 1 ---- Module Slot Number: 1 Device Id : 10 Device Name : eobc Device Errorcode : 0xc0005043 Device ID : 00 (0x00) Device Instance : 05 (0x05) Dev Type (HW/SW) : 00 (0x00) ErrNum (devInfo) : 67 (0x43) System Errorcode : 0x4042004d EOBC link failure Error Type : Warning PhyPortLayer : Ethernet Port(s) Affected : none DSAP : 0 (0x0) UUID : 0 (0x0) Time : Mon Nov 5 20:39:38 2012 (Ticks: 5098948A jiffies) exception information --- exception instance 2 ---- Module Slot Number: 1 Device Id : 10 Device Name : eobc Device Errorcode : 0xc0005047 Device ID : 00 (0x00) Device Instance : 05 (0x05) Dev Type (HW/SW) : 00 (0x00) ErrNum (devInfo) : 71 (0x47) System Errorcode : 0x4042004e EOBC heartbeat failure Error Type : Warning PhyPortLayer : Ethernet Port(s) Affected : none DSAP : 0 (0x0) UUID : 0 (0x0) Time : Mon Nov 5 20:39:37 2012 (Ticks: 50989489 jiffies)
The exceptionlog provides critical information to troubleshoot errors and exception conditions. Some of the device IDs are listed here.
#define DEV_LINECARD_CTRL 1 #define DEV_SAHARA_FPGA 2 #define DEV_RIVIERA_ASIC 3 #define DEV_LUXOR_ASIC 4 #define DEV_FRONTIER_U_ASIC 5 #define DEV_FRONTIER_D_ASIC 6 #define DEV_ALADDIN_ASIC 7 #define DEV_SSA_ASIC 8 #define DEV_MIRAGE_ASIC 9 #define DEV_EOBC_MAC 10 #define DEV_SUPERVISOR_CTRL 11 #define DEV_BELLAGIO_ASIC 12 #define DEV_SIBYTE 13 #define DEV_FLAMINGO 14 #define DEV_FATW_CTRL 15 #define DEV_MGMT_MAC 16 #define DEV_MOD_RDN_CTRL 17 #define DEV_MOD_ENV 18 #define DEV_GG_FPGA 19 #define DEV_BALLY_MAIN_BOARD 20 #define DEV_BALLY_DAUGHTER_CARD 21 #define DEV_LOCAL_SSO_ASIC 22 #define DEV_REMOTE_SSO_ASIC 23 #define DEV_ID_UD_FIX_FPGA 24 #define DEV_ID_PM_FPGA 25 // PM - Power Mngmnt #define DEV_ID_SUP_XBUS2 26 #define DEV_MARRIOTT_FPGA 27 #define DEV_REUSE_ME 28 #define DEV_GBIC 29 #define DEV_XGFC_FPGA 30 #define DEV_GNN_FPGA 31 #define DEV_SIBYTE_MEM_EPLD 32 #define DEV_BATTERY 33 #define DEV_IDE_DISK 45 #define DEV_XCVR 46 #define DEV_LINECARD 48 #define DEV_TEMP_SENSOR 49 #define DEV_HIFN_COMP 50 #define DEV_X2 51
In the Multilayer Data Switch (MDS) Chassis, the supervisor modules are brought up a little differently than the line-card modules. When two supervisors are present in the system and the system is powered-up, one of the supervisors will become active and the other standby. Active Supervisor bring-up and Standby Supervisor bring-up is different and is discussed here.
If there is no active supervisor in the system, the supervisor which boots up will default to active supervisor. A process called system manager is responsible for loading all the software components in an orderly fashion on the supervisor. One of the first software components that is run on the supervisor is the platform manager. This component will load all the kernel drivers and handshakes with the system manager. On Success, system-manager will go ahead and start the rest of the processes based on the internal dependency between processes.
From module manager’s perspective, Supervisor is just like another line-card module with subtle differences. When platform manager indicates to module manager that the Supervisor is UP, module manager does not wait for Registration. Instead, it informs all the software components that Supervisor is up (also known as Sup Insertion Sequence). All the components will configure the supervisor. If any component comes back with a failure, the supervisor will be rebooted.
If there is an active supervisor in the system, the supervisor which is booting up will default to standby supervisor state. The standby supervisor needs to mirror the state of the active supervisor. This is achieved by ‘system manager’ on active, initiating a gsync (global sync) of active supervisor state to standby supervisor. Once all the components on the standby are synchronized with that of the active supervisor, module manager is informed that the standby supervisor is up.
Module-manager will now go ahead and inform all the software components on the active supervisor to configure the standby supervisor (Also Known as Standby Sup Insertion Sequence). Any errors from any component during the Standby Sup Insertion Sequence will result in Standby Supervisor Reboot.
MDS maintains lot of debug information during runtime. But, whenever a supervisor reboots much of the debug information is lost. However all critical information is stored in non volatile ram, which can be used to reconstruct the failure. When an Active Supervisor reboots, the information that is stored in its nvram cannot be obtained until it comes back up again. Once the Supervisor comes back up again, these commands can be used to dump the persistent log:
Switch# show logging nvram
Switch# show system reset-reason
Switch# show module internal exception-log
Example 1: Active Sup Reboot (due to Supervisor Process Crash)
In this example, a Supervisor Process crashed (Service “xbar”) which causes the Active sup to be rebooted. When the supervisor comes back up again, the information stored in the reset-reason gives a clear indication, for the reboot of the supervisor.
switch# show system reset-reason ----- reset reason for module 6 ----- 1) At 94009 usecs after Tue Sep 27 18:52:13 2005 Reason: Reset triggered due to HA policy of Reset Service: Service "xbar" Version: 2.1(2)
If there is standby supervisor in the system, the standby supervisor will now become active supervisor. Displaying the syslog information on the standby supervisor will also provide the same information (although not as explicitly as ‘show system reset-reason’).
Switch# show logging 2005 Sep 27 18:58:05 172.20.150.204 %SYSMGR-3-SERVICE_CRASHED: Service "xbar" (PID 1225) hasn't caught signal 9 (no core). 2005 Sep 27 18:58:06 172.20.150.204 %SYSMGR-3-SERVICE_CRASHED: Service "xbar" (PID 2349) hasn't caught signal 9 (no core). 2005 Sep 27 18:58:06 172.20.150.204 %SYSMGR-3-SERVICE_CRASHED: Service "xbar" (PID 2352) hasn't caught signal 9 (no core).
Example 2: Active Sup Reboot (due to runtime diagnostic failure)
In this example, Supervisor in slot-6 is active and the arbiter on the Supervisor reports a Fatal Error. When any hardware device reports a Fatal Error, the module that contains the device is rebooted. In this case the Active Supervisor is rebooted. If there is a standby supervisor, the standby supervisor will take over. Syslog messages on the standby supervisor and exception log will have information to identify the source of error.
Switch# show logging 2005 Sep 28 14:17:47 172.20.150.204 %XBAR-5-XBAR_STATUS_REPORT: Module 6 reported status for component 12 code 0x60a02. 2005 Sep 28 14:17:59 172.20.150.204 %PORT-5-IF_UP: Interface mgmt0 on slot 5 is up 2005 Sep 28 14:18:00 172.20.150.204 %CALLHOME-2-EVENT: SUP_FAILURE switch# show module internal exceptionlog module 6 ********* Exception info for module 6 ******** exception information --- exception instance 1 ---- device id: 12 device errorcode: 0x80000020 system time: (1127917068 ticks) Wed Sep 28 14:17:48 2005 error type: FATAL error Number Ports went bad: 1,2,3,4,5,6 exception information --- exception instance 2 ---- device id: 12 device errorcode: 0x00060a02 system time: (1127917067 ticks) Wed Sep 28 14:17:47 2005 error type: Warning Number Ports went bad: 1,2,3,4,5,6
In addition, when the rebooted sup comes online again, ‘show system reset-reason’ will contain relevant information too. In this case the module 6 (which was the active sup) was rebooted by Sap 48 with error-code 0x80000020. The process which owns this sap can be obtained by the command ‘show system internal mts sup sap 48 description’ which says that the process was xbar-manager.
switch(standby)# show system reset-reason ----- reset reason for module 6 ----- 1) At 552751 usecs after Wed Sep 28 14:17:48 2005 Reason: Reset Requested due to Fatal Module Error Service: lcfail:80000020 sap:48 node:060 Version: 2.1(2)
Example 3: Standby Sup Failed to Come Online
In this example, active sup is up and running and standby sup is plugged into the system. However show module does not indicate that the module has ever come up.
switch# show module Mod Ports Module-Type Model Status --- ----- -------------------------------- ------------------ ------------ 5 0 Supervisor/Fabric-1 DS-X9530-SF1-K9 active * 8 8 IP Storage Services Module powered-dn Mod Sw Hw World-Wide-Name(s) (WWN) --- ----------- ------ -------------------------------------------------- 5 2.1(2) 1.1 -- Mod MAC-Address(es) Serial-Num --- -------------------------------------- ---------- 5 00-0b-be-f7-4d-1c to 00-0b-be-f7-4d-20 JAB070307XG
However, if you login to the console of the standby sup, it says it is standby.
runlog>telnet sw4-ts 2004 Trying 172.22.22.55... Connected to sw4-ts.cisco.com (172.22.22.55). Escape character is '^]'. MDS Switch login: admin Password: Cisco Storage Area Networking Operating System (SAN-OS) Software TAC support: http://www.cisco.com/tac Copyright (c) 2002-2005, Cisco Systems, Inc. All rights reserved. The copyrights to certain works contained herein are owned by other third parties and are used and distributed under license. Some parts of this software are covered under the GNU Public License. A copy of the license is available at http://www.gnu.org/licenses/gpl.html. switch(standby)#
As discussed earlier, when the standby sup is inserted into the system, the configuration and the state of all the components of the active supervisor is copied over to the standby (gsync). Till this process is complete, active supervisor does not consider standby supervisor is present. To verify if this process is complete, you could issue the following command on the active supervisor. The output of the command indicates that synchronization in progress (and is probably never completed).
switch# show system redundancy status Redundancy mode --------------- administrative: HA operational: None This supervisor (sup-1) ----------------------- Redundancy state: Active Supervisor state: Active Internal state: Active with HA standby Other supervisor (sup-2) ------------------------ Redundancy state: Standby Supervisor state: HA standby Internal state: HA synchronization in progress
The most likely reason why this could have happened is, if one of the software components on the standby failed to synchronize its state with the active supervisor. To verify which processes did not synchronize, you can issue this command on the active supervisor and the output indicates a lot of software components have not completed gsync.
switch# show system internal sysmgr gsyncstats Name Gsync done Gsync time(sec) ---------------- ---------- ------------- aaa 1 0 ExceptionLog 1 0 platform 1 1 radius 1 0 securityd 1 0 SystemHealth 1 0 tacacs 0 N/A acl 1 0 ascii-cfg 1 1 bios_daemon 0 N/A bootvar 1 0 callhome 1 0 capability 1 0 cdp 1 0 cfs 1 0 cimserver 1 0 cimxmlserver 0 N/A confcheck 1 0 core-dmon 1 0 core-client 0 N/A device-alias 1 0 dpvm 0 N/A dstats 1 0 epld_upgrade 0 N/A epp 1 1
In addition, looking at the standby supervisor we see that xbar software component has been restarted 23 times. This looks like the most likely cause that the standby did not come up.
switch(standby)# show system internal sysmgr service all Name UUID PID SAP state Start count ---------------- ---------- ------ ----- ----- ----------- aaa 0x000000B5 1458 111 s0009 1 ExceptionLog 0x00000050 [NA] [NA] s0002 None platform 0x00000018 1064 39 s0009 1 radius 0x000000B7 1457 113 s0009 1 securityd 0x0000002A 1456 55 s0009 1 vsan 0x00000029 1436 15 s0009 1 vshd 0x00000028 1408 37 s0009 1 wwn 0x00000030 1435 114 s0009 1 xbar 0x00000017 [NA] [NA] s0017 23 xbar_client 0x00000049 1434 917 s0009 1
Example 3: Standby Sup is in Powered-up State
In this example, standby sup is inserted in slot 6. show module command issued on the active-sup, shows Standby Sup is in powered-up state.
switch# show module Mod Ports Module-Type Model Status --- ----- -------------------------------- ------------------ ------------ 5 0 Supervisor/Fabric-1 DS-X9530-SF1-K9 active * 6 0 Supervisor/Fabric-1 powered-up 8 8 IP Storage Services Module powered-dn Mod Sw Hw World-Wide-Name(s) (WWN) --- ----------- ------ -------------------------------------------------- 5 2.1(2) 1.1 -- Mod MAC-Address(es) Serial-Num --- -------------------------------------- ---------- 5 00-0b-be-f7-4d-1c to 00-0b-be-f7-4d-20 JAB070307XG
In this example, show logging does not give any valuable information and neither does show module internal exception-log. However as all state transitions for a given module is stored in the module manager we can look at the state transistions of the module manager to figure out what is wrong. The internal state transistions are:
Switch# show module internal event-history module 5 64) FSM:<ID(1): Slot 6, node 0x0601> Transition at 563504 usecs after Wed Sep 28 14:44:53 2005 Previous state: [LCM_ST_LC_NOT_PRESENT] Triggered event: [LCM_EV_PFM_MODULE_SUP_INSERTED] Next state: [LCM_ST_SUPERVISOR_INSERTED] 65) FSM:<ID(1): Slot 6, node 0x0601> Transition at 563944 usecs after Wed Sep 28 14:44:53 2005 Previous state: [LCM_ST_SUPERVISOR_INSERTED] Triggered event: [LCM_EV_START_SUP_INSERTED_SEQUENCE] Next state: [LCM_ST_CHECK_INSERT_SEQUENCE] 66) Event:ESQ_START length:32, at 564045 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x2710, Ret:success Seq Type:SERIAL 67) Event:ESQ_REQ length:32, at 564422 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x1, Ret:success [E_MTS_TX] Dst:MTS_SAP_MIGUTILS_DAEMON(949), Opc:MTS_OPC_LC_INSERTED(1081) 68) Event:ESQ_RSP length:32, at 566174 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x1, Ret:success [E_MTS_RX] Src:MTS_SAP_MIGUTILS_DAEMON(949), Opc:MTS_OPC_LC_INSERTED(1081) 69) Event:ESQ_REQ length:32, at 566346 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x2, Ret:success [E_MTS_TX] Dst:MTS_SAP_NTP(72), Opc:MTS_OPC_LC_INSERTED(1081) 70) Event:ESQ_RSP length:32, at 566635 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x2, Ret:success [E_MTS_RX] Src:MTS_SAP_NTP(72), Opc:MTS_OPC_LC_INSERTED(1081) 71) Event:ESQ_REQ length:32, at 566772 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x3, Ret:success [E_MTS_TX] Dst:MTS_SAP_XBAR_MANAGER(48), Opc:MTS_OPC_LC_INSERTED(1081) 73) Event:ESQ_RSP length:32, at 586418 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x3, Ret:(null) [E_MTS_RX] Src:MTS_SAP_XBAR_MANAGER(48), Opc:MTS_OPC_LC_INSERTED(1081) 74) FSM:<ID(1): Slot 6, node 0x0601> Transition at 586436 usecs after Wed Sep 28 14:44:53 2005 Previous state: [LCM_ST_CHECK_INSERT_SEQUENCE] Triggered event: [LCM_EV_LC_INSERTED_SEQ_FAILED] Next state: [LCM_ST_CHECK_REMOVAL_SEQUENCE] 75) Event:ESQ_START length:32, at 586611 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x2710, Ret:success Seq Type:SERIAL 76) Event:ESQ_REQ length:32, at 593649 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x1, Ret:success [E_MTS_TX] Dst:MTS_SAP_MIGUTILS_DAEMON(949), Opc:MTS_OPC_LC_REMOVED(1082) 77) Event:ESQ_RSP length:32, at 594854 usecs after Wed Sep 28 14:44:53 2005 Instance:1, Seq Id:0x1, Ret:success [E_MTS_RX] Src:MTS_SAP_MIGUTILS_DAEMON(949), Opc:MTS_OPC_LC_REMOVED(1082) 90) FSM:<ID(1): Slot 6, node 0x0601> Transition at 604447 usecs after Wed Sep 28 14:44:53 2005 Previous state: [LCM_ST_CHECK_REMOVAL_SEQUENCE] Triggered event: [LCM_EV_ALL_LC_REMOVED_RESP_RECEIVED] Next state: [LCM_ST_LC_FAILURE] 91) FSM:<ID(1): Slot 6, node 0x0601> Transition at 604501 usecs after Wed Sep 28 14:44:53 2005 Previous state: [LCM_ST_LC_FAILURE] Triggered event: [LCM_EV_LC_INSERTED_SEQ_FAILED] Next state: [LCM_ST_LC_FAILURE] 92) FSM:<ID(1): Slot 6, node 0x0601> Transition at 604518 usecs after Wed Sep 28 14:44:53 2005 Previous state: [LCM_ST_LC_FAILURE] Triggered event: [LCM_EV_SUPERVISOR_FAILURE] Next state: [LCM_ST_LC_NOT_PRESENT] Curr state: [LCM_ST_LC_NOT_PRESENT] switch#
Look at the logs above Index 92, indicates that the supervisor is in failed state and the triggered event is LCM_EV_LC_INSERTED_SEQ_FAILED. (Insertion sequence failed). Going up the logs to find out why Insertion Sequence failed, see that insertion sequence failed right after a response from MTS_SAP_XBAR_MANAGER (Index 73 and Index 74). This indicates that there is something wrong with xbar configuration when the standby sup is inserted. More debugging can be done by looking at the internal logs of the failed component (in this case, xbar component).