Troubleshooting N7K HW(fan/PS/Temp/Xbar/SUP)

Available Languages

Download Options

PDF (14.4 KB)
View with Adobe Reader on a variety of devices
ePub (75.0 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (86.4 KB)
View on Kindle device or Kindle app on multiple devices

Updated:October 14, 2016

Document ID:200148

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

Debugging Chassis Issues

Fan Issues

Power Supply

Temperature or Heat

Debugging Supervisor Module Issues

Switch/Supervisor Reset/Reload

Active Supervisor Bring-up

Standby Supervisor Bring-up

Active Supervisor Reboot

Introduction

This document describes troubleshooting techniques for the Nexus 7000 (N7K) hardware.

Debugging Chassis Issues

Fan Issues

This command displays the fan module status on the switch.

SITE1-AGG1# show environment fan
Fan:
------------------------------------------------------
Fan             Model                Hw         Status
------------------------------------------------------
Fan1(sys_fan1)  N7K-C7010-FAN-S      1.1        Ok  
Fan2(sys_fan2)  N7K-C7010-FAN-S      1.1        Ok  
Fan3(fab_fan1)  N7K-C7010-FAN-F      1.1        Ok  
Fan4(fab_fan2)  N7K-C7010-FAN-F      1.1        Ok  
Fan_in_PS1      --                   --         Ok             
Fan_in_PS2      --                   --         Ok             
Fan_in_PS3      --                   --         Shutdown       
Fan Zone Speed: Zone 1: 0x78 Zone 2: 0x58
Fan Air Filter : Present

Fan status can be one of ok, failure or absent.

Ok – All fans including the fan controller are functioning properly
Failure – One or more fans or fan controller have failed. Software cannot determine if a single fan, multiple fans, or all fans have failed. If at least one fan has failed, this status is displayed. This priority 1 syslog message is printed: Fan module Failed.
Absent – Fan module has been removed. As soon as the fan module is removed, software starts a 5 minute countdown; if the fan module is not re-inserted within 5 minutes, the entire switch is shutdown. Software reads a byte on the Serial Electrically Erasable Programmable Read Only Memory (SEEPROM) to determine if the fan module is present. If the fan module is partially inserted or software is unable to access the SEEPROM on the fan module due to any other reason, software cannot distinguish this case from a real fan module removal. The switch will be shutdown in 5 minutes. If software detects a removal, this priority 0 syslog message is printed every 5 seconds.

“Fan module removed. Fan module has been absent for 120 seconds"

No explicit action is taken by software on a Power Supply fan failure, other than indicating such a failure using syslog messages.

Power Supply

This command displays the power supplies installed, power usage summary and status of power supplies on the switch.

The command as well as a sample output is provided.

SITE1-AGG1# show environment power 
Power Supply:
Voltage: 50 Volts
Power                              Actual        Total
Supply    Model                    Output     Capacity    Status
                                 (Watts )     (Watts )
-------  -------------------  -----------  -----------  --------------
1        N7K-AC-6.0KW              1179 W       6000 W     Ok        
2        N7K-AC-6.0KW              1117 W       6000 W     Ok        
3        N7K-AC-6.0KW                 0 W          0 W     Shutdown  

                                  Actual        Power      
Module    Model                     Draw    Allocated    Status
                                 (Watts )     (Watts )     
-------  -------------------  -----------  -----------  --------------
1        N7K-M148GT-11              N/A          400 W    Powered-Up
3        N7K-M132XP-12              N/A          750 W    Powered-Up
4        N7K-F132XP-15              318 W        385 W    Powered-Up
5        N7K-SUP1                   N/A          210 W    Powered-Up
6        N7K-SUP1                   N/A          210 W    Powered-Up
10       N7K-M132XP-12L             535 W        750 W    Powered-Up
Xb1      N7K-C7010-FAB-1            N/A           80 W    Powered-Up
Xb2      N7K-C7010-FAB-1            N/A           80 W    Powered-Up
Xb3      N7K-C7010-FAB-1            N/A           80 W    Powered-Up
Xb4      xbar                       N/A           80 W    Absent
Xb5      xbar                       N/A           80 W    Absent
fan1     N7K-C7010-FAN-S            133 W        720 W    Powered-Up
fan2     N7K-C7010-FAN-S            133 W        720 W    Powered-Up
fan3     N7K-C7010-FAN-F             12 W        120 W    Powered-Up
fan4     N7K-C7010-FAN-F             12 W        120 W    Powered-Up

N/A - Per module power not available


Power Usage Summary:
--------------------
Power Supply redundancy mode (configured)                PS-Redundant
Power Supply redundancy mode (operational)               Non-Redundant

Total Power Capacity (based on configured mode)              12000 W
Total Power of all Inputs (cumulative)                       12000 W
Total Power Output (actual draw)                              2296 W
Total Power Allocated (budget)                                4785 W
Total Power Available for additional modules                  7215 W

Power supply status can be one of these:

Ok – Power supply is functioning properly
Fail/Shutdown – Either the power supply has failed or it is shutdown using the switch on the power supply. Whenever a power supply fails, software prints this priority 2 syslog message; Power supply 1 failed or shutdown (Serial number xxxx).
Shutdown – Software has shutdown the power supply. Software shuts down the lower capacity power supply only if it detects a mis-matched pair of power supplies and the mode is redundant or there is a transition from combined to redundant mode. If both power supplies are the same capacity or the mode is combined, software never shuts down a power supply. This priority 2 syslog message is printed and accompanies a software power supply shutdown; Detected power supply 1. This reduces the redundant power available to the system and can cause service disruptions (Serial number xxxx).
Absent – The power supply is absent and has been removed. This priority 2 syslog message is printed during a power supply removal; Power supply 2 removed (Serial number xxxx).

Power supply failures:

Each power supply has a LED that indicates power output status. This LED is directly controlled by the power supply and a red color indicates a power supply failure. When you scan the syslog, you might show alternating messages about power supply failure and recovery, further indicating power supply related problems.

Temperature or Heat

Each card in the chassis has atleast two temperature sensors. Each temperature sensor is configured with a minor and a major threshold. This command with sample output shows how temperature information can be retrieved from the switch:

SITE1-AGG1# show environment temperature 
Temperature:
--------------------------------------------------------------------
Module   Sensor        MajorThresh   MinorThres   CurTemp     Status
                       (Celsius)     (Celsius)    (Celsius)         
--------------------------------------------------------------------
1        Crossbar(s5)    105             95          46         Ok             
1        CTSdev4 (s9)    115             105         56         Ok             
1        CTSdev5 (s10)   115             105         57         Ok             
1        CTSdev7 (s12)   115             105         56         Ok             
1        CTSdev9 (s14)   115             105         53         Ok             
1        CTSdev10(s15)   115             105         53         Ok             
1        CTSdev11(s16)   115             105         52         Ok             
1        CTSdev12(s17)   115             105         51         Ok             
1        QEng1Sn1(s18)   115             105         51         Ok             
1        QEng1Sn2(s19)   115             105         50         Ok             
1        QEng1Sn3(s20)   115             105         48         Ok             
1        QEng1Sn4(s21)   115             105         48         Ok             
1        L2Lookup(s22)   120             110         47         Ok             
1        L3Lookup(s23)   120             110         54         Ok             
3        Crossbar(s5)    105             95          50         Ok             
3        QEng1Sn1(s12)   115             110         69         Ok             
3        QEng1Sn2(s13)   115             110         67         Ok             
3        QEng1Sn3(s14)   115             110         66         Ok             
3        QEng1Sn4(s15)   115             110         67         Ok             
3        QEng2Sn1(s16)   115             110         70         Ok             
3        QEng2Sn2(s17)   115             110         67         Ok             
3        QEng2Sn3(s18)   115             110         66         Ok             
3        QEng2Sn4(s19)   115             110         67         Ok             
3        L2Lookup(s27)   115             105         51         Ok             
3        L3Lookup(s28)   120             110         64         Ok             
4        Crossbar1(s1)   105             95          69         Ok             
4        Crossbar2(s2)   105             95          52         Ok             
4        L2dev1(s3)      105             95          37         Ok             
4        L2dev2(s4)      105             95          43         Ok             
4        L2dev3(s5)      105             95          45         Ok             
4        L2dev4(s6)      105             95          45         Ok             
4        L2dev5(s7)      105             95          40         Ok             
4        L2dev6(s8)      105             95          41         Ok             
4        L2dev7(s9)      105             95          42         Ok             
4        L2dev8(s10)     105             95          40         Ok             
4        L2dev9(s11)     105             95          38         Ok             
4        L2dev10(s12)    105             95          38         Ok             
4        L2dev11(s13)    105             95          38         Ok             
4        L2dev12(s14)    105             95          37         Ok             
4        L2dev13(s15)    105             95          34         Ok             
4        L2dev14(s16)    105             95          33         Ok             
4        L2dev15(s17)    105             95          33         Ok             
4        L2dev16(s18)    105             95          32         Ok             
5        Intake  (s3)    60              42          24         Ok             
5        EOBC_MAC(s4)    105             95          42         Ok             
5        CPU     (s5)    105             95          42         Ok             
5        Crossbar(s6)    105             95          47         Ok             
5        Arbiter (s7)    110             100         55         Ok             
5        CTSdev1 (s8)    115             105         44         Ok             
5        InbFPGA (s9)    105             95          43         Ok             
5        QEng1Sn1(s10)   115             105         48         Ok             
5        QEng1Sn2(s11)   115             105         46         Ok             
5        QEng1Sn3(s12)   115             105         44         Ok             
5        QEng1Sn4(s13)   115             105         44         Ok             
6        Intake  (s3)    60              42          24         Ok             
6        EOBC_MAC(s4)    105             95          40         Ok             
6        CPU     (s5)    105             95          36         Ok             
6        Crossbar(s6)    105             95          45         Ok             
6        Arbiter (s7)    110             100         52         Ok             
6        CTSdev1 (s8)    115             105         43         Ok             
6        InbFPGA (s9)    105             95          43         Ok             
6        QEng1Sn1(s10)   115             105         53         Ok             
6        QEng1Sn2(s11)   115             105         51         Ok             
6        QEng1Sn3(s12)   115             105         48         Ok             
6        QEng1Sn4(s13)   115             105         48         Ok             
10       Crossbar(s5)    105             95          46         Ok             
10       QEng1Sn1(s12)   115             110         65         Ok             
10       QEng1Sn2(s13)   115             110         62         Ok             
10       QEng1Sn3(s14)   115             110         64         Ok             
10       QEng1Sn4(s15)   115             110         65         Ok             
10       QEng2Sn1(s16)   115             110         65         Ok             
10       QEng2Sn2(s17)   115             110         63         Ok             
10       QEng2Sn3(s18)   115             110         64         Ok             
10       QEng2Sn4(s19)   115             110         65         Ok             
10       L2Lookup(s27)   115             105         51         Ok             
10       L3Lookup(s28)   120             110         71         Ok             
xbar-1   Intake  (s2)    60              42          27         Ok             
xbar-1   Crossbar(s3)    105             95          55         Ok             
xbar-2   Intake  (s2)    60              42          25         Ok             
xbar-2   Crossbar(s3)    105             95          49         Ok             
xbar-3   Intake  (s2)    60              42          26         Ok             
xbar-3   Crossbar(s3)    105             95          47         Ok

The Intake sensor is placed at the airflow intake and is the most critical indicator of card temperature. All software actions are taken based on a major temperature violation of the Intake sensor.

All minor threshold violations and major threshold violations on non-Intake sensors

These result in a syslog message, callhome event and a Simple Network Management Protocol (SNMP) trap. This priority 1 or 2 messages are printed in the syslog – Module 1 reported Major temperature alarm (sensor-index 1 temperature 76).

Major temperature threshold violation on a linecard on Intake sensor

The linecard is instantly shutdown with this priority 0 syslog message - Module 1 powered down due to major temperature alarm.

Major temperature threshold violation on a redundant Supervisor on Intake sensor

The redundant Supervisor is instantly shutdown. This will result in either a switchover or the standby shutting down, depending on the particular Supervisor that violated the threshold. This priority 0 syslog message is displayed - Module 1 powered down due to major temperature alarm.

Temperature sensor failure

Sometimes, the temperature sensors fail and become inaccessible. No explicit software action is taken for this condition. This priority 4 syslog message is printed – Module 1 temperature sensor failed.

Debugging Supervisor Module Issues

Switch/Supervisor Reset/Reload

Debugging a switch/supervisor level reset/reload typically involves looking into debug/log information stored on the Non-Volatile Random Access Memory (NVRAM) on the Supervisors. There are 3 kinds of debug/log information present in the NVRAM that might hold some important information.

1.1 Reset reason

Reset reasons are stored on the Supervisor NVRAM on each Supervisor. Each Supervisor stores its own reset reason. After the switch comes back up, the reset reasons can be dumped using this CLI command. A sample output is provided.

SITE1-AGG1# show system reset-reason 
----- reset reason for Supervisor-module 5 (from Supervisor in slot 5) ---
1) No time
    Reason: Unknown
    Service: 
    Version: 6.1(2)
2) No time
    Reason: Unknown
    Service: 
    Version: 6.1(1)
3) At 246445 usecs after Wed Nov  7 21:26:59 2012
    Reason: Reset triggered due to Switchover Request by User
    Service: SAP(93): Swover due to install
    Version: 6.1(2)
4) At 36164 usecs after Tue Nov  6 01:18:15 2012
    Reason: Reset Requested by CLI command reload
    Service: 
    Version: 5.2(1)
----- reset reason for Supervisor-module 5 (from Supervisor in slot 6) ---
1) At 939785 usecs after Wed Nov  7 22:28:36 2012
    Reason: Reset due to upgrade
    Service: 
    Version: 6.1(1)
2) At 687128 usecs after Thu Mar 29 18:06:34 2012
    Reason: Reset of standby by active sup due to sysmgr timeout
    Service: 
    Version: 6.0(2)
3) At 10012 usecs after Thu Mar 29 17:56:13 2012
    Reason: Reset of standby by active sup due to sysmgr timeout
    Service: 
    Version: 6.0(2)
4) At 210045 usecs after Thu Mar 29 17:45:51 2012
    Reason: Reset of standby by active sup due to sysmgr timeout
    Service: 
    Version: 6.0(2)
----- reset reason for Supervisor-module 6 (from Supervisor in slot 5) ---
1) At 50770 usecs after Wed Nov  7 21:12:19 2012
    Reason: Reset due to upgrade
    Service: 
    Version: 6.1(2)
2) At 434294 usecs after Mon Nov  5 22:10:16 2012
    Reason: Reset due to upgrade
    Service: 
    Version: 5.2(1)
3) At 518 usecs after Mon Nov  5 21:21:51 2012
    Reason: Reset Requested by CLI command reload
    Service: 
    Version: 5.2(7)
4) At 556934 usecs after Mon Nov  5 21:12:15 2012
    Reason: Reset due to upgrade
    Service: 
    Version: 5.2(1)
----- reset reason for Supervisor-module 6 (from Supervisor in slot 6) ---
1) No time
    Reason: Unknown
    Service: 
    Version: 6.1(2)
2) At 462775 usecs after Wed Nov  7 22:38:44 2012
    Reason: Reset triggered due to Switchover Request by User
    Service: SAP(93): Swover due to install
    Version: 6.1(1)
3) No time
    Reason: Unknown
    Service: 
    Version: 6.1(2)
4) No time
    Reason: Unknown
    Service: 
    Version: 5.2(1)

Upto the last 4 reset reasons are saved and displayed. A reset reason contains:

Timestamp of when the reset/reload occurred
Reason for resetting/reloading the card
Service that caused hat reset/reload – if any
Software version that was running at that time

Sometimes a reset reason of Unknown is displayed. Reset reasons that are unknown to software or beyond software control are categorized as Unknown. These typically include:

Any power cycle of switch – including controlled power cycle of power supplies or a reset of power supplies caused by a power glitch or power failure
Front panel reset button reset on Supervisor
Any other hardware failures causing the CPU/DRAM/IO to reset or hang

1.2 NVRAM syslog

Syslog messages that are priority 0, 1 and 2 are also logged into the NVRAM of the Supervisor. After the switch comes back up online, syslog messages in the NVRAM can be displayed using this command. The command and a sample output is displayed:

SITE1-AGG1# show log nvram
2012 Nov 17 05:59:51 SITE1-AGG1 %$ VDC-1 %$ %SYSMGR-STANDBY-2-LAST_CORE_BASIC_TRACE: : PID 15681 with message 'Core detected due to hwclock crash'. 
2012 Nov 17 12:07:11 SITE1-AGG1 %$ VDC-1 %$ %CMPPROXY-2-LOG_CMP_UP: Connectivity Management processor(on module 5) is now UP
2012 Nov 17 12:07:56 SITE1-AGG1 %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 1 has come online 
2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_OK: Power supply 1 ok (Serial number DTM131000A4)
2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_FANOK: Fan in Power supply 1 ok
2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_OK: Power supply 2 ok (Serial number DTM140700HS)
2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_FANOK: Fan in Power supply 2 ok
2012 Nov 17 12:07:58 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-PS_DETECT: Power supply 3 detected but shutdown (Serial number DTM1413004P)
2012 Nov 17 12:07:59 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-XBAR_DETECT: Xbar 1 detected (Serial number JAF1308ABCS)
2012 Nov 17 12:08:01 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-XBAR_DETECT: Xbar 2 detected (Serial number JAB120600NX)
2012 Nov 17 12:08:02 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-XBAR_DETECT: Xbar 3 detected (Serial number JAF1508AJHN)
2012 Nov 17 12:08:04 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_DETECT: Module 1 detected (Serial number JAB121602HP) Module-Type 10/100/1000 Mbps Ethernet Module Model N7K-M148GT-11
2012 Nov 17 12:08:04 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_PWRUP: Module 1 powered up (Serial number JAB121602HP)
2012 Nov 17 12:08:11 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_DETECT: Module 3 detected (Serial number JAF1441BSED) Module-Type 10 Gbps Ethernet Module Model N7K-M132XP-12
2012 Nov 17 12:08:11 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_DETECT: Module 4 detected (Serial number JAF1542ABML) Module-Type 1/10 Gbps Ethernet Module Model N7K-F132XP-15
2012 Nov 17 12:08:12 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_PWRUP: Module 3 powered up (Serial number JAF1441BSED)
2012 Nov 17 12:08:12 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_PWRUP: Module 4 powered up (Serial number JAF1542ABML)
2012 Nov 17 12:08:15 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_DETECT: Module 10 detected (Serial number JAF1521BNMK) Module-Type 10 Gbps Ethernet XL Module Model N7K-M132XP-12L
2012 Nov 17 12:08:15 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_PWRUP: Module 10 powered up (Serial number JAF1521BNMK)
2012 Nov 17 12:08:30 SITE1-AGG1 %$ VDC-1 %$ %CMPPROXY-STANDBY-2-LOG_CMP_UP: Connectivity Management processor(on module 6) is now UP
2012 Nov 17 12:08:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-FANMOD_FAN_OK: Fan module 1 (Fan1(sys_fan1) fan) ok
2012 Nov 17 12:08:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-FANMOD_FAN_OK: Fan module 2 (Fan2(sys_fan2) fan) ok
2012 Nov 17 12:08:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-FANMOD_FAN_OK: Fan module 3 (Fan3(fab_fan1) fan) ok
2012 Nov 17 12:08:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-FANMOD_FAN_OK: Fan module 4 (Fan4(fab_fan2) fan) ok
2012 Nov 17 12:11:40 SITE1-AGG1 %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 2 has come online 
2012 Nov 17 12:12:31 SITE1-AGG1 %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 3 has come online 
2012 Nov 17 12:13:21 SITE1-AGG1 %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 4 has come online 
2012 Nov 17 13:10:33 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_TEMPMINALRM: Xbar-1 reported minor temperature alarm. Sensor=2 Temperature=43 MinThreshold=42
2012 Nov 17 19:56:35 SITE1-AGG1 %$ VDC-1 %$ %PLATFORM-2-MOD_TEMPOK: Xbar-1 recovered from minor temperature alarm. Sensor=2 Temperature=41 MinThreshold=42

Scanning the NVRAM syslog might provide some more information on the particular failure that caused the switch/Supervisor reload/reset.

1.3 Module exceptionlog

Module exceptionlog is a wraparound log of all errors and exceptional conditions on each module. Some exceptions are catastrophic, some partially affect certain ports in a module, others are for warning purposes. Each log entry has the particular device that logged the exception, the exception level, error code, ports affected, timestamp. The exception log is stored in the NVRAM on the Supervisor and it can be displayed using this CLI command. A sample output is provided.

SITE1-AGG1# show module internal exceptionlog 
********* Exception info for module 1 ********
exception information --- exception instance 1 ----
Module Slot Number: 1
Device Id         : 10
Device Name       : eobc
Device Errorcode  : 0xc0005043
Device ID         : 00 (0x00)
Device Instance   : 05 (0x05)
Dev Type (HW/SW)  : 00 (0x00)
ErrNum (devInfo)  : 67 (0x43)
System Errorcode  : 0x4042004d EOBC link failure
Error Type        : Warning
PhyPortLayer      : Ethernet
Port(s) Affected  : none
DSAP              : 0 (0x0)
UUID              : 0 (0x0)
Time              : Mon Nov  5 20:39:38 2012
                    (Ticks: 5098948A jiffies) 
 
exception information --- exception instance 2 ----
Module Slot Number: 1
Device Id         : 10
Device Name       : eobc
Device Errorcode  : 0xc0005047
Device ID         : 00 (0x00)
Device Instance   : 05 (0x05)
Dev Type (HW/SW)  : 00 (0x00)
ErrNum (devInfo)  : 71 (0x47)
System Errorcode  : 0x4042004e EOBC heartbeat failure
Error Type        : Warning
PhyPortLayer      : Ethernet
Port(s) Affected  : none
DSAP              : 0 (0x0)
UUID              : 0 (0x0)
Time              : Mon Nov  5 20:39:37 2012
                    (Ticks: 50989489 jiffies)

The exceptionlog provides critical information to troubleshoot errors and exception conditions. Some of the device IDs are listed here.

#define DEV_LINECARD_CTRL 1
#define DEV_SAHARA_FPGA 2
#define DEV_RIVIERA_ASIC 3
#define DEV_LUXOR_ASIC 4
#define DEV_FRONTIER_U_ASIC 5
#define DEV_FRONTIER_D_ASIC 6
#define DEV_ALADDIN_ASIC 7
#define DEV_SSA_ASIC 8
#define DEV_MIRAGE_ASIC 9
#define DEV_EOBC_MAC 10
#define DEV_SUPERVISOR_CTRL 11
#define DEV_BELLAGIO_ASIC 12
#define DEV_SIBYTE 13
#define DEV_FLAMINGO 14
#define DEV_FATW_CTRL 15
#define DEV_MGMT_MAC 16
#define DEV_MOD_RDN_CTRL 17
#define DEV_MOD_ENV 18
#define DEV_GG_FPGA 19
#define DEV_BALLY_MAIN_BOARD 20
#define DEV_BALLY_DAUGHTER_CARD 21
#define DEV_LOCAL_SSO_ASIC 22
#define DEV_REMOTE_SSO_ASIC 23
#define DEV_ID_UD_FIX_FPGA 24
#define DEV_ID_PM_FPGA 25 // PM - Power Mngmnt
#define DEV_ID_SUP_XBUS2 26
#define DEV_MARRIOTT_FPGA 27
#define DEV_REUSE_ME 28
#define DEV_GBIC 29
#define DEV_XGFC_FPGA 30
#define DEV_GNN_FPGA 31
#define DEV_SIBYTE_MEM_EPLD 32
#define DEV_BATTERY 33
#define DEV_IDE_DISK 45
#define DEV_XCVR 46
#define DEV_LINECARD 48
#define DEV_TEMP_SENSOR 49
#define DEV_HIFN_COMP 50
#define DEV_X2 51

In the Multilayer Data Switch (MDS) Chassis, the supervisor modules are brought up a little differently than the line-card modules. When two supervisors are present in the system and the system is powered-up, one of the supervisors will become active and the other standby. Active Supervisor bring-up and Standby Supervisor bring-up is different and is discussed here.

Active Supervisor Bring-up

If there is no active supervisor in the system, the supervisor which boots up will default to active supervisor. A process called system manager is responsible for loading all the software components in an orderly fashion on the supervisor. One of the first software components that is run on the supervisor is the platform manager. This component will load all the kernel drivers and handshakes with the system manager. On Success, system-manager will go ahead and start the rest of the processes based on the internal dependency between processes.

From module manager’s perspective, Supervisor is just like another line-card module with subtle differences. When platform manager indicates to module manager that the Supervisor is UP, module manager does not wait for Registration. Instead, it informs all the software components that Supervisor is up (also known as Sup Insertion Sequence). All the components will configure the supervisor. If any component comes back with a failure, the supervisor will be rebooted.

Standby Supervisor Bring-up

If there is an active supervisor in the system, the supervisor which is booting up will default to standby supervisor state. The standby supervisor needs to mirror the state of the active supervisor. This is achieved by ‘system manager’ on active, initiating a gsync (global sync) of active supervisor state to standby supervisor. Once all the components on the standby are synchronized with that of the active supervisor, module manager is informed that the standby supervisor is up.
Module-manager will now go ahead and inform all the software components on the active supervisor to configure the standby supervisor (Also Known as Standby Sup Insertion Sequence). Any errors from any component during the Standby Sup Insertion Sequence will result in Standby Supervisor Reboot.

Active Supervisor Reboot

MDS maintains lot of debug information during runtime. But, whenever a supervisor reboots much of the debug information is lost. However all critical information is stored in non volatile ram, which can be used to reconstruct the failure. When an Active Supervisor reboots, the information that is stored in its nvram cannot be obtained until it comes back up again. Once the Supervisor comes back up again, these commands can be used to dump the persistent log:

Switch# show logging nvram
Switch# show system reset-reason
Switch# show module internal exception-log

Example 1: Active Sup Reboot (due to Supervisor Process Crash)

In this example, a Supervisor Process crashed (Service “xbar”) which causes the Active sup to be rebooted. When the supervisor comes back up again, the information stored in the reset-reason gives a clear indication, for the reboot of the supervisor.

switch# show system reset-reason
----- reset reason for module 6 -----
1) At 94009 usecs after Tue Sep 27 18:52:13 2005
Reason: Reset triggered due to HA policy of Reset
Service: Service "xbar"
Version: 2.1(2)

If there is standby supervisor in the system, the standby supervisor will now become active supervisor. Displaying the syslog information on the standby supervisor will also provide the same information (although not as explicitly as ‘show system reset-reason’).

Switch# show logging
2005 Sep 27 18:58:05 172.20.150.204 %SYSMGR-3-SERVICE_CRASHED: Service "xbar" (PID 1225) hasn't caught signal 9 (no core).
2005 Sep 27 18:58:06 172.20.150.204 %SYSMGR-3-SERVICE_CRASHED: Service "xbar" (PID 2349) hasn't caught signal 9 (no core).
2005 Sep 27 18:58:06 172.20.150.204 %SYSMGR-3-SERVICE_CRASHED: Service "xbar" (PID 2352) hasn't caught signal 9 (no core).

Example 2: Active Sup Reboot (due to runtime diagnostic failure)

In this example, Supervisor in slot-6 is active and the arbiter on the Supervisor reports a Fatal Error. When any hardware device reports a Fatal Error, the module that contains the device is rebooted. In this case the Active Supervisor is rebooted. If there is a standby supervisor, the standby supervisor will take over. Syslog messages on the standby supervisor and exception log will have information to identify the source of error.

Switch# show logging
2005 Sep 28 14:17:47 172.20.150.204 %XBAR-5-XBAR_STATUS_REPORT: Module 6 reported status for component 12 code 0x60a02.
2005 Sep 28 14:17:59 172.20.150.204 %PORT-5-IF_UP: Interface mgmt0 on slot 5 is up
2005 Sep 28 14:18:00 172.20.150.204 %CALLHOME-2-EVENT: SUP_FAILURE

switch# show module internal exceptionlog module 6
********* Exception info for module 6 ********

exception information --- exception instance 1 ----
device id: 12
device errorcode: 0x80000020
system time: (1127917068 ticks) Wed Sep 28 14:17:48 2005

error type: FATAL error
Number Ports went bad:
1,2,3,4,5,6

exception information --- exception instance 2 ----
device id: 12
device errorcode: 0x00060a02
system time: (1127917067 ticks) Wed Sep 28 14:17:47 2005

error type: Warning
Number Ports went bad:
1,2,3,4,5,6

In addition, when the rebooted sup comes online again, ‘show system reset-reason’ will contain relevant information too. In this case the module 6 (which was the active sup) was rebooted by Sap 48 with error-code 0x80000020. The process which owns this sap can be obtained by the command ‘show system internal mts sup sap 48 description’ which says that the process was xbar-manager.

switch(standby)# show system reset-reason
----- reset reason for module 6 -----
1) At 552751 usecs after Wed Sep 28 14:17:48 2005
Reason: Reset Requested due to Fatal Module Error
Service: lcfail:80000020 sap:48 node:060
Version: 2.1(2)

Example 3: Standby Sup Failed to Come Online

In this example, active sup is up and running and standby sup is plugged into the system. However show module does not indicate that the module has ever come up.

switch# show module
Mod Ports Module-Type Model Status
--- ----- -------------------------------- ------------------ ------------
5 0 Supervisor/Fabric-1 DS-X9530-SF1-K9 active *
8 8 IP Storage Services Module powered-dn

Mod Sw Hw World-Wide-Name(s) (WWN)
--- ----------- ------ --------------------------------------------------
5 2.1(2) 1.1 --

Mod MAC-Address(es) Serial-Num
--- -------------------------------------- ----------
5 00-0b-be-f7-4d-1c to 00-0b-be-f7-4d-20 JAB070307XG

However, if you login to the console of the standby sup, it says it is standby.

runlog>telnet sw4-ts 2004
Trying 172.22.22.55...
Connected to sw4-ts.cisco.com (172.22.22.55).
Escape character is '^]'.

MDS Switch
login: admin
Password:
Cisco Storage Area Networking Operating System (SAN-OS) Software
TAC support: http://www.cisco.com/tac
Copyright (c) 2002-2005, Cisco Systems, Inc. All rights reserved.
The copyrights to certain works contained herein are owned by
other third parties and are used and distributed under license.
Some parts of this software are covered under the GNU Public
License. A copy of the license is available at
http://www.gnu.org/licenses/gpl.html.
switch(standby)#

As discussed earlier, when the standby sup is inserted into the system, the configuration and the state of all the components of the active supervisor is copied over to the standby (gsync). Till this process is complete, active supervisor does not consider standby supervisor is present. To verify if this process is complete, you could issue the following command on the active supervisor. The output of the command indicates that synchronization in progress (and is probably never completed).

switch# show system redundancy status
Redundancy mode
---------------
administrative: HA
operational: None

This supervisor (sup-1)
-----------------------
Redundancy state: Active
Supervisor state: Active
Internal state: Active with HA standby

Other supervisor (sup-2)
------------------------
Redundancy state: Standby
Supervisor state: HA standby
Internal state: HA synchronization in progress

The most likely reason why this could have happened is, if one of the software components on the standby failed to synchronize its state with the active supervisor. To verify which processes did not synchronize, you can issue this command on the active supervisor and the output indicates a lot of software components have not completed gsync.

switch# show system internal sysmgr gsyncstats
Name Gsync done Gsync time(sec)
---------------- ---------- -------------
aaa 1 0
ExceptionLog 1 0
platform 1 1
radius 1 0
securityd 1 0
SystemHealth 1 0
tacacs 0 N/A
acl 1 0
ascii-cfg 1 1
bios_daemon 0 N/A
bootvar 1 0
callhome 1 0
capability 1 0
cdp 1 0
cfs 1 0
cimserver 1 0
cimxmlserver 0 N/A
confcheck 1 0
core-dmon 1 0
core-client 0 N/A
device-alias 1 0
dpvm 0 N/A
dstats 1 0
epld_upgrade 0 N/A
epp 1 1

In addition, looking at the standby supervisor we see that xbar software component has been restarted 23 times. This looks like the most likely cause that the standby did not come up.

switch(standby)# show system internal sysmgr service all
Name UUID PID SAP state Start count
---------------- ---------- ------ ----- ----- -----------
aaa 0x000000B5 1458 111 s0009 1
ExceptionLog 0x00000050 [NA] [NA] s0002 None
platform 0x00000018 1064 39 s0009 1
radius 0x000000B7 1457 113 s0009 1
securityd 0x0000002A 1456 55 s0009 1
vsan 0x00000029 1436 15 s0009 1
vshd 0x00000028 1408 37 s0009 1
wwn 0x00000030 1435 114 s0009 1
xbar 0x00000017 [NA] [NA] s0017 23
xbar_client 0x00000049 1434 917 s0009 1

Example 3: Standby Sup is in Powered-up State

In this example, standby sup is inserted in slot 6. show module command issued on the active-sup, shows Standby Sup is in powered-up state.

switch# show module
Mod Ports Module-Type Model Status
--- ----- -------------------------------- ------------------ ------------
5 0 Supervisor/Fabric-1 DS-X9530-SF1-K9 active *
6 0 Supervisor/Fabric-1 powered-up
8 8 IP Storage Services Module powered-dn

Mod Sw Hw World-Wide-Name(s) (WWN)
--- ----------- ------ --------------------------------------------------
5 2.1(2) 1.1 --

Mod MAC-Address(es) Serial-Num
--- -------------------------------------- ----------
5 00-0b-be-f7-4d-1c to 00-0b-be-f7-4d-20 JAB070307XG

In this example, show logging does not give any valuable information and neither does show module internal exception-log. However as all state transitions for a given module is stored in the module manager we can look at the state transistions of the module manager to figure out what is wrong. The internal state transistions are:

Switch# show module internal event-history module 5
64) FSM:<ID(1): Slot 6, node 0x0601> Transition at 563504 usecs after Wed Sep 28 14:44:53 2005
Previous state: [LCM_ST_LC_NOT_PRESENT]
Triggered event: [LCM_EV_PFM_MODULE_SUP_INSERTED]
Next state: [LCM_ST_SUPERVISOR_INSERTED]

65) FSM:<ID(1): Slot 6, node 0x0601> Transition at 563944 usecs after Wed Sep 28 14:44:53 2005
Previous state: [LCM_ST_SUPERVISOR_INSERTED]
Triggered event: [LCM_EV_START_SUP_INSERTED_SEQUENCE]
Next state: [LCM_ST_CHECK_INSERT_SEQUENCE]

66) Event:ESQ_START length:32, at 564045 usecs after Wed Sep 28 14:44:53 2005 
Instance:1, Seq Id:0x2710, Ret:success
Seq Type:SERIAL

67) Event:ESQ_REQ length:32, at 564422 usecs after Wed Sep 28 14:44:53 2005 
Instance:1, Seq Id:0x1, Ret:success
[E_MTS_TX] Dst:MTS_SAP_MIGUTILS_DAEMON(949), Opc:MTS_OPC_LC_INSERTED(1081)

68) Event:ESQ_RSP length:32, at 566174 usecs after Wed Sep 28 14:44:53 2005
Instance:1, Seq Id:0x1, Ret:success
[E_MTS_RX] Src:MTS_SAP_MIGUTILS_DAEMON(949), Opc:MTS_OPC_LC_INSERTED(1081)

69) Event:ESQ_REQ length:32, at 566346 usecs after Wed Sep 28 14:44:53 2005
Instance:1, Seq Id:0x2, Ret:success
[E_MTS_TX] Dst:MTS_SAP_NTP(72), Opc:MTS_OPC_LC_INSERTED(1081)

70) Event:ESQ_RSP length:32, at 566635 usecs after Wed Sep 28 14:44:53 2005
Instance:1, Seq Id:0x2, Ret:success
[E_MTS_RX] Src:MTS_SAP_NTP(72), Opc:MTS_OPC_LC_INSERTED(1081)

71) Event:ESQ_REQ length:32, at 566772 usecs after Wed Sep 28 14:44:53 2005
Instance:1, Seq Id:0x3, Ret:success
[E_MTS_TX] Dst:MTS_SAP_XBAR_MANAGER(48), Opc:MTS_OPC_LC_INSERTED(1081)

73) Event:ESQ_RSP length:32, at 586418 usecs after Wed Sep 28 14:44:53 2005
Instance:1, Seq Id:0x3, Ret:(null)
[E_MTS_RX] Src:MTS_SAP_XBAR_MANAGER(48), Opc:MTS_OPC_LC_INSERTED(1081)

74) FSM:<ID(1): Slot 6, node 0x0601> Transition at 586436 usecs after Wed Sep 28 14:44:53 2005
Previous state: [LCM_ST_CHECK_INSERT_SEQUENCE]
Triggered event: [LCM_EV_LC_INSERTED_SEQ_FAILED]
Next state: [LCM_ST_CHECK_REMOVAL_SEQUENCE]

75) Event:ESQ_START length:32, at 586611 usecs after Wed Sep 28 14:44:53 2005
Instance:1, Seq Id:0x2710, Ret:success
Seq Type:SERIAL

76) Event:ESQ_REQ length:32, at 593649 usecs after Wed Sep 28 14:44:53 2005
Instance:1, Seq Id:0x1, Ret:success
[E_MTS_TX] Dst:MTS_SAP_MIGUTILS_DAEMON(949), Opc:MTS_OPC_LC_REMOVED(1082)

77) Event:ESQ_RSP length:32, at 594854 usecs after Wed Sep 28 14:44:53 2005
Instance:1, Seq Id:0x1, Ret:success
[E_MTS_RX] Src:MTS_SAP_MIGUTILS_DAEMON(949), Opc:MTS_OPC_LC_REMOVED(1082)

90) FSM:<ID(1): Slot 6, node 0x0601> Transition at 604447 usecs after Wed Sep 28 14:44:53 2005
Previous state: [LCM_ST_CHECK_REMOVAL_SEQUENCE]
Triggered event: [LCM_EV_ALL_LC_REMOVED_RESP_RECEIVED]
Next state: [LCM_ST_LC_FAILURE]

91) FSM:<ID(1): Slot 6, node 0x0601> Transition at 604501 usecs after Wed Sep 28 14:44:53 2005
Previous state: [LCM_ST_LC_FAILURE]
Triggered event: [LCM_EV_LC_INSERTED_SEQ_FAILED]
Next state: [LCM_ST_LC_FAILURE]

92) FSM:<ID(1): Slot 6, node 0x0601> Transition at 604518 usecs after Wed Sep 28 14:44:53 2005
Previous state: [LCM_ST_LC_FAILURE]
Triggered event: [LCM_EV_SUPERVISOR_FAILURE]
Next state: [LCM_ST_LC_NOT_PRESENT]

Curr state: [LCM_ST_LC_NOT_PRESENT]
switch#

Look at the logs above Index 92, indicates that the supervisor is in failed state and the triggered event is LCM_EV_LC_INSERTED_SEQ_FAILED. (Insertion sequence failed). Going up the logs to find out why Insertion Sequence failed, see that insertion sequence failed right after a response from MTS_SAP_XBAR_MANAGER (Index 73 and Index 74). This indicates that there is something wrong with xbar configuration when the standby sup is inserted. More debugging can be done by looking at the internal logs of the failed component (in this case, xbar component).

Contributed by Cisco Engineers

Jane Gao
Cisco TAC Engineer

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

Nexus 7000 Series Switches

Troubleshooting N7K HW(fan/PS/Temp/Xbar/SUP)

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Debugging Chassis Issues

Fan Issues

Power Supply

Temperature or Heat

Debugging Supervisor Module Issues

Switch/Supervisor Reset/Reload

Active Supervisor Bring-up

Standby Supervisor Bring-up

Active Supervisor Reboot

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products