The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
This document describes how to troubleshoot Line card issues, faulty states under which line card gets stuck, possible reasons and recovery actions on a Cisco 4000 Series Network Convergence System (NCS4016).
NCS4016 is a 16 LC(0-15 slots) Chassis and each LC capacity of 200G. Below are few basics sequence of events while LC is booted up on NCS4016 Chassis.
Before you start the troubleshooting, it is suggested to keep a note of below commands.
With all LC & RP operational you should be able to see output as below.
sysadmin-vm:0_RP0# show platform
Tue Aug 18 19:57:02.631 UTC
Location Card Type HW State SW State Config State
----------------------------------------------------------------------------
0/0 NCS4K-2H-O-K OPERATIONAL N/A NSHUT
0/5 NCS4K-24LR-O-S OPERATIONAL N/A NSHUT
0/6 NCS4K-20T-O-S OPERATIONAL N/A NSHUT
0/8 NCS4K-2H-O-K OPERATIONAL N/A NSHUT
0/RP0 NCS4K-RP OPERATIONAL OPERATIONAL NSHUT
0/FC1 NCS4016-FC-M OPERATIONAL N/A NSHUT
0/CI0 NCS4K-CRAFT OPERATIONAL N/A NSHUT
0/FT0 NCS4K-FTA OPERATIONAL N/A NSHUT
0/FT1 NCS4K-FTA OPERATIONAL N/A NSHUT
0/PT0 NCS4K-AC-PEM OPERATIONAL N/A NSHUT
0/PT1 NCS4K-AC-PEM OPERATIONAL N/A NSHUT
0/EC0 NCS4K-ECU OPERATIONAL N/A NSHUT
sysadmin-vm:0_RP0#
Below are few common faulty HW & SW States in which LC could be stuck and their reasons.
This state suggests that card failed to boot due to some power issues or the CCC power-on interpreter prevented the completion of power up sequence.
Recommended actions:
Check the output of below command.
# sysadmin-vm:0_RP1# show platform detail location <location of card>
In above command look for “Last Event” and “Last Event Reason:” this will tell us the reason of failure.
sysadmin-vm:0_RP1# show platform detail location 0/fc1
Sat Jul 4 13:52:14.782 UTC
Platform Information for 0/FC1
PID : NCS4016-FC-M
Description : "NCS 4016 Agnostic Cross Connect - Multichassis "
VID/SN : V01
HW Oper State : OPERATIONAL
SW Oper State : N/A
Configuration : "NSHUT RST"
HW Version : 1.0
Last Event : HW_EVENT_FAILURE
Last Event Reason : "Intial discovery FAIL EXIT0 , power request on, but not finish ccc-pon startup power_control 0x00000001"
For the above failure state you could also check the status of CCC controller for particular location. You should be checking the status of power zone which is “SET”. Since different LC uses different power zone to boot up.
sysadmin-vm:0_RP0# show controller ccc power detail location 0/RP0
Tue Aug 18 18:33:30.245 UTC
Power detail : Zone information for 0/RP0:
---------------------------------------------------------
| Power Zone | Power Status | Power Contrl | Power Fault |
---------------------------------------------------------
| 0 | OK | SET | -- |
| 1 | OK | -- | -- |
| 2 | OK | SET | -- |
| 3 | OK | -- | -- |
| 4 | OK | SET | -- |
| 5 | -- | -- | -- |
| 6 | OK | -- | -- |
| 7 | -- | -- | -- |
| 8 | OK | SET | -- |
sysadmin-vm:0_RP0#
Recovery Actions:
# sysadmin-vm:0_RP1# hw-module location <location of card> reload
This state is seen on the LC which is CPU less and all LC cards in NCS4k are CPU less.
Recommended actions:
sysadmin-vm:0_RP1# show platform
0/FC0 NC4K-FC OPERATIONAL N/A NSHUT
0/FC1 NC4K-FC POWERED_ON N/A NSHUT
0/FC2 NC4K-FC OPERATIONAL N/A NSHUT
In this case the fabric driver will try to recover the card on its own but if it cannot detects the ASIC in 3 minutes, failed then the card will land up in POWERED_ON state.
Check below output which shows all present cards in chassis are powered on successfully.
sysadmin-vm:0_RP0# show controller ccc power summary
Tue Aug 18 19:09:37.575 UTC
CCC Power Summary :
Location Card Type Power State
----------------------------------------------------------------
0/0 NCS4K-2H-O-K ON
0/FC1 NCS4016-FC-M ON
0/5 NCS4K-24LR-O-S ON
0/6 NCS4K-20T-O-S ON
0/RP0 NCS4K-RP ON
0/8 NCS4K-2H-O-K ON
sysadmin-vm:0_RP0#
Recovery Actions:
# sysadmin-vm:0_RP1# hw-module location <location of card> reload
This means that card has been detected and is in power off state. This could be the valid state when the card has been configured to power OFF in configuration. Card might have been forced to shutdown due to environmental alarm, failure in CCC driver in detecting the card due to I2C failures.
Recommended actions:
sysadmin-vm:0_RP1# show platform detail location <location of card>
In above output please check “Last Event :” and “Last Event Reason :”.
To confirm the alarms you could also execute below command if the card has been shutdown due to any alarm conditions. Below output showing alarm condition for respective card location.
sysadmin-vm:0_RP0# show alarms
Tue Aug 18 18:03:35.421 UTC
-------------------------------------------------------------------------------
Active Alarms
-------------------------------------------------------------------------------
Location Severity Group Set time Description
-------------------------------------------------------------------------------
0/PT0-PM0 major environ 05/22/70 04:56:45 Power Module Error (PM_NO_INPUT_DETECTED).
0/PT0-PM0 major environ 05/22/70 04:56:45 Power Module Output Disabled (PM_OUTPUT_EN_PIN_HI).
0/PT0-PM2 major environ 05/22/70 04:56:45 Power Module Error (PM_NO_INPUT_DETECTED).
0/PT0-PM2 major environ 05/22/70 04:56:45 Power Module Output Disabled (PM_OUTPUT_EN_PIN_HI).
0/PT0-PM3 major environ 05/22/70 04:56:45 Power Module Error (PM_NO_INPUT_DETECTED).
0/PT0-PM3 major environ 05/22/70 04:56:45 Power Module Output Disabled (PM_OUTPUT_EN_PIN_HI).
0/PT1-PM1 major environ 05/22/70 04:56:45 Power Module Error (PM_NO_INPUT_DETECTED).
You can also run the same command to check the output for respective location of the card.
sysadmin-vm:0_RP1# show alarms brief card location < location of card>
Recovery Actions:
# sysadmin-vm:0_RP1# hw-module location <location of card> reload
The most common reason for this state is CCC driver failing to read the IDPROM from the card or CCC driver detected the IDPROM corruption that failed the card to be detected.
sysadmin-vm:0_RP1# show platform
Sat Jul 4 15:27:50.478 UTC
Location Card Type HW State SW State Config State
----------------------------------------------------------------------------
0/1 UNKNOWN POWERED_ON OPERATIONAL NSHUT
Recovery Actions:
# sysadmin-vm:0_RP1# hw-module location <location of card> reload
Please note for card to get in SW_INACTIVE state it has to be get operational in HW state. A common reason for card getting in to this state is HOST OS not able to access SSD.
Recommended actions:
Check if the card has control Ethernet connection.
sysadmin-vm:0_RP1# show controller switch reachable
Sat Jul 4 16:31:33.690 UTC
Rack Card Switch
--------------------
0 RP0 RP-SW
0 RP1 RP-SW
0 LC0 LC-SW
0 LC1 LC-SW
0 LC2 LC-SW
0 LC4 LC-SW
If the card doesn’t have the control Ethernet connection then execute below command to check Ethernet protocol state to the card. The state of the protocol should be either “Active” or “Standby” any other state seen would indicate the connection issue.
sysadmin-vm:0_RP0# show controller switch mlap location 0/RP0/RP-SW
Tue Aug 18 18:08:22.343 UTC
Rack Card Switch Rack Serial Number
--------------------------------------
0 RP0 RP-SW SAL19058RDF
Phys Admin Protocol Forward Protocol
Port State State State State Type Connects To
--------------------------------------------------------------------------
0 Down Up Down - Internal LC15
1 Down Up Down - Internal LC7
2 Down Up Down - Internal LC13
3 Down Up Down - Internal LC12
4 Down Up Down - Internal LC14
5 Down Up Down - Internal LC11
6 Up Up Active Forwarding Internal LC6
7 Up Up Active Forwarding Internal LC5
8 Down Up Down - Internal LC1
9 Down Up Down - Internal LC4
10 Down Up Down - Internal LC3
11 Down Up Down - Internal LC10
16 Up Up Active Forwarding Internal LC0
17 Up Up Active Forwarding Internal LC8
26 Down Up Down - Internal LC2
27 Down Up Down - Internal LC9
32 Down Up Down - Internal MATESC (RP0 Ctrl)
33 Down Up Down - Internal MATESC (RP1 Ctrl)
36 Up Up Active Forwarding Internal CCC (RP0 Ctrl)
37 Up Up Rem Managed Forwarding Internal CCC (RP1 Ctrl)
52 Down Up Down - External SFP+ 1
54 Down Up Down - External SFP+ 0
Recovery Actions:
If you have confirmed that port is down then you can also try to access the card CPU console and check if card is responsive or not. Upon access card will throw messages suggesting why it went to SW_INACTIVE state.
sysadmin-vm:0_RP1# attach location <location of card>
Last hop of resort should be re-imaging the card.
#reimage_chassis –s <slot id> but prior to this step consult with technical expert.
Related Links: