Introduction
This document describes how to identify memory issues in ASR5K-PSC-32G (Packet Services Card 2 (PSC2)) and ASR5K-PSC-64G (Packet Services Card 3 (PSC3)) cards. The symptom seen when the problem is present is that the card resets itself. All information required to troubleshoot is available in Show Support Detail (SSD).
Prerequisites
Requirements
Cisco recommends that you have knowledge of the CLI of the Aggregation Services Router 5000 (ASR5K).
Components Used
This document is not restricted to specific software and hardware versions.
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.
Memory Issues
Either the Packet Services Card 2 (PSC2) or the Packet Services Card 3 (PSC3) might crash ether due to a kernel crash or missing heartbeat.
Kernel Crash
A Kernel crash can happen when the card experiences multiple Correctable Memory Errors or a single Uncorrectable Memory Error. In order to identify if the issue is kernel crash follow these steps:
- In the SSD, check show crash list for kernel crash:
<snip>
******** show crash list *******
== =================== ======= ========== =========== ================
# Time Process Card/CPU/ SW HW_SER_NUM
PID VERSION SMC / Crash Card
== =================== ======= ========== =========== ================
86 2012-Jun-07+18:28:21 sessmgr 15/0/04453 12.2(42876) PLB30103469/PLB40098624
87 2012-Jun-15+04:02:34 kernel 16/0/NA 12.2(NA) PLB30103469/PLB39098500
88 2012-Jun-15+04:50:38 sessmgr 02/0/04372 12.2(42876) PLB30103469/PLB40098609
<snip>
-
Once the crash number for the kernel crash is identified, check the crash details for the crash number in show crash list. In the previous example, Crash 87 happened in Card 16.
<snip>
********************* CRASH #87 ***********************
2.6.38-staros-v3-hw-64 #1 SMP PREEMPT Wed Apr 18 14:32:38 EDT 2012 1 0 PLB39098500 428760, label "": Corrected error (Socket=0 channel=0 dimm=0)
<4>[52569.305831] EDAC MC0: CE row 0, channel 0, label "": Corrected error (Socket=0 channel=0 dimm=0)
<4>[52569.314566] EDAC MC0: CE row 0, channel 0, label "": Corrected error (Socket=0 channel=0 dimm=0)
<4>[52579.321273] edac_mc_handle_fbd_ce: 449 callbacks suppressed
<4>[52579.326820] EDAC MC0: CE row 0, channel 0, label "": Corrected error (Socket=0 channel=0 dimm=0)
…………..
<0>[52668.605978] [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 8: fe0000000001009f
<0>[52668.614014] [Hardware Error]: TSC 66946ea1b05a ADDR 44f307280 MISC 4c43688800045941
<0>[52668.621767] [Hardware Error]: PROCESSOR 0:106a5 TIME 1339732830 SOCKET 0 APIC 0
<0>[52668.629028] [Hardware Error]: Machine check: Processor context corrupt
<0>[52668.635520] Kernel panic - not syncing: Fatal Machine check
<snip>
The "EDAC MC0: CE row 0, channel 0, label "": Corrected error" along with ‘Kernel Panic’ crash indicates memory failure and a Return Material Authorization (RMA) is required.
Memory Not Detected
The PSC2/PSC3 line card might reboot with indication of Missing Heartbeat. One reason is that the system detected bad DIMM. When bad DIMM is detected, the card tries to reboot multiple times before it goes into the Offline state.
For the PSC2 card, in the output debug console card x cpu 0 found in the SSD, these errors will be seen:
1338537199.891 card 6-cpu0: ERROR: Memory size 24576 MB for cpu0 not matching with value 32768 MB in IDEEPROM 1338537199.891 card 6-cpu0:
1338537199.891 card 6-cpu0: ERROR: Bus 255 CPU 0 Chan 0 DIMM 0 NotPresent
Also, the syslog will be populated with this error:
The Packet Services Card 2 with serial number SAD154403TT in slot 6 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=[CPU0 MB: CFE_FAILURE] [CPU1] [CPU2] [CPU3] [GPIO_IN: 00,ff,ff,ff] [GPIO_OUT: 01,ff,00,ff]
For the PSC3 card, in the output debug console card x cpu 0 found in the SSD, this error will be seen:
1412147713.299 card 7-cpu0: WARNING: Memory size 49152 MB for cpu0 not matching with value 65536 MB in IDEEPROM
The card that experiences this problem needs to be replaced.