Introduction
This document describes how to troubleshoot npumgr restart, which is triggered due to EZprmSER_CheckError in Aggregation Services Router 5500 (ASR5500).
Prerequisites
Requirements
Cisco recommends that you have knowledge of these topics:
- Hardware knowledge of ASR5500
- StarOS
Components Used
This document is not restricted to specific software and hardware versions.
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
Problem
After a Network Processing Unit (NPU) memory error is detected it can cause an NPUMGR Segmentation fault with this signature.
Fatal Signal 11: Segmentation fault
PC: [0d8e2647/X] EZprmSER_CheckError()
Faulty address: 0x272e95d4
Signal from: kernel
Signal detail: address not mapped to object
Process: card=7 cpu=1 arch=X pid=16579 argv0=npumgr
Crash time: 2017-Oct-03+01:02:32 UTC
Recent errno: 115 Operation now in progress
Build_number: 67999
Stack (22120@0x0xffc3a000):
[0d8e2647/X] EZprmSER_CheckError() sp=0xffc3aaf0
[0d78c348/X] EZapiPrm_SERCheckError() sp=0xffc3ab14
[004f4ba5/X] aresEZevents_MemSErr_Handler() sp=0xffc3ad94
[004f688b/X] aresEZevents_Handler() sp=0xffc3f104
[0d77206c/X] EZdev_ISRTask() sp=0xffc3f138
[0c25eb02/X] sn_loop_run() sp=0xffc3f5e8
[0bf451c5/X] main() sp=0xffc3f658
This restart can be seen on both Data Processing Card (DPC) and Management Input/Output (MIO) cards.
The events that lead to the restart can be summarized as follows:
- Memory error (single bit ECC error) was detected on the NPU.
- NPU interrupts the npumgr driver that a memory error has been detected.
- Npumgr attempts to scan the memory for the error and restart from npudriver code.
NPU will be restarted any time a parity (or memory) error is observed on the NPU for a card - this is similar to the node reaction for when npumgr task restarts as well. Since the trigger for the restart is known to be NPU interrupt for an observed memory error, this restart is considered a transient hardware error.
Note that a cosmic ray or electrostatic discharge can cause a bit to flip in memory - that is what ECC is there to correct.
If you experience one ECC error in one of the cards, it is actually an expected event.
If a card has more than one ECC error within a month, then the card is suspected to have a hardware issue.
Solution
Cisco recommends that monitor the card, and replace if a similar issue is seen on the card within a month.
The event is being triggered during fast npu restart for memory error recovery on NPU while performing a data collection to debug this segmentation fault.
Cisco bug ID CSCvu44031 is fixing the segmentation fault.