Troubleshoot Card Restart due to NPUMgr Restart due to EZprmSER_CheckError

Available Languages

Download Options

PDF (7.2 KB)
View with Adobe Reader on a variety of devices
ePub (85.2 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (71.4 KB)
View on Kindle device or Kindle app on multiple devices

Updated:February 8, 2022

Document ID:217678

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

This document describes how to troubleshoot npumgr restart, which is triggered due to EZprmSER_CheckError in Aggregation Services Router 5500 (ASR5500).

Prerequisites

Requirements

Cisco recommends that you have knowledge of these topics:

Hardware knowledge of ASR5500
StarOS

Components Used

This document is not restricted to specific software and hardware versions.

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.

Problem

After a Network Processing Unit (NPU) memory error is detected it can cause an NPUMGR Segmentation fault with this signature.

Fatal Signal 11: Segmentation fault
  PC: [0d8e2647/X] EZprmSER_CheckError()
  Faulty address: 0x272e95d4
  Signal from: kernel
  Signal detail: address not mapped to object
  Process: card=7 cpu=1 arch=X pid=16579 argv0=npumgr
  Crash time: 2017-Oct-03+01:02:32 UTC
  Recent errno: 115 Operation now in progress
  Build_number: 67999
  Stack (22120@0x0xffc3a000):
    [0d8e2647/X] EZprmSER_CheckError() sp=0xffc3aaf0
    [0d78c348/X] EZapiPrm_SERCheckError() sp=0xffc3ab14
    [004f4ba5/X] aresEZevents_MemSErr_Handler() sp=0xffc3ad94
    [004f688b/X] aresEZevents_Handler() sp=0xffc3f104
    [0d77206c/X] EZdev_ISRTask() sp=0xffc3f138
    [0c25eb02/X] sn_loop_run() sp=0xffc3f5e8
    [0bf451c5/X] main() sp=0xffc3f658

This restart can be seen on both Data Processing Card (DPC) and Management Input/Output (MIO) cards.

The events that lead to the restart can be summarized as follows:

Memory error (single bit ECC error) was detected on the NPU.
NPU interrupts the npumgr driver that a memory error has been detected.
Npumgr attempts to scan the memory for the error and restart from npudriver code.

NPU will be restarted any time a parity (or memory) error is observed on the NPU for a card - this is similar to the node reaction for when npumgr task restarts as well. Since the trigger for the restart is known to be NPU interrupt for an observed memory error, this restart is considered a transient hardware error.

Note that a cosmic ray or electrostatic discharge can cause a bit to flip in memory - that is what ECC is there to correct.

If you experience one ECC error in one of the cards, it is actually an expected event.
If a card has more than one ECC error within a month, then the card is suspected to have a hardware issue.

Solution

Cisco recommends that monitor the card, and replace if a similar issue is seen on the card within a month.

The event is being triggered during fast npu restart for memory error recovery on NPU while performing a data collection to debug this segmentation fault.

Cisco bug ID CSCvu44031 is fixing the segmentation fault.

Revision History

Revision	Publish Date	Comments
1.0	08-Feb-2022	Initial Release

Contributed by Cisco Engineers

Ayodele Adebawojo
Cisco TAC Engineer

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

Troubleshoot Card Restart due to NPUMgr Restart due to EZprmSER_CheckError

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Prerequisites

Requirements

Components Used

Problem

Solution

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco