Managing Router Hardware

This chapter describes about clearing the memory and partitions of an RP or a line card before an RMA (Return Merchandise Authorization).

Clear the Memory and the Partitions of a Card

Users can clear the memory and the partitions of an RP or a line card before an RMA (Return Merchandise Authorization). Clearing the memory and partitions of the card is performed when the card is defective and has to be returned.

When a line card or an RP is identified for an RMA, the user might want to remove the card from the chassis. However, the service personnel may not be available onsite to remove the card immediately. By clearing the memory and partitions of the card, the users can clear the RP or the line card and power-off the card and also let it remain in the slot.

After clearing the memory, do not reload the card or the chassis until the card is removed from the slot. This is because reloading will reboot the card or the chassis resulting in restoring the data that was erased.

In a dual RP system, the reset of the standby RP must be executed from the active RP. Once the standby RP has been cleaned, the standby RP will be shut down to prevent resync with the active RP.

Prerequisites

XR VM and the System Admin VM must be operational.


Note


Do not perform an admin process restart, card reload, or an FPD upgrade while clearing the memory and partitions of the card.


Commands

Run the following commands from the XR VM to clear the memory and the partitions of the card:

  • show zapdisk locations- displays the locations where the memory and the partition can be cleared.

  • zapdisk start location <location-id> - clears the memory and the partition from the specified location.

The following steps explain how to clear the memory or the partition of the card:

  1. Display the Locations to Clear the Memory - Use the show zapdisk locations command to display the locations to be cleared.

    The following example shows how to display the location:

    <! Display the Locations to Clear the Memory !>
    
    Router# show zapdisk locations
    0/RP1      Fully qualified location specification
    0/2        Fully qualified location specification
    0/6        Fully qualified location specification
    all        all locations
    
    Router#conf t
    Router(config)#logging console disable 
    Router(config)#commit 
    Router(config)#end
                  
  2. Clear the Memory or Partition - Use the zapdisk start location command to clear the memory or partition.

    The following example shows how to clear the memory or partition:

    <! Clear the Memory or Partition !>
    
    Router#zapdisk start location 0/2
    Action on designated location is in progress, please monitor admin syslog.
    Action on designated location is in progress, please monitor admin syslog.
    
    Router#zapdisk start location 0/6
    Action on designated location is in progress, please monitor admin syslog.
    Action on designated location is in progress, please monitor admin syslog.
    
    Router#zapdisk start location 0/RP1
    Action on designated location is in progress, please monitor admin syslog.
    Action on designated location is in progress, please monitor admin syslog.
    
  3. Verify that the memory and the partition is cleared - Use show logging, show platform, show controller card, and show reboot-history card location commands to verify if the memory and partitions are cleared.

    The following example shows how to verify if the memory and partitions are successfully cleared:

    
    <!Verification!>
    
    sysadmin-vm:0_RP0# show controller card-mgr event-history brief location 0/2
    
    Card Event History for: 0/2
      
    Card Event History as seen by Master (0/RP0)
      Current State: ZAPDISK_POWERED_ON
    
      DATE   TIME (UTC)    STATE                   EVENT                           
      -----  ------------  ----------------------  ------------------------------
      03/04  22:26:13.400  ZAPDISK_RESET           ev_dm1_power_up_ok              
      03/04  22:26:02.630  SYSADMIN_VM_GOING_DOWN  ev_zapdisk_req                  
      03/04  22:25:46.660  CARD_READY              ev_sysadmin_vm_shutdown         
      03/04  21:58:14.842  OIR_INSERT_NOTIF        if_card_local_init_done         
      03/04  21:58:14.841  WAIT_CARD_INFO          ev_card_info_synced             
      03/04  21:57:57.219  WAIT_SYSADMIN_VM_READY  ev_sysadmin_vm_booted           
      03/04  21:57:45.305  HOST_OS_RUNNING         ev_sysadmin_vm_started          
      03/04  21:57:24.371  BOOTLDR_STARTED         ev_host_os_started              
      03/04  21:56:04.619  CARD_POWERED_ON         ev_bootldr_started              
      03/04  21:55:58.212  CARD_IN_RESET           ev_dm1_power_up_ok              
      03/04  21:55:45.397  IMAGE_INSTALLED         ev_ios_install_reset            
      03/04  21:55:44.896  INSTALLING_IMAGE        ev_ios_install_done             
      03/04  21:54:53.045  WAIT_FIRST_EVENT        ev_ios_install_started          
      03/04  21:54:53.043  IDLE                    ev_present                      
    
    sysadmin-vm:0_RP0# show controller card-mgr event-history brief location 0/6
    Card Event History for: 0/6
      
    Card Event History as seen by Master (0/RP0)
      Current State: ZAPDISK_POWERED_ON
    
      DATE   TIME (UTC)    STATE                   EVENT                           
      -----  ------------  ----------------------  ------------------------------
      03/04  22:26:14.309  ZAPDISK_RESET           ev_dm1_power_up_ok              
      03/04  22:26:03.722  SYSADMIN_VM_GOING_DOWN  ev_zapdisk_req                  
      03/04  22:25:49.563  CARD_READY              ev_sysadmin_vm_shutdown         
      03/04  22:00:32.071  OIR_INSERT_NOTIF        if_card_local_init_done         
      03/04  22:00:32.070  WAIT_CARD_INFO          ev_card_info_synced             
      03/04  22:00:10.314  WAIT_SYSADMIN_VM_READY  ev_sysadmin_vm_booted           
      03/04  21:59:57.999  HOST_OS_RUNNING         ev_sysadmin_vm_started          
      03/04  21:59:35.271  BOOTLDR_STARTED         ev_host_os_started              
      03/04  21:58:18.244  CARD_POWERED_ON         ev_bootldr_started              
      03/04  21:58:11.836  CARD_IN_RESET           ev_dm1_power_up_ok              
      03/04  21:57:59.122  IMAGE_INSTALLED         ev_ios_install_reset            
      03/04  21:57:58.521  INSTALLING_IMAGE        ev_ios_install_done             
      03/04  21:54:53.045  WAIT_FIRST_EVENT        ev_ios_install_started          
      03/04  21:54:53.043  IDLE                    ev_present                      
    
    Aborted: by user
    sysadmin-vm:0_RP0# show controller card-mgr event-history brief location 0/RP1
    Card Event History for: 0/RP1
      
    Card Event History as seen by Master (0/RP0)
      Current State: ZAPDISK_POWERED_ON
    
      DATE   TIME (UTC)    STATE                   EVENT                           
      -----  ------------  ----------------------  ------------------------------
      03/04  22:26:24.730  ZAPDISK_RESET           ev_dm1_power_up_ok              
      03/04  22:26:04.503  HOST_GOING_DOWN         ev_zapdisk_req                  
      03/04  22:26:00.677  SYSADMIN_VM_GOING_DOWN  ev_host_shutdown_started        
      03/04  22:25:54.770  CARD_READY              ev_sysadmin_vm_shutdown         
      03/04  21:57:28.878  OIR_INSERT_NOTIF        if_card_local_init_done         
      03/04  21:57:28.878  WAIT_CARD_INFO          ev_card_info_synced             
      03/04  21:57:11.443  WAIT_SYSADMIN_VM_READY  ev_sysadmin_vm_booted           
      03/04  21:56:59.228  HOST_OS_RUNNING         ev_sysadmin_vm_started          
      03/04  21:56:31.882  BOOTING_IOS_IMAGE       ev_host_os_started              
      03/04  21:56:26.466  BOOTING_IOS_IMAGE       ev_boot_kernel                  
      03/04  21:56:12.834  CARD_POWERED_ON         ev_bootldr_ssd_boot             
      03/04  21:56:09.730  CARD_IN_RESET           ev_dm1_power_up_ok              
      03/04  21:55:48.701  IMAGE_INSTALLED         ev_ios_install_reset            
      03/04  21:55:47.700  INSTALLING_IMAGE        ev_ios_install_done             
      03/04  21:54:53.046  WAIT_FIRST_EVENT        ev_ios_install_started          
    Aborted: by user
    sysadmin-vm:0_RP0# show logging | i card_mgr                                 
    0/RP0/ADMIN0:Mar  4 22:26:03.240 : card_mgr[3211]: %DRIVER-CARD_MGR-5-ZAPDISK_STARTED : Card cleanup started for location 0/2  
    0/RP0/ADMIN0:Mar  4 22:26:04.332 : card_mgr[3211]: %DRIVER-CARD_MGR-5-ZAPDISK_STARTED : Card cleanup started for location 0/6  
    0/RP0/ADMIN0:Mar  4 22:26:04.503 : card_mgr[3211]: %DRIVER-CARD_MGR-5-ZAPDISK_STARTED : Card cleanup started for location 0/RP1  
    sysadmin-vm:0_RP0# show reboot-history card location 0/2
    Card Reboot History for 0/2
     0
      Reason Code  22
      Reason       "ZAPDISK by user request"
      Src Location 0/RP0
      Src Name     card_mgr
    sysadmin-vm:0_RP0# show reboot-history card location 0/6 
    
    Card Reboot History for 0/6
     0
        Reason Code  22
      Reason       "ZAPDISK by user request"
      Src Location 0/RP0
      Src Name     card_mgr
    sysadmin-vm:0_RP0# show reboot-history card location 0/RPCard Reboot History for 0/RP1
     0
        Reason Code  22
      Reason       "ZAPDISK by user request"
      Src Location 0/RP0
      Src Name     card_mgr
    
    sysadmin-vm:0_RP0# show reboot-history card location 0/RP1
    Card Reboot History for 0/RP1
     0
        Reason Code  22
      Reason       "ZAPDISK by user request"
      Src Location 0/RP0
      Src Name     card_mgr
    
  4. Power-Down the Card - Shut down the card.

System Logs during RSP Switchover

Table 1. Feature History Table

Feature Name

Release Information

Feature Description

RSP Slot Location in Syslog

Release 7.8.1

When an RSP switchover occurs, the router logs the active RSP slot location in the syslog message. This helps you quickly identify the active RSP slot from your router's system log messages.

In earlier releases, the RSP switchover Syslog message didn't include the active RSP slot location.

In the event of an RSP switchover, the router logs the following syslog messages:

RP/0/1/CPU0:Feb 19 09:08:00.655 UTC: rmf_svr[436]: %HA-REDCON-6-GO_ACTIVE : this card going active
RP/1/1/CPU0:Mar 8 11:43:29.041 UTC: rmf_svr[147]: %HA-REDCON-6-GO_STANDBY : this card going standby, location RP/1/1/CPU0

From Cisco IOS XR Release 7.8.1 onwards, the RSP switchover syslog message for the active RSP includes the RSP slot location as well:

RP/0/1/CPU0:Mar  8 11:42:50.876 UTC: rmf_svr[165]: %HA-REDCON-6-GO_ACTIVE : this card going active , location RP/0/1/CPU0:

Configurable Fault Recovery Attempts

Table 2. Feature History Table

Feature Name

Release Information

Feature Description

Configurable Fault Recovery Attempts

Release 24.3.1

Introduced in this release on: NCS 5500 modular routers (NCS 5500 line cards; NCS 5700 line cards  [Mode: Compatibility; Native])

You can now reduce the risk of traffic loss by controlling fault recovery attempts by a line card, fabric card, shelf controller, or route processor. This feature allows you to specify the number of recovery attempts before the card is shut down, offering greater control and flexibility.

This feature is disabled by default.

The feature introduces these changes:

CLI:

YANG DATA Model:

This feature is supported on the Cisco NCS 5500 series modular routers and on these line cards:

  • NC57-48Q2D-S

  • NC57-48Q2D-SE-S

  • NC57-36H6D-S

  • NC55-24X100G-SE

  • NC55-36X100G-A-SE

  • NC55-MOD-A-S

  • NC55-MOD-A-SE-S

  • NC55-36X100G-S

  • NC55-36X100G

Fault Recovery Mechanism

Fault recovery is a mechanism designed to handle faults in hardware components such as line cards, fabric cards, shelf controllers, and route processors. This mechanism ensures that a faulty card does not enter a continuous cycle of automatic recovery attempts, which can lead to operational instability.

How Fault Recovery Mechanism Works

The critical alarms lead to hardware module reload for recovery. Reloading a card shifts the traffic to an alternate path. After the hardware module reload is completed, the traffic streams move back. If the errors persist, the traffic switch may continue until someone eventually brings down the card. Depending on the configured features and the overall capacity and traffic load going through the router, there is a potential for traffic loss if one hardware module keeps reloading and trying to take the traffic load momentarily.

In the previous releases, if a router, line card, fabric card, shelf controller, or a route processor experienced a fault, they used to trigger fault recovery and reboot themselves to be operational. Fault recovery mechanism was time based as the fault recovery count used to reset to zero if the card remained operational for more than an hour. After the fault recovery count exceeded five, then the faulty card was shut down. As power related faults triggered were not frequent, and fault recovery count used to reset to zero, the card never entered the shut down mode. As a result, the card always attempted for fault recovery.

How to Control Fault Recovery Attempts

Rather than reloading hardware modules for fault recovery when the router is carrying live traffic, it is better to power down the affected hardware module and notify users to attempt recovery in a controlled environment. You can set the number of recovery attempts to shut down the card.

With the Cisco IOS XR Software Release 24.3.1, we have introduced the hw-module fault-recovery command with which you can set the number of times a fault recovery can take place before permanently shutting down a faulty card.

For example, if you configure the fault recovery count to 1, the router will reboot the faulty module after the first recovery. On the next attempt, the router shuts down or powers off the faulty module.

Restrictions and Guidelines for Configurable Fault Recovery Attempts

Guidelines for Configurable Fault Recovery Attempts

Follow these guidelines for configuring fault recovery attempts:

  • Configure the hw-module fault-recovery location command for each location individually. To apply this configuration to all the locations, specify each location individually and then save your changes.

  • This feature is disabled by default.

Restrictions for Configurable Fault Recovery Attempts

These restrictions apply when you configure fault recovery attempts:

  • When you configure the hw-module fault-recovery location command, the router prompt displays the location all option, but it is not functional.

Configure Fault Recovery Attempts

Configuration Examples

This configuration example shows how to configure a fault recovery attempt on the fabric card FC0.

Router#configure
Router (config)#hw-module fault-recovery location 0/FC0 count 1
Router(config)#commit

This configuration example shows how to configure fault recovery on multiple locations.

Router#configure
Router (config)#hw-module fault-recovery location 0/FC1 count 1
Router (config)#hw-module fault-recovery location 0/RP0 count 2
Router (config)#hw-module fault-recovery location 0/FT2 count 1
Router(config)#commit

Note


If you do not specify the fault-recovery count for location , the router sets the count value to three by default.


Verification

Use show running-config formal | include hw-module command to display the number of times a card can initiate recovery attempts before shutting down .

Router#show running-config formal | include hw-module 
Building configuration... 
hw-module fault-recovery location 0/FC0 count 1

The following system log is generated when the number of fault recovery attempts on the card exceeds the configured count:

Router:Dec 4 15:44:25.247 PST: shelfmgr[121]: %PLATFORM-SHELFMGR-4-CARD_SHUTDOWN : Shutting down 0/FC0: Fault retry attempts exceeded configured count(1)
Use the show reboot history command to get the reason of card shutting down. In the following example, it shows that the card was shut down due to Fault retry attempts exceeded configured count(1).
Router:ios#show reboot history location 0/FC0 detail
Mon Dec  4 15:44:55.827 PST
--------------------------------------------------------------------------------
No   Attribute       Value
--------------------------------------------------------------------------------
1    Time (PST)      Dec 04 2023 15:44:22
     Cause Code      0x0800000d
     Cause String    REBOOT_CAUSE_FM
     Graceful Reload No
     Kdump Requested No
     Reason          Fault retry attempts exceeded configured count(1)  
Use the show platform command to see the current state of the card that was shut down because of Fault recovery handling feature.

Router:ios#show platform 
Mon Oct 2 21:08:03.383 UTC 

Location  Card Type        HW State         SW State       Config State
----------------------------------------------------------------------------
0/0       NC55-36X100G     POWERED_OFF     SW_INACTIVE     NSHUT
0/1       NC55-36X100G-S   OPERATIONAL     OPERATIONAL     NSHUT
0/2       NC55-36X100G-S   OPERATIONAL     OPERATIONAL     NSHUT
0/3       NC55-36X100G     OPERATIONAL     OPERATIONAL     NSHUT
0/6       NC55-36X100G-S   OPERATIONAL     OPERATIONAL     NSHUT
0/8       NC55-36X100G-S   OPERATIONAL     OPERATIONAL     NSHUT
0/15      NC55-36X100G     OPERATIONAL     OPERATIONAL     NSHUT
0/RP0     NC55-RP          OPERATIONAL     OPERATIONAL     NSHUT
0/RP1     NC55-RP          OPERATIONAL     OPERATIONAL     NSHUT
0/FC0     NC55-5516-FC     SHUT DOWN       OPERATIONAL     NSHUT
0/FC1     NC55-5516-FC     OPERATIONAL     OPERATIONAL     NSHUT
0/FC2     NC55-5516-FC     OPERATIONAL     OPERATIONAL     NSHUT
0/FC3     NC55-5516-FC     OPERATIONAL     OPERATIONAL     NSHUT
0/FC4     NC55-5516-FC     OPERATIONAL     OPERATIONAL     NSHUT
0/FC5     NC55-5516-FC     OPERATIONAL     OPERATIONAL     NSHUT
0/FT0     NC55-5516-FAN    OPERATIONAL     N/A             NSHUT
0/FT1     NC55-5516-FAN    OPERATIONAL     N/A             NSHUT
0/FT2     NC55-5516-FAN    OPERATIONAL     N/A             NSHUT
0/PM0     N9K-PAC-3000W-B  OPERATIONAL     N/A             NSHUT
16/07/24, 14:58
Router#