The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
This document describes the Nexus 7000 Supervisor 2/2E compact flash failure issue documented in software defect CSCus22805, all the possible failure scenarios, and recovery steps.
Prior to any workaround, it is strongly recommended to have physical access to the device in case a physical reseat is required. For some reload upgrades, console access may be required, and it is always recommended to perform these workarounds with console access to the supervisor to observe the boot process.
If any of the steps in the workarounds fail, contact Cisco TAC for additional possible recovery options.
Each N7K supervisor 2/2E is equipped with 2 eUSB flash devices in RAID1 configuration, one primary and one mirror. Together they provide non-volatile repositories for boot images, startup configuration and persistent application data.
What can happen is over a period of months or years in service, one of these devices may be disconnected from the USB bus, causing the RAID software to drop the device from the configuration. The device can still function normally with 1/2 devices. However, when the second device drops out of the array, the bootflash is remounted as read-only, meaning you cannot save configuration or files to the bootflash, or allow the standby to sync to the active in the event it is reloaded.
There is no operational impact on systems running in a dual flash failure state, however a reload of the affected supervisor is needed to recover from this state. Furthermore, any changes to running configuration will not be reflected in startup and would be lost in the event of a power outage.
These symptoms have been seen:
switch# show diagnostic result module 5
Current bootup diagnostic level: complete
Module 5: Supervisor module-2 (Standby)
Test results: (. = Pass, F = Fail, I = Incomplete,
U = Untested, A = Abort, E = Error disabled)
1) ASICRegisterCheck-------------> .
2) USB---------------------------> .
3) NVRAM-------------------------> .
4) RealTimeClock-----------------> .
5) PrimaryBootROM----------------> .
6) SecondaryBootROM--------------> .
7) CompactFlash------------------> F <=====
8) ExternalCompactFlash----------> .
9) PwrMgmtBus--------------------> U
10) SpineControlBus---------------> .
11) SystemMgmtBus-----------------> U
12) StatusBus---------------------> U
13) StandbyFabricLoopback---------> .
14) ManagementPortLoopback--------> .
15) EOBCPortLoopback--------------> .
16) OBFL--------------------------> .
dcd02.ptfrnyfs# copy running-config startup-config
[########################################] 100%
Configuration update aborted: request was aborted
switch %MODULE-4-MOD_WARNING: Module 2 (Serial number: JAF1645AHQT) reported warning
due to The compact flash power test failed in device DEV_UNDEF (device error 0x0)
switch %DEVICE_TEST-2-COMPACT_FLASH_FAIL: Module 1 has failed test CompactFlash 20
times on device Compact Flash due to error The compact flash power test failed
To diagnose the current state of the compact flash cards you need to use these internal commands. Note that the command will not parse out, and it must be typed out completely:
switch# show system internal raid | grep -A 1 "Current RAID status info"
switch# show system internal file /proc/mdstat
If there are two supervisors in the chassis, you will need to check the status of the standby supervisor as well to determine which failure scenario you are facing. Check this by prepending the command with the "slot x" keyword where "x" is the slot number of the standby supervisor. This allows you to run the command remotely on the standby.
switch# slot 2 show system internal raid | grep -A 1 "Current RAID status info"
switch# slot 2 show system internal file /proc/mdstat
These commands will give a lot of RAID statistics and events, but you are only concerned with the current RAID information.
In the line "RAID data from CMOS", you want to look at the hex value after 0xa5. This will show how many flashes may currently be facing an issue.
For example:
switch# show system internal raid | grep -A 1 "Current RAID status info"
Current RAID status info:
RAID data from CMOS = 0xa5 0xc3
From this output you want to look at the number beside of 0xa5 which is 0xc3. You can then use these keys to determine if the primary or secondary compact flash has failed, or both. The above output shows 0xc3 which tells us that both the primary and the secondary compact flashes have failed.
0xf0 | No failures reported |
0xe1 | Primary flash failed |
0xd2 | Alternate (or mirror) flash failed |
0xc3 | Both primary and alternate failed |
In the "/proc/mdstat" output ensure that all disks are showing as "U", which represents "U"p:
switch# slot 2 show system internal file /proc/mdstat
Personalities : [raid1]
md6 : active raid1 sdc6[0] sdb6[1]
77888 blocks [2/1] [_U]
md5 : active raid1 sdc5[0] sdb5[1]
78400 blocks [2/1] [_U]
md4 : active raid1 sdc4[0] sdb4[1]
39424 blocks [2/1] [_U]
md3 : active raid1 sdc3[0] sdb3[1]
1802240 blocks [2/1] [_U]
In this scenario you see that the primary compact flash is not up [_U]. A healthy output will show all blocks as [UU].
Note: Both outputs need to show as healthy (0xf0 and [UU]) to diagnose the supervisor as healthy. So if you see a 0xf0 output in the CMOS data but see a [_U] in the /proc/mdstat, the box is unhealthy.
To determine which scenario you are facing, you will need to use the above commands in the "Diagnosis" section to correlate with a Scenario Letter below. Using the columns, match up the number of failed compact flashes on each supervisor.
For example, if you saw that the code is 0xe1 on the Active supervisor and 0xd2 on the Standby, this would be "1 Fail" on the Active and "1 Fail" on the Standby which is scenario letter "D".
Single Supervisor:
Scenario Letter | Active Supervisor | Active Supervisor Code |
A | 1 Fail | 0xe1 or 0xd2 |
B | 2 Fails | 0xc3 |
Dual Supervisors:
Scenario Letter | Active Supervisor | Standby Supervisor | Active Supervisor Code | Standby Supervisor Code |
C | 0 Fail | 1 Fail | 0xf0 | 0xe1 or 0xd2 |
D | 1 Fail | 0 Fail | 0xe1 or 0xd2 | 0xf0 |
E | 1 Fail | 1 Fail | 0xe1 or 0xd2 | 0xe1 or 0xd2 |
F | 2 Fails | 0 Fail | 0xc3 | 0xf0 |
G | 0 Fail | 2 Fails | 0xf0 | 0xc3 |
H | 2 Fails | 1 Fail | 0xc3 | 0xe1 or 0xd2 |
I | 1 Fail | 2 Fail | 0xe1 or 0xd2 | 0xc3 |
J | 2 Fails | 2 Fails | 0xc3 | 0xc3 |
Recovery Scenario:
1 Fail on the Active
Steps to Resolution:
1. Load flash recovery tool to repair bootflash. You can download the recovery tool from CCO under utilities for the N7000 platform or use the link below:
It is wrapped in a tar gz compressed file, please uncompress it to find the .gbin recovery tool and a .pdf readme. Review the readme file, and load the .gbin tool onto bootflash of the N7K. While this recovery is designed to be non-impacting and can be performed live, TAC recommends to perform in a Maintenance Window in case any unexpected issues arise. After the file is on bootflash, you can run the recovery tool with:
switch# show system internal file /proc/mdstat \
Personalities : [raid1]
md6 : active raid1 sdd6[2] sdc6[0]
77888 blocks [2/1] [U_] <-- "U_" represents the broken state
resync=DELAYED
md5 : active raid1 sdd5[2] sdc5[0]
78400 blocks [2/1] [U_]
resync=DELAYED
md4 : active raid1 sdd4[2] sdc4[0]
39424 blocks [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdd3[2] sdc3[0]
1802240 blocks [2/1] [U_]
[=>...................] recovery = 8.3% (151360/1802240) finish=2.1min s peed=12613K/sec
unused devices: <none>
switch# show system internal file /proc/mdstat Personalities : [raid1]
md6 :active raid1 sdd6[1] sdc6[0]
77888 blocks [2/2] [UU] <-- "UU" represents the fixed state
md5 :active raid1 sdd5[1] sdc5[0]
78400 blocks [2/2] [UU]
md4 :active raid1 sdd4[1] sdc4[0]
39424 blocks [2/2] [UU]
md3 :active raid1 sdd3[1] sdc3[0]
1802240 blocks [2/2] [UU]
unused devices: <none>
Recovery Scenario:
2 Fails on the Active
Steps to Resolution:
Note: It is commonly seen in instances of dual flash failures, a software reload might not fully recover the RAID and could require running the recovery tool or subsequent reloads to recover. In almost every occurrence, it has been resolved with a physical reseat of the supervisor module. Therefore, if physical access to the device is possible, after backing up configuration externally, you can attempt a quick recovery that has the highest chance of succeeding by physically reseating the supervisor when ready to reload the device. This will fully remove power from the supervisor and should allow the recovery of both disks in the RAID. Proceed to Step 3 if the physical reseat recovery is only partial, or Step 4 if it is entirely not successful in that the system is not fully booting.
Failure Scenario:
0 Fails on the Active
1 Fail on the Standby
Steps to Resolution:
In the scenario of a dual supervisor setup, with no flash failures on the active and a single failure on the standby, a non impacting recovery can be performed.
1. As the active has no failures and the standby only has a single failure, the Flash Recovery Tool can be loaded onto the active and executed. After running the tool, it will automatically copy itself to the standby and attempt to resync the array. The recovery tool can be downloaded here:
Once you have downloaded the tool, unzipped it, and uploaded it to the bootflash of the box, you will need to execute the following command to begin the recovery:
# load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
The tool will start running and detect disconnected disks and attempt to resync them with the RAID array.
You can check the recovery status with:
# show system internal file /proc/mdstat
Verify that recovery is proceeding, it may take several minutes to fully repair all disks to a [UU] status. An example of a recovery in operation looks as follows:
switch# show system internal file /proc/mdstat \
Personalities : [raid1]
md6 : active raid1 sdd6[2] sdc6[0]
77888 blocks [2/1] [U_] <-- "U_" represents the broken state
resync=DELAYED
md5 : active raid1 sdd5[2] sdc5[0]
78400 blocks [2/1] [U_]
resync=DELAYED
md4 : active raid1 sdd4[2] sdc4[0]
39424 blocks [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdd3[2] sdc3[0]
1802240 blocks [2/1] [U_]
[=>...................] recovery = 8.3% (151360/1802240) finish=2.1min s peed=12613K/sec
unused devices: <none>
After recovery is finished it should look as follows:
switch# show system internal file /proc/mdstat Personalities : [raid1]
md6 :active raid1 sdd6[1] sdc6[0]
77888 blocks [2/2] [UU] <-- "UU" represents the correct state
md5 :active raid1 sdd5[1] sdc5[0]
78400 blocks [2/2] [UU]
md4 :active raid1 sdd4[1] sdc4[0]
39424 blocks [2/2] [UU]
md3 :active raid1 sdd3[1] sdc3[0]
1802240 blocks [2/2] [UU]
unused devices: <none>
After all disks are in [UU], the RAID array is fully back up with both disks sync'd.
2. If the Flash Recovery Tool is unsuccessful, since the active has both disks up, the standby should be able to successfully sync to the active on reload.
Therefore, in a scheduled window, perform a "out-of-service module x" for the standby supervisor, it is recommended to have console access to the standby to observe the boot process in the case any unexpected issues arise. After the supervisor is down, wait a few seconds and then perform "no poweroff module x" for the standby. Wait until the standby fully boots into the "ha-standby" status.
After the standby is back up, check the RAID with "slot x show system internal raid" and "slot x show system internal file /proc/mdstat".
If both disks are not fully back up after reload, run the recovery tool again.
3. If the reload and recovery tool are not successful, it would be recommended to attempt physically reseating the standby module in the window to try and clear the condition. If physical reseat is not successful, try performing an "init system" from switch boot mode by following the password recovery steps to break into this mode during boot. If still unsuccessful, contact TAC to attempt manual recovery.
Recovery Scenario:
1 Fail on the Active
0 Fails on the Standby
Steps to Resolution:
In the scenario of a dual supervisor setup, with 1 flash failure on the active and no failures on the standby, a non impacting recovery can be performed by using the Flash Recovery Tool.
1. As the standby has no failures and the active only has a single failure, the Flash Recovery Tool can be loaded onto the active and executed. After running the tool, it will automatically copy itself to the standby and attempt to resync the array. The recovery tool can be downloaded here:
Once you have downloaded the tool, unzipped it, and uploaded it to the bootflash of the active, you will need to execute the following command to begin the recovery:
# load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
The tool will start running and detect disconnected disks and attempt to resync them with the RAID array.
You can check the recovery status with:
# show system internal file /proc/mdstat
Verify that recovery is proceeding, it may take several minutes to fully repair all disks to a [UU] status. An example of a recovery in operation looks as follows:
switch# show system internal file /proc/mdstat \
Personalities : [raid1]
md6 : active raid1 sdd6[2] sdc6[0]
77888 blocks [2/1] [U_] <-- "U_" represents the broken state
resync=DELAYED
md5 : active raid1 sdd5[2] sdc5[0]
78400 blocks [2/1] [U_]
resync=DELAYED
md4 : active raid1 sdd4[2] sdc4[0]
39424 blocks [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdd3[2] sdc3[0]
1802240 blocks [2/1] [U_]
[=>...................] recovery = 8.3% (151360/1802240) finish=2.1min s peed=12613K/sec
unused devices: <none>
After recovery is finished it should look as follows:
switch# show system internal file /proc/mdstat Personalities : [raid1]
md6 :active raid1 sdd6[1] sdc6[0]
77888 blocks [2/2] [UU] <-- "UU" represents the correct state
md5 :active raid1 sdd5[1] sdc5[0]
78400 blocks [2/2] [UU]
md4 :active raid1 sdd4[1] sdc4[0]
39424 blocks [2/2] [UU]
md3 :active raid1 sdd3[1] sdc3[0]
1802240 blocks [2/2] [UU]
unused devices: <none>
After all disks are in [UU], the RAID array is fully back up with both disks sync'd.
2. If the Flash Recovery Tool is unsuccessful, the next step would be to perform a "system switchover" to failover the supervisor modules in a maintenance window.
Therefore, in a scheduled window, perform a "system switchover", it is recommended to have console access to observe the boot process in the case any unexpected issues arise. Wait until the standby fully boots into the "ha-standby" status.
After the standby is back up, check the RAID with "slot x show system internal raid" and "slot x show system internal file /proc/mdstat".
If both disks are not fully back up after reload, run the recovery tool again.
3. If the reload and recovery tool are not successful, it would be recommended to attempt physically reseating the standby module in the window to try and clear the condition. If physical reseat is not successful, try performing an "init system" from switch boot mode by following the password recovery steps to break into this mode during boot. If still unsuccessful, contact TAC to attempt manual recovery.
Recovery Scenario:
1 Fail on the Active
1 Fail on the Standby
Steps to Resolution:
In the event of a single flash failure on both the active and standby, a non impacting workaround can still be accomplished.
1. As no supervisor is in a read-only state, the first step is to attempt using the Flash Recovery Tool.
The recovery tool can be downloaded here:
Once you have downloaded the tool, unzipped it, and uploaded it to the bootflash of the active, you will need to execute the following command to begin the recovery:
# load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
It will automatically detect disconnected disks on the active and attempt repair, as well as automatically copy itself to standby and detect and correct failures there.
You can check the recovery status with:
# show system internal file /proc/mdstat
Verify that recovery is proceeding, it may take several minutes to fully repair all disks to a [UU] status. An example of a recovery in operation looks as follows:
switch# show system internal file /proc/mdstat \
Personalities : [raid1]
md6 : active raid1 sdd6[2] sdc6[0]
77888 blocks [2/1] [U_] <-- "U_" represents the broken state
resync=DELAYED
md5 : active raid1 sdd5[2] sdc5[0]
78400 blocks [2/1] [U_]
resync=DELAYED
md4 : active raid1 sdd4[2] sdc4[0]
39424 blocks [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdd3[2] sdc3[0]
1802240 blocks [2/1] [U_]
[=>...................] recovery = 8.3% (151360/1802240) finish=2.1min s peed=12613K/sec
unused devices: <none>
After recovery is finished it should look as follows:
switch# show system internal file /proc/mdstat Personalities : [raid1]
md6 :active raid1 sdd6[1] sdc6[0]
77888 blocks [2/2] [UU] <-- "UU" represents the correct state
md5 :active raid1 sdd5[1] sdc5[0]
78400 blocks [2/2] [UU]
md4 :active raid1 sdd4[1] sdc4[0]
39424 blocks [2/2] [UU]
md3 :active raid1 sdd3[1] sdc3[0]
1802240 blocks [2/2] [UU]
unused devices: <none>
After all disks are in [UU], the RAID array is fully back up with both disks sync'd.
If both supervisors recover into the [UU] status, then recovery is complete. If recovery is partial or did not succeed go to Step 2.
2. In the event that the recovery tool did not succeed, identify the current state of the RAID on the modules. If there is still a single flash failure on both, attempt a "system switchover" which will reload the current active and force the standby to the active role.
After the previous active is reloaded back into "ha-standby", check its RAID status as it should be recovered during the reload.
If the supervisor successfully recovers after the switchover, you can try running the flash recovery tool again to try and repair the single disk failure on the current active supervisor, or another "system switchover" to reload the current active and force the previous active and current standby that was repaired back to the active role. Verify the reloaded supervisor has both disks repaired again, re-run the recovery tool if necessary.
3. If during this process the switchover is not fixing the RAID, perform an "out-of-service module x" for the standby and then "no poweroff module x" to fully remove and re-apply power to the module.
If out of service is not succesful, attempt a physical reseat of the standby.
If after running the recovery tool one supervisor recovers its RAID and the other still has a failure, force the supervisor with the single failure to standby with a "system switchover" if necessary. If the supervisor with a single failure is
already standby, do an "out-of-service module x" for the standby and "no poweroff module x" to fully remove and reapply power to the module. If it is still not recovering, attempt a physical reseat of the module. In the event a reseat does not fix,
break into the switch boot prompt using the password recovery procedure and do an "init system" to reinitialize the bootflash. If this is still unsuccessful, have TAC attempt manual recovery.
Note: If at any point the standby is stuck in a "powered-up" state and not "ha-standby", if unable to get the standby fully up with the steps above, a chassis reload will be required.
Recovery Scenario:
2 Fails on the Active
0 Fails on the Standby
Steps to Resolution:
With 2 failures on the active and 0 on the standby supervisor, a non-impacting recovery is possible, depending on how much of the running-configuration has been added since the standby was unable to sync its running-config with the active.
The recovery procedure will be to copy the current running configuration from the active supervisor, failover to the healthy standby supervisor, copy the missing running configuration to the new active, manually bring the previous active online, then run the recovery tool.
2. Once the running-configuration has been copied off of the active supervisor, it will be a good idea to compare it to the start-up configuration to see what has changed since the last save. This can be seen with "show startup-configuration". The differences will of course be completely dependent on the environment, but it is good to be aware of what may be missing when the standby comes online as the active. It is also a good idea to have the differences already copied out in a notepad so that they can be quickly added to the new active supervisor after the switchover.
3. After the differences have been evaluated, you will need to perform a supervisor switchover. TAC recommends that this is done during a maintenance window, as unforseen issues may occur. The command to perform the failover to the standby will be "system switchover".
4. The switchover should occur very quickly and the new standby will begin rebooting. During this time you will want to add any missing configuration back to the new active. This can be done by copying the configuration from the TFTP server (or wherever it was saved previously) or by simply manually adding the configuration in the CLI. In most instances the missing configurations are very short and the CLI option will be the most feasible.
5. After some time the new standby supervisor may come back online in an "ha-standby" state, but what normally occurs is that it gets stuck in a "powered-up" state. The state can be viewed using the "show module" command and referring to the "Status" column next to the module.
If the new standby comes up in a "powered-up" state, you will need to manually bring it back online. This can be done by issuing the following commands, where "x" is the standby module stuck in a "powered-up" state:
(config)# out-of-service module x
(config)# no poweroff module x
6. Once the standby is back online in an "ha-standby" state, you will then need to run the recovery tool to ensure that the recovery is complete. The tool can be downloaded at the following link:
Once you have downloaded the tool, unzipped it, and uploaded it to the bootflash of the box, you will need to execute the following command to begin the recovery:
# load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
The tool will start running and detect disconnected disks and attempt to resync them with the RAID array.
You can check the recovery status with:
# show system internal file /proc/mdstat
Verify that recovery is proceeding, it may take several minutes to fully repair all disks to a [UU] status. An example of a recovery in operation looks as follows:
switch# show system internal file /proc/mdstat \
Personalities : [raid1]
md6 : active raid1 sdd6[2] sdc6[0]
77888 blocks [2/1] [U_] <-- "U_" represents the broken state
resync=DELAYED
md5 : active raid1 sdd5[2] sdc5[0]
78400 blocks [2/1] [U_]
resync=DELAYED
md4 : active raid1 sdd4[2] sdc4[0]
39424 blocks [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdd3[2] sdc3[0]
1802240 blocks [2/1] [U_]
[=>...................] recovery = 8.3% (151360/1802240) finish=2.1min s peed=12613K/sec
unused devices: <none>
After recovery is finished it should look as follows:
switch# show system internal file /proc/mdstat
Personalities : [raid1]
md6 :active raid1 sdd6[1] sdc6[0]
77888 blocks [2/2] [UU] <-- "UU" represents the correct state
md5 :active raid1 sdd5[1] sdc5[0]
78400 blocks [2/2] [UU]
md4 :active raid1 sdd4[1] sdc4[0]
39424 blocks [2/2] [UU]
md3 :active raid1 sdd3[1] sdc3[0]
1802240 blocks [2/2] [UU]
unused devices: <none>
After all disks are in [UU], the RAID array is fully back up with both disks sync'd.
0 Fails on the Active, 2 on the Standby
Recovery Scenario:
0 Fails on the Active
2 Fails on the Standby
Steps to Resolution:
With 0 failures on the active and 2 on the standby supervisor, a non-impacting recovery is possible.
The recovery procedure will be to perform a reload of the standby.
1. It is commonly seen in supervisors with a dual flash failure that a software "reload module x" may only partially repair the RAID or have it get stuck powered-up upon reboot.
Therefore, it is recommended to either physically reseat the supervisor with dual flash failure to fully remove and reapply power to the module, or you can perform the following (x for standby slot #):
# out-of-service module x
# no poweroff module x
If you see that the standby keeps getting stuck in the powered-up state and ultimately keeps power cycling after the steps above, this is likely due to the active reloading the standby for not coming up in time.
This may be due to the booting up standby attempting to re-initialize its bootflash/RAID, which can take up to 10 minutes, but it keeps being reset by the active before it can accomplish.
To resolve this, configure the following using 'x' for the standby slot # stuck in powered-up:
(config)# system standby manual-boot
(config)# reload module x force-dnld
The above will make it so the active does not automatically reset the standby, and then reload the standby and force it to sync its image from the active.
Wait 10-15 minutes to see if the standby is finally able to get to ha-standby status. After it is in ha-standby status, re-enable automatic reboots of the standby with:
(config)# system no standby manual-boot
6. Once the standby is back online in an "ha-standby" state, you will then need to run the recovery tool to ensure that the recovery is complete. The tool can be downloaded at the following link:
https://software.cisco.com/download/release.html?mdfid=284472710&flowid=&softwareid=282088132&relind=AVAILABLE&rellifecycle=&reltype=latest
Once you have downloaded the tool, unzipped it, and uploaded it to the bootflash of the box, you will need to execute the following command to begin the recovery:
# load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
The tool will start running and detect disconnected disks and attempt to resync them with the RAID array.
You can check the recovery status with:
# show system internal file /proc/mdstat
Verify that recovery is proceeding, it may take several minutes to fully repair all disks to a [UU] status. An example of a recovery in operation looks as follows:
switch# show system internal file /proc/mdstat
Personalities : [raid1]
md6 : active raid1 sdd6[2] sdc6[0]
77888 blocks [2/1] [U_] <-- "U_" represents the broken state
resync=DELAYED
md5 : active raid1 sdd5[2] sdc5[0]
78400 blocks [2/1] [U_]
resync=DELAYED
md4 : active raid1 sdd4[2] sdc4[0]
39424 blocks [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdd3[2] sdc3[0]
1802240 blocks [2/1] [U_]
[=>...................] recovery = 8.3% (151360/1802240) finish=2.1min s peed=12613K/sec
unused devices: <none>
After recovery is finished it should look as follows:
switch# show system internal file /proc/mdstat Personalities : [raid1]
md6 :active raid1 sdd6[1] sdc6[0]
77888 blocks [2/2] [UU] <-- "UU" represents the correct state
md5 :active raid1 sdd5[1] sdc5[0]
78400 blocks [2/2] [UU]
md4 :active raid1 sdd4[1] sdc4[0]
39424 blocks [2/2] [UU]
md3 :active raid1 sdd3[1] sdc3[0]
1802240 blocks [2/2] [UU]
unused devices: <none>
After all disks are in [UU], the RAID array is fully back up with both disks sync'd.
2 Fails on the Active, 1 on the Standby
Recovery Scenario:
2 Fails on the Active
1 Fails on the Standby
Steps to Resolution:
With 2 failures on the active and 1 on the standby supervisor, a non-impacting recovery is possible, depending on how much of the running-configuration has been added since the standby was unable to sync its running-config with the active.
The recovery procedure will be to backup the current running configuration from the active supervisor, failover to the healthy standby supervisor, copy the missing running configuration to the new active, manually bring the previous active online, then run the recovery tool.
1. Backup all running configuration externally with "copy running-config tftp: vdc-all". Please note that in the occurrence of dual flash failure, configuration changes since the system remounted to read-only are not present on the startup configuration. You can review "show system internal raid" for the affected module to determine when the second disk failed which is where the system goes read-only. From there you can review "show accounting log" for each VDC to determine what changes were made since the dual flash failure so you will know what to add if the startup configuration persists upon reload.
Please note that it is possible that startup configuration is wiped upon reload of a supervisor with dual flash failure, which is why the configuration must be backed up externally.
2. Once the running-configuration has been copied off of the active supervisor, it will be a good idea to compare it to the start-up configuration to see what has changed since the last save. This can be seen with "show startup-configuration". The differences will of course be completely dependent on the environment, but it is good to be aware of what may be missing when the standby comes online as the active. It is also a good idea to have the differences already copied out in a notepad so that they can be quickly added to the new active supervisor after the switchover.
3. After the differences have been evaluated, you will need to perform a supervisor switchover. TAC recommends that this is done during a maintenance window, as unforseen issues may occur. The command to perform the failover to the standby will be "system switchover".
4. The switchover should occur very quickly and the new standby will begin rebooting. During this time you will want to add any missing configuration back to the new active. This can be done by copying the configuration from the TFTP server (or wherever it was saved previously) or by simply manually adding the configuration in the CLI, do not copy directly from tftp to running-configuration, copy to bootflash first, and then to running configuration. In most instances the missing configurations are very short and the CLI option will be the most feasible.
5. After some time the new standby supervisor may come back online in an "ha-standby" state, but what normally occurs is that it gets stuck in a "powered-up" state. The state can be viewed using the "show module" command and referring to the "Status" column next to the module.
If the new standby comes up in a "powered-up" state, you will need to manually bring it back online. This can be done by issuing the following commands, where "x" is the standby module stuck in a "powered-up" state:
(config)# out-of-service module
(config)# no poweroff module x
If you see that the standby keeps getting stuck in the powered-up state and ultimately keeps power cycling after the steps above, this is likely due to the active reloading the standby for not coming up in time.
This may be due to the booting up standby attempting to re-initialize its bootflash/RAID, which can take up to 10 minutes, but it keeps being reset by the active before it can accomplish.
To resolve this, configure the following using 'x' for the standby slot # stuck in powered-up:
(config)# system standby manual-boot
(config)# reload module x force-dnld
The above will make it so the active does not automatically reset the standby, and then reload the standby and force it to sync its image from the active.
Wait 10-15 minutes to see if the standby is finally able to get to ha-standby status. After it is in ha-standby status, re-enable automatic reboots of the standby with:
(config)# system no standby manual-boot
6. Once the standby is back online in an "ha-standby" state, you will then need to run the recovery tool to ensure that the recovery is complete and to repair the single disk failure on the active. The tool can be downloaded at the following link:
https://software.cisco.com/download/release.html?mdfid=284472710&flowid=&softwareid=282088132&relind=AVAILABLE&rellifecycle=&reltype=latest
Once you have downloaded the tool, unzipped it, and uploaded it to the bootflash of the box, you will need to execute the following command to begin the recovery:
# load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
The tool will start running and detect disconnected disks and attempt to resync them with the RAID array.
You can check the recovery status with:
# show system internal file /proc/mdstat
Verify that recovery is proceeding, it may take several minutes to fully repair all disks to a [UU] status. An example of a recovery in operation looks as follows:
switch# show system internal file /proc/mdstat \
Personalities : [raid1]
md6 : active raid1 sdd6[2] sdc6[0]
77888 blocks [2/1] [U_] <-- "U_" represents the broken state
resync=DELAYED
md5 : active raid1 sdd5[2] sdc5[0]
78400 blocks [2/1] [U_]
resync=DELAYED
md4 : active raid1 sdd4[2] sdc4[0]
39424 blocks [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdd3[2] sdc3[0]
1802240 blocks [2/1] [U_]
[=>...................] recovery = 8.3% (151360/1802240) finish=2.1min s peed=12613K/sec
unused devices: <none>
After recovery is finished it should look as follows:
switch# show system internal file /proc/mdstat Personalities : [raid1]
md6 :active raid1 sdd6[1] sdc6[0]
77888 blocks [2/2] [UU] <-- "UU" represents the correct state
md5 :active raid1 sdd5[1] sdc5[0]
78400 blocks [2/2] [UU]
md4 :active raid1 sdd4[1] sdc4[0]
39424 blocks [2/2] [UU]
md3 :active raid1 sdd3[1] sdc3[0]
1802240 blocks [2/2] [UU]
unused devices: <none>
After all disks are in [UU], the RAID array is fully back up with both disks sync'd.
If the current active with a single failure is not recovered by the recovery tool, attempt another "system switchover" ensuring your current standby is in "ha-standby" status. If still not successfull please contact Cisco TAC
Recovery Scenario:
1 Fail on the Active
2 Fails on the Standby
Steps to Resolution:
In a dual supervisor scenario with 1 failure on the active and 2 failures on the standby supervisor a non-impacting recovery can be possible, but in many cases a reload may be necessary.
The process will be to first back up all running configuratoins, then attempt to recover the failed compact flash on the active usingt he recovery tool, then, if successful, you will manually reload the standby and run the recovery tool again. If the initial recovery attempt is unable to recover the failed flash on the active, TAC must be engaged to attempt a manual recovery using the debug plugin.
1. Backup all running configuration externally with "copy running-config tftp: vdc-all". You may also copy the running-config to a local USB stick if a TFTP server is not set up in the environment.
2. Once the current running-configuration is backed up, you will then need to run the recovery tool to attempt a recovery of the failed flash on the active. The tool can be downloaded at the following link:
Once you have downloaded the tool, unzipped it, and uploaded it to the bootflash of the box, you will need to execute the following command to begin the recovery:
# load bootflash:n7000-s2-flash-recovery-tool.10.0.2.gbin
The tool will start running and detect disconnected disks and attempt to resync them with the RAID array.
You can check the recovery status with:
# show system internal file /proc/mdstat
Verify that recovery is proceeding, it may take several minutes to fully repair all disks to a [UU] status. An example of a recovery in operation looks as follows:
switch# show system internal file /proc/mdstat \
Personalities : [raid1]
md6 : active raid1 sdd6[2] sdc6[0]
77888 blocks [2/1] [U_] <-- "U_" represents the broken state
resync=DELAYED
md5 : active raid1 sdd5[2] sdc5[0]
78400 blocks [2/1] [U_]
resync=DELAYED
md4 : active raid1 sdd4[2] sdc4[0]
39424 blocks [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdd3[2] sdc3[0]
1802240 blocks [2/1] [U_]
[=>...................] recovery = 8.3% (151360/1802240) finish=2.1min s peed=12613K/sec
unused devices: <none>
After recovery is finished it should look as follows:
switch# show system internal file /proc/mdstat
Personalities : [raid1]
md6 :active raid1 sdd6[1] sdc6[0]
77888 blocks [2/2] [UU] <-- "UU" represents the correct state
md5 :active raid1 sdd5[1] sdc5[0]
78400 blocks [2/2] [UU]
md4 :active raid1 sdd4[1] sdc4[0]
39424 blocks [2/2] [UU]
md3 :active raid1 sdd3[1] sdc3[0]
1802240 blocks [2/2] [UU]
unused devices: <none>
After all disks are in [UU], the RAID array is fully back up with both disks sync'd.
3. If, after running the Recovery Tool in step 2, you are not able to recover the failed compact flash on the active supervisor, you must contact TAC to attempt a manual recovery using the linux debug plugin.
4. After verifying that both flashes show as "[UU]" on the active, you can proceed with manually rebooting the standby supervisor. This can be done by issuing the following commands, where "x" is the standby module stuck in a "powered-up" state:
(config)# out-of-service module x
(config)# no poweroff module x
This should bring the standby supervisor back into an "ha-standby" state (this is checked by viewing the Status column in the "show module" output). If this is successful proceed to step 6, if not, try the procedure outlined in step 5.
5. If you see that the standby keeps getting stuck in the powered-up state and ultimately keeps power cycling after the steps above, this is likely due to the active reloading the standby for not coming up in time. This may be due to the booting up standby attempting to re-initialize its bootflash/RAID, which can take up to 10 minutes, but it keeps being reset by the active before it can accomplish. To resolve this, configure the following using 'x' for the standby slot # stuck in powered-up:
(config)# system standby manual-boot
(config)# reload module x force-dnld
The above will make it so the active does not automatically reset the standby, and then reload the standby and force it to sync its image from the active.
Wait 10-15 minutes to see if the standby is finally able to get to ha-standby status. After it is in ha-standby status, re-enable automatic reboots of the standby with:
(config)# system no standby manual-boot
6. Once the standby is back online in an "ha-standby" state, you will then need to run the recovery tool to ensure that the recovery is complete. You can run the same tool that you have on the active for this step, no additional download is needed as the recovery tool runs on the active and the standby.
Recovery Scenario:
2 Fails on the Active
2 Fails on the Standby
Steps to Resolution:
Note: It is commonly seen in instances of dual flash failures, a software "reload" may not fully recover the RAID and could require running the recovery tool or subsequent reloads to recover. In almost every occurrence, it has been resolved with a physical reseat of the supervisor module. Therefore, if physical access to the device is possible, after backing up configuration externally, you can attempt a quick recovery that has the highest chance of succeeding by physically reseating the supervisor when ready to reload the device. This will fully remove power from the supervisor and should allow the recovery of both disks in the RAID. Proceed to Step 3 if the physical reseat recovery is only partial, or Step 4 if it is entirely not successful in that the system is not fully booting.
See the Long Term Solutions section below.
The reason this is not possible is because in order to allow the standby supervisor to come up in an "ha-standby" state, the active supervisor must write several things to its compact flash (SNMP info, etc.), which it cannot do if it has a dual flash failure itself.
Contact Cisco TAC for options in this scenario.
There is a separate defect for the N7700 Sup2E - CSCuv64056 . The recovery tool will not work for the N7700.
The recovery tool does not work for NPE images.
No. An ISSU will utilize a supervisor switchover, which may not perform correctly due to the compact flash failure.
RAID status bits gets reset after board reset after applying auto recovery.
However not all failure conditions can be recovered automatically.
If the RAID status bits are not printed as [2/2] [UU], recovery is incomplete.
Follow the recovery steps listed
No, But system may not boot back up on a power failure. Startup configs will be lost as well.
ISSU will not fix failed eUSB. The best option is to run the recovery tool for single eusb failure on the sup or reload the sup incase of dual eusb failure.
Once the issue is corrected then do the upgrade. The fix for CSCus22805 helps corrects single eusb failure ONLY and it does so by by scanning the system at regular interval and attempts to reawaken inaccessible or read-only eUSB using the script.
It is rare to see both eusb flash failure on the supervisor occurring simultaneously hence this workaround will be effective.
Generally it is seen by a longer uptime. This is not exactly quantified and can range from a year or longer. The bottom line is that the more stress on the eusb flash in terms of read writes, the higher the probability of the system running into this scenario.
Show system internal raid shows the flash status twice in different sections. Also these sections are not consistent
The first section shows the current status and the second section shows the bootup status.
The current status is what matters and it should always show as UU.
This defect has a workaround in 6.2(14), but the firmware fix was added to 6.2(16) and 7.2(x) and later.
It is advisable to upgrade to a release with the firmware fix to completely resolve this issue.
If you are unable to upgrade to a fixed version of NXOS there are two possible solutions.
Solution 1 is to run the flash recovery tool proactively every week using the scheduler. The following scheduler configuratoin with the flash recovery tool in the bootflash:
feature scheduler
scheduler job name Flash_Job
copy bootflash:/n7000-s2-flash-recovery-tool.10.0.2.gbin bootflash:/flash_recovery_tool_copy
load bootflash:/flash_recovery_tool_copy
exit
scheduler schedule name Flash_Recovery
job name Flash_Job
time weekly 7
Notes:
Solution 2 is documented at the following technote link