This document describes how to resolve fabric errors reported in the Cisco Nexus 7000 platform. A troubleshoot of fabric Cyclic Redundancy Checksums (CRCs) involves the collection of data, data analysis, and an elimination process in order to isolate the problem component. This document covers the most common types of fabric CRC errors.
Here is a high-level diagram of a Nexus 7018 fabric module with M1 linecards:
The previous image gives an overview of the components involved when a packet traverses a fabric module. Stage 1 (S1), Stage 2 (S2), and Stage 3 (S3) are the three stages of the Nexus 7000 fabric, Octopus is the queue engine, Santa Cruz (SC) is the fabric ASIC, and Instance 1 and 2 are the two SC instances on the XBAR. This document considers only one XBAR. Please remember that most of the Nexus 7000 Series switches have three or more XBARs installed.
With the assumption that a unidirectional flow from Module 1 (M1) to Module 2 (M2) is present, the ingress Octopus-1 on M1 performs error checks on packets it receives from the south, and the egress Octopus-1 on M2 from the north. If CRC is detected in S3, a problem might have happened in S1 or S2 also, since no CRC check is performed in those stages. So, the devices involved in the path are the ingress Octopus, chassis, crossbar fabric, and egress Octopus.
In M1/Fab1 architecture, CRCs are detected only on the egress linecard (S3).
Here is a sample error message:
%OC_USD-SLOT1-2-RF_CRC: OC1 received packets with
CRC error from MOD 15 through XBAR slot 1/inst 1
This is reported by M1, which indicates that it received packets with the wrong CRC from Module 15 (M15) via XBAR slot 1/instance 1.
This section describes four of the most common types of fabric CRC Errors.
%OC_USD-SLOT1-2-RF_CRC: OC1 received packets withThis means that the module in slot 1 detected a CRC error from M15 through XBAR slot 1/instance 1. The module where the CRC errors originate is referred to as the ingress module (M15 in this case), and the module that reported the problem is the egress module (M1). XBAR 1 is the cross bar in which the packet was received. There are two instances per XBAR. In this case, M1 detected CRC errors from M15 through XBAR slot 1 instance 1.
CRC error from MOD 15 through XBAR slot 1/inst 1
%OC_USD-SLOT4-2-RF_CRC: OC2 received packets withIn this message, Module 4 (M4) reported the CRC error from M1. Notice that the XBAR information is missing. The system is unable to ascertain the XBAR that the packet traversed. There are many reasons, but the most common ones are: The information in the fabric header of the packet might be corrupt, so the source module cannot be determined; the XBAR that was traversed is removed from the system since the error incremented. Thus, it was not reported in the hourly syslog message.
CRC error from MOD 1
%OC_USD-2-RF_CRC: OC1 received packets withIn this instance, a device detected a CRC from Module 16 (M16) through XBAR 1. There is, however, no receiver module. When the Supervisor (SUP) detects a CRC that comes from the fabric module, the slot information is not logged. When you do not see slot information, then the SUP detected the problem. This does not mean that the SUP is bad. Just as when the module reports the problem, there are multiple components that might have caused the problem: M16, the chassis (not as likely), XBAR 1, or the SUP.
CRC error from MOD 16 through XBAR slot 1/inst 1
%OC_USD-SLOT6-2-RF_CRC: OC2 received packets withThe source module is gleaned from the ingress Octopus that sourced the bad packet. The driver that raises an interrupt in order to log this error message does not always know the ingress Octopus from which the bad packet originated. This is because some of the bits used in order to represent the ingress Octopus are not used. If the system determines multiple modules have these unused bits turned on, the system must assume that any one of them might be the source, which causes the error message to include all of those modules. The system found that Module 13 (M13) cannot have this conflict due to those bits not being used; thus, it is not logged as a potential source.
CRC error from MOD 11 or 12 or 14 or 15 or 16 or 17 or 18
New linecards (M2) and fabric module 2 (FAB2) detect CRCs in S1, S2, or S3. When you investigate in detail and find patterns in the failure and log messages, it helps isolate the faulty component.
Here are some questions to ask:
Answers to the these questions allow you to approach the troubleshoot procedure from an angle that is more likely to lead to faster resolution.
This section establishes a general framework used in order to troubleshoot these issues.
This section provides examples of how to troubleshoot similar problems.
%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7
For a few hours, CRC errors are seen on M1 and Module 3 (M3) that come from Module 7 (M7) only.
There is a bad or mis-seated XBAR that corrupts packets headed to M7, or M7 is bad or mis-seated.
If you have three XBARs installed, it gives you N+1 redundancy. Therefore, you are able to shut them down one at a time (never shut down more than one at any given time) with only minimal impact in order to see if the problem is resolved. Enter these commands in order to complete this process:
N7K(config)# poweroff xbar 1
<monitor>
N7K(config)# no poweroff xbar 1
N7K(config)# poweroff xbar 2
<monitor>
N7K(config)# no poweroff xbar 2
N7K(config)# poweroff xbar 3
N7K(config)# no poweroff xbar 3
In this particular case study, the problem was not resolved when the XBARs were shut down.
As there are two modules that report CRC errors, it is unlikely that those two modules (M1 & M3) are the cause. The next step is to reseat M7 (ingress module), because it is most likely the faulty component. Mis-seated linecards might cause this problem, and it is recommended to reseat the module before replacement.
In this case study, CRC errors continued to increment on the fabric module after a reseat of M7. Contact the Cisco Technical Assistance Center (TAC) at this point (or before this point) in order to replace M7 since a reseat does not resolve the problem.
In this case study, the replacement of M7 stopped the fabric CRC error messages, and resolved the packet loss.
%OC_USD-SLOT11-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT12-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT13-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT15-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT2-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT4-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT5-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT6-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT7-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT8-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
Multiple modules report CRC errors from Module 12 (M12) that go through XBAR 3.
XBAR 3 is bad or mis-seated, or M12 is mis-seated or faulty.
In this case, XBAR 3 is shut down with the procedure previously described (in the first case study), and monitored for further errors. It was found that errors ceased when XBAR 3 was shut down. At this point, XBAR 3 is reseated, and care is taken in order to ensure that no pins are bent on the midplane and that the module is properly inserted. After XBAR 3 is reenabled, the problem never reoccurs. This problem is attributed to a mis-seated XBAR module.
%OC_USD-SLOT6-2-RF_CRC: OC1 received packets with CRC error from
MOD 1 or 2 or 7 or 13 or 17 through XBAR
slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1
%OC_USD-SLOT6-2-RF_CRC: OC2 received packets with CRC error from
MOD 1 or 2 or 3 or 7 or 15 or 17 through XBAR
slot 2/inst 1 and slot 3/inst 1
%OC_USD-SLOT6-2-RF_CRC: OC1 received packets with CRC error from
MOD 1 or 2 or 5 or 7 or 16 or 17 through XBAR
slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1
Module 6 (M6) reports packets with CRC errors received from multiple linecards and XBARs.
M6 is mis-seated or bad.
M6 is the most-likely cause of this issue because it is the one common modules in all of the error messages. Of all the modules listed in the error messages, the one that most consistently appears is M6. Therefore, attempt to reseat M6 in order to see if the issue is resolved before you replace it.
In this case, M6 is reseated, but the errors still persist. So, you must open a Cisco TAC case in order to have M6 replaced. After M6 is replaced, the errors are not reported.
Here is a list of the commands used in order to troubleshoot/debug:
Revision | Publish Date | Comments |
---|---|---|
1.0 |
13-Aug-2013 |
Initial Release |