Introduction
This document describes key software defects which can cause corrupt data frames to be injected into a Unified Computing System (UCS) Fabric as identified by interface Cyclic Redundancy Check (CRC) or Frame Check Sequence (FCS) error counters.
Note: This document does not describe how to isolate the point of CRC injection.
Background Information
In a UCS environment, CRC errors can be of high impact. The isolation and mitigation of the cause of such errors must be treated with high priority.
The impact depends on the point at which the issue occurs, which can extend to multiple chassis and impact both Ethernet and Storage connectivity.
While physical component failure (especially cable and Small Form-Factor Pluggable (SFP)) is the most frequent cause, there are known software defects which can also cause CRC errors.
These defects cause low signal strength between various components, which leads to corrupt frames.
A key concept you can refer to is Eye Height which is a measure of the signal integrity between physical layer components. If the signal level drops below a particular level (differs between components), frames sent or received can be corrupted.
Cisco recommends that you have reviewedFlexPod Common Performance Problems, especiallyFrame and Packet Loss in order to identify the source of unstomped CRC errors within the UCS Fabric and/or upstream switches.
While the document is intended for FlexPod deployments, the section mentioned is applicable for non-FlexPod UCS environments.
Indications To CRC Related Defect
If you have Twinax cabling in your UCS environment, it is more likely to be impacted by one or more of these defects, as the majority of the defects are for Twinax based cabling.
Environments which only have optical cabling can still experience issues, as it CRC errors can be injected between Adapter and UCS I/O Module (IOM). However, this is limited to specific servers and does not affect multiple servers or chassis in the case of an Uplink or Server port issue.
If disable/enable of a port in UCS Manager seems to stop interface errors with no further action such as cable swap or reseat, further checks must be made to verify if a software defect is the root cause of the issue.
If CRC errors have been seen after sudden port flaps/reboots, these defects can be a possible cause.
Commands To Verify Eye Height
A key indication of a CRC related software defects is a low Eye Height value for one or more ports.
Common commands used to check this are:
Nexus 5500 based switches:
show hardware internal carmel eye
UCS 6200 Fabric Interconnects:
connect nxos a
show hardware internal carmel eye
exit
connect nxos b
show hardware internal carmel eye
exit
Sample output that shows a good Eye Height (200 mv):
UCSB-5-A(nxos)# show hardware internal carmel eye
+-------+------------+-------------+------------+----------------------------+--+--+--+--+--+--+--+--+--+--+
| Port | Eye Height | Eye Width | Raw values | Time measured |St|20|21|22|23|24|25|26|2E|2F|
+-------+------------+-------------+------------+----------------------------+--+--+--+--+--+--+--+--+--+--+
Eth 1/1 | 200 mv | 796 mUI | 40/ 33 | 08/31/2016 16:48:52.345248 |a9|ee|82|00|00|6e|82|00|88|00|
fi0 | 200 mv | 843 mUI | 40/ 36 | 08/31/2016 16:48:52.350360 |00|00|00|00|00|00|00|00|00|00|
fi1 | 200 mv | 859 mUI | 40/ 37 | 08/31/2016 16:48:52.355470 |00|00|00|00|00|00|00|00|00|00|
On these platforms, if the value is:
- Below 50mV, it has been found to trigger CRC errors
- 50 - 100mV, it can cause CRC errors and mitigation is advised
- > 100 mV, it must not cause CRC errors
The above commands are not applicable to 6332, 6454 or 6324 Fabric Interconnects
UCS 2200 IOM Modules:
connect local-mgmt a or connect local-mgmt b
connect iom x
show platform software woodside sts (Note: The HI number/s for the servers that you need to check)
dbgexec woo
kr_geteye HIxx
Ctrl-C to exit dbgexec mode
Sample output that shows a good Eye Height (125 mV):
woo> kr_geteye HI31
[serdes] reg: 64/40h = 42ch
check_kr_status: HI31: up (kr_retries=0)
sent SPICO interrupt(20, 0, 49)
Vertical eye result 0x14
sent SPICO interrupt(20, 0, 49)
Horizontal eye result 0x28
HI31: 125.0 mV, 0.6250 UI (NORM)
UCS 2300 IOM Modules:
connect local-mgmt a or connect local-mgmt b
connect iom x
show platform software tiburon sts (Note the HI number/s for the servers you need to check)
dbgexec tib
kr_geteye 0 HIxx
Ctrl-C to exit dbgexec mode
Sample output that shows a good Eye Height (156 mv):
tib> kr_geteye 0 HI31
Start eye measurement HI31...
bottom: -73.5 (mV), top: 82.7 (mV), height: 156.2 (mV)
left: -0.34 (UI), right: 0.33 (UI), width: 0.69 (UI)
total time = 0.119456 sec
On these platforms, if the height value is:
- Below 90 mV, it has been found to trigger CRC errors
- > 90 mV, it must not trigger CRC errors
Defects
Fabric Interconnect
This defect is seen on Fabric Interconnect ports, such as Uplink and Server ports.
It is fixed in UCS Infrastructure 2.2(3a), Refer to Bug Search Tool for other fixed releases.
- Near identical bug which later affects UCS firmware:
CSCuw36398 Observing CRC errors on Copper cable
This defect is seen on Fabric Interconnect ports, such as Uplink and Server ports
It is fixed in UCS Infrastructure 2.2(7b). Refer to Bug Search Tool for other fixed releases.
IOM And Adapter
- CSCuz78417 Serdes eye height between IOM and VIC lower than 90mV
This defect is observed between IOM Host Interfaces (HIF) and Adapters backplane interfaces.
It has since beeen found that this can be caused by Chassis backplane issues. If you observe this issue, open a Service Request with Cisco TAC.
- CSCva47085 VIC1340+2304 IOM Native 40g Link Training Issue Causes Connectivity Loss
This defect is seen between IOM HIF and Adapters, which affects the individual servers.
Currently under investigation.
C-Series
- CSCux31002 VIC 1227 shows CRCs when you use an active twinax cable.
Fixed in standalone C Series firmware 2.0(9c). Refer to Bug Search Tool for other fixed releases.
This bug's trigger condition is the reverse of the common wisdom that Active Twinax is less likely to cause CRC issues due to its active power transmission.
Nexus 5500
- CSCuj86736 Need to optimize DFE tuning in 55xxUP series switches - RX CRC Errors
While not strictly a UCS bug, it is still commonly seen in UCS setups due to the prevalence of Nexus 55xx upstream. Refer to Bug Search Tool for details about fixed versions.
Workarounds/Mitigation
Refer to the release note for each bug for specific details, but if you have found evidence of low Eye Height, then shut/no shut of the port is reasonable.
In the case of an IOM/Adapter Eye Height defect, a reset of the DCE in the interface can be done. Navigate to Server > Adapter > DCE Interface > Reset Connectivityas it is appropriate.
Outputs must be then checked to see if the Eye Height has increased to good values and if CRC counters have no longer incremented.
Several flaps (commonly up to 5) can be needed to increase the Eye Height sufficently.
If the Eye Height does not recover after several link flaps, there could be a hardware failure of the component.
When you flap ports, be aware that this can trigger a shallow discovery by UCS Manager.
A shallow discovery under normal circumstances is not data plane impacting, however, there are known defects that affect B200-M4 blades (see CSCut61527 for the most common defect). A shallow discovery can turn into a deep discovery, which can trigger Host OS reboot.
Cisco recommends that you review the Release Notes for your UCS Manager version for other applicable defects.
Besides manual port flapping as a reactive recovery step, UCS Policy-Based Port Error Handling in UCS Manager 2.2(4) and later can be used to disable NIF ports when CRC errors are seen. While such action can quickly limit the impact of CRC errors, it can have the potential for disruption of traffic flow, hence is not enabled by default and care must be taken if you enabke it.
UCS Manager generates faults for CRC errors and such faults can be monitored via XML API or Simple Network Management Protocol (SNMP).