Introduction
This document describes how to troubleshoot memory modules and related issues in the Cisco Unified Computing System (UCS) solution.
Prerequisites
Requirements
Cisco recommends knowledge of Cisco Unified Computing System (UCS).
Components Used
This document is not restricted to specific software and hardware versions.
However, this document addresses:
- Cisco UCS B-Series Blade Servers
- UCS Manager
- UCS uses Dual In-line Memory Module (DIMM) as RAM modules.
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
Troubleshoot Methodology
This section covers several parts of UCS memory issues.
- Memory placement
- Troubleshoot DIMMs via UCSM and CLI
- Logs to check in technical support
Terms and Acronyms
DIMM |
Dual In-line Memory Module |
ECC |
Error Correcting Code |
LVDIMM |
Low Voltage DIMM |
MCA |
Machine Check Architecture |
MEMBIST |
Memory Built-in Self Test |
MRC |
Memory Reference Code |
POST |
Power On Self Test |
SPD |
Serial Presence Detect |
DDR |
Double Data Rate |
RAS |
Reliability, Availability and Serviceability |
Memory Placement
Memory placement is one of the most notable physical aspects of the UCS solution.
Typically, the server comes with memory pre-populated with a requested amount.
However, when in doubt, refer to the hardware installation guide.
For memory population rules, refer to B-series technical specifications for the specific platform.
B-series technical specifications link:
Data Sheets
Memory Errors
- DIMM Error
- Multibit = Uncorrectable
- POST is mapped by BIOS; OS does not see DIMM.
- Runtime usually causes OS reboot.
- Singlebit = Correctable
- OS continues to see the DIMM.
- Error Correcting Code (ECC) Error
- Parity Error
- Serial Presence Detect (SPD) Error
- Configuration Error
- Not supported DIMMs
- Not supported DIMM population
- Unpaired DIMMs
- Mismatch errors
- Identity unestablishable error
- Check and update the catalog.
Correctable versus Uncorrectable Errors
Whether a particular error is correctable or uncorrectable depends on the strength of the ECC code employed within the memory system.
Dedicated hardware is able to fix correctable errors when they occur with no impact on program execution.
The DIMMs with correctable errors are not disabled and are available for the OS to use. The Total Memory
and Effective Memory
are the same.
These correctable errors are reported in the UCSM operability state as Degraded
while overall operability is Operable
with correctable errors.
Uncorrectable errors make it impossible for the application or operating system to continue execution.
The DIMMs with uncorrectable errors are disabled and OS does not see them. UCSM operState change to Inoperable in this case.
Troubleshoot DIMM’s via UCSM and CLI
Check Errors from GUI
UCSM |
Logs |
Description |
DIMM Status |
Operability |
SEL |
Comments |
Operable |
Operable |
Check SEL log for DIMM related errors. |
A DIMM is installed and functional. |
Operable |
Degraded |
Check SEL for ECC errors. |
A correctable ECC DIMM error is detected during run time. |
Removed |
N/A |
No logs |
A DIMM is not installed or corrupted SPD data. |
Disabled |
Operable |
Check SEL for Identity unestablishable errors. |
Check and update capability catalog. |
Disabled |
N/A |
Check SEL if another DIMM in failed in the same channel. |
A DIMM is healthy but disabled because configuration rule could not be maintained by a failed DIMM in the same channel. |
Disabled |
N/A |
No logs |
Failed memory configuration rule because of missed DIMMs. |
Inoperable |
Inoperable/Replacement required |
|
UE ECC Error was detected. |
Degraded |
Inoperable |
Check SEL for ECC errors. |
DIMM status and Operability changed due to ECC errors were detected before host rebooted. |
Degraded |
Inoperable/Replacement required |
Check SEL for ECC error during POST/MRC. |
Uncorrectable ECC error was detected during runtime, DIMM remains available to OS, OS crashes and comes back up but still can use this DIMM. Error can occur again later. DIMM must be replaced in most situations. |
To obtain statistics, navigate to Equipment > Chassis > Server > Inventory > Memory,
then right-click Memory
and select show navigator.
Check Errors from CLI
These commands are useful when troubleshooting errors from CLI.
scope server x/y -> show memory detail
scope server x/y -> show memory-array detail
scope server x/y -> scope memory-array x -> show stats history memory-array-env-stats detail
From memory array scope, you can also get access to DIMM.
scope server X/Y > scope memory-array Z > scope DIMM N
From there, then you can obtain per-DIMM statistics or reset the error counters.
UCS/chassis/server/memory-array/dimm # reset-errors
UCS /chassis/server/memory-array/dimm* # commit-buffer
UCS /chassis/server/memory-array/dimm # show stats memory-error-state
If you see a correctable error that matches this information, the problem can be corrected by resetting the BMC instead of resetting the blade server.
Use these Cisco UCS Manager CLI commands:
(Resetting the BMC does not impact the OS running on the blade.)
To reset memory-error counters on a Cisco UCS C-Series Rack Server operating in standalone mode, run the script on the CLI:
UCS-C# scope reset-ecc
UCS-C/reset-ecc # set enabled yes
UCS-C/reset-ecc *# commit
For colusa servers:
UCS# scope chassis
UCS /chassis # scope server x
UCS /chassis/server # reset-ecc
With UCS releases 2.27, and 3.1 and above, the thresholds for memory corrected errors have been removed.
Therefore, memory modules are no longer reported as Inoperable
or Degraded
solely due to corrected memory errors.
As per whitepaper Managing Correctable Memory Errors on Cisco UCS Servers
Industry demands for greater capacity, greater bandwidth, and lower operating voltages lead to increased memory error rates.
Traditionally, the industry has treated correctable errors in the same way as uncorrectable errors, requiring the module to be replaced immediately upon alert.
Given extensive research that correctable errors are not correlated with uncorrectable errors, and that correctable errors do not degrade system performance, the Cisco UCS team recommends against immediate replacement of modules with correctable errors.
Customers who experience a Degraded memory alert for correctable errors are advised to reset the memory error and resume operation. This recommendation helps to avoid unnecessary server disruption. Future enhancements to error management distinguish among various types of correctable errors, and identify the appropriate actions, if any, needed.
At minimum, use version 2.1(3c) or 2.2(1b) which has enhancement with UCS memory error management
Log Files to Check in Tech Support
UCSM_X_TechSupport > sam_techsupportinfo
provides information about DIMM and memory array.
Chassis/server tech support
CIMCX_TechSupport\tmp\CICMX_TechSupport.txt -> Generic tech support information about sever X.
CIMCX_TechSupport\obfl\obfl-log -> OBFL logs provide an ongoing logs about status and boot of server X.
CIMCX_TechSupport\var\log\sel -> SEL logs for server X.
Based on the platform/version, navigate to the files in tech support bundle.
var/nuova/BIOS > RankMarginTest.txt
var/nuova/BIOS > MemoryHob.txt
var/nuova/var/nuova/ BIOS > MrcOut_*.txt
These files provide information about memory as seen from BIOS level.
Information there can be cross-referenced again with DIMM states report tables.
Example:
/var/nuova/BIOS/RankMarginTest.txt
- Useful for showing the test results from BIOS Training test MEMBIST.
- Look for errors.
- Look to see if any DIMMs are mapped out.
- Show DIMM specific information (Vendor/speed/PID).
DIMM |GB|R|MfgDate|Mod ID |DRAM ID |Reg ID |CtW Tck CLS Taa V|Freq|Part#
A1 18| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
A2 26| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
B1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
B2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
C1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
C2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
D1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
D2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
E1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
E2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
F1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
F2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
The first column has two values:
DIMM locator (F2)
DIMM status (01)
Here is a brief description for each status:
0x00 // Not Installed (No DIMM)
0x01 // Installed (Working)
//// 0x02-0F (Reserved)
//// Failed
0x10 // Failed Training
0x11 // Failed Clock Training
//// 0x12-17 (Reserved)
0x18 // Failed MemBIST
//// 0x19-1F (Reserved)
//// Ignored
0x20 // Ignored (Disabled from debug console)
0x21 // Ignored (SPD Error reported by BMC)
0x22 // Ignored (Non-RDIMM)
0x23 // Ignored (Non-ECC)
0x24 // Ignored (Non-x4)
0x25 // Ignored (Other PDIMM in same LDIMM failed)
0x26 // Ignored (Other LDIMM in same channel failed)
0x27 // Ignored (Other channel in LockStep or Mirror failed)
0x28 // Ignored (Invalid PDIMM population)
0x29 // Ignored (PDIMM Organization Mismatch)
0x2A // Ignored (PDIMM Register Vendor Mismatch)
//// 0x2B-7F (Reserved)
var/nuova/BIOS > MemoryHob.txt
Shows effective and failed memory installed on the server.
+++ BEGINNING OF FILE
Memory Speed = 1067 MHz
Memory Mode = 00
RAS Modes = 03
MRC Flags = 0000000A
Total Memory = 98304 MB
Effective Memory = 90112 MB
Failed Memory = 8192 MB
Ignored Memory = 0 MB
Redundant Memory = 0 MB
|---------------------------------|
| Memory | Channel | DIMM Status |
| Channel | Status | 1 2 |
|---------------------------------|
| A | 01 | 01 01 |
| B | 01 | 01 01 |
| C | 01 | 01 01 |
| D | 01 | 01 01 |
| E | 01 | 01 01 |
| F | 01 | 01 18 |
|---------------------------------|
18h - DIMM status is marked as failed when it fails in MemBist test. Replace with a known good DIMM.
DIMM Status Description
00h Not Installed (No DIMM)
01h Installed (Working)
02h-0Fh Reserved
10h Failed (Training)
11h Failed (Clock training)
12h-17h Reserved
18h Failed (MemBIST)
19h-1Fh Reserved
20h Ignored (Disabled from debug console)
21h Ignored (SPD Error reported by BMC)
22h Ignored (Non-RDIMM)
23h Ignored (Non-ECC)
24h Ignored (Non-x4)
25h Ignored (Other PDIMM in same LDIMM failed)
26h Ignored (Other LDIMM in same channel failed)
27h Ignored (Other channel in LockStep or Mirror)
28h Ignored (Invalid memory population)
29h Ignored (Organization mismatch)
2Ah Ignored (Register vendor mismatch)
2Bh- 7Fh Reserved
80h Ignored ( Workaround -Looping)
81h Ignored (Stuck I2C bus)
82h – FFh Reserved
DIMM Blocklisting
In Cisco UCS Manager
, the state of the Dual In-line Memory Module
(DIMM) is based on SEL event records.
When the BIOS encounters a noncorrectable memory error during memory test execution, the DIMM is marked as faulty.
A faulty DIMM is a considered a nonfunctional device.
If you enable DIMM blocklisting, Cisco UCS Manager monitors the memory test execution messages and blocklists any DIMMs that encounter memory errors in the DIMM SPD data.
DIMM Blocklisting was introduced as an optional global policy in UCSM 2.2(2).
Server firmware must be 2.2(1)+ for B-series blades and 2.2(3)+ for C-series rack servers to properly implement this feature.
In UCSM 2.2(4), the DIMM, Blocklisting is enabled.
Open the tech support file …/var/log/DimmBL.log
Open the file /var/nuova/BIOS/MrcOut.txt if it is available
Find the DIMM Status table. Look for DIMM Status:
DIMM Blocklisted = 1E
Find the DIMM Status table. Look for DIMM Status:
DIMM Status:
00 - Not Installed
01 - Installed
10 - Failed (Training failure) clear
1E - Failed (DIMM Blocklisted by BMC)
1F - Failed (SPD Error)
25 - Disabled (Other DIMM failed in same channel.)
Example:
DIMM Status:
|=======================|
| Memory | DIMM Status |
| Channel | 1 2 3 |
|=======================|
| A | 25 1F 25 |
| B | 01 01 01 |
| C | 1F 25 25 |
| D | 01 01 01 |
| E | 01 01 01 |
| F | 25 25 1E |
| G | 01 01 01 |
| H | 01 01 01 |
|=======================|
DIMM Status:
01 - Installed
1E - Failed (DIMM Blocklisted by BMC)
1F - Failed (SPD Error)
25 - Disabled (Other DIMM failed in same channel)
Methods to Clear DIMM Blocklisting Errors
UCSM GUI
UCSM CLI
UCS-B/chassis/server # reset-all-memory-errors
Related Information
Notable Bugs
Cisco bug ID CSCug93076 B200M3-DDR voltage regulator has excessive noise under light load
Cisco bug ID CSCup07488 IPMI DIMM fault sensor is setting Dimm Degraded with no error count.
Cisco bug ID CSCud22620 Improved accuracy at identifying Degraded DIMMs
Cisco bug ID CSCuw44524 C460M4, B260M4 or B460M4 IVB clear CMOS can cause memory UECC Error
Cisco bug ID CSCur19705 ECC/UECC Errors observed on B200M3
Cisco bug ID CSCvm88447Reset ECC steps documentation are missing for Standalone Colusa Servers