Troubleshoot DIMM Memory Issues in UCS

Available Languages

Download Options

PDF (99.9 KB)
View with Adobe Reader on a variety of devices
ePub (130.2 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (187.4 KB)
View on Kindle device or Kindle app on multiple devices

Updated:January 22, 2024

Document ID:200775

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Troubleshoot Methodology

Terms and Acronyms

Memory Placement

Memory Errors

Correctable versus Uncorrectable Errors

Troubleshoot DIMM’s via UCSM and CLI

Check Errors from GUI

Check Errors from CLI

Log Files to Check in Tech Support

DIMM Blocklisting

Methods to Clear DIMM Blocklisting Errors

Introduction

This document describes how to troubleshoot memory modules and related issues in the Cisco Unified Computing System (UCS) solution.

Prerequisites

Requirements

Cisco recommends knowledge of Cisco Unified Computing System (UCS).

Components Used

This document is not restricted to specific software and hardware versions.

However, this document addresses:

Cisco UCS B-Series Blade Servers
UCS Manager
UCS uses Dual In-line Memory Module (DIMM) as RAM modules.

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.

Troubleshoot Methodology

This section covers several parts of UCS memory issues.

Memory placement
Troubleshoot DIMMs via UCSM and CLI
Logs to check in technical support

Terms and Acronyms

DIMM	Dual In-line Memory Module
ECC	Error Correcting Code
LVDIMM	Low Voltage DIMM
MCA	Machine Check Architecture
MEMBIST	Memory Built-in Self Test
MRC	Memory Reference Code
POST	Power On Self Test
SPD	Serial Presence Detect
DDR	Double Data Rate
RAS	Reliability, Availability and Serviceability

Memory Placement

Memory placement is one of the most notable physical aspects of the UCS solution.

Typically, the server comes with memory pre-populated with a requested amount.

However, when in doubt, refer to the hardware installation guide.

For memory population rules, refer to B-series technical specifications for the specific platform.

B-series technical specifications link:

Data Sheets

Memory Errors

DIMM Error
- Multibit = Uncorrectable
- Singlebit = Correctable

Error Correcting Code (ECC) Error

Parity Error
Serial Presence Detect (SPD) Error

Configuration Error
- Not supported DIMMs
- Not supported DIMM population

Unpaired DIMMs
Mismatch errors

Identity unestablishable error

Check and update the catalog.

Correctable versus Uncorrectable Errors

Whether a particular error is correctable or uncorrectable depends on the strength of the ECC code employed within the memory system.

Dedicated hardware is able to fix correctable errors when they occur with no impact on program execution.

The DIMMs with correctable errors are not disabled and are available for the OS to use. The Total Memory and Effective Memory are the same.

These correctable errors are reported in the UCSM operability state as Degraded while overall operability is Operable with correctable errors.

Uncorrectable errors make it impossible for the application or operating system to continue execution.

The DIMMs with uncorrectable errors are disabled and OS does not see them. UCSM operState change to Inoperable in this case.

Troubleshoot DIMM’s via UCSM and CLI

Check Errors from GUI

UCSM		Logs	Description
DIMM Status	Operability	SEL	Comments
Operable	Operable	Check SEL log for DIMM related errors.	A DIMM is installed and functional.
Operable	Degraded	Check SEL for ECC errors.	A correctable ECC DIMM error is detected during run time.
Removed	N/A	No logs	A DIMM is not installed or corrupted SPD data.
Disabled	Operable	Check SEL for Identity unestablishable errors.	Check and update capability catalog.
Disabled	N/A	Check SEL if another DIMM in failed in the same channel.	A DIMM is healthy but disabled because configuration rule could not be maintained by a failed DIMM in the same channel.
Disabled	N/A	No logs	Failed memory configuration rule because of missed DIMMs.
Inoperable	Inoperable/Replacement required		UE ECC Error was detected.
Degraded	Inoperable	Check SEL for ECC errors.	DIMM status and Operability changed due to ECC errors were detected before host rebooted.
Degraded	Inoperable/Replacement required	Check SEL for ECC error during POST/MRC.	Uncorrectable ECC error was detected during runtime, DIMM remains available to OS, OS crashes and comes back up but still can use this DIMM. Error can occur again later. DIMM must be replaced in most situations.

To obtain statistics, navigate to Equipment > Chassis > Server > Inventory > Memory, then right-click Memory and select show navigator.

Check Errors from CLI

These commands are useful when troubleshooting errors from CLI.

scope server x/y -> show memory detail
scope server x/y -> show memory-array detail
scope server x/y -> scope memory-array x -> show stats history memory-array-env-stats detail

From memory array scope, you can also get access to DIMM.

scope server X/Y > scope memory-array Z > scope DIMM N

From there, then you can obtain per-DIMM statistics or reset the error counters.

UCS/chassis/server/memory-array/dimm # reset-errors                
UCS /chassis/server/memory-array/dimm* # commit-buffer               
UCS /chassis/server/memory-array/dimm # show stats memory-error-state

If you see a correctable error that matches this information, the problem can be corrected by resetting the BMC instead of resetting the blade server.

Use these Cisco UCS Manager CLI commands:

(Resetting the BMC does not impact the OS running on the blade.)

To reset memory-error counters on a Cisco UCS C-Series Rack Server operating in standalone mode, run the script on the CLI:

UCS-C# scope reset-ecc
UCS-C/reset-ecc # set enabled yes
UCS-C/reset-ecc *# commit

For colusa servers:

UCS# scope chassis
UCS /chassis # scope server x
UCS /chassis/server # reset-ecc

With UCS releases 2.27, and 3.1 and above, the thresholds for memory corrected errors have been removed.

Therefore, memory modules are no longer reported as Inoperable or Degraded solely due to corrected memory errors.

As per whitepaper Managing Correctable Memory Errors on Cisco UCS Servers

Industry demands for greater capacity, greater bandwidth, and lower operating voltages lead to increased memory error rates.

Traditionally, the industry has treated correctable errors in the same way as uncorrectable errors, requiring the module to be replaced immediately upon alert.

Given extensive research that correctable errors are not correlated with uncorrectable errors, and that correctable errors do not degrade system performance, the Cisco UCS team recommends against immediate replacement of modules with correctable errors.

Customers who experience a Degraded memory alert for correctable errors are advised to reset the memory error and resume operation. This recommendation helps to avoid unnecessary server disruption. Future enhancements to error management distinguish among various types of correctable errors, and identify the appropriate actions, if any, needed.

At minimum, use version 2.1(3c) or 2.2(1b) which has enhancement with UCS memory error management

Log Files to Check in Tech Support

UCSM_X_TechSupport > sam_techsupportinfo provides information about DIMM and memory array.

Chassis/server tech support

CIMCX_TechSupport\tmp\CICMX_TechSupport.txt -> Generic tech support information about sever X.
CIMCX_TechSupport\obfl\obfl-log -> OBFL logs provide an ongoing logs about status and boot of server X.
CIMCX_TechSupport\var\log\sel -> SEL logs for server X.

Based on the platform/version, navigate to the files in tech support bundle.

var/nuova/BIOS > RankMarginTest.txt

var/nuova/BIOS > MemoryHob.txt

var/nuova/var/nuova/ BIOS > MrcOut_*.txt

These files provide information about memory as seen from BIOS level.

Information there can be cross-referenced again with DIMM states report tables.

Example:

/var/nuova/BIOS/RankMarginTest.txt

Useful for showing the test results from BIOS Training test MEMBIST.

Look for errors.
Look to see if any DIMMs are mapped out.
Show DIMM specific information (Vendor/speed/PID).

DIMM |GB|R|MfgDate|Mod ID |DRAM ID   |Reg ID    |CtW Tck CLS Taa V|Freq|Part#
A1 18| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
A2 26| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
B1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
B2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
C1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
C2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
D1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
D2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
E1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
E2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
F1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9 
F2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi   03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9

The first column has two values:

DIMM locator (F2)

DIMM status (01)

Here is a brief description for each status:

0x00 // Not Installed (No DIMM)

0x01 // Installed (Working)

//// 0x02-0F (Reserved)

//// Failed

0x10 // Failed Training

0x11 // Failed Clock Training

//// 0x12-17 (Reserved)

0x18 // Failed MemBIST

//// 0x19-1F (Reserved)

//// Ignored

0x20 // Ignored (Disabled from debug console)

0x21 // Ignored (SPD Error reported by BMC)

0x22 // Ignored (Non-RDIMM)

0x23 // Ignored (Non-ECC)

0x24 // Ignored (Non-x4)

0x25 // Ignored (Other PDIMM in same LDIMM failed)

0x26 // Ignored (Other LDIMM in same channel failed)

0x27 // Ignored (Other channel in LockStep or Mirror failed)

0x28 // Ignored (Invalid PDIMM population)

0x29 // Ignored (PDIMM Organization Mismatch)

0x2A // Ignored (PDIMM Register Vendor Mismatch)

//// 0x2B-7F (Reserved)

var/nuova/BIOS > MemoryHob.txt

Shows effective and failed memory installed on the server.

 +++ BEGINNING OF FILE
Memory Speed     = 1067 MHz
Memory Mode      = 00
RAS Modes        = 03
MRC Flags        = 0000000A
Total Memory     = 98304 MB
Effective Memory = 90112 MB
Failed Memory    = 8192 MB
Ignored Memory   = 0 MB
Redundant Memory = 0 MB
|---------------------------------|
| Memory  | Channel | DIMM Status |
| Channel | Status  |   1    2    |
|---------------------------------|
|   A     |    01   |   01   01   |
|   B     |    01   |   01   01   |
|   C     |    01   |   01   01   |
|   D     |    01   |   01   01   |
|   E     |    01   |   01   01   |  
|   F     |    01   |   01   18   |
|---------------------------------|

18h - DIMM status is marked as failed when it fails in MemBist test. Replace with a known good DIMM.

DIMM Status Description

00h Not Installed (No DIMM)

01h Installed (Working)

02h-0Fh Reserved

10h Failed (Training)

11h Failed (Clock training)

12h-17h Reserved

18h Failed (MemBIST)

19h-1Fh Reserved

20h Ignored (Disabled from debug console)

21h Ignored (SPD Error reported by BMC)

22h Ignored (Non-RDIMM)

23h Ignored (Non-ECC)

24h Ignored (Non-x4)

25h Ignored (Other PDIMM in same LDIMM failed)

26h Ignored (Other LDIMM in same channel failed)

27h Ignored (Other channel in LockStep or Mirror)

28h Ignored (Invalid memory population)

29h Ignored (Organization mismatch)

2Ah Ignored (Register vendor mismatch)

2Bh- 7Fh Reserved

80h Ignored ( Workaround -Looping)

81h Ignored (Stuck I2C bus)

82h – FFh Reserved

DIMM Blocklisting

In Cisco UCS Manager , the state of the Dual In-line Memory Module (DIMM) is based on SEL event records.

When the BIOS encounters a noncorrectable memory error during memory test execution, the DIMM is marked as faulty.

A faulty DIMM is a considered a nonfunctional device.

If you enable DIMM blocklisting, Cisco UCS Manager monitors the memory test execution messages and blocklists any DIMMs that encounter memory errors in the DIMM SPD data.

DIMM Blocklisting was introduced as an optional global policy in UCSM 2.2(2).

Server firmware must be 2.2(1)+ for B-series blades and 2.2(3)+ for C-series rack servers to properly implement this feature.

In UCSM 2.2(4), the DIMM, Blocklisting is enabled.

Open the tech support file …/var/log/DimmBL.log

Open the file /var/nuova/BIOS/MrcOut.txt if it is available

Find the DIMM Status table. Look for DIMM Status:

DIMM Blocklisted = 1E

Find the DIMM Status table. Look for DIMM Status:

DIMM Status:

00 - Not Installed

01 - Installed

10 - Failed (Training failure) clear

1E - Failed (DIMM Blocklisted by BMC)

1F - Failed (SPD Error)

25 - Disabled (Other DIMM failed in same channel.)

Example:

DIMM Status:

|=======================|

| Memory | DIMM Status |

| Channel | 1 2 3 |

|=======================|

| A | 25 1F 25 |

| B | 01 01 01 |

| C | 1F 25 25 |

| D | 01 01 01 |

| E | 01 01 01 |

| F | 25 25 1E |

| G | 01 01 01 |

| H | 01 01 01 |

|=======================|

DIMM Status:

01 - Installed

1E - Failed (DIMM Blocklisted by BMC)

1F - Failed (SPD Error)

25 - Disabled (Other DIMM failed in same channel)

Methods to Clear DIMM Blocklisting Errors

UCSM GUI

DIMM Blocklisting Errors

UCSM CLI

UCS-B/chassis/server # reset-all-memory-errors

Related Information

Cisco UCS Manager GUI Configuration Guide, Release 2.2

Field Notice: FN - 63651 - UCS-B M3-Series Blade Servers Might Get Memory Errors Due to Voltage Regulator Setting - BIOS/Firmware Upgrade Recommended

Notable Bugs

Cisco bug ID CSCug93076 B200M3-DDR voltage regulator has excessive noise under light load

Cisco bug ID CSCup07488 IPMI DIMM fault sensor is setting Dimm Degraded with no error count.

Cisco bug ID CSCud22620 Improved accuracy at identifying Degraded DIMMs

Cisco bug ID CSCuw44524 C460M4, B260M4 or B460M4 IVB clear CMOS can cause memory UECC Error

Cisco bug ID CSCur19705 ECC/UECC Errors observed on B200M3

Cisco bug ID CSCvm88447Reset ECC steps documentation are missing for Standalone Colusa Servers

Revision History

Revision	Publish Date	Comments
4.0	22-Jan-2024	Updated Biased Language, Machine Translation, Style Requirements, and Formatting.
3.0	19-Dec-2022	Recertification
1.0	21-Oct-2016	Initial Release

Contributed by Cisco Engineers

Sivakumar Sukumar
Senior Technical Leader Cisco

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

UCS B-Series Blade Servers

Troubleshoot DIMM Memory Issues in UCS

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Prerequisites

Requirements

Components Used

Troubleshoot Methodology

Terms and Acronyms

Memory Placement

Memory Errors

Correctable versus Uncorrectable Errors

Troubleshoot DIMM’s via UCSM and CLI

Check Errors from GUI

Check Errors from CLI

Log Files to Check in Tech Support

DIMM Blocklisting

Methods to Clear DIMM Blocklisting Errors

UCSM GUI

UCSM CLI

Related Information

Notable Bugs

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products