Diagnostics Configuration

Overview of Cisco UCS Manager Diagnostics

The Cisco UCS Manager diagnostics tool enables you to verify the health of the hardware components on your servers. The diagnostics tool provides a variety of tests to exercise and stress the various hardware subsystems on the servers, such as memory and CPU. You can use the tool to run a sanity check on the state of your servers after you fix or replace a hardware component. You can also use this tool to run comprehensive burn-in tests before you deploy a new server in your production environment.

When a system is new, a default diagnostics policy is created in org scope. This default policy is named default and it cannot be deleted. The user will receive an error message if they try to delete it. The default diagnostic policy is the preferred way to execute the same set of tests across all servers. Any diagnostic policy, including the default can be customized.

The default policy only has one memory test. The default parameters of the memory test can be modified. In addition, the memory test within the default diagnostics policy can be deleted. If it does not have a memory test, the diagnostic policy will not run.

Creating a Diagnostics Policy

Before you begin

You must have admin privileges to perform this task.

Procedure


Step 1

Navigate to Servers > Policies > Diagnostics Policies.

Step 2

Click Add.

Step 3

Complete the following fields:

Field

Description

Name

Name of the diagnostics policy. The character limit is 16.

Description

Description of the diagnostics policy. This is optional.

Step 4

Click Next.

Step 5

Click Add.

Step 6

Complete the following fields:

Name Description

Order

The order in which the tests will be executed.

CPU Filter

Sets the CPU filter to all CPUs or to a specified CPU.

Loop Count

Sets the loop count to the specified iterations. The range is from 1-1000.

Memory Chunk Size

Sets the memory chunk to 5mb-chunk or big-chunk.

Memory Size

Sets the memory size to a specific value.

Pattern

Sets the memory test to butterfly, killer, prbs, prbs-addr, or prbs-killer.

Step 7

Click OK.

Step 8

Click Finish.


Diagnostics Test on a Blade Server

Starting a Diagnostics Test on a Blade Server

Before you begin

You must have admin privileges to perform this task.

Procedure


Step 1

Navigate to Equipment > Chassis > Server.

Step 2

Choose the server for which you want to start the diagnostics test.

Step 3

Click on the Diagnostics tab.

Step 4

Click Start. Once the diagnostics test has started, the button will be grayed out.


Stopping a Diagnostics Test on a Blade Server

Procedure


Step 1

Navigate to Equipment > Chassis > Server.

Step 2

Choose the server for which you want to stop the diagnostics test.

Step 3

Click on the Diagnostics tab.

Step 4

Click Stop. Once the diagnostic text has stopped, the button will be grayed out.


Diagnostics Test on a Rack Server

Starting a Diagnostics Test on a Rack Server

Diagnostics Test is available for C220 M5, C240 M5, and C480 M5/C480 M5 ML rack servers.

Before you begin

You must have admin privileges to perform this task.

Procedure


Step 1

Navigate to Equipment > Rack Mounts > Server.

Step 2

Choose the server for which you want to start the diagnostics test.

Step 3

Click on the Diagnostics tab.

Step 4

Click Start. Once the diagnostics test has started, the button will be grayed out.


Stopping a Diagnostics Test on a Rack Server

Procedure


Step 1

Navigate to Equipment > Rack Mounts > Server.

Step 2

Choose the server for which you want to stop the diagnostics test.

Step 3

Click on the Diagnostics tab.

Step 4

Click Stop. Once the diagnostic text has stopped, the button will be grayed out.


Starting a Diagnostics Tests on All Servers


Note

Starting diagnostics testing all servers will cause a reboot of each individual server.


Before you begin

You must have admin privileges to perform this task.

Procedure


Step 1

Navigate to Equipment > Diagnostics.

Step 2

Click Start. Once the diagnostics test has started, the link will be grayed out.

In the Diagnostic Result table, you can view the following information:

Field

Description

Name

The system-defined server name.

Chassis ID

The unique identifier for the chassis. This numeric identifier is assigned based on the location of the chassis within the system.

Note 

Not applicable for rack servers.

PID

The server model PID.

Overall Progress Percentage

A description of the overall progress percentage of the diagnostics test on the server.

Operation Status

A description of the diagnostics operation status of the server.

Note 

If a server fails to run the diagnostic test, click on the server link and to view the error description under the Diagnostics tab. You can also view the faults generated in the Faults tab.


Stopping a Diagnostics Tests on All Servers

Before you begin

You must have admin privileges to perform this task.

Procedure


Step 1

Navigate to Equipment > Diagnostics.

Step 2

Click Stop. Once the diagnostics test has stopped, the link will be grayed out.


Viewing the Server Diagnostics Status/Result

Before you begin

You can run the diagnostic test on individual servers through CLI and view the status on this page.

Procedure


Step 1

In the Navigation pane, click Equipment.

Step 2

Expand Equipment > Chassis > Servers.

or for rack servers, Expand Equipment > Rack Mounts > Server

Step 3

Choose the server for which you want to view the diagnostic status and then click the Diagnostics tab.

You can view the following information:

Name Description

Diagnostic Policies

Enables the user to select a diagnostics policy and apply it to a specific server.

Start/Stop

Enables the user to start or stop a diagnostics test on a specific server

Operation State The server's diagnostics operation status. Possible values are Idle, In-Progress, Completed, Failed, and Cancelled.
FSM Status Descr A brief description of the current task in the server's diagnostics operation.
FSM Progress The overall progress of the diagnostics operation being executed on the server.
Test Overall Progress The overall progress of the diagnostics test.
Error Description A description of the error returned from the diagnostics operation.
Table 1. Diagnostic Result
Name Description
ID The unique identifier associated with the test.

Test Type

The type of diagnostics test.

Status The status of the test execution. Values are: Idle, In Progress, Completed, or Failed.
Description The description of the diagnostics test run. Once the test is complete, it provides detailed descriptions of the result.
Result The result of the diagnostics test. Values are Pass, Fail, or NA.
Progress Percentage The progress percentage of the diagnostics test.

Diagnostics Troubleshooting

Issue

Steps to Debug

If the BIOS detects a bad DIMM, the DIMM is disabled and is not visible to the Diagnostics operation.

Refer to memory-related faults in addition to the diagnostics operation results.

If the DIMM blacklisting feature is enabled and a DIMM is blacklisted, it is not visible to the Diagnostics operation.

Refer to memory-related faults in addition to the diagnostics operation results.

The Diagnostics operation may not execute successfully, if the server has bad DIMMs which prevent the server from booting.

NA

The Diagnostics operation can fail, if an uncorrectable error causes a server reboot.

NA

A Diagnostics operation failure can occur if there are memory errors that cause the Diagnostics operation to hang.

NA

The Diagnostics operation can be interrupted by external events, such as a managed endpoint failover or a critical UCSM process restart. In these cases, the Diagnostics operation is cancelled and the Memory Tests are marked as failed.

The failure is triggered by external events. Retry the Diagnostics operation.

A Memory test fails with the error: Uncorrectable errors detected.

Check for server faults under the Chassis/Server/Faults tab.

See the SEL logs for the DIMM errors under the Chassis/Server/SEL Logs tab.

A Memory test failure needs further analysis.

See the diagnostics operation logs in following log file archive on the primary FI in the /workspace partition: diagnostics/diag_log_<system-name>_<timestamp>_<chassis-id>_<blade-id>.tgz

See the analysis file: tmp/ServerDiags/MemoryPmem2.<id>/MemoryPmem2.analysis in the previously mentioned log file archive.

Use the following command to find the diagnostics logs with the analysis files:

# for file in `ls /workspace/diagnostics/*diag*`; do tar -tzvf $file | grep analysis && echo "IN " $file; done