Troubleshoot upgrade failure due to vMotion error for VMs with vGPU

Available Languages

Download Options

PDF (151.5 KB)
View with Adobe Reader on a variety of devices
ePub (214.0 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (125.7 KB)
View on Kindle device or Kindle app on multiple devices

Updated:July 28, 2023

Document ID:220697

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Background information

Problem

Solution

Disable ECC mode on the affected node.

Related information

Introduction

This document describes how to troubleshoot upgrade failure due to vMotion error for VMs with vGPU.

Prerequisites

Hyperflex/Esxi cluster with inconsistent ECC configuration for Nvidia GPUs.

Note: Confirm system is not affected by CSCvp47724

Requirements

vCenter cluster with Nvidia GPU enabled for VMs.

Components Used

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.

Intersight (SaaS in this case)
HyperFlex 5.0(2a)
Nvidia GPU (Tesla T4)

Background information

In this specific scenario, vMotion was failing because of issues with inconsistent ECC configuration for Nvidia GPUs, which was causing error while upgrading HyperFlex.

Note: NVIDIA GPU Cards that use the Pascal architecture, such as Tesla V100, P100, P40, as well as the Tesla M6 and M60 GPUs, support ECC memory for improved data integrity. However, the NVIDIA vGPU software does not support ECC. You must therefore ensure that ECC memory is disabled on all GPUs when using NVIDIA vGPU.

Problem

vMotion failure identified due to inconsistent ECC configuration on the Nvidia GPU.

Failed task : 'Verify Pre-Upgrade HXDP Validations'.

akmalla_0-1690277621408

Solution

Disable ECC mode on the affected node.

How was the affected node identified?

Manual vMotion will display error for the node "Error while migrating VMs to this node : "One or more devices (pciPassthru0) required by VM XXXX are not available on host XXXX"

Steps:

List the VIB that was installed in the ESXi Hypervisor

 # esxcli software vib list |grep –i NVIDIA

Check the NVIDIA Driver Operation

[root@hxesxi:~] nvidia-smi
Sat Jul 22 01:31:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.02   Driver Version: 470.182.02   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
| N/A   35C    P8    16W /  70W |   1971MiB / 15359MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Found ECC (Error Correcting Code) mode enabled on the affected node.

# nvidia-smi –q
ECC Mode
Current                     : Enabled
Pending                     : Enabled

Disable Error Correcting Code (ECC)

# nvidia-smi –e 0
Disabled ECC support for GPU 0000….
All done.

Check that the ECC mode is disabled:

# nvidia-smi –q
ECC Mode
Current                     : Disabled
Pending                     : Disabled

After disabling ECC mode and rebooting the node, vMotion will succeed and upgrade will progress.

Related information

Nvidia - Insufficient resources. One or more devices

VMware - Using GPUs with Virtual Machines on vSphere

Revision History

Revision	Publish Date	Comments
1.0	01-Aug-2023	Initial Release

Contributed by Cisco Engineers

Akash Malla
Cisco TAC

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

HyperFlex HX Data Platform

Troubleshoot upgrade failure due to vMotion error for VMs with vGPU

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Prerequisites

Requirements

Components Used

Background information

Problem

Solution

Disable ECC mode on the affected node.

Related information

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products