簡介
本文說明如何識別整合運算系統(UCS),並在雲端原生部署平台(CNDP)中檢查其上的錯誤專案。
背景資訊
與硬體相關的警報在Ultra雲核心訂戶微服務基礎架構(SMI)群集管理器(CM)通用執行環境(CEE)中報告。 在CM虛擬IP(VIP)中報告Kubernetes(K8)、docker等相關資訊。
注意:請參閱網路設計和客戶資訊調查表(CIQ)以驗證IP。
問題
show alerts中報告錯誤「Equipment Alarm」。
- 登入到CM-CEE,運行命令show alerts active detail和show alerts history summary以顯示所有活動警報和歷史警報。
- 注意警報中報告的伺服器IP。
[lab-deployer/labceec01] cee# show alerts active detail
alerts active detail server-alert 9c367ce5ee48
severity major
type "Equipment Alarm"
startsAt 2021-10-27T17:10:37.025Z
source 10.10.10.10
summary "DDR4_P1_C1_ECC: DIMM 5 is inoperable : Check or replace DIMM"
labels [ "alertname: server-alert" "cluster: cr-chr-deployer" "description: DDR4_P1_C1_ECC: DIMM 5 is inoperable : Check or replace DIMM" "fault_id: sys/rack-unit-1/board/memarray-1/mem-5/fault-F0185" "id: 134219020" "monitor: prometheus" "replica: cr-chr-deployer" "server: 10.10.10.10" "severity: major" ]
annotations [ "dn: cr-chr-deployer/10.10.10.10/sys/rack-unit-1/board/memarray-1/mem-5/fault-F0185/134219020" "summary: DDR4_P1_C1_ECC: DIMM 5 is inoperable : Check or replace DIMM" "type: Equipment Alarm" ]
[lab-deployer/labceec01] cee# show alerts history summary
NAME UID SEVERITY STARTS AT DURATION SOURCE SUMMARY
---------------------------------------------------------------------------------------------
vm-alive f6a65030b593 minor 09-02T10:28:28 1m40s 10-192-0-13 labd0123 is alive.
vm-error 3a6d840e3eda major 09-02T10:27:18 1m 10-192-0-13 labd0123 is down.
vm-alive 49b2c1941dc6 minor 09-02T10:25:38 1m40s 10-192-0-14 labd0123 is alive.
解決方案
識別在SMI CM的伺服器上託管的服務(容器)和/或虛擬機器(VM)或基於核心的虛擬機器(KVM),運行命令show running-config,然後查詢伺服器IP的配置。
- 登入到CM VIP(使用者名稱:雲使用者)
- 從OPS Center獲取smi-cm命名空間
- 登入到OPS Center,然後檢查群集配置
- 確定伺服器上運行的節點和虛擬機器
cloud-user@lab-deployer-cm-primary:~$ kubectl get svc -n smi-cm
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cluster-files-offline-smi-cluster-deployer ClusterIP 10.102.200.178 <none> 8080/TCP 98d
iso-host-cluster-files-smi-cluster-deployer ClusterIP 10.102.100.208 192.168.1.102 80/TCP 98d
iso-host-ops-center-smi-cluster-deployer ClusterIP 10.102.200.73 192.168.1.102 3001/TCP 98d
netconf-ops-center-smi-cluster-deployer ClusterIP 10.102.100.207 192.168.184.193 3022/TCP,22/TCP 98d
ops-center-smi-cluster-deployer ClusterIP 10.10.20.20 <none> 8008/TCP,2024/TCP,2022/TCP,7681/TCP,3000/TCP,3001/TCP 98d
squid-proxy-node-port NodePort 10.102.60.114 <none> 3128:32261/TCP 98d
cloud-user@lab-deployer-cm-primary:~$ ssh -p 2024 admin@10.10.20.20
admin@10.10.20.20's password:
Welcome to the Cisco SMI Cluster Deployer on lab-deployer-cm-primary
Copyright © 2016-2020, Cisco Systems, Inc.
All rights reserved.
admin connected from 192.168.1.100 using ssh on ops-center-smi-cluster-deployer-7848c69844-xzdw6
[lab-deployer-cm-primary] SMI Cluster Deployer# show running-config clusters
容器的輸出示例
在本示例中,伺服器由節點primary-1使用。
[lab-deployer-cm-primary] SMI Cluster Deployer# show running-config clusters lab01-smf nodes primary-1
clusters lab01-smf
nodes primary-1
maintenance false
k8s node-type primary
k8s ssh-ip 10.192.10.22
k8s sshd-bind-to-ssh-ip true
k8s node-ip 10.192.10.22
k8s node-labels smi.cisco.com/node-type oam
exit
k8s node-labels smi.cisco.com/node-type-1 proto
exit
ucs-server cimc user admin
ucs-server cimc ip-address 10.10.10.10
VM輸出示例
伺服器可用於基於KVM的虛擬機器。
在本示例中,伺服器具有使用者平面功能(UPF)- upf1和upf2。
[lab-deployer-cm-primary] SMI Cluster Deployer# show running-config clusters lab01-upf nodes labupf
clusters lab01-upf
nodes labupf
maintenance false
ssh-ip 10.192.30.7
type kvm
vms upf1
upf software lab...
...
type upf
exit
vms upf2
upf software lab...
...
type upf
exit
ucs-server cimc user admin
...
ucs-server cimc ip-address 10.10.10.10
...
exit
通過SSH連線到UCS主機
連線到UCS主機並驗證具有scope fault的故障條目、show fault entries和show fault history。
labucs111-cmp1-11 /fault # show fault-entries
Time Severity Description ------------------------- ------------- ---------------------------------------
2021-03-26T10:10:10 major "DDR4_P1_C1_ECC: DIMM 19 is inoperable : Check or replace DIMM"
LABCP0222-Server22-02 /fault # show fault-history
Time Severity Source Cause Description
------------------- ------------- --------------- ------------------------- ----------------------------------------
2021 Dec 10 02:02:02 UTC info %CIMC EQUIPMENT_INOPERABLE "[F0174][cleared][equipment-inoperable][sys/rack-unit-1/board] IERR: A catastrophic fault has occurred on one of the processors: Cleared "
2021 Dec 1 01:01:01 UTC critical %CIMC EQUIPMENT_INOPERABLE "[F0174][critical][equipment-inoperable][sys/rack-unit-1/board] IERR: A catastrophic fault has occurred on one of the processors: Please check the processor's status. "