本產品的文件集力求使用無偏見用語。針對本文件集的目的,無偏見係定義為未根據年齡、身心障礙、性別、種族身分、民族身分、性別傾向、社會經濟地位及交織性表示歧視的用語。由於本產品軟體使用者介面中硬式編碼的語言、根據 RFP 文件使用的語言,或引用第三方產品的語言,因此本文件中可能會出現例外狀況。深入瞭解思科如何使用包容性用語。
思科已使用電腦和人工技術翻譯本文件,讓全世界的使用者能夠以自己的語言理解支援內容。請注意,即使是最佳機器翻譯,也不如專業譯者翻譯的內容準確。Cisco Systems, Inc. 對這些翻譯的準確度概不負責,並建議一律查看原始英文文件(提供連結)。
本文說明如何識別特定作業階段管理功能(SMF)雲端原生部署平台(CNDP)的計算和枝葉交換器,並解決通用執行環境(CEE)中報告的「網路接收錯誤」警示。
CEE Opcenter Rack2上報告「network-receive-error」警報。
[lab0200-smf/labceed22] cee# show alerts active summary
NAME UID SEVERITY STARTS AT SOURCE SUMMARY
-----------------------------------------------------------------------------------------------------------------------------------------------------
network-receive-error 998c77d6a6a0 major 10-26T00:10:31 lab0200-smf-mas Network interface "bd0" showing receive errors on hostname lab0200-s...
network-receive-error ea4217bf9d9e major 10-26T00:10:31 lab0200-smf-mas Network interface "bd0" showing receive errors on hostname lab0200-s...
network-receive-error 97fad40d2a58 major 10-26T00:10:31 lab0200-smf-mas Network interface "eno6" showing receive errors on hostname lab0200-...
network-receive-error b79540eb4e78 major 10-26T00:10:31 lab0200-smf-mas Network interface "eno6" showing receive errors on hostname lab0200-...
network-receive-error e3d163ff4012 major 10-26T00:10:01 lab0200-smf-mas Network interface "bd0" showing receive errors on hostname lab0200-s...
network-receive-error 12a7b5a5c5d5 major 10-26T00:10:01 lab0200-smf-mas Network interface "eno6" showing receive errors on hostname lab0200-...
有關警報的說明,請參閱Ultra Cloud Core Subscriber Microservices Infrastructure Operations Guide。
Alert: network-receive-errors
Annotations:
Type: Communications Alarm
Summary: Network interface "{{ $labels.device }}" showing receive errors on hostname {{ $labels.hostname }}"
Expression:
|
rate(node_network_receive_errs_total{device!~"veth.+"}[2m]) > 0
For: 2m
Labels:
Severity: major
登入CEE labceed22,檢查bd0和eno6介面上報告的「network-receive-error」警報詳細資訊,以識別節點和pod。
[lab0200-smf/labceed22] cee# show alerts active summary
NAME UID SEVERITY STARTS AT SOURCE SUMMARY
---------------------------------------------------------------------------------------------------------------------------------------------------------
network-receive-error 3b6a0a7ce1a8 major 10-26T21:17:01 lab0200-smf-mas Network interface "bd0" showing receive errors on hostname tpc...
network-receive-error 15abab75c8fc major 10-26T21:17:01 lab0200-smf-mas Network interface "eno6" showing receive errors on hostname tp...
執行show alerts active detail network-receive-error <UID>以獲取警報的詳細資訊。
在示例中,兩個警報的來源都是node lab0200-smf-primary-1 pod node-exporter-47xmm。
[lab0200-smf/labceed22] cee# show alerts active detail network-receive-error 3b6a0a7ce1a8
alerts active detail network-receive-error 3b6a0a7ce1a8
severity major
type "Communications Alarm"
startsAt 2021-10-26T21:17:01.913Z
source lab0200-smf-primary-1
summary "Network interface \"bd0\" showing receive errors on hostname lab0200-smf-primary-1\""
labels [ "alertname: network-receive-errors" "cluster: lab0200-smf_cee-labceed22" "component: node-exporter" "controller_revision_hash: 75c4cb979f" "device: bd0" "hostname: lab0200-smf-primary-1" "instance: 10.192.1.42:9100" "job: kubernetes-pods" "monitor: prometheus" "namespace: cee-labceed22" "pod: node-exporter-47xmm" "pod_template_generation: 1" "replica: lab0200-smf_cee-labceed22" "severity: major" ]
annotations [ "summary: Network interface \"bd0\" showing receive errors on hostname lab0200-smf-primary-1\"" "type: Communications Alarm" ]
[lab0200-smf/labceed22] cee# show alerts active detail network-receive-error 15abab75c8fc
alerts active detail network-receive-error 15abab75c8fc
severity major
type "Communications Alarm"
startsAt 2021-10-26T21:17:01.913Z
source lab0200-smf-primary-1
summary "Network interface \"eno6\" showing receive errors on hostname lab0200-smf-primary-1\""
labels [ "alertname: network-receive-errors" "cluster: lab0200-smf_cee-labceed22" "component: node-exporter" "controller_revision_hash: 75c4cb979f" "device: eno6" "hostname: lab0200-smf-primary-1" "instance: 10.192.1.42:9100" "job: kubernetes-pods" "monitor: prometheus" "namespace: cee-labceed22" "pod: node-exporter-47xmm" "pod_template_generation: 1" "replica: lab0200-smf_cee-labceed22" "severity: major" ]
annotations [ "summary: Network interface \"eno6\" showing receive errors on hostname lab0200-smf-primary-1\"" "type: Communications Alarm" ]
登入機架2的K8主VIP,驗證源節點和Pod的狀態。
在本例中,兩者都處於良好狀態:準備並運行。
cloud-user@lab0200-smf-primary-1:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lab0200-smf-primary-1 Ready control-plane 105d v1.21.0
lab0200-smf-primary-2 Ready control-plane 105d v1.21.0
lab0200-smf-primary-3 Ready control-plane 105d v1.21.0
lab0200-smf-worker-1 Ready <none> 105d v1.21.0
lab0200-smf-worker-2 Ready <none> 105d v1.21.0
lab0200-smf-worker-3 Ready <none> 105d v1.21.0
lab0200-smf-worker-4 Ready <none> 105d v1.21.0
lab0200-smf-worker-5 Ready <none> 105d v1.21.0
cloud-user@lab0200-smf-primary-1:~$ kubectl get pods -A -o wide | grep node-exporter--47xmm
cee-labceed22 node-exporter-47xmm 1/1 Running 0 18d 10.192.1.44 lab0200-smf-primary-1 <none> <none>
驗證bd0和eno6介面是否使用ip addr | grep eno6和ip addr | grep bd0.
附註:對bd0應用過濾器時,eno6顯示在輸出中。原因是eno5和eno6配置為bd0下的繫結介面,可以在SMI集群部署器中驗證。
cloud-user@lab0200-smf-primary-1:~$ ip addr | grep eno6
3: eno6: <BROADCAST,MULTICAST,SECONDARY,UP,LOWER_UP> mtu 1500 qdisc mq primary bd0 state UP group default qlen 1000
cloud-user@lab0200-smf-primary-1:~$ ip addr | grep bd0
2: eno5: <BROADCAST,MULTICAST,SECONDARY,UP,LOWER_UP> mtu 1500 qdisc mq primary bd0 state UP group default qlen 1000
3: eno6: <BROADCAST,MULTICAST,SECONDARY,UP,LOWER_UP> mtu 1500 qdisc mq primary bd0 state UP group default qlen 1000
12: bd0: <BROADCAST,MULTICAST,PRIMARY,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
13: vlan111@bd0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
14: vlan112@bd0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
182: cali7a166bd093d@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
登入Cluster Manager VIP,然後通過ssh訪問運營(運營)中心ops-center-smi-cluster-deployer。
cloud-user@lab-deployer-cm-primary:~$ kubectl get svc -n smi-cm
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cluster-files-offline-smi-cluster-deployer ClusterIP 10.102.53.184 <none> 8080/TCP 110d
iso-host-cluster-files-smi-cluster-deployer ClusterIP 10.102.38.70 172.16.1.102 80/TCP 110d
iso-host-ops-center-smi-cluster-deployer ClusterIP 10.102.83.54 172.16.1.102 3001/TCP 110d
netconf-ops-center-smi-cluster-deployer ClusterIP 10.102.196.125 10.241.206.65 3022/TCP,22/TCP 110d
ops-center-smi-cluster-deployer ClusterIP 10.102.12.170 <none> 8008/TCP,2024/TCP,2022/TCP,7681/TCP,3000/TCP,3001/TCP 110d
squid-proxy-node-port NodePort 10.102.72.168 <none> 3128:32572/TCP 110d
cloud-user@lab-deployer-cm-primary:~$ ssh -p 2024 admin@10.102.12.170
admin@10.102.12.170's password:
Welcome to the Cisco SMI Cluster Deployer on lab-deployer-cm-primary
Copyright © 2016-2020, Cisco Systems, Inc.
All rights reserved.
admin connected from 172.16.1.100 using ssh on ops-center-smi-cluster-deployer-5cdc5f94db-bnxqt
[lab-deployer-cm-primary] SMI Cluster Deployer#
檢驗節點的群集、節點預設值、介面和引數模式。在本示例中, lab0200-smf。
[lab-deployer-cm-primary] SMI Cluster Deployer# show running-config clusters
clusters lab0200-smf
environment lab0200-smf-deployer_1
…
node-defaults initial-boot netplan ethernets eno5
dhcp4 false
dhcp6 false
exit
node-defaults initial-boot netplan ethernets eno6
dhcp4 false
dhcp6 false
exit
node-defaults initial-boot netplan ethernets enp216s0f0
dhcp4 false
dhcp6 false
exit
node-defaults initial-boot netplan ethernets enp216s0f1
dhcp4 false
dhcp6 false
exit
node-defaults initial-boot netplan ethernets enp94s0f0
dhcp4 false
dhcp6 false
exit
node-defaults initial-boot netplan ethernets enp94s0f1
dhcp4 false
dhcp6 false
exit
node-defaults initial-boot netplan bonds bd0
dhcp4 false
dhcp6 false
optional true
interfaces [ eno5 eno6 ]
parameters mode active-backup
parameters mii-monitor-interval 100
parameters fail-over-mac-policy active
exit
在主VIP中,驗證介面bd0和eno6上的錯誤和/或丟棄。
當兩個介面都出現丟棄時,必須檢查UCS或枝葉交換機硬體是否有任何硬體問題。
cloud-user@lab0200-smf-primary-1:~$ ifconfig bd0
bd0: flags=5187<UP,BROADCAST,RUNNING,PRIMARY,MULTICAST> mtu 1500
inet6 fe80::8e94:1fff:fef6:53cd prefixlen 64 scopeid 0x20<link>
ether 8c:94:1f:f6:53:cd txqueuelen 1000 (Ethernet)
RX packets 47035763777 bytes 19038286946282 (19.0 TB)
RX errors 49541 dropped 845484 overruns 0 frame 49541
TX packets 53797663096 bytes 32320571418654 (32.3 TB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
cloud-user@lab0200-smf-primary-1:~$ ifconfig eno6
eno6: flags=6211<UP,BROADCAST,RUNNING,SECONDARY,MULTICAST> mtu 1500
ether 8c:94:1f:f6:53:cd txqueuelen 1000 (Ethernet)
RX packets 47035402290 bytes 19038274391478 (19.0 TB)
RX errors 49541 dropped 845484 overruns 0 frame 49541
TX packets 53797735337 bytes 32320609021235 (32.3 TB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
在SMI群集部署器中運行show running-config clusters <cluster name> nodes <node name>,以查詢UCS伺服器的CIMC IP地址。
[lab-deployer-cm-primary] SMI Cluster Deployer# show running-config clusters lab0200-smf nodes primary-1
clusters lab0200-smf
nodes primary-1
maintenance false
host-profile cp-data-r2-sysctl
k8s node-type primary
k8s ssh-ip 10.192.1.42
k8s sshd-bind-to-ssh-ip true
k8s node-ip 10.192.1.42
k8s node-labels smi.cisco.com/node-type oam
exit
k8s node-labels smi.cisco.com/node-type-1 proto
exit
ucs-server cimc user admin
...
ucs-server cimc ip-address 172.16.1.62
...
exit
通過Active CM通過SSH進入CIMC IP地址172.16.1.62並驗證伺服器名稱。
在本例中,伺服器名稱為LAB0200-Server8-02。
cloud-user@lab-deployer-cm-primary:~$ ssh admin@172.16.1.62
Warning: Permanently added '172.16.1.62' (RSA) to the list of known hosts.
admin@172.16.1.62's password:
LAB0200-Server8-02#
附註:驗證客戶資訊調查表(CIQ)中伺服器的名稱(如果CIQ可用)。
在主VIP上,使用ls -la /sys/class/net 命令檢查eno6的物理介面名稱。在本示例中,當使用lscpi 標識eno6裝置時,則必須使用埠1d:00.1標識eno6。
cloud-user@lab0200-smf-primary-1:~$ ls -la /sys/class/net
total 0
drwxr-xr-x 2 root root 0 Oct 12 06:18 .
drwxr-xr-x 87 root root 0 Oct 12 06:18 ..
lrwxrwxrwx 1 root root 0 Oct 12 06:18 bd0 -> ../../devices/virtual/net/bd0
lrwxrwxrwx 1 root root 0 Oct 12 06:18 bd1 -> ../../devices/virtual/net/bd1
…
lrwxrwxrwx 1 root root 0 Oct 12 06:18 eno5 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/0000:19:01.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0/net/eno5
lrwxrwxrwx 1 root root 0 Oct 12 06:18 eno6 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/0000:19:01.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.1/net/eno6
附註:lspci顯示有關UCS伺服器上所有裝置(如MLOM、SLOM、PCI等)的資訊。裝置資訊可用於對映ls -la /sys/class/net 命令輸出中的介面名稱。
在示例中,埠1d:00.1屬於MLOM和eno6介面。eno5是1d:00.0 MLOM埠。
cloud-user@lab0200-smf-primary-1:~$ lspci
……
1d:00.0 Ethernet controller: Cisco Systems Inc VIC Ethernet NIC (rev a2)
1d:00.1 Ethernet controller: Cisco Systems Inc VIC Ethernet NIC (rev a2)
3b:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
3b:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
5e:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
5e:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
d8:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
d8:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
在CIMC GUI中,匹配主VIP的ifconfig輸出中看到的MLOM MAC地址。
cloud-user@lab0200-smf-primary-1:~$ ifconfig bd0
bd0: flags=5187<UP,BROADCAST,RUNNING,PRIMARY,MULTICAST> mtu 1500
inet6 fe80::8e94:1fff:fef6:53cd prefixlen 64 scopeid 0x20<link>
ether 8c:94:1f:f6:53:cd txqueuelen 1000 (Ethernet)
RX packets 47035763777 bytes 19038286946282 (19.0 TB)
RX errors 49541 dropped 845484 overruns 0 frame 49541
TX packets 53797663096 bytes 32320571418654 (32.3 TB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
cloud-user@lab0200-smf-primary-1:~$ ifconfig eno6
eno6: flags=6211<UP,BROADCAST,RUNNING,SECONDARY,MULTICAST> mtu 1500
ether 8c:94:1f:f6:53:cd txqueuelen 1000 (Ethernet)
RX packets 47035402290 bytes 19038274391478 (19.0 TB)
RX errors 49541 dropped 845484 overruns 0 frame 49541
TX packets 53797735337 bytes 32320609021235 (32.3 TB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
如圖所示,在集群管理器網路中,MLOM(eno5/eno6)連線到枝葉1和2。
附註:如果CIQ可用,則驗證將在CIQ中保留主機名。
同時登入到Leaves和grep伺服器名稱。
在示例中,LAB0200-Server8-02 MLOM和MLOM介面連線到Leaf1和Leaf2上的介面Eth1/49。
Leaf1# sh int description | inc LAB0200-Server8-02
Eth1/10 eth 40G PCIE-01-2-LAB0200-Server8-02
Eth1/30 eth 40G PCIE-02-2-LAB0200-Server8-02
Eth1/49 eth 40G LAB0200-Server8-02 MLOM-P2
Leaf2# sh int description | inc LAB0200-Server8-02
Eth1/10 eth 40G PCIE-01-1-LAB0200-Server8-02
Eth1/30 eth 40G PCIE-02-1-LAB0200-Server8-02
Eth1/49 eth 40G LAB0200-Server8-02 MLOM-P1
重要事項: 每個問題都需要有自己的分析。如果在Nexus端未發現錯誤,請檢查UCS伺服器介面是否存在錯誤。
在此方案中,問題與與LAB0200-Server8-02 MLOM eno6連線的Leaf1 int eth1/49上的鏈路故障有關。
UCS伺服器經過驗證,未發現硬體問題,MLOM和埠狀態良好。
Leaf1顯示TX輸出錯誤:
Leaf1# sh int Eth1/49
Ethernet1/49 is up
admin state is up, Dedicated Interface
Hardware: 10000/40000/100000 Ethernet, address: e8eb.3437.48ca (bia e8eb.3437.48ca)
Description: LAB0200-Server8-02 MLOM-P2
MTU 9216 bytes, BW 40000000 Kbit , DLY 10 usec
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, medium is broadcast
Port mode is trunk
full-duplex, 40 Gb/s, media type is 40G
Beacon is turned off
Auto-Negotiation is turned on FEC mode is Auto
Input flow-control is off, output flow-control is off
Auto-mdix is turned off
Rate mode is dedicated
Switchport monitor is off
EtherType is 0x8100
EEE (efficient-ethernet) : n/a
admin fec state is auto, oper fec state is off
Last link flapped 5week(s) 6day(s)
Last clearing of "show interface" counters never
12 interface resets
Load-Interval #1: 30 seconds
30 seconds input rate 162942488 bits/sec, 26648 packets/sec
30 seconds output rate 35757024 bits/sec, 16477 packets/sec
input rate 162.94 Mbps, 26.65 Kpps; output rate 35.76 Mbps, 16.48 Kpps
Load-Interval #2: 5 minute (300 seconds)
300 seconds input rate 120872496 bits/sec, 22926 packets/sec
300 seconds output rate 54245920 bits/sec, 17880 packets/sec
input rate 120.87 Mbps, 22.93 Kpps; output rate 54.24 Mbps, 17.88 Kpps
RX
85973263325 unicast packets 6318912 multicast packets 55152 broadcast packets
85979637389 input packets 50020924423841 bytes
230406880 jumbo packets 0 storm suppression bytes
0 runts 0 giants 0 CRC 0 no buffer
0 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop
0 input with dribble 0 input discard
0 Rx pause
TX
76542979816 unicast packets 88726302 multicast packets 789768 broadcast packets
76632574981 output packets 29932747104403 bytes
3089287610 jumbo packets
79095 output error 0 collision 0 deferred 0 late collision
0 lost carrier 0 no carrier 0 babble 0 output discard
0 Tx pause
在int eth1/49 Leaf1上使用電纜更換解決「network-receive-error」警報。
上一次介面鏈路故障是在更換電纜之前報告的。
2021 Nov 17 07:36:48 TPLF0201 %BFD-5-SESSION_STATE_DOWN: BFD session 1090519112 to neighbor 10.22.101.1 on interface Vlan2201 has gone down. Reason: Control
Detection Time Expired.
2021 Nov 17 07:37:30 TPLF0201 %BFD-5-SESSION_STATE_DOWN: BFD session 1090519107 to neighbor 10.22.101.2 on interface Vlan2201 has gone down. Reason: Control
Detection Time Expired.
2021 Nov 18 05:09:12 TPLF0201 %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/48 is down (Link failure)
更換電纜後,在labceed22的eno6/bd0上清除警報。
[lab0200-smf/labceed22] cee# show alerts active summary
NAME UID SEVERITY STARTS AT SOURCE SUMMARY
---------------------------------------------------------------------------------------------------------------------------------------------------------
watchdog a62f59201ba8 minor 11-02T05:57:18 System This is an alert meant to ensure that the entire alerting pipeline is functional. This ale...
修訂 | 發佈日期 | 意見 |
---|---|---|
1.0 |
28-Mar-2022 |
初始版本 |