對eno6/bd0介面上的SMF CNDP「network-receive-error」進行故障排除

下載選項

PDF (477.4 KB)
在多種裝置上使用 Adobe Reader 檢視
ePub (189.6 KB)
在 iPhone、iPad、Android、Sony Reader 或 Windows Phone 上的各種應用程式中檢視
Mobi (Kindle) (149.6 KB)
在 Kindle 裝置或多部裝置的 Kindle 應用程式上檢視

已更新: 2022 年 3 月 28 日

文件 ID:217732

無偏見用語

本產品的文件集力求使用無偏見用語。針對本文件集的目的，無偏見係定義為未根據年齡、身心障礙、性別、種族身分、民族身分、性別傾向、社會經濟地位及交織性表示歧視的用語。由於本產品軟體使用者介面中硬式編碼的語言、根據 RFP 文件使用的語言，或引用第三方產品的語言，因此本文件中可能會出現例外狀況。深入瞭解思科如何使用包容性用語。

關於此翻譯

思科已使用電腦和人工技術翻譯本文件，讓全世界的使用者能夠以自己的語言理解支援內容。請注意，即使是最佳機器翻譯，也不如專業譯者翻譯的內容準確。Cisco Systems, Inc. 對這些翻譯的準確度概不負責，並建議一律查看原始英文文件（提供連結）。

簡介

本文說明如何識別特定作業階段管理功能(SMF)雲端原生部署平台(CNDP)的計算和枝葉交換器，並解決通用執行環境(CEE)中報告的「網路接收錯誤」警示。

問題

CEE Opcenter Rack2上報告「network-receive-error」警報。

[lab0200-smf/labceed22] cee# show alerts active summary
NAME UID SEVERITY STARTS AT SOURCE SUMMARY
-----------------------------------------------------------------------------------------------------------------------------------------------------
network-receive-error 998c77d6a6a0 major 10-26T00:10:31 lab0200-smf-mas Network interface "bd0" showing receive errors on hostname lab0200-s...
network-receive-error ea4217bf9d9e major 10-26T00:10:31 lab0200-smf-mas Network interface "bd0" showing receive errors on hostname lab0200-s...
network-receive-error 97fad40d2a58 major 10-26T00:10:31 lab0200-smf-mas Network interface "eno6" showing receive errors on hostname lab0200-...
network-receive-error b79540eb4e78 major 10-26T00:10:31 lab0200-smf-mas Network interface "eno6" showing receive errors on hostname lab0200-...
network-receive-error e3d163ff4012 major 10-26T00:10:01 lab0200-smf-mas Network interface "bd0" showing receive errors on hostname lab0200-s...
network-receive-error 12a7b5a5c5d5 major 10-26T00:10:01 lab0200-smf-mas Network interface "eno6" showing receive errors on hostname lab0200-...

有關警報的說明，請參閱Ultra Cloud Core Subscriber Microservices Infrastructure Operations Guide。

Alert: network-receive-errors
 Annotations:
  Type: Communications Alarm
  Summary: Network interface "{{ $labels.device }}" showing receive errors on hostname {{ $labels.hostname }}"
  Expression:
   |
   rate(node_network_receive_errs_total{device!~"veth.+"}[2m]) > 0
For: 2m
Labels:
 Severity: major

確定警報的來源

登入CEE labceed22，檢查bd0和eno6介面上報告的「network-receive-error」警報詳細資訊，以識別節點和pod。

[lab0200-smf/labceed22] cee# show alerts active summary
NAME                   UID           SEVERITY  STARTS AT       SOURCE                 SUMMARY                                                            
---------------------------------------------------------------------------------------------------------------------------------------------------------
network-receive-error  3b6a0a7ce1a8  major     10-26T21:17:01  lab0200-smf-mas  Network interface "bd0" showing receive errors on hostname tpc...  
network-receive-error  15abab75c8fc  major     10-26T21:17:01  lab0200-smf-mas  Network interface "eno6" showing receive errors on hostname tp...

執行show alerts active detail network-receive-error <UID>以獲取警報的詳細資訊。

在示例中，兩個警報的來源都是node lab0200-smf-primary-1 pod node-exporter-47xmm。

[lab0200-smf/labceed22] cee# show alerts active detail network-receive-error  3b6a0a7ce1a8
alerts active detail network-receive-error 3b6a0a7ce1a8
 severity    major
 type        "Communications Alarm"
 startsAt    2021-10-26T21:17:01.913Z
 source      lab0200-smf-primary-1
 summary     "Network interface \"bd0\" showing receive errors on hostname lab0200-smf-primary-1\""
 labels      [ "alertname: network-receive-errors" "cluster: lab0200-smf_cee-labceed22" "component: node-exporter" "controller_revision_hash: 75c4cb979f" "device: bd0" "hostname: lab0200-smf-primary-1" "instance: 10.192.1.42:9100" "job: kubernetes-pods" "monitor: prometheus" "namespace: cee-labceed22" "pod: node-exporter-47xmm" "pod_template_generation: 1" "replica: lab0200-smf_cee-labceed22" "severity: major" ]
 annotations [ "summary: Network interface \"bd0\" showing receive errors on hostname lab0200-smf-primary-1\"" "type: Communications Alarm" ]

[lab0200-smf/labceed22] cee# show alerts active detail network-receive-error 15abab75c8fc
alerts active detail network-receive-error 15abab75c8fc
 severity    major
 type        "Communications Alarm"
 startsAt    2021-10-26T21:17:01.913Z
 source      lab0200-smf-primary-1
 summary     "Network interface \"eno6\" showing receive errors on hostname lab0200-smf-primary-1\""
 labels      [ "alertname: network-receive-errors" "cluster: lab0200-smf_cee-labceed22" "component: node-exporter" "controller_revision_hash: 75c4cb979f" "device: eno6" "hostname: lab0200-smf-primary-1" "instance: 10.192.1.42:9100" "job: kubernetes-pods" "monitor: prometheus" "namespace: cee-labceed22" "pod: node-exporter-47xmm" "pod_template_generation: 1" "replica: lab0200-smf_cee-labceed22" "severity: major" ]
 annotations [ "summary: Network interface \"eno6\" showing receive errors on hostname lab0200-smf-primary-1\"" "type: Communications Alarm" ]

驗證節點、Pod、埠狀態

來自主要VIP的節點和Pod驗證

登入機架2的K8主VIP，驗證源節點和Pod的狀態。

在本例中，兩者都處於良好狀態：準備並運行。

cloud-user@lab0200-smf-primary-1:~$ kubectl get nodes
NAME                         STATUS   ROLES                  AGE    VERSION
lab0200-smf-primary-1   Ready    control-plane          105d   v1.21.0
lab0200-smf-primary-2   Ready    control-plane          105d   v1.21.0
lab0200-smf-primary-3   Ready    control-plane          105d   v1.21.0
lab0200-smf-worker-1   Ready    <none>                 105d   v1.21.0
lab0200-smf-worker-2   Ready    <none>                 105d   v1.21.0
lab0200-smf-worker-3   Ready    <none>                 105d   v1.21.0
lab0200-smf-worker-4   Ready    <none>                 105d   v1.21.0
lab0200-smf-worker-5   Ready    <none>                 105d   v1.21.0

cloud-user@lab0200-smf-primary-1:~$ kubectl get pods -A -o wide | grep node-exporter--47xmm
cee-labceed22     node-exporter-47xmm                                       1/1     Running   0          18d    10.192.1.44       lab0200-smf-primary-1   <none>           <none>

從K8的主要VIP驗證埠

驗證bd0和eno6介面是否使用ip addr | grep eno6和ip addr | grep bd0.

附註：對bd0應用過濾器時，eno6顯示在輸出中。原因是eno5和eno6配置為bd0下的繫結介面，可以在SMI集群部署器中驗證。

cloud-user@lab0200-smf-primary-1:~$ ip addr | grep eno6
3: eno6: <BROADCAST,MULTICAST,SECONDARY,UP,LOWER_UP> mtu 1500 qdisc mq primary bd0 state UP group default qlen 1000

cloud-user@lab0200-smf-primary-1:~$ ip addr | grep bd0
2: eno5: <BROADCAST,MULTICAST,SECONDARY,UP,LOWER_UP> mtu 1500 qdisc mq primary bd0 state UP group default qlen 1000
3: eno6: <BROADCAST,MULTICAST,SECONDARY,UP,LOWER_UP> mtu 1500 qdisc mq primary bd0 state UP group default qlen 1000
12: bd0: <BROADCAST,MULTICAST,PRIMARY,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
13: vlan111@bd0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
14: vlan112@bd0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
182: cali7a166bd093d@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default

從SMI集群部署器驗證埠

登入Cluster Manager VIP，然後通過ssh訪問運營（運營）中心ops-center-smi-cluster-deployer。

cloud-user@lab-deployer-cm-primary:~$ kubectl get svc -n smi-cm
NAME                                          TYPE        CLUSTER-IP       EXTERNAL-IP     PORT(S)                                                 AGE
cluster-files-offline-smi-cluster-deployer    ClusterIP   10.102.53.184    <none>          8080/TCP                                                110d
iso-host-cluster-files-smi-cluster-deployer   ClusterIP   10.102.38.70     172.16.1.102   80/TCP                                                  110d
iso-host-ops-center-smi-cluster-deployer      ClusterIP   10.102.83.54     172.16.1.102   3001/TCP                                                110d
netconf-ops-center-smi-cluster-deployer       ClusterIP   10.102.196.125   10.241.206.65   3022/TCP,22/TCP                                         110d
ops-center-smi-cluster-deployer               ClusterIP   10.102.12.170    <none>          8008/TCP,2024/TCP,2022/TCP,7681/TCP,3000/TCP,3001/TCP   110d
squid-proxy-node-port                         NodePort    10.102.72.168    <none>          3128:32572/TCP                                          110d

cloud-user@lab-deployer-cm-primary:~$ ssh -p 2024 admin@10.102.12.170
admin@10.102.12.170's password:
      Welcome to the Cisco SMI Cluster Deployer on lab-deployer-cm-primary
      Copyright © 2016-2020, Cisco Systems, Inc.
      All rights reserved.
admin connected from 172.16.1.100 using ssh on ops-center-smi-cluster-deployer-5cdc5f94db-bnxqt
[lab-deployer-cm-primary] SMI Cluster Deployer#

檢驗節點的群集、節點預設值、介面和引數模式。在本示例中， lab0200-smf。

[lab-deployer-cm-primary] SMI Cluster Deployer#  show running-config clusters
clusters lab0200-smf
 environment lab0200-smf-deployer_1
…
 node-defaults initial-boot netplan ethernets eno5
  dhcp4 false
  dhcp6 false
 exit
 node-defaults initial-boot netplan ethernets eno6
  dhcp4 false
  dhcp6 false
 exit
 node-defaults initial-boot netplan ethernets enp216s0f0
  dhcp4 false
  dhcp6 false
 exit
 node-defaults initial-boot netplan ethernets enp216s0f1
  dhcp4 false
  dhcp6 false
 exit
 node-defaults initial-boot netplan ethernets enp94s0f0
  dhcp4 false
  dhcp6 false
 exit
 node-defaults initial-boot netplan ethernets enp94s0f1
  dhcp4 false
  dhcp6 false
 exit
node-defaults initial-boot netplan bonds bd0
  dhcp4      false
  dhcp6      false
  optional   true
  interfaces [ eno5 eno6 ]
  parameters mode      active-backup
  parameters mii-monitor-interval 100
  parameters fail-over-mac-policy active
 exit

在主VIP中，驗證介面bd0和eno6上的錯誤和/或丟棄。

當兩個介面都出現丟棄時，必須檢查UCS或枝葉交換機硬體是否有任何硬體問題。

cloud-user@lab0200-smf-primary-1:~$  ifconfig bd0
bd0: flags=5187<UP,BROADCAST,RUNNING,PRIMARY,MULTICAST>  mtu 1500
        inet6 fe80::8e94:1fff:fef6:53cd  prefixlen 64  scopeid 0x20<link>
        ether 8c:94:1f:f6:53:cd  txqueuelen 1000  (Ethernet)
        RX packets 47035763777  bytes 19038286946282 (19.0 TB)
        RX errors 49541  dropped 845484  overruns 0  frame 49541
        TX packets 53797663096  bytes 32320571418654 (32.3 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

cloud-user@lab0200-smf-primary-1:~$  ifconfig eno6
eno6: flags=6211<UP,BROADCAST,RUNNING,SECONDARY,MULTICAST>  mtu 1500
        ether 8c:94:1f:f6:53:cd  txqueuelen 1000  (Ethernet)
        RX packets 47035402290  bytes 19038274391478 (19.0 TB)
        RX errors 49541  dropped 845484  overruns 0  frame 49541
        TX packets 53797735337  bytes 32320609021235 (32.3 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

確定UCS伺服器

從SMI集群Deployer驗證UCS伺服器

在SMI群集部署器中運行show running-config clusters <cluster name> nodes <node name>，以查詢UCS伺服器的CIMC IP地址。

[lab-deployer-cm-primary] SMI Cluster Deployer# show running-config clusters lab0200-smf nodes primary-1
clusters lab0200-smf
nodes primary-1
  maintenance  false
  host-profile cp-data-r2-sysctl
  k8s node-type       primary
  k8s ssh-ip          10.192.1.42
  k8s sshd-bind-to-ssh-ip true
  k8s node-ip         10.192.1.42
  k8s node-labels smi.cisco.com/node-type oam
  exit
  k8s node-labels smi.cisco.com/node-type-1 proto
  exit
  ucs-server cimc user admin
...
  ucs-server cimc ip-address 172.16.1.62
...
  exit

通過Active CM通過SSH進入CIMC IP地址172.16.1.62並驗證伺服器名稱。

在本例中，伺服器名稱為LAB0200-Server8-02。

cloud-user@lab-deployer-cm-primary:~$ ssh admin@172.16.1.62
Warning: Permanently added '172.16.1.62' (RSA) to the list of known hosts.
admin@172.16.1.62's password:
LAB0200-Server8-02#

附註：驗證客戶資訊調查表(CIQ)中伺服器的名稱（如果CIQ可用）。

對映主要VIP埠和UCS網路介面

在主VIP上，使用ls -la /sys/class/net 命令檢查eno6的物理介面名稱。在本示例中，當使用lscpi 標識eno6裝置時，則必須使用埠1d:00.1標識eno6。

cloud-user@lab0200-smf-primary-1:~$ ls -la /sys/class/net
total 0
drwxr-xr-x  2 root root    0 Oct 12 06:18 .
drwxr-xr-x 87 root root    0 Oct 12 06:18 ..
lrwxrwxrwx  1 root root    0 Oct 12 06:18 bd0 -> ../../devices/virtual/net/bd0
lrwxrwxrwx  1 root root    0 Oct 12 06:18 bd1 -> ../../devices/virtual/net/bd1
…
lrwxrwxrwx  1 root root    0 Oct 12 06:18 eno5 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/0000:19:01.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.0/net/eno5
lrwxrwxrwx  1 root root    0 Oct 12 06:18 eno6 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/0000:19:01.0/0000:1b:00.0/0000:1c:00.0/0000:1d:00.1/net/eno6

附註：lspci顯示有關UCS伺服器上所有裝置（如MLOM、SLOM、PCI等）的資訊。裝置資訊可用於對映ls -la /sys/class/net 命令輸出中的介面名稱。

在示例中，埠1d:00.1屬於MLOM和eno6介面。eno5是1d:00.0 MLOM埠。

cloud-user@lab0200-smf-primary-1:~$ lspci
……
1d:00.0 Ethernet controller: Cisco Systems Inc VIC Ethernet NIC (rev a2)
1d:00.1 Ethernet controller: Cisco Systems Inc VIC Ethernet NIC (rev a2)
3b:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
3b:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
5e:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
5e:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
d8:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
d8:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)

在CIMC GUI中，匹配主VIP的ifconfig輸出中看到的MLOM MAC地址。

cloud-user@lab0200-smf-primary-1:~$  ifconfig bd0
bd0: flags=5187<UP,BROADCAST,RUNNING,PRIMARY,MULTICAST>  mtu 1500
        inet6 fe80::8e94:1fff:fef6:53cd  prefixlen 64  scopeid 0x20<link>
        ether 8c:94:1f:f6:53:cd  txqueuelen 1000  (Ethernet)
        RX packets 47035763777  bytes 19038286946282 (19.0 TB)
        RX errors 49541  dropped 845484  overruns 0  frame 49541
        TX packets 53797663096  bytes 32320571418654 (32.3 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


cloud-user@lab0200-smf-primary-1:~$  ifconfig eno6
eno6: flags=6211<UP,BROADCAST,RUNNING,SECONDARY,MULTICAST>  mtu 1500
        ether 8c:94:1f:f6:53:cd  txqueuelen 1000  (Ethernet)
        RX packets 47035402290  bytes 19038274391478 (19.0 TB)
        RX errors 49541  dropped 845484  overruns 0  frame 49541
        TX packets 53797735337  bytes 32320609021235 (32.3 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

識別枝葉交換機

如圖所示，在集群管理器網路中，MLOM(eno5/eno6)連線到枝葉1和2。

附註：如果CIQ可用，則驗證將在CIQ中保留主機名。

CM Networking Design

同時登入到Leaves和grep伺服器名稱。

在示例中，LAB0200-Server8-02 MLOM和MLOM介面連線到Leaf1和Leaf2上的介面Eth1/49。

Leaf1# sh int description | inc LAB0200-Server8-02
Eth1/10       eth    40G     PCIE-01-2-LAB0200-Server8-02
Eth1/30       eth    40G     PCIE-02-2-LAB0200-Server8-02
Eth1/49       eth    40G     LAB0200-Server8-02 MLOM-P2


Leaf2# sh int description | inc LAB0200-Server8-02
Eth1/10       eth    40G     PCIE-01-1-LAB0200-Server8-02
Eth1/30       eth    40G     PCIE-02-1-LAB0200-Server8-02
Eth1/49       eth    40G     LAB0200-Server8-02 MLOM-P1

解決方案

重要事項： 每個問題都需要有自己的分析。如果在Nexus端未發現錯誤，請檢查UCS伺服器介面是否存在錯誤。

在此方案中，問題與與LAB0200-Server8-02 MLOM eno6連線的Leaf1 int eth1/49上的鏈路故障有關。

UCS伺服器經過驗證，未發現硬體問題，MLOM和埠狀態良好。

Leaf1顯示TX輸出錯誤：

Leaf1# sh int Eth1/49
Ethernet1/49 is up
admin state is up, Dedicated Interface
  Hardware: 10000/40000/100000 Ethernet, address: e8eb.3437.48ca (bia e8eb.3437.48ca)
  Description: LAB0200-Server8-02 MLOM-P2
  MTU 9216 bytes, BW 40000000 Kbit , DLY 10 usec
  reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA, medium is broadcast
  Port mode is trunk
  full-duplex, 40 Gb/s, media type is 40G
  Beacon is turned off
  Auto-Negotiation is turned on  FEC mode is Auto
  Input flow-control is off, output flow-control is off
  Auto-mdix is turned off
  Rate mode is dedicated
  Switchport monitor is off
  EtherType is 0x8100
  EEE (efficient-ethernet) : n/a
    admin fec state is auto, oper fec state is off
  Last link flapped 5week(s) 6day(s)
  Last clearing of "show interface" counters never
  12 interface resets
  Load-Interval #1: 30 seconds
    30 seconds input rate 162942488 bits/sec, 26648 packets/sec
    30 seconds output rate 35757024 bits/sec, 16477 packets/sec
    input rate 162.94 Mbps, 26.65 Kpps; output rate 35.76 Mbps, 16.48 Kpps
  Load-Interval #2: 5 minute (300 seconds)
    300 seconds input rate 120872496 bits/sec, 22926 packets/sec
    300 seconds output rate 54245920 bits/sec, 17880 packets/sec
    input rate 120.87 Mbps, 22.93 Kpps; output rate 54.24 Mbps, 17.88 Kpps
  RX
    85973263325 unicast packets  6318912 multicast packets  55152 broadcast packets
    85979637389 input packets  50020924423841 bytes
    230406880 jumbo packets  0 storm suppression bytes
    0 runts  0 giants  0 CRC  0 no buffer
    0 input error  0 short frame  0 overrun   0 underrun  0 ignored
    0 watchdog  0 bad etype drop  0 bad proto drop  0 if down drop
    0 input with dribble  0 input discard
    0 Rx pause
  TX
    76542979816 unicast packets  88726302 multicast packets  789768 broadcast packets
    76632574981 output packets  29932747104403 bytes
    3089287610 jumbo packets
    79095 output error  0 collision  0 deferred  0 late collision
    0 lost carrier  0 no carrier  0 babble  0 output discard
    0 Tx pause

在int eth1/49 Leaf1上使用電纜更換解決「network-receive-error」警報。

上一次介面鏈路故障是在更換電纜之前報告的。

2021 Nov 17 07:36:48 TPLF0201 %BFD-5-SESSION_STATE_DOWN: BFD session 1090519112 to neighbor 10.22.101.1 on interface Vlan2201 has gone down. Reason: Control 
Detection Time Expired.
2021 Nov 17 07:37:30 TPLF0201 %BFD-5-SESSION_STATE_DOWN: BFD session 1090519107 to neighbor 10.22.101.2 on interface Vlan2201 has gone down. Reason: Control 
Detection Time Expired.
2021 Nov 18 05:09:12 TPLF0201 %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/48 is down (Link failure)

更換電纜後，在labceed22的eno6/bd0上清除警報。

[lab0200-smf/labceed22] cee# show alerts active summary
NAME UID SEVERITY STARTS AT SOURCE SUMMARY
---------------------------------------------------------------------------------------------------------------------------------------------------------
watchdog a62f59201ba8 minor 11-02T05:57:18 System This is an alert meant to ensure that the entire alerting pipeline is functional. This ale...

修訂記錄

修訂	發佈日期	意見
1.0	28-Mar-2022	初始版本

由思科工程師貢獻

Nebojsa Kosanovic
Cisco TAC Engineer
Analu Moreno
Cisco TAC Engineer

這份文件是否有所幫助？

意見

讓思科協助您

開啟支援問題單
(需有思科服務合約)