簡介
本文描述如何對運行16.x版本的Cisco IOS® XE平台出現中斷導致的CPU使用率過高進行故障排除。
背景資訊
本文檔由Raymond Whiting和Yogesh Ramdoss(思科TAC工程師)撰寫。
本文檔還介紹了此平台上用於解決CPU使用率高問題的幾個重要新命令。瞭解Cisco IOS XE的構建方式非常重要。藉助Cisco IOS XE,Cisco已遷移到Linux核心,並且所有子系統都已分解為多個進程。Cisco IOS內部以前的所有子系統(如模組驅動程式、高可用性(HA)等)現在都作為Linux作業系統(OS)中的軟體進程運行。Cisco IOS本身在Linux OS(IOSd)中作為守護程式運行。Cisco IOS XE不僅保留了傳統Cisco IOS的外觀和感覺,還保留了其操作、支援和管理。
以下是一些有用的定義:
- 轉發引擎驅動程式(FED):這是Cisco Catalyst交換機的核心,負責所有硬體程式設計/轉發。
- IOSd:這是在Linux核心上運行的Cisco IOS守護程式。它在核心中作為軟體進程運行。
- 封包交付系統(PDS):這是將封包傳送到各個子系統和從各個子系統傳送封包的架構和程式。例如,它控制資料包如何從FED傳送到IOSd,反之亦然。
- 控制平面(CP):控制平面是一個通用術語,用於對涉及Catalyst交換機CPU的功能和流量進行分組。其中包括目的地為交換器或來自交換器的跨距樹狀目錄通訊協定(STP)、熱待命路由器通訊協定(HSRP)和路由通訊協定等流量。這也包括必須由CPU處理的應用層協定,如安全外殼(SSH)和簡單網路管理協定(SNMP)。
- 資料平面(DP):通常,資料平面包含硬體ASIC和流量,這些流量是在沒有控制平面的幫助的情況下轉發的。
- Punt:DP攔截的入口協定控制資料包,該資料包傳送到CP以便進行處理。
- 注入:CP生成的協定資料包傳送到DP,以便在IO介面上輸出。
- LSMPI:Linux共用記憶體分支介面。
資料平面和控制平面之間通訊路徑的概要圖:
高CPU故障排除工作流程
本部分提供一個系統工作流程,用於診斷交換機上的高CPU問題。請注意,在撰寫本節時,它會介紹所選流程。
案例研究1.地址解析協定中斷
本節中的故障排除和驗證過程可廣泛用於由於中斷而導致CPU使用率較高的情況。
步驟 1.確定消耗CPU週期的進程
命show process cpu 令用於顯示IOSd守護程式的當前進程狀態。當您新增輸出modify| exclude 0.00時,它將過濾掉當前空閒的進程。
此輸出提供兩個有價值的資訊:
- 5秒的CPU使用率:91%/30%
- 第一個數字(91%)是交換機的總體CPU利用率
- 第二個數字(30%)是由資料平面中斷引起的利用率
- 地址解析協定
(ARP) Input(Address Resolution Protocol)進程是當前消耗資源的頂級Cisco IOS進程:
show processes cpu platform sorted
Switch# show processes cpu sort | ex 0.00
CPU utilization for five seconds: 91%/30%; one minute: 30%; five minutes: 8%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
37 14645 325 45061 59.53% 18.86% 4.38% 0 ARP Input
137 2288 115 19895 1.20% 0.14% 0.07% 0 Per-minute Jobs
373 2626 35334 74 0.15% 0.11% 0.09% 0 MMA DB TIMER
218 3123 69739 44 0.07% 0.09% 0.12% 0 IP ARP Retry Age
404 2656 35333 75 0.07% 0.09% 0.09% 0 MMA DP TIMER
命令用於顯示來自Linux核心的進程利用率情況。從輸出中可以看到FED進程較高,這是由於傳入IOSd進程的ARP請求所致:
Switch# show processes cpu platform sorted CPU utilization for five seconds: 38%, one minute: 38%, five minutes: 40% Core 0: CPU utilization for five seconds: 39%, one minute: 37%, five minutes: 39% Core 1: CPU utilization for five seconds: 41%, one minute: 38%, five minutes: 40% Core 2: CPU utilization for five seconds: 30%, one minute: 38%, five minutes: 40% Core 3: CPU utilization for five seconds: 37%, one minute: 39%, five minutes: 41% Pid PPid 5Sec 1Min 5Min Status Size Name -------------------------------------------------------------------------------- 22701 22439 89% 88% 88% R 2187444224 linux_iosd-imag 11626 11064 46% 47% 48% S 2476175360 fed main event 4585 2 7% 9% 9% S 0 lsmpi-xmit 4586 2 3% 6% 6% S 0 lsmpi-rx
步驟 2.調查FED為什麼將資料包傳送到控制平面
從步驟1。您可以斷定IOSd/ARP進程運行率很高,但它是從資料平面引入的流量的犧牲品。需要進一步調查FED進程將流量傳送到CPU的原因以及此流量的來源位置。
提show platform software fed switch active punt cause summary供了點亮原因的簡要概述。此命令多次運行後遞增的任何數字表示:
Switch#show platform software fed switch active punt cause summary Statistics for all causes Cause Cause Info Rcvd Dropped ------------------------------------------------------------------------------ 7 ARP request or response 18444227 0 11 For-us data 16 0 21 RP<->QFP keepalive 3367 0 24 Glean adjacency 2 0 55 For-us control 6787 0 60 IP subnet or broadcast packet 14 0 96 Layer2 control protocols 3548 0 ------------------------------------------------------------------------------
從FED傳送到控制平面的資料包使用分割隊列結構來保證高優先順序控制流量。它不會丟失在低優先順序流量(如ARP)之後。可以使用檢視這些隊列的高級概述show platform software fed switch active cpu-interface。運行此命令數次後,可以發現(Forus Resolution Forus — 表示目的地為CPU的流量)隊列增長迅速。
Switch#show platform software fed switch active cpu-interface queue retrieved dropped invalid hol-block ------------------------------------------------------------------------- Routing Protocol 8182 0 0 0 L2 Protocol 161 0 0 0 sw forwarding 2 0 0 0 broadcast 14 0 0 0 icmp gen 0 0 0 0 icmp redirect 0 0 0 0 logging 0 0 0 0 rpf-fail 0 0 0 0 DOT1X authentication 0 0 0 0 Forus Traffic 16 0 0 0 Forus Resolution 24097779 0 0 0 Inter FED 0 0 0 0 L2 LVX control 0 0 0 0 EWLC control 0 0 0 0 EWLC data 0 0 0 0 L2 LVX data 0 0 0 0 Learning cache 0 0 0 0 Topology control 4117 0 0 0 Proto snooping 0 0 0 0 DHCP snooping 0 0 0 0 Transit Traffic 0 0 0 0 Multi End station 0 0 0 0 Webauth 0 0 0 0 Crypto control 0 0 0 0 Exception 0 0 0 0 General Punt 0 0 0 0 NFL sampled data 0 0 0 0 Low latency 0 0 0 0 EGR exception 0 0 0 0 FSS 0 0 0 0 Multicast data 0 0 0 0 Gold packet 0 0 0 0
使用可提供show platform software fed switch active punt cpuq all 這些隊列的更詳細檢視。隊列5負責ARP,並在命令的多次運行中按預期遞增。show plat soft fed sw active inject cpuq clear命令可用於清除計數器以方便讀取。
Switch#show platform software fed switch active punt cpuq all <snip> CPU Q Id : 5 CPU Q Name : CPU_Q_FORUS_ADDR_RESOLUTION Packets received from ASIC : 21018219 Send to IOSd total attempts : 21018219 Send to IOSd failed count : 0 RX suspend count : 0 RX unsuspend count : 0 RX unsuspend send count : 0 RX unsuspend send failed count : 0 RX consumed count : 0 RX dropped count : 0 RX non-active dropped count : 0 RX conversion failure dropped : 0 RX INTACK count : 1050215 RX packets dq'd after intack : 90 Active RxQ event : 3677400 RX spurious interrupt : 1050016 <snip>
從這裡開始,有幾種選擇。ARP是廣播流量,因此您可以查詢廣播流量速率異常高的介面(對於排除第2層環路也很有用)。必須多次運行此命令,才能確定主動增加多少介面。
Switch#show interfaces counters Port InOctets InUcastPkts InMcastPkts InBcastPkts Gi1/0/1 1041141009678 9 0 16267828358 Gi1/0/2 1254 11 0 1 Gi1/0/3 0 0 0 0 Gi1/0/4 0 0 0 0
另一種方法是使用嵌入式封包擷取(EPC)工具來收集在控制平面上可見的封包範例。
Switch#monitor capture cpuCap control-plane in match any file location flash:cpuCap.pcap Switch#show monitor capture cpuCap Status Information for Capture cpuCap Target Type: Interface: Control Plane, Direction: IN Status : Inactive Filter Details: Capture all packets Buffer Details: Buffer Type: LINEAR (default) File Details: Associated file name: flash:cpuCap.pcap Limit Details: Number of Packets to capture: 0 (no limit) Packet Capture duration: 0 (no limit) Packet Size to capture: 0 (no limit) Packet sampling rate: 0 (no sampling)
此命令在交換機上配置內部捕獲,以捕獲傳送到控制平面的所有流量。此流量會儲存在快閃記憶體上的檔案中。這是一個普通檔案wireshark pcap,可以從交換機匯出並在Wireshark中開啟以供進一步分析。
開始捕獲,讓它運行幾秒鐘,然後停止捕獲:
Switch#monitor capture cpuCap start Enabling Control plane capture may seriously impact system performance. Do you want to continue? [yes/no]: yes Started capture point : cpuCap *Jun 14 17:57:43.172: %BUFCAP-6-ENABLE: Capture Point cpuCap enabled. Switch#monitor capture cpuCap stop Capture statistics collected at software: Capture duration - 59 seconds Packets received - 215950 Packets dropped - 0 Packets oversized - 0 Bytes dropped in asic - 0 Stopped capture point : cpuCap Switch# *Jun 14 17:58:37.884: %BUFCAP-6-DISABLE: Capture Point cpuCap disabled.
也可以在switch:提示上檢視擷取檔案
Switch#show monitor capture file flash:cpuCap.pcap Starting the packet display ........ Press Ctrl + Shift + 6 to exit 1 0.000000 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 2 0.000054 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 3 0.000082 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 4 0.000109 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 5 0.000136 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 6 0.000162 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 7 0.000188 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 8 0.000214 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 9 0.000241 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2
從該輸出可以明顯看出,192.168.1.2主機是導致交換機上CPU使用率較高的常數ARP的來源。使用show ip arp and命令可跟蹤主機,並將其從網路中移除或定址ARPshow mac address-table address。也可以使用capture view命令show monitor capture file flash:cpuCap.pcap detail中的detail選項獲取捕獲的每個資料包的完整詳細資訊。如需Catalyst交換器上封包擷取的詳細資訊,請參閱本指南。
案例研究2.使用CoPP的IP重新導向
預設情況下,最新一代Catalyst交換器受控制階段管制(CoPP)保護。CoPP用於保護CPU免受惡意攻擊和錯誤配置,這些攻擊和錯誤配置可能會破壞交換機維護生成樹和路由協定等關鍵功能的能力。這些保護可能導致交換機的CPU和清除介面計數器僅略高,但在穿越交換機時流量被丟棄的情況。在正常操作時注意裝置上的基線CPU利用率非常重要。提高CPU利用率未必是一個問題,這取決於裝置上啟用的功能,但當此利用率增加且配置未發生更改時,這可能表明存在問題。
請考慮以下情況 — 位於網關交換機之外的主機報告下載速度較慢以及到Internet的ping丟失。交換機的常規運行狀況檢查不會顯示介面上的錯誤,也不會顯示來自網關交換機的任何ping丟失。
當您檢查CPU時,它顯示由於中斷而稍微增加的數字。
Switch#show processes cpu sorted | ex 0.00 CPU utilization for five seconds: 8%/7%; one minute: 8%; five minutes: 8% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 122 913359 1990893 458 0.39% 1.29% 1.57% 0 IOSXE-RP Punt Se 147 5823 16416 354 0.07% 0.05% 0.06% 0 PLFM-MGR IPC pro 404 13237 183032 72 0.07% 0.08% 0.07% 0 MMA DP TIMER
檢查CPU介面時,會看到ICMP重新導向計數器已主動遞增。
Switch#show platform software fed switch active cpu-interface queue retrieved dropped invalid hol-block ------------------------------------------------------------------------- Routing Protocol 12175 0 0 0 L2 Protocol 236 0 0 0 sw forwarding 714673 0 0 0 broadcast 2 0 0 0 icmp gen 0 0 0 0 icmp redirect 2662788 0 0 0 logging 7 0 0 0 rpf-fail 0 0 0 0 DOT1X authentication 0 0 0 0 Forus Traffic 21776434 0 0 0 Forus Resolution 724021 0 0 0 Inter FED 0 0 0 0 L2 LVX control 0 0 0 0 EWLC control 0 0 0 0 EWLC data 0 0 0 0 L2 LVX data 0 0 0 0 Learning cache 0 0 0 0 Topology control 6122 0 0 0 Proto snooping 0 0 0 0 DHCP snooping 0 0 0 0 Transit Traffic 0 0 0 0
雖然FED中不會觀察到捨棄專案,但如果您勾選CoPP,則可以在ICMP重新導向佇列中觀察到捨棄專案。
Switch#show platform hardware fed switch 1 qos queue stats internal cpu policer CPU Queue Statistics ============================================================================================ (default) (set) Queue QId PlcIdx Queue Name Enabled Rate Rate Drop(Bytes) ----------------------------------------------------------------------------- 0 11 DOT1X Auth Yes 1000 1000 0 1 1 L2 Control Yes 2000 2000 0 2 14 Forus traffic Yes 4000 4000 0 3 0 ICMP GEN Yes 600 600 0 4 2 Routing Control Yes 5400 5400 0 5 14 Forus Address resolution Yes 4000 4000 0 6 0 ICMP Redirect Yes 600 600 463538463 7 16 Inter FED Traffic Yes 2000 2000 0 8 4 L2 LVX Cont Pack Yes 1000 1000 0 <snip>
CoPP實質上是一種置於裝置控制平面上的QoS策略。CoPP的工作方式與交換機上的任何其他QoS一樣:當特定流量的隊列用盡時,使用該隊列的流量會被丟棄。從這些輸出中,您知道由於ICMP重新導向流量正在進行軟體交換,而且您知道由於ICMP重新導向隊列的速率限制,此流量將被丟棄。 您可以在控制平面上完成捕獲,以驗證到達控制平面的資料包是否來自使用者。
要檢視每個類使用的匹配邏輯,您有一個CLI來幫助識別命中特定隊列的資料包型別。請考慮以下示例,以瞭解哪些內容會命中system-cpp-routing-control類:
Switch#show platform software qos copp policy-info
Default rates of all classmaps are displayed:
policy-map system-cpp-policy
class system-cpp-police-routing-control
police rate 5400 pps
Switch#show platform software qos copp class-info
ACL representable classmap filters are displayed:
class-map match-any system-cpp-police-routing-control
description Routing control and Low Latency
match access-group name system-cpp-mac-match-routing-control
match access-group name system-cpp-ipv4-match-routing-control
match access-group name system-cpp-ipv6-match-routing-control
match access-group name system-cpp-ipv4-match-low-latency
match access-group name system-cpp-ipv6-match-low-latency
mac access-list extended system-cpp-mac-match-routing-control
permit any host 0180.C200.0014
permit any host 0900.2B00.0004
ip access-list extended system-cpp-ipv4-match-routing-control
permit udp any any eq rip
<...snip...>
ipv6 access-list system-cpp-ipv6-match-routing-control
permit ipv6 any FF02::1:FF00:0/104
permit ipv6 any host FF01::1
<...snip...>
ip access-list extended system-cpp-ipv4-match-low-latency
permit udp any any eq 3784
permit udp any any eq 3785
ipv6 access-list system-cpp-ipv6-match-low-latency
permit udp any any eq 3784
permit udp any any eq 3785
<...snip...>
Switch#monitor capture cpuSPan control-plane in match any file location flash:cpuCap.pcap Control-plane direction IN is already attached to the capture Switch#monitor capture cpuSpan start Enabling Control plane capture may seriously impact system performance. Do you want to continue? [yes/no]: yes Started capture point : cpuSpan Switch# *Jun 15 17:28:52.841: %BUFCAP-6-ENABLE: Capture Point cpuSpan enabled. Switch#monitor capture cpuSpan stop Capture statistics collected at software: Capture duration - 12 seconds Packets received - 5751 Packets dropped - 0 Packets oversized - 0 Bytes dropped in asic - 0 Stopped capture point : cpuSpan Switch# *Jun 15 17:29:02.415: %BUFCAP-6-DISABLE: Capture Point cpuSpan disabled. Switch#show monitor capture file flash:cpuCap.pcap detailed Starting the packet display ........ Press Ctrl + Shift + 6 to exit Frame 1: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) on interface 0
<snip>
Ethernet II, Src: OmronTat_2c:a1:52 (00:00:0a:2c:a1:52), Dst: Cisco_8f:cb:47 (00:42:5a:8f:cb:47)
<snip>
Internet Protocol Version 4, Src: 192.168.1.10, Dst: 8.8.8.8
<snip>
此主機ping 8.8.8.8時,會將ping傳送到閘道的MAC位址,因為目的地位址在VLAN之外。網關交換機檢測到下一跳位於同一個VLAN中,將目的MAC地址重寫到防火牆,然後轉發資料包。此過程可能會在硬體中發生,但此硬體轉發的一個例外是IP重定向過程。交換器收到ping時,會偵測到它是在同一個VLAN上路由流量,並將流量傳送到CPU,以便產生重新導向封包回主機。此重新導向訊息旨在通知主機有前往目的地的更佳路徑。在此案例中,第2層下一躍點是依照設計和預期設定的,必須將交換器設定為不傳送重新導向訊息,並在硬體中轉送封包。當您停用VLAN介面上的重新導向時,會執行此作業。
interface Vlan1 ip address 192.168.1.1 255.255.255.0 no ip redirects end
IP重新導向關閉時,交換器會重新寫入MAC位址,並在硬體中轉送。
案例研究3.間歇性高CPU
如果交換機上的高CPU使用率是間歇性的,可以在交換機上設定指令碼,以便在高CPU事件發生時自動運行這些命令。這是使用Cisco IOS內嵌式事件管理員(EEM)完成的。
輸入值用於確定CPU在指令碼觸發之前處於多高的位置。該指令碼監控5秒的CPU平均SNMP OID。兩個檔案被寫入快閃記憶體tac-cpu-<timestamp>.txt,包含命令輸出和tac-cpu-<timestamp>.pcapCPU入口捕獲。以後可以複查這些檔案。
config t
no event manager applet high-cpu authorization bypass
event manager applet high-cpu authorization bypass
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.3.1 get-type next entry-op gt entry-val 80 poll-interval 1 ratelimit 300 maxrun 180
action 0.01 syslog msg "High CPU detected, gathering system information."
action 0.02 cli command "enable"
action 0.03 cli command "term exec prompt timestamp"
action 0.04 cli command "term length 0"
action 0.05 cli command "show clock"
action 0.06 regex "([0-9]|[0-9][0-9]):([0-9]|[0-9][0-9]):([0-9]|[0-9][0-9])" $_cli_result match match1
action 0.07 string replace "$match" 2 2 "."
action 0.08 string replace "$_string_result" 5 5 "."
action 0.09 set time $_string_result
action 1.01 cli command "show proc cpu sort | append flash:tac-cpu-$time.txt"
action 1.02 cli command "show proc cpu hist | append flash:tac-cpu-$time.txt"
action 1.03 cli command "show proc cpu platform sorted | append flash:tac-cpu-$time.txt"
action 1.04 cli command "show interface | append flash:tac-cpu-$time.txt"
action 1.05 cli command "show interface stats | append flash:tac-cpu-$time.txt"
action 1.06 cli command "show log | append flash:tac-cpu-$time.txt"
action 1.07 cli command "show ip traffic | append flash:tac-cpu-$time.txt"
action 1.08 cli command "show users | append flash:tac-cpu-$time.txt"
action 1.09 cli command "show platform software fed switch active punt cause summary | append flash:tac-cpu-$time.txt"
action 1.10 cli command "show platform software fed switch active cpu-interface | append flash:tac-cpu-$time.txt"
action 1.11 cli command "show platform software fed switch active punt cpuq all | append flash:tac-cpu-$time.txt"
action 2.08 cli command "no monitor capture tac_cpu"
action 2.09 cli command "monitor capture tac_cpu control-plane in match any file location flash:tac-cpu-$time.pcap"
action 2.10 cli command "monitor capture tac_cpu start" pattern "yes"
action 2.11 cli command "yes"
action 2.12 wait 10
action 2.13 cli command "monitor capture tac_cpu stop"
action 3.01 cli command "term default length"
action 3.02 cli command "terminal no exec prompt timestamp"
action 3.03 cli command "no monitor capture tac_cpu"
相關資訊