簡介
本文描述如何排除突發保持連線故障。
必要條件
需求
Cisco IOS® XE基礎知識。
採用元件
本文檔基於Cisco IOS XE路由器(如CSR8000v、ASR1000和ISR4000系列)。
本文中的資訊是根據特定實驗室環境內的裝置所建立。文中使用到的所有裝置皆從已清除(預設)的組態來啟動。如果您的網路運作中,請確保您瞭解任何指令可能造成的影響。
背景資訊
在基於Cisco IOS XE的系統中,分支路徑是內部資料路徑。這是進行控制平面和資料平面之間通訊的路徑。
此內部路徑用於傳輸路由器消耗的控制平面資料包。
當此路徑失敗時,您會在日誌中看到此類錯誤。
%IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 60 seconds
keep alive消息是監控QFP和RP之間路徑健康狀態的消息。
此路徑對系統運行至關重要。
如果在5分鐘內未收到這些keep alive,您可以看到以下關鍵日誌:
%IOSXE_INFRA-2-FATAL_NO_PUNT_KEEPALIVE: Keepalive not received for 300 seconds resetting
系統重置以便從此情況中恢復。
Punt Debug日誌檔案
在因突發保持活動失敗而重置時,系統會建立一個名為punt_debug.log的檔案,該檔案收集相關資料以理解發生時的行為。
附註:確保使用最新版本的Cisco IOS XE軟體更新系統,以生成punt_debug.log檔案。
此檔案包含多次執行這些命令以瞭解不同的計數器。
show platform software infra punt-keepalive
show platform software infra lsmpi
show platform software infrastructure lsmpi driver
show platform software infra lsmpi bufusage
show platform software punt-policer
show platform software status control-processor brief
show process cpu platform sorted
show platform software infrastructure punt
show platform hardware qfp active statistics drop
show platform hardware qfp active infra punt statistics type per-cause
show platform hardware qfp active infrastructure bqs queue output default all
附註:在punt_debug.log中,您應重點關注可能導致問題的錯誤指示符和大量資料包。
Linux共用記憶體輸入介面(LSMPI)
此元件用於將資料包和消息從轉發處理器傳輸到路由處理器。
龐特管制員
Punt管制器是一種控制平面保護機制,允許系統保護和管制控制平面資料包。
使用show platform software punt-policer指令,可以看到conform封包和由於此管制器而遭捨棄的封包。
----------------- show platform software punt-policer ------------------
Per Punt-Cause Policer Configuration and Packet Counters
Punt Config Rate(pps) Conform Packets Dropped Packets Config Burst(pkts) Config Alert
Cause Description Normal High Normal High Normal High Normal High Normal High
-------------------------------------------------------------------------------------------------------------------------------------------------------------
2 IPv4 Options 874 655 0 0 0 0 874 655 Off Off
3 Layer2 control and legacy 8738 2185 0 0 0 0 8738 2185 Off Off
4 PPP Control 437 1000 0 0 0 0 437 1000 Off Off
—— snip : output omitted for brevity ——
命令show platform software infrastructure punt 顯示有關punt原因的計數器資料。
------------------ show platform software infrastructure punt ------------------
LSMPI interface internal stats:
enabled=0, disabled=0, throttled=0, unthrottled=0, state is ready
Input Buffers = 51181083
Output Buffers = 51150283
—— snip : output omitted for brevity ——
EPC CP RX Pkt cleansed 0
Punt cause out of range 0
IOSXE-RP Punt packet causes:
3504959 ARP request or response packets
27 Incomplete adjacency packets
—— snip : output omitted for brevity ——
FOR_US Control IPv4 protcol stats:
2369262 TCP packets
FOR_US Control IPv6 protcol stats:
6057 ICMPV6 packets
Packet histogram(500 bytes/bin), avg size in 119, out 95:
Pak-Size In-Count Out-Count
0+: 51108211 51144723
500+: 22069 2632
1000+: 2172 0
1500+: 3170 0
此資料有助於瞭解哪些因素會影響突發保持連線路徑。
用於資料收集的嵌入式事件管理器(EEM)
如果punt_debug.log沒有提供足夠的資料來診斷問題,則可以在問題發生時使用EEM指令碼來獲取更多的資料點。
event manager applet punt_script authorization bypass
event syslog pattern "IOSXE_INFRA-4-NO_PUNT_KEEPALIVE" maxrun 1000
action 0.0 cli command "enable"
action 0.1 set i "0"
action 0.2 cli command "test platform software punt-keepalive ignore-fault"
action 0.3 while $i lt 10
action 0.4 syslog msg "iteration $i"
action 0.9 cli command "show clock | append bootflash:qfp_lsmpi.txt"
action 1.0 cli command "show platform software infrastructure lsmpi | append bootflash:qfp_lsmpi.txt"
action 1.1 cli command "show platform software infrastructure lsmpi driver | append bootflash:qfp_lsmpi.txt"
action 1.2 cli command "show platform software infrastructure lsmpi driver 0 | append bootflash:qfp_lsmpi.txt"
action 1.3 cli command "show platform software infrastructure lsmpi bufusage | append bootflash:qfp_lsmpi.txt"
action 1.4 cli command "show platform software infrastructure lsmpi bufusage 0 | append bootflash:qfp_lsmpi.txt"
action 1.5 cli command "show platform software infrastructure punt-keepalive | append bootflash:qfp_lsmpi.txt"
action 1.6 cli command "show platform software infrastructure punt | append bootflash:qfp_lsmpi.txt"
action 1.7 cli command "show platform software punt-policer | append bootflash:qfp_lsmpi.txt"
action 1.8 cli command "show platform hardware qfp active infrastructure punt stat type per-cause | append bootflash:qfp_lsmpi.txt"
action 1.9 cli command "show platform hardware qfp active infrastructure punt statistics type punt-drop | append bootflash:qfp_lsmpi.txt"
action 1.a cli command "show platform hardware qfp active infrastructure punt statistics type inject-drop | append bootflash:qfp_lsmpi.txt"
action 1.b cli command "show platform hardware qfp active infrastructure bqs queue output default interface-string internal0/0/rp:0 hier detail | append bootflash:qfp_lsmpi.txt"
action 1.c cli command "show platform hardware qfp active statistics drop | append bootflash:qfp_lsmpi.txt"
action 1.d cli command "show platform hardware qfp active datapath utilization | append bootflash:qfp_lsmpi.txt"
action 1.e cli command "show platform hardware qfp active datapath infrastructure sw-hqf | append bootflash:qfp_lsmpi.txt"
action 1.f cli command "show platform hardware qfp active datapath infrastructure sw-distrib | append bootflash:qfp_lsmpi.txt"
action 1.g cli command "show platform hardware qfp active datapath infrastructure sw-pktmem | append bootflash:qfp_lsmpi.txt"
action 1.h cli command "show platform software status control-processor brief | append bootflash:qfp_lsmpi.txt"
action 2.0 increment i
action 2.1 wait 3
action 2.4 end
action 3.0 syslog msg "End of data collection. Please transfer the file at bootflash:qfp_lsmpi.txt"
action 5.0 cli command "debug platform hardware qfp active datapath crashdump"
附註:指令碼中包含的命令因配置該指令碼的平台而異。
此指令碼允許您在問題時間瞭解lsmpi、資源和點亮狀態。
EEM指令碼包括命令debug platform hardware qfp active datapath crashdump,用於生成開發人員小組和TAC所需的qfp核心轉儲。
附註:如果向Cisco TAC提交案例,請提供指令碼生成的核心檔案。
如果需要封包追蹤軌跡,可將以下修正新增到指令碼:
首先,設定資料包跟蹤配置,可以從EEM指令碼中完成:
debug platform packet-trace packet 8192 fia-trace circular
debug platform condition both
debug platform packet-trace copy packet both L2
然後,在EEM指令碼中使用以下操作來啟動和停止它:
操作6.2 cli命令「debug platform condition start」
操作6.3等待8
操作6.4 cli命令「debug platform condition stop」
然後,使用以下命令將資料轉儲到單獨的檔案中:
操作6.5 cli命令"show platform packet-trace statistics |附加bootflash:traceAll.txt"
操作6.6 cli命令「show platform packet-trace summary |附加bootflash:traceAll.txt"
操作6.7 cli命令"show platform packet-trace packet all decode |附加bootflash:traceAll.txt"
此資料包跟蹤操作邏輯在EEM指令碼中的while循環的end語句之後新增。
此指令碼允許您瞭解導致此問題的資料包型別。
使用IOS XE資料路徑資料包跟蹤功能進行故障排除中記錄了資料包跟蹤功能
一個例項
CSR8000v不斷重新啟動。
擷取系統報告後,您可以觀察到crashdump和iosd core檔案,這些檔案指示punt keep alive相關功能在堆疊追蹤中。
但是,crashinfo檔案為純文字檔案,您可以看到以下症狀:
Jan 15 14:29:41.756 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 160 seconds
Jan 15 14:30:01.761 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 180 seconds
Jan 15 14:30:21.766 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 200 seconds
Jan 15 14:30:41.776 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 220 seconds
Jan 15 14:31:01.780 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 240 seconds
Jan 15 14:31:41.789 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 280 seconds
Jan 15 14:32:01.791 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 300 seconds
Jan 15 14:32:01.791 AWST: %IOSXE_INFRA-2-FATAL_NO_PUNT_KEEPALIVE: Keepalive not received for 300 seconds resetting
%Software-forced reload
Exception to IOS Thread:
Frame pointer 0x7F0AE0EE29A8, PC = 0x7F0B342C16D2
UNIX-EXT-SIGNAL: Aborted(6), Process = PuntInject Keepalive Process
-Traceback= 1#7b5996c3
受影響的進程是PuntInjection Keepalive進程。
keepalive達到300秒閾值標籤時,系統必須觸發中止訊號。
punt_debug.log在show platform software infrastructure lsmpi driver命令中揭示了一些傳輸失敗:
Reason for TX drops (sticky):
Bad packet len : 0
Bad buf len : 0
Bad ifindex : 0
No device : 0
No skbuff : 0
Device xmit fail : 82541 >>>>>>>>>>>>>>>>>>>>> Tx failure
這是一般故障。
此計數器在檔案中提取的多個樣本內增加。
提供EEM指令碼是為了獲取更多有關資源、點資料路徑和其他基礎設施相關命令的資料。
通過檢查lsmpi流量點數計數器,可以看到EIGRP控制平面資料包非常出色。以下是識別為適用於使用者封包的封包:
17660574 For-us data packets
543616 RP<->QFP keepalive packets
1004 Glean adjacency packets
3260636 BFD control packets
122523839 For-us control packets<<<<
FOR_US Control IPv4 protcol stats:
153551 TCP packets
2663105 GRE packets
104394559 EIGRP packets<<<<
後來,發現虛擬機器監控程式超訂用,影響了底層計算資源。
CSR8000v已部署到另一個虛擬機器監控程式中,這有助於緩解問題。
改進專案
已透過思科錯誤ID CSCwf85505開始使用Cisco IOS XE 17.15版,對自動qfp核心檔案產生進行了增強