简介
本文档介绍如何排除punt keep alive故障。
先决条件
要求
Cisco IOS® XE的基本知识。
使用的组件
本文档基于Cisco IOS XE路由器,如CSR8000v、ASR1000和ISR4000系列。
本文档中的信息都是基于特定实验室环境中的设备编写的。本文档中使用的所有设备最初均采用原始(默认)配置。如果您的网络处于活动状态,请确保您了解所有命令的潜在影响。
背景信息
在基于Cisco IOS XE的系统中,punt路径是内部数据路径。这是控制平面和数据平面之间发生通信的路径。
此内部路径用于传输控制平面数据包以供路由器使用。
当此路径发生故障时,您可以在日志中看到此类型的错误。
%IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 60 seconds
keep alive消息是监控QFP和RP之间路径运行状况的消息。
此路径对系统运行至关重要。
如果在5分钟内未收到这些保持活动,您可以看到如下关键日志:
%IOSXE_INFRA-2-FATAL_NO_PUNT_KEEPALIVE: Keepalive not received for 300 seconds resetting
系统重置以便从此情况恢复。
Punt Debug日志文件
在punt keep alive failures事件和由此引起的重置时,系统创建一个名为punt_debug.log的文件,该文件收集相关数据,以了解问题发生时的行为。
注意:确保系统使用最新的Cisco IOS XE软件版本来生成文件punt_debug.log。
此文件包含多次执行这些命令以了解不同的计数器。
show platform software infra punt-keepalive
show platform software infra lsmpi
show platform software infrastructure lsmpi driver
show platform software infra lsmpi bufusage
show platform software punt-policer
show platform software status control-processor brief
show process cpu platform sorted
show platform software infrastructure punt
show platform hardware qfp active statistics drop
show platform hardware qfp active infra punt statistics type per-cause
show platform hardware qfp active infrastructure bqs queue output default all
注意:在punt_debug.log中,您应重点关注可能导致问题的错误指示符和大量数据包。
Linux共享内存分支接口(LSMPI)
此组件用于将数据包和消息从转发处理器传输到路由处理器。
Punt策略器
Punt监察器是一种控制平面保护机制,允许系统保护和管制控制平面数据包。
使用命令show platform software punt-policer,您可以看到由于此监察器而丢弃的conform数据包和数据包。
----------------- show platform software punt-policer ------------------
Per Punt-Cause Policer Configuration and Packet Counters
Punt Config Rate(pps) Conform Packets Dropped Packets Config Burst(pkts) Config Alert
Cause Description Normal High Normal High Normal High Normal High Normal High
-------------------------------------------------------------------------------------------------------------------------------------------------------------
2 IPv4 Options 874 655 0 0 0 0 874 655 Off Off
3 Layer2 control and legacy 8738 2185 0 0 0 0 8738 2185 Off Off
4 PPP Control 437 1000 0 0 0 0 437 1000 Off Off
—— snip : output omitted for brevity ——
命令show platform software infrastructure punt显示有关punt原因的计数器数据。
------------------ show platform software infrastructure punt ------------------
LSMPI interface internal stats:
enabled=0, disabled=0, throttled=0, unthrottled=0, state is ready
Input Buffers = 51181083
Output Buffers = 51150283
—— snip : output omitted for brevity ——
EPC CP RX Pkt cleansed 0
Punt cause out of range 0
IOSXE-RP Punt packet causes:
3504959 ARP request or response packets
27 Incomplete adjacency packets
—— snip : output omitted for brevity ——
FOR_US Control IPv4 protcol stats:
2369262 TCP packets
FOR_US Control IPv6 protcol stats:
6057 ICMPV6 packets
Packet histogram(500 bytes/bin), avg size in 119, out 95:
Pak-Size In-Count Out-Count
0+: 51108211 51144723
500+: 22069 2632
1000+: 2172 0
1500+: 3170 0
这些数据有助于了解哪些因素会影响punt keep alive路径。
嵌入式事件管理器(EEM)用于数据收集
如果punt_debug.log没有提供足够的数据来诊断问题,可以使用EEM脚本在问题发生时获取更多的数据点。
event manager applet punt_script authorization bypass
event syslog pattern "IOSXE_INFRA-4-NO_PUNT_KEEPALIVE" maxrun 1000
action 0.0 cli command "enable"
action 0.1 set i "0"
action 0.2 cli command "test platform software punt-keepalive ignore-fault"
action 0.3 while $i lt 10
action 0.4 syslog msg "iteration $i"
action 0.9 cli command "show clock | append bootflash:qfp_lsmpi.txt"
action 1.0 cli command "show platform software infrastructure lsmpi | append bootflash:qfp_lsmpi.txt"
action 1.1 cli command "show platform software infrastructure lsmpi driver | append bootflash:qfp_lsmpi.txt"
action 1.2 cli command "show platform software infrastructure lsmpi driver 0 | append bootflash:qfp_lsmpi.txt"
action 1.3 cli command "show platform software infrastructure lsmpi bufusage | append bootflash:qfp_lsmpi.txt"
action 1.4 cli command "show platform software infrastructure lsmpi bufusage 0 | append bootflash:qfp_lsmpi.txt"
action 1.5 cli command "show platform software infrastructure punt-keepalive | append bootflash:qfp_lsmpi.txt"
action 1.6 cli command "show platform software infrastructure punt | append bootflash:qfp_lsmpi.txt"
action 1.7 cli command "show platform software punt-policer | append bootflash:qfp_lsmpi.txt"
action 1.8 cli command "show platform hardware qfp active infrastructure punt stat type per-cause | append bootflash:qfp_lsmpi.txt"
action 1.9 cli command "show platform hardware qfp active infrastructure punt statistics type punt-drop | append bootflash:qfp_lsmpi.txt"
action 1.a cli command "show platform hardware qfp active infrastructure punt statistics type inject-drop | append bootflash:qfp_lsmpi.txt"
action 1.b cli command "show platform hardware qfp active infrastructure bqs queue output default interface-string internal0/0/rp:0 hier detail | append bootflash:qfp_lsmpi.txt"
action 1.c cli command "show platform hardware qfp active statistics drop | append bootflash:qfp_lsmpi.txt"
action 1.d cli command "show platform hardware qfp active datapath utilization | append bootflash:qfp_lsmpi.txt"
action 1.e cli command "show platform hardware qfp active datapath infrastructure sw-hqf | append bootflash:qfp_lsmpi.txt"
action 1.f cli command "show platform hardware qfp active datapath infrastructure sw-distrib | append bootflash:qfp_lsmpi.txt"
action 1.g cli command "show platform hardware qfp active datapath infrastructure sw-pktmem | append bootflash:qfp_lsmpi.txt"
action 1.h cli command "show platform software status control-processor brief | append bootflash:qfp_lsmpi.txt"
action 2.0 increment i
action 2.1 wait 3
action 2.4 end
action 3.0 syslog msg "End of data collection. Please transfer the file at bootflash:qfp_lsmpi.txt"
action 5.0 cli command "debug platform hardware qfp active datapath crashdump"
此脚本允许您在问题期间了解lsmpi、资源和传送状态。
EEM脚本包括命令debug platform hardware qfp active datapath crashdump,该命令可生成开发人员团队和TAC所需的qfp核心转储。
注意:如果向Cisco TAC提交案例,请提供脚本生成的核心文件。
如果需要数据包跟踪,可以将以下修改添加到脚本中:
首先,设置数据包跟踪配置,可以在EEM脚本之外完成:
debug platform packet-trace packet 8192 fia-trace circular
调试平台条件both
debug platform packet-trace copy packet both L2
然后,在EEM脚本中执行以下操作来启动和停止它:
操作6.2 cli命令“debug platform condition start”
操作6.3等待8
操作6.4 cli命令“debug platform condition stop”
然后,使用以下命令将数据转储到单独的文件中:
操作6.5 cli命令“show platform packet-trace statistics | append bootflash:traceAll.txt"
操作6.6 cli命令“show platform packet-trace summary | append bootflash:traceAll.txt"
action 6.7 cli命令"show platform packet-trace packet all decode | append bootflash:traceAll.txt"
此数据包跟踪操作逻辑添加在EEM脚本中while循环的end语句之后。
此脚本允许您了解导致此问题的数据包类型。
数据包跟踪功能记录在使用IOS XE数据路径数据包跟踪功能进行故障排除中
一个实例
CSR8000v不断重新启动。
提取系统报告后,您可以观察到crashdump和iosd core文件,其中指示punt keep alive在堆栈跟踪中的相关函数。
但是,crashinfo文件为明文形式,您可以看到以下症状:
Jan 15 14:29:41.756 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 160 seconds
Jan 15 14:30:01.761 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 180 seconds
Jan 15 14:30:21.766 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 200 seconds
Jan 15 14:30:41.776 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 220 seconds
Jan 15 14:31:01.780 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 240 seconds
Jan 15 14:31:41.789 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 280 seconds
Jan 15 14:32:01.791 AWST: %IOSXE_INFRA-4-NO_PUNT_KEEPALIVE: Keepalive not received for 300 seconds
Jan 15 14:32:01.791 AWST: %IOSXE_INFRA-2-FATAL_NO_PUNT_KEEPALIVE: Keepalive not received for 300 seconds resetting
%Software-forced reload
Exception to IOS Thread:
Frame pointer 0x7F0AE0EE29A8, PC = 0x7F0B342C16D2
UNIX-EXT-SIGNAL: Aborted(6), Process = PuntInject Keepalive Process
-Traceback= 1#7b5996c3
受影响的进程为PuntInject Keepalive进程。
当keepalive达到300秒阈值标记时,系统必须触发中止信号。
punt_debug.log在show platform software infrastructure lsmpi driver命令中揭示了一些传输故障:
Reason for TX drops (sticky):
Bad packet len : 0
Bad buf len : 0
Bad ifindex : 0
No device : 0
No skbuff : 0
Device xmit fail : 82541 >>>>>>>>>>>>>>>>>>>>> Tx failure
这是一般故障。
此计数器在文件中取出的多个样本内增加。
提供EEM脚本以获得更多有关资源、传送数据路径和其他基础设施相关命令的数据。
通过检查lsmpi流量传送计数器,您可以看到EIGRP控制平面数据包非常出色。这些数据包标识为us数据包:
17660574 For-us data packets
543616 RP<->QFP keepalive packets
1004 Glean adjacency packets
3260636 BFD control packets
122523839 For-us control packets<<<<
FOR_US Control IPv4 protcol stats:
153551 TCP packets
2663105 GRE packets
104394559 EIGRP packets<<<<
后来,发现虚拟机监控程序超订用,从而影响底层计算资源。
CSR8000v部署在另一个虚拟机监控程序中,这有助于缓解问题。
增强功能
通过Cisco Bug ID CSCwf85505从Cisco IOS XE 17.15版本开始引入了qfp核心文件自动生成的增强功能