简介
本文档介绍当“ThreshFabricEGQDiscards” SNMP陷阱时如何对交换矩阵和存储卡(FSC)卡进行故障排除。
先决条件
要求
Cisco 建议您了解以下主题:
使用的组件
本文档不限于特定的软件和硬件版本。
本文档中的信息都是基于特定实验室环境中的设备编写的。本文档中使用的所有设备最初均采用原始(默认)配置。如果您的网络处于活动状态,请确保您了解所有命令的潜在影响。
问题
当ASR5500机箱中一个FSC卡上的交换矩阵边缘(FE)芯片组上的单事件翻转(SEU)时,发现错误“ThreshFabricEGQDiscards”。由于FE表中的此位翻转,FE芯片开始损坏交换矩阵中的数据包(信元),导致出口队列丢弃,从而导致卡之间的心跳故障。
使用命令行界面(CLI)命令show snmp trap history verbose可以看到此问题的示例。
Sat Jan 02 03:59:30 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 9 device 2 threshold 50 measured value 2430 interval 30
Sat Jan 02 03:59:30 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 9 device 1 threshold 50 measured value 2096 interval 30
Sat Jan 02 03:59:40 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 5 device 4 threshold 50 measured value 481 interval 30
Sat Jan 02 03:59:40 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 4 device 2 threshold 50 measured value 3761 interval 30
Sat Jan 02 03:59:40 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 4 device 1 threshold 50 measured value 3660 interval 30
Sat Jan 02 03:59:40 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 5 device 2 threshold 50 measured value 173 interval 30
Sat Jan 02 03:59:40 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 5 device 1 threshold 50 measured value 133 interval 30
Sat Jan 02 03:59:42 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 8 device 2 threshold 50 measured value 2977 interval 30
Sat Jan 02 03:59:42 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 8 device 1 threshold 50 measured value 4310 interval 30
Sat Jan 02 03:59:44 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 3 device 1 threshold 50 measured value 4499 interval 30
Sat Jan 02 03:59:44 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 3 device 2 threshold 50 measured value 4091 interval 30
Sat Jan 02 03:59:45 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 10 device 1 threshold 50 measured value 2796 interval 30
Sat Jan 02 03:59:45 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 10 device 2 threshold 50 measured value 5418 interval 30
Sat Jan 02 03:59:47 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 1 device 2 threshold 50 measured value 4747 interval 30
Sat Jan 02 03:59:47 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 1 device 1 threshold 50 measured value 5243 interval 30
Sat Jan 02 03:59:49 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 7 device 2 threshold 50 measured value 4644 interval 30
Sat Jan 02 03:59:49 2016 Internal trap notification 523 (ThreshFabricEGQDiscards) slot 7 device 1 threshold 50 measured value 5017 interval 30
此线路在多个卡CPU控制台下显示:
注意:命令debug console card是hidden/test命令。当在StarOs节点上运行show support details命令时,ASR5500上所有卡的每次都会收集此命令。
******** debug console card 1 cpu 0 tail 10000 only *******
Saturday January 02 05:45:38 EST 2016
[...]
2016-Jan-02+03:59:47.479 card 1-cpu0: afio [1/0/2701] [2862193.674] afio/afio_petrab_egress.c:121: #1: petrab=1=1/1, PetraB EGQ Egress drop threshold exceeded, drop count=5243, interval=30 secs, threshold=50
故障排除
检查出口丢包是否递增。
注意:如果交换矩阵错误不断增加,并且您在版本19.0或更高版本上运行StarOs节点,则继续本文的“解决方案”部分。
注意:如果交换矩阵错误不断增加,并且您运行的StarOs节点版本低于版本。19.0,请向TAC提出服务请求。
步骤1.进入测试模式,以下是如何在StarOs节点上启用它的文档。
cli test-commands [encrypted] password password
步骤2.检查交换矩阵运行状况。
show fabric health | grep -i -E "^Petra-B|EGQ"
问题不存在时的输出示例:
[local]#show fabric health | grep -i -E "^Petra-B|EGQ" Petra-B 1=1/1
Petra-B 2=1/2
Petra-B 3=2/1
Petra-B 4=2/2
Petra-B 5=3/1
Petra-B 6=3/2
[...]
EGQ丢弃数据包数量增加的输出示例:
[local]#show fabric health | grep -i -E "^Petra-B|EGQ"
Petra-B 1=1/1
EGQ.RqpDiscardPacketCounter 1143278
EGQ.EhpDiscardPacketCounter 1143278
EGQ.PqpDiscardUnicastPacketCounter 1143278
Petra-B 2=1/2
EGQ.RqpDiscardPacketCounter 1068491
EGQ.EhpDiscardPacketCounter 1068491
EGQ.PqpDiscardUnicastPacketCounter 1068491
[local]#show fabric health | grep -i -E "^Petra-B|EGQ"
Petra-B 1=1/1
EGQ.RqpDiscardPacketCounter 1346022 <<<
EGQ.EhpDiscardPacketCounter 1346022 <<<
EGQ.PqpDiscardUnicastPacketCounter 1346022 <<<
Petra-B 2=1/2
EGQ.RqpDiscardPacketCounter 1271360 <<<
EGQ.EhpDiscardPacketCounter 1271360 <<<
EGQ.PqpDiscardUnicastPacketCounter 1271360 <<<
解决方案
自动恢复机制
行为更改类型:
新的CLI命令,用于在检测到过多交换矩阵出口丢弃时启用FSC自动恢复/重置过程
发布:
19.0
旧行为:
手动恢复过程以重置FSC。
新行为:
新的CLI配置命令,请查看文档:
fabric fsc auto-recovery启用max-attempts <X>以启用此功能。
max-attempts是重置每个FSC的次数。默认情况下,最大尝试次数为无限制。
fabric fsc auto-recovery disable以禁用此功能。
show afctrl-auto-recovery显示有关FSC自动恢复的详细信息,包括尚未重置的设备、重置计数、最大尝试次数、出口丢弃阈值状态和FSC自动恢复历史记录。
警告:对客户的影响:FSC FE设备重置,任何正在传输的数据包都会丢失。
注意:当MIO故障切换时,将复制除历史记录之外的所有值。