此产品的文档集力求使用非歧视性语言。在本文档集中,非歧视性语言是指不隐含针对年龄、残障、性别、种族身份、族群身份、性取向、社会经济地位和交叉性的歧视的语言。由于产品软件的用户界面中使用的硬编码语言、基于 RFP 文档使用的语言或引用的第三方产品使用的语言,文档中可能无法确保完全使用非歧视性语言。 深入了解思科如何使用包容性语言。
思科采用人工翻译与机器翻译相结合的方式将此文档翻译成不同语言,希望全球的用户都能通过各自的语言得到支持性的内容。 请注意:即使是最好的机器翻译,其准确度也不及专业翻译人员的水平。 Cisco Systems, Inc. 对于翻译的准确性不承担任何责任,并建议您总是参考英文原始文档(已提供链接)。
本文档介绍根据故障症状确定用户平面(UP)重新加载不同方案的过程,以排除故障。
RCM — 冗余配置管理器
SSD — 显示支持详细信息
UPF/UP — 用户平面功能
VPP — 矢量数据包处理
BFD — 双向转发检测
确定UP重新加载场景症状的方法:
在CUPS设置中,UP重新加载方案经常遇到挑战,要求有效的症状识别和后续故障排除。
要启动此过程,请检查系统正常运行时间以确定上次重新启动的确切时间。此信息有助于重点分析与重新加载事件对应的RCM日志。
使用此命令检查系统运行时间,如下所示:
******** show system uptime *******
Friday July 22 09:28:14 IST 2022
System uptime: 0D 0H 6M
注意:验证RCM和UP时间戳是否同步到同一时区。如果存在差异,请进行必要的关联。例如,如果UP时间在IST中,而RCM时间在UTC中,请注意,RCM时间始终落后于UP时间5:30小时。
验证在重新加载期间是否发生任何崩溃。您可以使用此命令检查崩溃事件:
******** show crash list *******
Sunday January 23 12:12:14 IST 2022
=== ==================== ======== ========== =============== =======================
# Time Process Card/CPU/ SW HW_SER_NUM
PID VERSION VPO / Crash Card
=== ==================== ======== ========== =============== =======================
1 2022-Jan-14+13:16:40 sessmgr 01/0/11287 21.25.5 NA
2 2022-Jan-19+20:51:01 sessmgr 01/0/16142 21.25.5 NA
3 2022-Jan-22+15:51:55 vpp 01/0/07307 21.25.5 NA
4 2022-Jan-22+15:52:08 sessmgr 01/0/27011 21.25.5 NA
5 2022-Jan-22+16:07:43 sessmgr 01/0/13528 21.25.5 NA
在此步骤中,您需要检查是否发生任何崩溃,例如vpp/sessmgr崩溃。如果检测到vpp崩溃,UP会由于崩溃立即重新加载,提示RCM启动切换到另一个UP。
如果会话管理器崩溃顺序一致,则可能触发VPP崩溃,从而重新加载UP。
每当遇到此类崩溃时,请确保收集vpp/sessmgr的核心文件。
注:对于vpp,可以访问小内核而不是完整的核心文件。
行动计划:一旦获取核心文件或微核心,下一步就是执行核心文件调试以查明崩溃的根本原因。
下面说明了在syslog中发现的与BFD监控故障相关的错误。
当RCM和UP之间出现BFD抖动或丢包时,尤其是当它们之间的连接涉及ACI时,会发生这些错误。
基本上,计时器配置为监控BFD数据包。如果由于上述任何原因,此计时器超时,则会触发监控故障。此事件会提示RCM启动切换。
Jan 22 15:51:55 <NODENAME> evlogd: [local-60sec55.823] [bfd 170500 error] [1/0/9345 <bfdlc:0> bfd_network.c:1798] [software internal system] <bfdctx:7> Session(1/-1260920720) DOWN control detection timer expired
Jan 22 15:51:55 <NODENAME> evlogd: [local-60sec55.856] [bfd 170500 error] [1/0/9345 <bfdlc:0> bfd_network.c:1798] [software internal system] <bfdctx:5> Session(2/1090521080) DOWN control detection timer expired
Jan 22 15:51:55 <NODENAME> evlogd: [local-60sec55.859] [srp 84220 error] [1/0/10026 <vpnmgr:7> pnmgr_rcm_bfd.c:704] [context: rcmctx, contextID: 7] [software internal system syslog] BFD down, closing TCP.
Jan 22 15:51:56 <NODENAME> evlogd: [local-60sec55.979] [srp 84220 error] [1/0/10026 <vpnmgr:7> pnmgr_rcm_bgp.c:428] [context: rcmctx, contextID: 7] [software internal system syslog] Cannot inform RCM about BGP monitor failure as TCP connection with RCM down.
要解决此问题,必须对系统进行全面检查,找出可能导致BFD抖动的任何潜在问题。如果确定有问题的时间戳,则需要与ACI协调,以调查其末端是否有与该时间戳对应的摆动或问题。
BGP抖动或UP中的监控故障可触发RCM发起的切换。这些特定错误特征如下所述。
Mar 21 09:10:37 <NODENAME> evlogd: [local-60sec37.482] [vpn 5572 info] [1/0/10038 <vpnmgr:7> pnmgr_rcm_bgp.c:392] [context: rcmctx, contextID: 7] [software internal system critical-info syslog] BGP monitor group 3 down.
Mar 21 09:10:37 <NODENAME> evlogd: [local-60sec37.482] [vpn 5572 info] [1/0/10038 <vpnmgr:7> pnmgr_rcm_bgp.c:392] [context: rcmctx, contextID: 7] [software internal system critical-info syslog] BGP monitor group 4 down.
Mar 21 09:10:37 <NODENAME> evlogd: [local-60sec37.482] [srp 84220 error] [1/0/10038 <vpnmgr:7> pnmgr_rcm_bgp.c:423] [context: rcmctx, contextID: 7] [software internal system syslog] Informed RCM about BGP monitor failure.
影响BGP摆动的可能因素以及识别这些因素的方法。SNMP陷阱可能会显示指示BGP抖动发生的错误:
Wed Jan 18 10:30:03 2023 Internal trap notification 1289 (BGPPeerSessionIPv6Down) vpn upf-in ipaddr abcd:ab:cd:abc::def
Wed Jan 18 10:30:09 2023 Internal trap notification 1288 (BGPPeerSessionIPv6Up) vpn upf-in ipaddr abcd:ab:cd:abc::def
Wed Jan 18 10:30:19 2023 Internal trap notification 1289 (BGPPeerSessionIPv6Down) vpn upf-in ipaddr abcd:ab:cd:abc::def
Wed Jan 18 10:30:03 2023 Internal trap notification 1289 (BGPPeerSessionIPv6Down) vpn upf-in ipaddr abcd:ab:cd:abc::def
Wed Jan 18 10:30:09 2023 Internal trap notification 1288 (BGPPeerSessionIPv6Up) vpn upf-in ipaddr abcd:ab:cd:abc::defInitiate the process by identifying the context associated with the error that indicates BGP flaps, utilizing the context ID. With the context established, you can precisely determine the particular service involved and retrieve the corresponding IP details.
在基于RCM的CUPS设置和基于ICSR的CUPS设置中,在UP内创建单个情景。例如,在RCM设置中,在UP中建立“rcm”情景,而ICSR设置涉及创建“srp”情景。以下是基于RCM的CUPS的配置示例:
******** show rcm info *******
Thursday March 17 20:51:40 IST 2022
Redundancy Configuration Module:
-------------------------------------------------------------------------------
Context: rcm
Bind Address: <UPF IP binding with RCM controller>
Chassis State: Active
Session State: SockActive
Route-Modifier: 30
RCM Controller Address: <RCM controller IP>
RCM Controller Port: 9200
RCM Controller Connection State: Connected
Ready To Connect: Yes
Management IP Address: <UPF management IP>
Host ID: Active7
SSH IP Address: (Deactivated)
SSH IP Installation: Enabled
redundancy-configuration-module rcm
rcm controller-endpoint dest-ip-addr <Destination RCM controller IP> port 9200 upf-mgmt-ip-addr <UPF management IP> node-name <Nodename>
bind address <UPF IP binding with RCM controller>
monitor bfd peer X.X.X.X
monitor bgp failure reload active
monitor bgp context GnS5S8-U X.X.X.X group 1
monitor bgp context GnS5S8-U X.X.X.X group 1
monitor bgp context GnS5S8-U abcd:defc:c:f::XXXX group 2
monitor bgp context GnS5S8-U defg:abcg:c:f::XXXX group 2
monitor bgp context SGi Z.Z.Z.Z group 3
monitor bgp context SGi G.G.G.G group 3
monitor bgp context SGi XXXX:YYYY:c:f::aaaa group 4
monitor bgp context SGi XXXX:YYYY:c:f::bbbb group 4
monitor bgp context Li XXXX:YYYY:c:f::cccc group 5
monitor bgp context Li XXXX:YYYY:c:f::dddd group 5
monitor sx context GnS5S8-U bind-address XXXX:YYYY:c:f::eeee peer-address XXXX:YYYY:c:f::ffff
#exit
Sample config for ICSR based CUPs without RCM
******** show srp info *******
Sunday April 23 04:39:49 JST 2023
Service Redundancy Protocol:
-------------------------------------------------------------------------------
Context: SRP
Local Address: <UP IP>
Chassis State: Active
Chassis Mode: Backup
Chassis Priority: 10
Local Tiebreaker: FA-02-1B-E8-C1-7E
Route-Modifier: 3
Peer Remote Address: <UP IP>
Peer State: Standby
Peer Mode: Primary
Peer Priority: 1
Peer Tiebreaker: FA-02-1B-13-31-D1
Peer Route-Modifier: 6
Last Hello Message received: Sun Apr 23 04:39:47 2023 (2 seconds ago)
Peer Configuration Validation: Complete
Last Peer Configuration Error: None
Last Peer Configuration Event: Sun Apr 23 04:21:10 2023 (1119 seconds ago)
Last Validate Switchover Status: None
Connection State: Connected
service-redundancy-protocol
monitor bfd context SRP <bfd peer IP> chassis-to-chassis
monitor bfd context SRP <bfd peer IP> chassis-to-chassis
monitor bgp context SAEGW-U-1 <IP> group 1
monitor bgp context SAEGW-U-1 <IP> group 1
monitor bgp context SAEGW-U-1 <IP> group 2
monitor bgp context SAEGW-U-1 <IP> group 2
monitor bgp context SAEGW-U-1 <IP> group 3
monitor bgp context SAEGW-U-1 <IP> group 3
monitor bgp context SGI-1 <IP> group 4
monitor bgp context SGI-1 <IP> group 4
monitor system vpp delay-period 30
peer-ip-address <IP>
bind address <IP>
#exit
在这两种配置中,监控是在其各自的情景中为BGP实施(类似于监控BFD)。
为每个监控实例分配一个唯一的组号,并为不同的服务分配单独的组号。例如,在RCM情景中,“SGi”与组编号3关联,“SGi IPv6”与组编号4关联,“Li”与组编号5关联。
使用提供的配置作为基础,RCM设置涉及监控此情景中的指定BGP链路。如果任何BGP链路出现抖动或检测BGP链路存在困难,监控可能会遇到故障。在没有RCM UP的ICSR设置中,BGP链路监控由SRP执行。此机制的工作原理与这里概述的解释类似。
主要目标是监督这些链接。遇到这些监控错误时,第一步是确定链路未受监控的原因。可能的原因可能包括BGP摆动、被列入监控的IP与其各自情景中指定的IP之间的配置差异,或者数据包丢失问题。
类似地,如对BGP摆动的解释,将对CP和UP之间的Sx摆动进行监控。如果检测到Sx抖动,RCM会相应地启动切换。
Errors for Sx flap which can be seen from snmp traps
Thu Apr 28 15:22:55 2022 Internal trap notification 1382 (SxPathFailure) Context Name:gwctx, Service Name:sx-srvc-cp, Self-IP:X.X.X.X, Peer-IP:Y.Y.Y.Y, Old Recovery Timestamp:3854468847, New Recovery Timestamp
RCM控制器日志:
Monitoring failure for BFD
{"log":"2022/11/12 13:33:31.138 [ERROR] [red.go:2144] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'X.X.X.X' monitor failure, reason UpfMonitor_BFD\n",
Monitoring failure for BGP
{"log":"2022/11/12 15:34:27.644 [ERROR] [red.go:2144] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'X.X.X.X' monitor failure, reason UpfMonitor_BGP\n"
Monitoring failure for Sx
{"log":"2022/11/12 15:34:46.763 [ERROR] [red.go:2144] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'X.X.X.X' monitor failure, reason UpfMonitor_SX\n"
RCM命令输出:
rcm show-status
(to check RCM in Master or Backup state)
rcm show-statistics configmgr
(to check number of UPs connected to this configmgr and current stat of about which are the active UPs and standby UPs )
rcm show-statistics controller
(to check number of UPs connected to this controller and current stat of about which are the active UPs and standby UPs )
rcm show-statistics switchover
rcm show-statistics switchover-verbose
(to check which UP got switchovered to which UP and at what time and with what reason)
命令输出示例:
root@Nodename:
[unknown] ram# ram show-status
message :
{"status”: “MASTER"}
[unknown] rcm# rcm show-statistics switchover
message :
{
"stats_history": [
{
"status": "Success",
"started": "Mar 21 03:40:37.480",
"ended": "Mar 21 03:40:41.659",
"switchoverreason": "BGP Failure",
"source_endpoint": "X.X.X.X",
"destination_endpoint": "Y.Y.Y.Y"
}
],
"num_switchover": 1
}
必须获取控制器日志,并仔细查看所有切换场景(如前所述)。此分析旨在确保无缝执行切换流程,且没有任何问题。
{"log":"2022/05/10 00:30:48.553 [INFO] [events.go:87] [rcm_ctrl_ep.events.bfdmgr] eventsDbSetCallBack: endpoint X.X.X.X : STATE_UP -\u003e STATE_DOWN\n","stream":"stdout","time":"2022-05-10T00:30:48.553622344Z"}
--------------------Indication of active UP bfd went down
{"log":"2022/05/10 00:30:48.553 [DEBUG] [control.go:2920] [rcm_ctrl.control.main] [stateMachine]: Received Event Endpoint: groupId: 1 endpoint: X.X.X.X status: STATE_DOWN\n","stream":"stdout","time":"2022-05-10T00:30:48.553654666Z"}
{"log":"\n","stream":"stdout","time":"2022-05-10T00:30:48.553661415Z"}
{"log":"2022/05/10 00:30:48.553 [INFO] [red.go:2353] [rcm_ctrl.control.main] [upfHandlUpfAction]: StateChange: UPFAction_ActiveToDown\n","stream":"stdout","time":"2022-05-10T00:30:48.553670033Z"}
{"log":"2022/05/10 00:30:48.553 [ERROR] [red.go:2103] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'X.X.X.X' monitor failure, reason UpfMonitor_BFD\n","stream":"stdout","time":"2022-05-10T00:30:48.55368269Z"}
{"log":"2022/11/12 13:33:27.759 [ERROR] [red.go:2144] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'Z.Z.Z.Z' monitor failure, reason UpfMonitor_BGP\n",
----------- Indication of BFD/BGD timer expired and there is a monitoring failure
{"log":"2022/05/10 00:30:48.553 [WARN] [red.go:2256] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: upf X.X.X.X Switch over to Y.Y.Y.Y\n","stream":"stdout","time":"2022-05-10T00:30:48.553696821Z"}
---------- Indication of switchover initiated by RCM
{"log":"2022/05/10 00:32:03.555 [DEBUG] [control.go:3533] [rcm_ctrl.control.main] [snmpThread]: SNMP trap raised for : SwitchoverComplete\n","stream":"stdout","time":"2022-05-10T00:32:03.556753903Z"}
{"log":"2022/05/10 00:32:03.603 [DEBUG] [control.go:1885] [rcm_ctrl.control.main] [handleUpfStateMsg]: endpoint: Y.Y.Y.Y State: UpfMsgState_Active RouteModifier: 28 HostID 'Active3'\n","stream":"stdout","time":"2022-05-10T00:32:03.60379131Z"}
{"log":"2022/05/10 00:32:03.603 [DEBUG] [control.go:2048] [rcm_ctrl.control.main] [handleUpfStateMsg]: endpoint: Y.Y.Y.Y OldState: UPFState_Active NewState: UPFState_Active\n","stream":"stdout","time":"2022-05-10T00:32:03.603847124Z"}
---------- Indication of switchover completed and other UP became Active
{"log":"2022/05/10 00:32:03.646 [INFO] [control.go:1054] [rcm_ctrl.control.main] [handleUpfActiveAckMsg]: Subscriber data / Sx messages flowing towards UP 'Y.Y.Y.Y'\n","stream":"stdout","time":"2022-05-10T00:32:03.646883813Z"}
-------------- Traffic routed towards other Active UP
{"log":"2022/05/10 00:32:53.861 [INFO] [red.go:859] [rcm_ctrl.control.main] [handleUpfSetStandby]: Assigning PEND_STANDBY state to UPF 'X.X.X.X'. Notifies Configmgr, NSO and Redmgrs after receiving State Ack from UPF.\n","stream":"stdout","time":"2022-05-10T00:32:53.862051117Z"}
{"log":"2022/05/10 00:32:53.861 [INFO] [red.go:1681] [rcm_ctrl.control.main] [sendStateToUpf]:send state UpfMsgState_Standby to upf X.X.X.X \n","stream":"stdout","time":"2022-05-10T00:32:53.862059689Z"}
{"log":"2022/05/10 00:32:53.890 [INFO] [red.go:1176] [rcm_ctrl.control.main] [handleUpfNotifyMgrs]: Received UpfMsgState_Standby ACK from UPF 'X.X.X.X'. Notifying Configmgr and Redmgrs.\n","stream":"stdout","time":"2022-05-10T00:32:53.890712421Z"}
---------------------- Switchovered UP became Standby
在RCM启动从一个UP切换到另一个UP期间,RCM会推送必要的配置。为确保成功应用此配置,RCM会设置计时器以完成此过程。
将配置推送并存储在UP的路径中后,UP将在RCM定义的指定时间范围内执行配置。
UP完成配置执行后,会向RCM发送信号。此信号由syslogs中的事件日志条目指示,确认配置推送已成功完成。
Nov 13 12:01:09 <NODENAME> evlogd: [local-60sec9.041] [cli 30000 debug] [1/0/10935 <cli:1010935> cliparse.c:571] [context: local, contextID: 1] [software internal system syslog] CLI command [user rcmadmin, mode [local]INVIGJ02GNR1D1UP12CO]: rcm-config-push-complete
Nov 13 12:01:09 <NODENAME> evlogd: [local-60sec9.041] [cli 30000 debug] [1/0/10935 <cli:1010935> cliparse.c:571] [context: local, contextID: 1] [software internal system syslog] CLI command [user rcmadmin, mode [local]INVIGJ02GNR1D1UP12CO]: rcm-config-push-complete end-of-config
rcm-config-push-complete end-of-config
确定推送的配置文件中有问题的CLI,可通过RCM ConfigMgr日志确定。
当RCM尝试发送配置,但在与UP建立连接时遇到困难时,可能会发生SFTP相关问题。这些挑战可能源于密码复杂或影响SFTP操作的其他因素。
查看ConfigMgr日志可以监控SFTP状态和识别配置错误。以下是典型错误实例的示例表示。
RCM ConfigMgr日志中的SFTP日志显示为:
{"log":"2022/11/12 23:53:09.066 rcm-configmgr [DEBUG] [sshclient.go:395] [rcm_grpc_ep.msg-process.Int] Initiate a sftp connection to host: X.X.X.X \n","stream":"stdout","time":"2022-11-12T23:53:09.067894173Z"}
{"log":"2022/11/12 23:53:09.066 rcm-configmgr [DEBUG] [sftpClient.go:26] [rcm_grpc_ep.grpc.Int] Conneting to host X.X.X.X for sftp with src path: /cfg/ConfigMgr/upfconfig10-103-108-154_22.cfg and dst path: /sftp/10-103-108-154_22.cfg \n","stream":"stdout","time":"2022-11-12T23:53:09.067903156Z"}
{"log":"2022/11/12 23:53:09.203 rcm-configmgr [DEBUG] [sftpClient.go:58] [rcm_grpc_ep.grpc.Int] Successfully opened the file%!(EXTRA string=/cfg/ConfigMgr/upfconfig10-103-108-154_22.cfg)\n","stream":"stdout","time":"2022-11-12T23:53:09.203698078Z"}
{"log":"2022/11/12 23:53:09.211 rcm-configmgr [DEBUG] [sftpClient.go:66] [rcm_grpc_ep.grpc.Int] Total bytes copied 405933: \n","stream":"stdout","time":"2022-11-12T23:53:09.212063509Z"}
在UP系统日志中观察到的SFTP期间密码过期:
2022-May-16+17:45:02.834 [cli 30005 info] [1/0/14263 <cli:1014263> _commands_cli.c:1474] [software internal system syslog] CLI session ended for Security Administrator admin on device /dev/pts/5
2022-May-16+17:45:02.834 [cli 30024 error] [1/0/14263 <cli:1014263> cli.c:1657] [software internal system syslog] Misc error: Password change required rc=0
2022-May-16+17:45:02.834 [cli 30087 info] [1/0/14263 <cli:1014263> cli.c:1352] [software internal system critical-info syslog] USER user 'admin' password has expired beyond grace period
2022-May-16+17:45:02.594 [cli 30004 info] [1/0/14263 <cli:1014263> cli_sess.c:164] [software internal system syslog] CLI session started for Security Administrator admin on device /dev/pts/5 from X.X.X.X
2022-May-16+17:45:02.537 [cli 30028 debug] [1/0/9816 <vpnmgr:1> luser_auth.c:1598] [context: local, contextID: 1] [software internal system syslog] Login attempt failure for user admin IP address X.X.X.X - Access type ssh/sftp
如果密码导致SFTP问题,请考虑生成新密码或延长密码到期时间。
如果排除了密码问题,请检查并发SFTP会话的数量,因为过多的会话可能会导致SFTP中断。
版本 | 发布日期 | 备注 |
---|---|---|
1.0 |
14-Aug-2023 |
初始版本 |