本產品的文件集力求使用無偏見用語。針對本文件集的目的,無偏見係定義為未根據年齡、身心障礙、性別、種族身分、民族身分、性別傾向、社會經濟地位及交織性表示歧視的用語。由於本產品軟體使用者介面中硬式編碼的語言、根據 RFP 文件使用的語言,或引用第三方產品的語言,因此本文件中可能會出現例外狀況。深入瞭解思科如何使用包容性用語。
思科已使用電腦和人工技術翻譯本文件,讓全世界的使用者能夠以自己的語言理解支援內容。請注意,即使是最佳機器翻譯,也不如專業譯者翻譯的內容準確。Cisco Systems, Inc. 對這些翻譯的準確度概不負責,並建議一律查看原始英文文件(提供連結)。
本文描述根據症狀確定使用者平面(UP)重新載入的不同方案的過程,以便解決此問題。
RCM — 備援組態管理員
SSD — 顯示支援詳細資訊
UPF/UP — 使用者平面功能
VPP — 向量封包處理
BFD — 雙向轉發檢測
確定UP重新載入方案症狀的方法:
在CUPS設定中,UP重新載入方案經常遇到挑戰,要求有效的症狀識別和後續故障排除。
要啟動該過程,請檢查系統正常運行時間,確定上次重新啟動UP的確切時間。此資訊有助於集中分析與重新載入事件對應的RCM日誌。
使用以下命令檢查系統正常運行時間:
******** show system uptime *******
Friday July 22 09:28:14 IST 2022
System uptime: 0D 0H 6M
注意:驗證RCM和UP時間戳是否同步到同一時區。如果存在差異,請進行必要的關聯。例如,如果UP時間在IST中而RCM時間在UTC中,請注意,RCM時間始終落後於UP時間5:30小時。
驗證重新載入期間是否發生任何崩潰。可以使用此命令檢查崩潰事件:
******** show crash list *******
Sunday January 23 12:12:14 IST 2022
=== ==================== ======== ========== =============== =======================
# Time Process Card/CPU/ SW HW_SER_NUM
PID VERSION VPO / Crash Card
=== ==================== ======== ========== =============== =======================
1 2022-Jan-14+13:16:40 sessmgr 01/0/11287 21.25.5 NA
2 2022-Jan-19+20:51:01 sessmgr 01/0/16142 21.25.5 NA
3 2022-Jan-22+15:51:55 vpp 01/0/07307 21.25.5 NA
4 2022-Jan-22+15:52:08 sessmgr 01/0/27011 21.25.5 NA
5 2022-Jan-22+16:07:43 sessmgr 01/0/13528 21.25.5 NA
在此步驟中,您需要檢查是否發生任何崩潰,例如vpp/sessmgr崩潰。如果檢測到vpp崩潰,UP會由於崩潰立即重新載入,提示RCM啟動切換到另一個UP。
如果會話管理器崩潰的順序一致,則可能觸發VPP崩潰,從而重新載入UP。
每當遇到此類崩潰時,請確保收集vpp/sessmgr的核心檔案。
註:對於vpp,可以訪問小核心而不是完整核心檔案。
行動計畫:一旦獲取核心檔案或微核心,下一步就是執行核心檔案調試以查明崩潰的根本原因。
此處介紹在與BFD監視故障相關的系統日誌中發現的錯誤。
當RCM和UP之間出現BFD擺動或丟包時,尤其是當它們之間的連線涉及ACI時,會發生這些錯誤。
實質上,定時器被配置為監控BFD資料包。如果由於上述任何原因,此計時器過期,將觸發監控故障。此事件會提示RCM啟動切換。
Jan 22 15:51:55 <NODENAME> evlogd: [local-60sec55.823] [bfd 170500 error] [1/0/9345 <bfdlc:0> bfd_network.c:1798] [software internal system] <bfdctx:7> Session(1/-1260920720) DOWN control detection timer expired
Jan 22 15:51:55 <NODENAME> evlogd: [local-60sec55.856] [bfd 170500 error] [1/0/9345 <bfdlc:0> bfd_network.c:1798] [software internal system] <bfdctx:5> Session(2/1090521080) DOWN control detection timer expired
Jan 22 15:51:55 <NODENAME> evlogd: [local-60sec55.859] [srp 84220 error] [1/0/10026 <vpnmgr:7> pnmgr_rcm_bfd.c:704] [context: rcmctx, contextID: 7] [software internal system syslog] BFD down, closing TCP.
Jan 22 15:51:56 <NODENAME> evlogd: [local-60sec55.979] [srp 84220 error] [1/0/10026 <vpnmgr:7> pnmgr_rcm_bgp.c:428] [context: rcmctx, contextID: 7] [software internal system syslog] Cannot inform RCM about BGP monitor failure as TCP connection with RCM down.
要解決此問題,必須對系統進行全面檢查,找出任何可能導致BFD翻動的潛在問題。如果找到有問題的時間戳,則需要與ACI協調,以調查其末端是否有與該時間戳對應的襟翼或問題。
BGP擺動或UP中的監控故障可以觸發RCM啟動的切換。這些特定錯誤如下所述。
Mar 21 09:10:37 <NODENAME> evlogd: [local-60sec37.482] [vpn 5572 info] [1/0/10038 <vpnmgr:7> pnmgr_rcm_bgp.c:392] [context: rcmctx, contextID: 7] [software internal system critical-info syslog] BGP monitor group 3 down.
Mar 21 09:10:37 <NODENAME> evlogd: [local-60sec37.482] [vpn 5572 info] [1/0/10038 <vpnmgr:7> pnmgr_rcm_bgp.c:392] [context: rcmctx, contextID: 7] [software internal system critical-info syslog] BGP monitor group 4 down.
Mar 21 09:10:37 <NODENAME> evlogd: [local-60sec37.482] [srp 84220 error] [1/0/10038 <vpnmgr:7> pnmgr_rcm_bgp.c:423] [context: rcmctx, contextID: 7] [software internal system syslog] Informed RCM about BGP monitor failure.
影響BGP瓣的可能因素和識別它們的方法。SNMP陷阱可能顯示指示BGP抖動發生的錯誤:
Wed Jan 18 10:30:03 2023 Internal trap notification 1289 (BGPPeerSessionIPv6Down) vpn upf-in ipaddr abcd:ab:cd:abc::def
Wed Jan 18 10:30:09 2023 Internal trap notification 1288 (BGPPeerSessionIPv6Up) vpn upf-in ipaddr abcd:ab:cd:abc::def
Wed Jan 18 10:30:19 2023 Internal trap notification 1289 (BGPPeerSessionIPv6Down) vpn upf-in ipaddr abcd:ab:cd:abc::def
Wed Jan 18 10:30:03 2023 Internal trap notification 1289 (BGPPeerSessionIPv6Down) vpn upf-in ipaddr abcd:ab:cd:abc::def
Wed Jan 18 10:30:09 2023 Internal trap notification 1288 (BGPPeerSessionIPv6Up) vpn upf-in ipaddr abcd:ab:cd:abc::defInitiate the process by identifying the context associated with the error that indicates BGP flaps, utilizing the context ID. With the context established, you can precisely determine the particular service involved and retrieve the corresponding IP details.
在基於RCM的CUPS設定和基於ICSR的CUPS設定中,均在UP中建立單獨的上下文。例如,在RCM設定中,「rcm」上下文在UP中建立,而ICSR設定涉及建立「srp」上下文。以下是基於RCM的CUPS的組態範例:
******** show rcm info *******
Thursday March 17 20:51:40 IST 2022
Redundancy Configuration Module:
-------------------------------------------------------------------------------
Context: rcm
Bind Address: <UPF IP binding with RCM controller>
Chassis State: Active
Session State: SockActive
Route-Modifier: 30
RCM Controller Address: <RCM controller IP>
RCM Controller Port: 9200
RCM Controller Connection State: Connected
Ready To Connect: Yes
Management IP Address: <UPF management IP>
Host ID: Active7
SSH IP Address: (Deactivated)
SSH IP Installation: Enabled
redundancy-configuration-module rcm
rcm controller-endpoint dest-ip-addr <Destination RCM controller IP> port 9200 upf-mgmt-ip-addr <UPF management IP> node-name <Nodename>
bind address <UPF IP binding with RCM controller>
monitor bfd peer X.X.X.X
monitor bgp failure reload active
monitor bgp context GnS5S8-U X.X.X.X group 1
monitor bgp context GnS5S8-U X.X.X.X group 1
monitor bgp context GnS5S8-U abcd:defc:c:f::XXXX group 2
monitor bgp context GnS5S8-U defg:abcg:c:f::XXXX group 2
monitor bgp context SGi Z.Z.Z.Z group 3
monitor bgp context SGi G.G.G.G group 3
monitor bgp context SGi XXXX:YYYY:c:f::aaaa group 4
monitor bgp context SGi XXXX:YYYY:c:f::bbbb group 4
monitor bgp context Li XXXX:YYYY:c:f::cccc group 5
monitor bgp context Li XXXX:YYYY:c:f::dddd group 5
monitor sx context GnS5S8-U bind-address XXXX:YYYY:c:f::eeee peer-address XXXX:YYYY:c:f::ffff
#exit
Sample config for ICSR based CUPs without RCM
******** show srp info *******
Sunday April 23 04:39:49 JST 2023
Service Redundancy Protocol:
-------------------------------------------------------------------------------
Context: SRP
Local Address: <UP IP>
Chassis State: Active
Chassis Mode: Backup
Chassis Priority: 10
Local Tiebreaker: FA-02-1B-E8-C1-7E
Route-Modifier: 3
Peer Remote Address: <UP IP>
Peer State: Standby
Peer Mode: Primary
Peer Priority: 1
Peer Tiebreaker: FA-02-1B-13-31-D1
Peer Route-Modifier: 6
Last Hello Message received: Sun Apr 23 04:39:47 2023 (2 seconds ago)
Peer Configuration Validation: Complete
Last Peer Configuration Error: None
Last Peer Configuration Event: Sun Apr 23 04:21:10 2023 (1119 seconds ago)
Last Validate Switchover Status: None
Connection State: Connected
service-redundancy-protocol
monitor bfd context SRP <bfd peer IP> chassis-to-chassis
monitor bfd context SRP <bfd peer IP> chassis-to-chassis
monitor bgp context SAEGW-U-1 <IP> group 1
monitor bgp context SAEGW-U-1 <IP> group 1
monitor bgp context SAEGW-U-1 <IP> group 2
monitor bgp context SAEGW-U-1 <IP> group 2
monitor bgp context SAEGW-U-1 <IP> group 3
monitor bgp context SAEGW-U-1 <IP> group 3
monitor bgp context SGI-1 <IP> group 4
monitor bgp context SGI-1 <IP> group 4
monitor system vpp delay-period 30
peer-ip-address <IP>
bind address <IP>
#exit
在兩種配置中,都在其各自的上下文中實施了BGP監控(類似於BFD監控)。
為每個監控例項分配一個唯一的組號,並為不同的服務分配單獨的組號。例如,在RCM上下文中,「SGi」與組編號3關聯,「SGi IPv6」與組編號4關聯,「Li」與組編號5關聯。
使用提供的配置作為基礎,RCM設定涉及監控此上下文中的指定BGP連結。如果其中任何BGP鏈路出現抖動或檢測BGP鏈路有困難,則監控可能會遇到故障。在沒有RCM UP的ICSR設定中,由SRP執行BGP鏈路監控。這一機制的作用類似於這一點所概述的解釋。
主要目標是監督這些聯絡。遇到這些監控錯誤時,第一步是確定鏈路未被監控的原因。可能的原因可能包括BGP擺動、被列入監控的IP與其各自上下文中所指定的IP之間的配置差異,或者資料包丟失問題。
類似地,如對BGP瓣的解釋,將實現對CP和UP之間的Sx瓣的監控。如果檢測到Sx翻動,RCM會相應地啟動切換。
Errors for Sx flap which can be seen from snmp traps
Thu Apr 28 15:22:55 2022 Internal trap notification 1382 (SxPathFailure) Context Name:gwctx, Service Name:sx-srvc-cp, Self-IP:X.X.X.X, Peer-IP:Y.Y.Y.Y, Old Recovery Timestamp:3854468847, New Recovery Timestamp
RCM控制器日誌:
Monitoring failure for BFD
{"log":"2022/11/12 13:33:31.138 [ERROR] [red.go:2144] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'X.X.X.X' monitor failure, reason UpfMonitor_BFD\n",
Monitoring failure for BGP
{"log":"2022/11/12 15:34:27.644 [ERROR] [red.go:2144] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'X.X.X.X' monitor failure, reason UpfMonitor_BGP\n"
Monitoring failure for Sx
{"log":"2022/11/12 15:34:46.763 [ERROR] [red.go:2144] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'X.X.X.X' monitor failure, reason UpfMonitor_SX\n"
RCM命令輸出:
rcm show-status
(to check RCM in Master or Backup state)
rcm show-statistics configmgr
(to check number of UPs connected to this configmgr and current stat of about which are the active UPs and standby UPs )
rcm show-statistics controller
(to check number of UPs connected to this controller and current stat of about which are the active UPs and standby UPs )
rcm show-statistics switchover
rcm show-statistics switchover-verbose
(to check which UP got switchovered to which UP and at what time and with what reason)
命令輸出示例:
root@Nodename:
[unknown] ram# ram show-status
message :
{"status”: “MASTER"}
[unknown] rcm# rcm show-statistics switchover
message :
{
"stats_history": [
{
"status": "Success",
"started": "Mar 21 03:40:37.480",
"ended": "Mar 21 03:40:41.659",
"switchoverreason": "BGP Failure",
"source_endpoint": "X.X.X.X",
"destination_endpoint": "Y.Y.Y.Y"
}
],
"num_switchover": 1
}
請務必獲取控制器日誌,並仔細檢視日誌,瞭解任何切換方案,如前所述。此分析旨在確保無縫執行切換過程,且沒有任何問題。
{"log":"2022/05/10 00:30:48.553 [INFO] [events.go:87] [rcm_ctrl_ep.events.bfdmgr] eventsDbSetCallBack: endpoint X.X.X.X : STATE_UP -\u003e STATE_DOWN\n","stream":"stdout","time":"2022-05-10T00:30:48.553622344Z"}
--------------------Indication of active UP bfd went down
{"log":"2022/05/10 00:30:48.553 [DEBUG] [control.go:2920] [rcm_ctrl.control.main] [stateMachine]: Received Event Endpoint: groupId: 1 endpoint: X.X.X.X status: STATE_DOWN\n","stream":"stdout","time":"2022-05-10T00:30:48.553654666Z"}
{"log":"\n","stream":"stdout","time":"2022-05-10T00:30:48.553661415Z"}
{"log":"2022/05/10 00:30:48.553 [INFO] [red.go:2353] [rcm_ctrl.control.main] [upfHandlUpfAction]: StateChange: UPFAction_ActiveToDown\n","stream":"stdout","time":"2022-05-10T00:30:48.553670033Z"}
{"log":"2022/05/10 00:30:48.553 [ERROR] [red.go:2103] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'X.X.X.X' monitor failure, reason UpfMonitor_BFD\n","stream":"stdout","time":"2022-05-10T00:30:48.55368269Z"}
{"log":"2022/11/12 13:33:27.759 [ERROR] [red.go:2144] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: UPF 'Z.Z.Z.Z' monitor failure, reason UpfMonitor_BGP\n",
----------- Indication of BFD/BGD timer expired and there is a monitoring failure
{"log":"2022/05/10 00:30:48.553 [WARN] [red.go:2256] [rcm_ctrl.control.main] [handleUpfActiveToDownAction]: upf X.X.X.X Switch over to Y.Y.Y.Y\n","stream":"stdout","time":"2022-05-10T00:30:48.553696821Z"}
---------- Indication of switchover initiated by RCM
{"log":"2022/05/10 00:32:03.555 [DEBUG] [control.go:3533] [rcm_ctrl.control.main] [snmpThread]: SNMP trap raised for : SwitchoverComplete\n","stream":"stdout","time":"2022-05-10T00:32:03.556753903Z"}
{"log":"2022/05/10 00:32:03.603 [DEBUG] [control.go:1885] [rcm_ctrl.control.main] [handleUpfStateMsg]: endpoint: Y.Y.Y.Y State: UpfMsgState_Active RouteModifier: 28 HostID 'Active3'\n","stream":"stdout","time":"2022-05-10T00:32:03.60379131Z"}
{"log":"2022/05/10 00:32:03.603 [DEBUG] [control.go:2048] [rcm_ctrl.control.main] [handleUpfStateMsg]: endpoint: Y.Y.Y.Y OldState: UPFState_Active NewState: UPFState_Active\n","stream":"stdout","time":"2022-05-10T00:32:03.603847124Z"}
---------- Indication of switchover completed and other UP became Active
{"log":"2022/05/10 00:32:03.646 [INFO] [control.go:1054] [rcm_ctrl.control.main] [handleUpfActiveAckMsg]: Subscriber data / Sx messages flowing towards UP 'Y.Y.Y.Y'\n","stream":"stdout","time":"2022-05-10T00:32:03.646883813Z"}
-------------- Traffic routed towards other Active UP
{"log":"2022/05/10 00:32:53.861 [INFO] [red.go:859] [rcm_ctrl.control.main] [handleUpfSetStandby]: Assigning PEND_STANDBY state to UPF 'X.X.X.X'. Notifies Configmgr, NSO and Redmgrs after receiving State Ack from UPF.\n","stream":"stdout","time":"2022-05-10T00:32:53.862051117Z"}
{"log":"2022/05/10 00:32:53.861 [INFO] [red.go:1681] [rcm_ctrl.control.main] [sendStateToUpf]:send state UpfMsgState_Standby to upf X.X.X.X \n","stream":"stdout","time":"2022-05-10T00:32:53.862059689Z"}
{"log":"2022/05/10 00:32:53.890 [INFO] [red.go:1176] [rcm_ctrl.control.main] [handleUpfNotifyMgrs]: Received UpfMsgState_Standby ACK from UPF 'X.X.X.X'. Notifying Configmgr and Redmgrs.\n","stream":"stdout","time":"2022-05-10T00:32:53.890712421Z"}
---------------------- Switchovered UP became Standby
在RCM從一個UP切換到另一個UP期間,RCM會推送必要的配置。為確保成功應用此配置,RCM設定計時器以完成該過程。
一旦推送配置並將其儲存在UP的路徑中,UP將在RCM定義的指定時間範圍內執行配置。
UP完成配置執行後,會向RCM傳送訊號。此訊號由系統日誌中的事件日誌項指示,確認配置推送成功完成。
Nov 13 12:01:09 <NODENAME> evlogd: [local-60sec9.041] [cli 30000 debug] [1/0/10935 <cli:1010935> cliparse.c:571] [context: local, contextID: 1] [software internal system syslog] CLI command [user rcmadmin, mode [local]INVIGJ02GNR1D1UP12CO]: rcm-config-push-complete
Nov 13 12:01:09 <NODENAME> evlogd: [local-60sec9.041] [cli 30000 debug] [1/0/10935 <cli:1010935> cliparse.c:571] [context: local, contextID: 1] [software internal system syslog] CLI command [user rcmadmin, mode [local]INVIGJ02GNR1D1UP12CO]: rcm-config-push-complete end-of-config
rcm-config-push-complete end-of-config
識別推送的配置檔案中有問題的CLI(可從RCM ConfigMgr日誌中確定)。
當RCM嘗試傳送配置但與UP建立連線時遇到困難時,可能會發生SFTP相關問題。這些挑戰可能源於密碼複雜性或其他影響SFTP操作的因素。
檢視ConfigMgr日誌可以監控SFTP狀態和識別配置錯誤。以下是典型錯誤例項的示例表示。
RCM ConfigMgr日誌中的SFTP日誌顯示為:
{"log":"2022/11/12 23:53:09.066 rcm-configmgr [DEBUG] [sshclient.go:395] [rcm_grpc_ep.msg-process.Int] Initiate a sftp connection to host: X.X.X.X \n","stream":"stdout","time":"2022-11-12T23:53:09.067894173Z"}
{"log":"2022/11/12 23:53:09.066 rcm-configmgr [DEBUG] [sftpClient.go:26] [rcm_grpc_ep.grpc.Int] Conneting to host X.X.X.X for sftp with src path: /cfg/ConfigMgr/upfconfig10-103-108-154_22.cfg and dst path: /sftp/10-103-108-154_22.cfg \n","stream":"stdout","time":"2022-11-12T23:53:09.067903156Z"}
{"log":"2022/11/12 23:53:09.203 rcm-configmgr [DEBUG] [sftpClient.go:58] [rcm_grpc_ep.grpc.Int] Successfully opened the file%!(EXTRA string=/cfg/ConfigMgr/upfconfig10-103-108-154_22.cfg)\n","stream":"stdout","time":"2022-11-12T23:53:09.203698078Z"}
{"log":"2022/11/12 23:53:09.211 rcm-configmgr [DEBUG] [sftpClient.go:66] [rcm_grpc_ep.grpc.Int] Total bytes copied 405933: \n","stream":"stdout","time":"2022-11-12T23:53:09.212063509Z"}
在UP系統日誌中觀察到的SFTP期間密碼過期:
2022-May-16+17:45:02.834 [cli 30005 info] [1/0/14263 <cli:1014263> _commands_cli.c:1474] [software internal system syslog] CLI session ended for Security Administrator admin on device /dev/pts/5
2022-May-16+17:45:02.834 [cli 30024 error] [1/0/14263 <cli:1014263> cli.c:1657] [software internal system syslog] Misc error: Password change required rc=0
2022-May-16+17:45:02.834 [cli 30087 info] [1/0/14263 <cli:1014263> cli.c:1352] [software internal system critical-info syslog] USER user 'admin' password has expired beyond grace period
2022-May-16+17:45:02.594 [cli 30004 info] [1/0/14263 <cli:1014263> cli_sess.c:164] [software internal system syslog] CLI session started for Security Administrator admin on device /dev/pts/5 from X.X.X.X
2022-May-16+17:45:02.537 [cli 30028 debug] [1/0/9816 <vpnmgr:1> luser_auth.c:1598] [context: local, contextID: 1] [software internal system syslog] Login attempt failure for user admin IP address X.X.X.X - Access type ssh/sftp
如果密碼導致SFTP問題,請考慮生成新密碼或延長密碼過期時間。
如果排除密碼問題,請檢查併發SFTP會話數,因為會話數過多可能導致SFTP中斷。
修訂 | 發佈日期 | 意見 |
---|---|---|
1.0 |
14-Aug-2023 |
初始版本 |