Introduction
This document describes RCM based UPF upgrade failure due to configmgr is missing the host entry
Problem
When RCM (Redundancy Configuration Manager) controller initiates a planned UPF (User Plane Function) switchover from UPF 1(Active) to UPF 2(Standby), configmgr is expected to have both the UPF 1 and UPF 2 in its host list. But due to some reason configmgr does not have the Active UPF 1 in its active host list, contradicting with host list.on controller
And when RCM triggers UPF 1 switchover to UPF 2 in such condition, switchover process gets initiated. During switchover process configmgr tries to find the Active UPF 1 host details in its host list, but fails to find.
UPF switchover process fails with reason "Old Active moved from PendingStandby to Active due to timeout in receiving Standby state (planned switchover)" and UPF1 gets moved from PendingStandby to Active and UPF 2 from PendingActive to Standby.
//How to detect switchover failure is due to configmgr missing host details in its host list
In the RCM tac dbg covering such switchover failure times, look for log event in the configmgr pod log.
2024/01/12 09:08:26.878 rcm-configmgr [DEBUG] [sshclient.go:980] [rcm_grpc_ep.msg-process.Int] [RcmGenTrap]: SNMP Trap Raised: (SwitchoverFailure) - Switchover from 10.248.187.151:22 to 10.248.187.153:22 in Group:1 Failed! Reason : Active Not Found
If rcm tac dbg is NOT present, you can also confirm the UPF switchover failed due to this issue by looking for snmp trap from RCM controller ops-center.
a) Login to Active RCM ops-center
b) Run command rcm show-snmp-trap history
c) Look in the snmp traps trap present
SwitchoverFailure 2024-01-18T05:19:45.Z 2024-01-18T05:19:45.Z rcm-configmgr Switchover from 10.244.127.23:22 to 10.244.127.29:22 in Group:1 Failed! Reason : Active Not Found
Solution
Untill permanent fix comes via Cisco bug ID CSCwi70133 Work around is to delete the configmgr pod from the corresponding AIO (All In One) K8s master node, using kubectl delete <configmgr-pod-name> -n <k8-name-space>
Example :
1. As part of pre-checks of the UPF upgrade automation work flow, checks to compare the controller and configmgr host list canbe done. If a host is missing in configmgr host list, configmgr pod delete can be done so that configmgr gets complete hosts list freshly from controller.
2. If UPF switchover is being given manually, collect 2 CLI commands outputs from active RCM and compare them to find if any Host(Active/Standby) is missing in the configmgr host output. If any host missing, issue configmgr pod delete from RCM AIO K8s master node & recheck the controller and configmgr host list. If the hosts are matching on controller and configmgr, proceed to manaul switchover of UPFs from controller.
a) rcm show-statistics controller
b) rcm show-statistics configmgr