Introduction
This document describes how to implement a workaround for Network Time Protocol (NTP) time sync issues in 5G Subscriber Microservices Infrastructure (SMI) protocol virtual machines (protocol VMs).
Prerequisites
Requirements
Cisco recommends that you have the knowledge of these topics:
- Cisco SMI
- 5G Cloud Native Deployment Platform (CNDP) architecture
- Cloud - Red Hat OpenStack Platform 13 ("Queens" release)
- Dockers and Kubernetes
Components used
The information in this document is based on these software and hardware versions:
- SMI 2020.01.1-21
- Kubernetes v1.16.2
- Red Hat OpenStack Platform 13 director
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
Background Information
What is SMI?
Cisco SMI is a layered stack of cloud technologies and standards that enable microservices-based applications from the Cisco Mobility, Cable, and Broadband Network Gateway (BNG) business units. These applications have similar subscriber management functions and similar datastore requirements.
Attributes:
- The Layer Cloud Stack (technologies and standards) provides top-to-bottom deployments and accommodates current cloud infrastructures.
- All applications share the Common Execution Environment (CEE) for non-application functions (data storage, deployment, configuration, telemetry, alarm), which provides a consistent interaction and experience for all customer touchpoints and integration points.
- Applications and the CEE are deployed in microservice containers and are connected with an Intelligent Service Mesh.
- An exposed API for deployment, configuration, and management enables automation.
What is CEE?
- The CEE is a software solution that was developed to monitor mobile and cable applications that are deployed on the SMI. The CEE captures information (key metrics) from the applications in a centralized way for engineers to debug and troubleshoot.
- The CEE is the common set of tools that are installed for all the applications. It comes equipped with a dedicated Ops Center, which provides the Command Line Interface (CLI) and APIs to manage the monitor tools. Only one CEE is available for each cluster.
What are Protocol VMs?
Protocol VMs host the Session Management Function (SMF) application microservices that are responsible for the protocol translation. Two protocol VMs are deployed per instance of SMF. A protocol VM has pods such as GTPc, Lawful Intercept (LI), Radius, Rest, and so on.
Problem
From a node (protocol VM) that exhibits the issue, you can see that the network attempts to use two different interfaces to route the connectivity to the NTP server. However, only one of the networks can reach the IP.
When you run a ping without the interface for some time, you can see that it causes packet loss.
Sample alert from CEE:
[pod-name-cnat/global] cee# show alerts active summary
NAME UID SEVERITY STARTS AT SOURCE SUMMARY
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
VoNR_Ded_Bearer_Creat be183f61dd65 major 01-18T16:32:17 System Success percentage of VoNR - PCF Initiated Dedicated Bearer Creation procedure in NR turns less than t...
clock-is-not-in-synch 38a35e9a8bd8 major 01-18T13:11:15 pod-name-cnat-cnat-co Clock not in synch detected on hostname pod-name-cnat-cnat-core-protocol-data1 . Ensure NTP is configu...
clock-is-not-in-synch 4a7e138b8bae major 01-18T13:07:35 pod-name-cnat-cnat-co Clock not in synch detected on hostname pod-name-cnat-cnat-core-protocol-ims2 . Ensure NTP is configur...
clock-is-not-in-synch 5b3e128f0101 major 01-18T12:05:55 pod-name-cnat-cnat-co Clock not in synch detected on hostname pod-name-cnat-cnat-core-protocol-data2 . Ensure NTP is configu...
container-memory-usag f588aa627792 critical 12-28T15:47:30 path-provisioner-v6zx Pod cee-global/path-provisioner-v6zxj/k8s_path-provisioner_path-provisioner-v6zxj_cee-global_f1a1e8fa-...
container-memory-usag 55b2ea84466f critical 12-27T22:27:10 path-provisioner-v6zx Pod cee-global/path-provisioner-v6zxj/ uses high memory 82.69%.
N11_SM_Timeout_SR f6fb82cac197 major 10-31T15:51:16 System This alert is fired when the increase in timeout for N11 messages toward AMF crosses threshold
Radius_Server_RTT 94ac2cff43a3 warning 10-28T05:08:56 System RTT for Radius Server: 216.x.x.x, Port: 1813 in namespace: smf-mvno is more than 5 ms.
Radius_Server_RTT d02b60b3a3a4 warning 10-28T05:05:56 System RTT for Radius Server: 52.x.x.x, Port: 1813 in namespace: smf-mvno is more than 5 ms.
Radius_Server_RTT 9afcee101013 warning 10-28T05:05:36 System RTT for Radius Server: 52.x.x.x, Port: 1813 in namespace: smf-mvno is more than 5 ms.
N11_SM_Timeout_SR 206c4bbccf21 major 10-20T13:26:16 System This alert is fired when the increase in timeout for N11 messages toward AMF crosses threshold
N4_Total_Outbound_Int 4e0f3c1d6300 major 10-20T09:01:23 System This alert is fired when the percentage of N4 outbound responses sent is lesser than threshold %.
N4_Total_Outbound_Int 3e2fed624704 major 10-20T08:21:23 System This alert is fired when the percentage of N4 outbound responses sent is lesser than threshold %.
watchdog 97a7976a4103 minor 05-05T09:10:58 System This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always...
Troubleshoot
-
Log into the impacted VMs and run this command: ip route | grep -i default
In this example, you can see two default routes, which is incorrect. (There should be only one default route.)
ubuntu@pod-name-cnat-cnat-core-protocol-ims2:~$ ip route | grep -i default
default via 172.16.x.x dev ens4 proto dhcp src 172.16.x.x metric 100
default via 172.16.x.x dev ens3 proto dhcp src 172.16.x.x metric 100
-
Check the status of the chronyd (NTP) service. Even though the service is up, you will see "Can't synchronise" errors.
Sample output:
root@pod-name-cnat-cnat-core-protocol-data1:/home/ubuntu# systemctl status chronyd.service
chrony.service - chrony, an NTP client/serverLoaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-01-08 07:19:19 UTC; 11 months 24 days ago
Docs: man:chronyd(8)
man:chronyc(1)
man:chrony.conf(5)
Main PID: 4300 (chronyd)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/chrony.service
└─4300 /usr/sbin/chronyd
Dec 30 17:06:28 pod-name-cnat-cnat-core-protocol-data1 chronyd[4300]: Selected source 5.196.x.x
Dec 31 06:37:11 pod-name-cnat-cnat-core-protocol-data1 chronyd[4300]: Can't synchronise: no selectable sources
Dec 31 17:15:31 pod-name-cnat-cnat-core-protocol-data1 chronyd[4300]: Selected source 5.196.x.x
Jan 01 06:29:43 pod-name-cnat-cnat-core-protocol-data1 chronyd[4300]: Can't synchronise: no selectable sources
Jan 01 16:50:02 pod-name-cnat-cnat-core-protocol-data1 chronyd[4300]: Selected source 5.196.x.x
Jan 01 19:25:05 pod-name-cnat-cnat-core-protocol-data1 chronyd[4300]: Can't synchronise: no selectable sources
-
Check the status of the network-config service. (It is inactive.)
ubuntu@pod-name-cnat-cnat-core-protocol-data1:~$ systemctl status network-config
network-config.service - Job that configures for routes
Loaded: loaded (/etc/systemd/system/network-config.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Fri 2021-01-08 09:29:05 UTC; 1 years 0 months ago
Process: 1185 ExecStart=/etc/cisco/network-config.sh start (code=killed, signal=TERM)
Main PID: 1185 (code=killed, signal=TERM)
Workaround
Note: This procedure does not cause any downtime in the application.
-
Edit the network-config service script (/etc/systemd/system/network-config.service
) in the affected node to add these lines:
root@pod-name-cnat-cnat-core-protocol-data1:/home/ubuntu# vim /etc/systemd/system/network-config.service
Restart=always
RestartSec=10s
StartLimitIntervalSec=300s
StartLimitBurst=30
-
Start the network-config service:
root@pod-name-cnat-cnat-core-protocol-data1:/home/ubuntu# systemctl start network-config
-
Reload the configuration with systemctl:
root@pod-name-cnat-cnat-core-protocol-data1:/home/ubuntu# systemctl daemon-reload
-
Restart the chronyd service:
root@pod-name-cnat-cnat-core-protocol-data1:/home/ubuntu# systemctl restart chronyd.service
Verification
After you complete the Workaround steps, run this command to verify that the issues are fixed:
for server in $(awk '/protocol/{print $1}' /etc/hosts); do ssh $server "hostname && timedatectl status && chronyc sources -vn" ; done
Note: A 'System clock synchronized' value of 'yes' indicates that the issues are fixed.
Sample output:
ubuntu@pod-name-cnat-cnat-core-master1:~$ for server in $(awk '/protocol/{print $1}' /etc/hosts); do ssh $server "hostname && timedatectl status && chronyc sources -vn" ; done
pod-name-cnat-cnat-core-protocol-data1
Local time: Tue 2021-09-07 10:22:16 UTC
Universal time: Tue 2021-09-07 10:22:16 UTC
RTC time: Tue 2021-09-07 10:22:17
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
systemd-timesyncd.service active: no
RTC in local TZ: no
210 Number of sources = 1
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* svtsnq01.mgmt 1 6 170 282 -1100us[-1858us] +/- 47ms
pod-name-cnat-cnat-core-protocol-data2
Local time: Tue 2021-09-07 10:22:16 UTC
Universal time: Tue 2021-09-07 10:22:16 UTC
RTC time: Tue 2021-09-07 10:22:17
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
systemd-timesyncd.service active: no
RTC in local TZ: no
210 Number of sources = 1
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* svtsnq01.mgmt 1 6 74 206 -218us[-1550us] +/- 46ms
pod-name-cnat-cnat-core-protocol-ims1
Local time: Tue 2021-09-07 10:22:17 UTC
Universal time: Tue 2021-09-07 10:22:17 UTC
RTC time: Tue 2021-09-07 10:22:18
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
systemd-timesyncd.service active: no
RTC in local TZ: no
210 Number of sources = 1
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* svtsnq01.mgmt 1 6 77 24 +796us[+1044us] +/- 47ms
pod-name-cnat-cnat-core-protocol-ims2
Local time: Tue 2021-09-07 10:22:17 UTC
Universal time: Tue 2021-09-07 10:22:17 UTC
RTC time: Tue 2021-09-07 10:22:18
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
systemd-timesyncd.service active: no
RTC in local TZ: no
210 Number of sources = 1
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* svtsnq01.mgmt 1 6 17 98 +2176us[+2275us] +/- 47ms