Introduction
This document describes how to implement the workaround for the Subscriber Microservices Infrastructure (SMI) Common Execution Environment (CEE) Pool of Devices (POD) (pgpool) restart issues.
Prerequisites
Requirements
Cisco recommends that you have knowledge of these topics:
- Cisco SMI CEE (Ultra Cloud Core CEE)
- 5G Cloud Native Deployment Platform (CNDP) or SMI Bare Metal (BM) architecture
- Dockers and Kubernetes
Components Used
The information in this document is based on these software and hardware versions:
- SMI 2020.02.2.35
- Kubernetes v1.21.0
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
Background Information
What is SMI?
Cisco SMI is a layered stack of cloud technologies and standards that enable microservices-based applications from Cisco Mobility, Cable, and Broadband Network Gateway (BNG) business units – all of which have similar subscriber management functions and similar datastore requirements.
The attributes are:
- Layer Cloud Stack (technologies and standards) to provide top to bottom deployments and also accommodate current customer cloud infrastructures.
- The CEE is shared by all applications for non-application functions (data storage, deployment, configuration, telemetry, and alarm). This provides consistent interaction and experience for all customer touchpoints and integration points.
- Applications and the CEE are deployed in microservice containers and connected with an Intelligent Service Mesh.
- Exposed API for deployment, configuration, and management in order to enable automation.
What is SMI CEE?
The CEE is a software solution developed to monitor mobile and cable applications that are deployed on the SMI. The CEE captures information (key metrics) from the applications in a centralized way for engineers to debug and troubleshoot.
The CEE is the common set of tools that are installed for all the applications. It comes equipped with a dedicated Ops Center, which provides the user interface (CLI) and APIs to manage the monitor tools. There is only one CEE available for each cluster.
What are CEE PODs?
A POD is a process that runs on your Kubernetes cluster. The POD encapsulates a granular unit that is known as a container. A POD contains one or multiple containers.
Kubernetes deploys one or multiple PODs on a single node which can be a physical or virtual machine. Each POD has a discrete identity with an internal IP address and port space. However, the containers within a POD can share the storage and network resources. CEE has a number of PODs that have unique functions. Pgpool and postgress are among several CEE PODs.
What is pgpool POD?
Pgpool manages the Postgres resource pool for connection, replication, load balance, and so on. Pgpool is a middleware that works between PostgreSQL servers and a PostgreSQL database.
What is Postgres POD?
Postgres supports the Structured Query Language (SQL) database with redundancy to store alerts and Grafana dashboards.
Problem
The pgpool PODs restart regularly while the postgresql PODs run with no issues.
In order to display the alerts, enter this command:
show alerts active summary | include "POD_|k8s-pod-"
A sample alert from the CEE is shown here.
[pod-name-smf-data/podname] cee# show alerts active summary | include "POD_|k8s-pod-"
k8s-pod-crashing-loop 1d9d2b113073 critical 12-15T21:47:39 pod-name-smf-data-mas
Pod cee-podname/grafana-65cbdb9846-krgfq (grafana) is restarting 1.03 times / 5 minutes.
POD_Restarted 04d42efb81de major 12-15T21:45:44 pgpool-67f48f6565-vjt Container=
k8s_pgpool_pgpool-67f48f6565-vjttd_cee-podname_a9f68607-eac4-40a9-86ef-db8176e0a22a_1474 of pod= pgpool-...
POD_Restarted f7657a0505c2 major 12-15T21:45:44 postgres-0 Container=
k8s_postgres_postgres-0_cee-podname_59e0a768-6870-4550-8db3-32e2ab047ce2_1385 of pod= postgres-0 in name...
POD_Restarted 6e57ae945677 major 12-15T21:45:44 alert-logger-d96644d4 Container=
k8s_alert-logger_alert-logger-d96644d4-dsc8h_cee-podname_2143c464-068a-418e-b5dd-ce1075b9360e_2421 of po...
k8s-pod-crashing-loop 5b8e6a207aad critical 12-15T21:45:09 pod-name-smf-data-mas Pod
cee-podname/pgpool-67f48f6565-vjttd (pgpool) is restarting 1.03 times / 5 minutes.
POD_Down 45a6b9bf73dc major 12-15T20:30:44 pgpool-67f48f6565-qbw Pod= pgpool-67f48f6565-qbw52 in namespace=
cee-podname is DOWN for more than 15min
POD_Down 4857f398a0ca major 12-15T16:40:44 pgpool-67f48f6565-vjt Pod= pgpool-67f48f6565-vjttd in namespace=
cee-podname is DOWN for more than 15min
k8s-pod-not-ready fc65254c2639 critical 12-11T21:07:29 pgpool-67f48f6565-qbw Pod
cee-podname/pgpool-67f48f6565-qbw52 has been in a non-ready state for longer than 1 minute.
k8s-pod-not-ready 008b859e7333 critical 12-11T16:35:49 pgpool-67f48f6565-vjt Pod
cee-podname/pgpool-67f48f6565-vjttd has been in a non-ready state for longer than 1 minute.
Troubleshoot
From the Kubernetes master, enter this command:
kubectl describe pods -n <namespace> postgres-0"
The sample output of the POD description is shown here. The output is truncated.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 14m default-scheduler Successfully assigned cee-pod-name-l1/postgres-2
to pod-name-master-3
Normal Pulling 14m kubelet Pulling image "docker.10.192.x.x.nip.io/cee-2020.02.2.i38/
smi-libraries/postgresql/2020.02.2/postgres:1.3.0-946d87d"
Normal Pulled 13m kubelet Successfully pulled image "docker.10.192.x.x.nip.io/cee-2020.02.2.i38/
smi-libraries/postgresql/2020.02.2/postgres:1.3.0-946d87d" in 29.048094722s
Warning Unhealthy 12m kubelet Readiness probe failed: [bin][h][ir] >>> [2021-10-11 18:09:48]
pod is not ready
Warning Unhealthy 10m kubelet Readiness probe failed: [bin][h][ir] >>> [2021-10-11 18:11:18]
pod is not ready
Warning Unhealthy 10m kubelet Readiness probe failed: [bin][h][ir] >>> [2021-10-11 18:11:48]
pod is not ready
Warning Unhealthy 9m49s kubelet Readiness probe failed: [bin][h][ir] >>> [2021-10-11 18:12:18]
pod is not ready
Warning Unhealthy 9m19s kubelet Readiness probe failed: [bin][h][ir] >>> [2021-10-11 18:12:48]
pod is not ready
Warning Unhealthy 8m49s kubelet Readiness probe failed: [bin][h][ir] >>> [2021-10-11 18:13:18]
pod is not ready
Warning Unhealthy 8m19s kubelet Readiness probe failed: [bin][h][ir] >>> [2021-10-11 18:13:48]
pod is not ready
Warning Unhealthy 7m49s kubelet Readiness probe failed: [bin][h][ir] >>> [2021-10-11 18:14:18]
pod is not ready
Warning Unhealthy 7m19s kubelet Readiness probe failed: [bin][h][ir] >>> [2021-10-11 18:14:48]
pod is not ready
Warning BackOff 6m44s kubelet Back-off restarting failed container
Or
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 13m default-scheduler 0/5 nodes are available: 2 node(s)
didn't match Pod's node affinity/selector, 3 node(s) didn't find available persistent
volumes to bind.
Normal Scheduled 13m default-scheduler Successfully assigned cee-pod-name-l1/postgres-0
to pod-name-master-1
Warning FailedScheduling 13m default-scheduler 0/5 nodes are available: 2 node(s)
didn't match Pod's node affinity/selector, 3 node(s) didn't find available
persistent volumes to bind.
Normal Pulling 13m kubelet Pulling image "docker.10.192.x.x.nip.io/cee-2020.02.2.i38/
smi-libraries/postgresql/2020.02.2/postgres:1.3.0-946d87d"
Normal Pulled 12m kubelet Successfully pulled image "docker.10.192.x.x.nip.io/
cee-2020.02.2.i38/smi-libraries/postgresql/2020.02.2/postgres:1.3.0-946d87d"
in 43.011763302s
Warning Unhealthy 7m20s kubelet Liveness probe failed: [bin][h][imm] >>>
[2021-10-11 18:09:16] My name is pg-postgres-0
Workaround
Note: This procedure does not cause any downtime in the application.
Shut Down the CEE
In order to shut down the CEE, enter these commands from the CEE:
[pod-name-smf-data/podname] cee#
[pod-name-smf-data/podname] cee# config terminal
Entering configuration mode terminal
[pod-name-smf-data/podname] cee(config)# system mode shutdown
[pod-name-smf-data/podname] cee(config)# commit
Commit complete.
Wait for the system to go to 100%
Delete Content From Folders
From master-vip, SSH to each one of the master VMs and remove the contents of these folders: /data/cee-podname/data-postgres-[0-2].
Master 1
cloud-user@pod-name-smf-data-master-1:~$ sudo rm -rf /data/cee-podname/data-postgres-0
Master 2
cloud-user@pod-name-smf-data-master-2:~$ sudo rm -rf /data/cee-podname/data-postgres-1
Master 3
cloud-user@pod-name-smf-data-master-3:~$ sudo rm -rf /data/cee-podname/data-postgres-2
Restore the CEE
In order to restore the CEE, enter these commands from the CEE:
[pod-name-smf-data/podname] cee#
[pod-name-smf-data/podname] cee# config terminal
Entering configuration mode terminal
[pod-name-smf-data/podname] cee(config)# system mode running
[pod-name-smf-data/podname] cee(config)# commit
Commit complete.
Wait for the system to go to 100%.
Post Checks
Verify Kubernetes from the master.
cloud-user@pod-name-smf-data-master-1:~$ kubectl get pods -A -o wide | egrep 'postgres|pgpool'
All pods should display up and running without any restarts
Verify Alerts are Cleared From the CEE
In order to verify that alerts are cleared from the CEE, enter this command:
show alerts active summary | include "POD_|k8s-pod-"
Also, you can enter this command in order to ensure there is one master and two standby DBs:
echo "0----------------------------------";kubectl
exec -it postgres-0 -n $(kubectl get pods -A | grep postgres | awk '{print $1}' | head -1)
-- /usr/local/bin/cluster/healthcheck/is_major_master.sh;echo "1--------------------------
--------";kubectl exec -it postgres-1 -n $(kubectl get pods -A | grep postgres | awk '{print $1}'
| head -1) -- /usr/local/bin/cluster/healthcheck/is_major_master.sh;echo "2---------------
-------------------"; kubectl exec -it postgres-2 -n $(kubectl get pods -A | grep postgres |
awk '{print $1}' | head -1) -- /usr/local/bin/cluster/healthcheck/is_major_master.sh;
The sample expected output is:
cloud-user@pod-name-smf-data-master-1:~$ echo "0----------------------------------";kubectl
exec -it postgres-0 -n $(kubectl get pods -A | grep postgres | awk '{print $1}' | head -1)
-- /usr/local/bin/cluster/healthcheck/is_major_master.sh;echo "1--------------------------
--------";kubectl exec -it postgres-1 -n $(kubectl get pods -A | grep postgres | awk '{print $1}'
| head -1) -- /usr/local/bin/cluster/healthcheck/is_major_master.sh;echo "2---------------
-------------------"; kubectl exec -it postgres-2 -n $(kubectl get pods -A | grep postgres |
awk '{print $1}' | head -1) -- /usr/local/bin/cluster/healthcheck/is_major_master.sh;
0----------------------------------
[bin][h][imm] >>> [2021-12-15 22:05:18] My name is pg-postgres-0
[bin][h][imm] >>> My state is good.
[bin][h][imm] >>> I'm not a master, nothing else to do!
1----------------------------------
[bin][h][imm] >>> [2021-12-15 22:05:19] My name is pg-postgres-1
[bin][h][imm] >>> My state is good.
[bin][h][imm] >>> I think I'm master. Will ask my neighbors if they agree.
[bin][h][imm] >>> Will ask nodes from PARTNER_NODES list
[bin][h][imm] >>> Checking node pg-postgres-0
[bin][h][imm] >>>>>>>>> Count of references to potential master pg-postgres-1 is 1 now
[bin][h][imm] >>> Checking node pg-postgres-1
[bin][h][imm] >>> Checking node pg-postgres-2
[bin][h][imm] >>>>>>>>> Count of references to potential master pg-postgres-1 is 2 now
[bin][h][imm] >>> Potential masters got references:
[bin][h][imm] >>>>>> Node: pg-postgres-1, references: 2
[bin][h][imm] >>> I have 2/2 incoming reference[s]!
[bin][h][imm] >>>> 2 - Does anyone have more?
[bin][h][imm] >>> Yahoo! I'm real master...so I think!
2----------------------------------
[bin][h][imm] >>> [2021-12-15 22:05:21] My name is pg-postgres-2
[bin][h][imm] >>> My state is good.
[bin][h][imm] >>> I'm not a master, nothing else to do!