Resolve Kubernetes Pods Show in Not Ready State after Site Isolation

Available Languages

Download Options

PDF (8.1 KB)
View with Adobe Reader on a variety of devices
ePub (83.7 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (69.9 KB)
View on Kindle device or Kindle app on multiple devices

Updated:September 20, 2021

Document ID:217376

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

Problem

Solution

Initial Pod and System Verification

Restart of Affected Pods

Verify Pods and System Status After Restart

Introduction

This document describes recovery steps when the Cisco Smart Install (SMI) pod gets into the not ready state due to Kubernetes bug https://github.com/kubernetes/kubernetes/issues/82346.

Problem

After site isolation, Converged Ethernet (CEE) reported the Processing Error Alarm in the CEE. The system ready status is below 100%.

[site1app/pod1] cee# show alerts active 
alerts active k8s-deployment-replica-mismatch f89d8d09389c
state active
severity critical
type "Processing Error Alarm"
startsAt 2021-05-27T08:38:58.703Z
source site1app-smi-cluster-policy-oam2
labels [ "component: kube-state-metrics" "deployment: prometheus-scrapeconfigs-synch" "exported_namespace: cee-pod1" "instance: 192.0.2.37:8080" "job: kubernetes-pods" "namespace: cee-pod1" "pod: kube-state-metrics-6c476f7494-tqkrc" "pod_template_hash: 6c476f7494" "release: cee-pod1-cnat-monitoring" ]
annotations [ "summary: Deployment cee-pod1/prometheus-scrapeconfigs-synch has not matched the expected number of replicas for longer than 2 minutes." ]

[site1app/pod1] cee# show system status 
system status deployed true
system status percent-ready 92.68

ubuntu@site1app-smi-cluster-policy-mas01:~$ kubectl get rs -n cee-pod1 | grep scrape
NAME DESIRED CURRENT READY AGE 
prometheus-scrapeconfigs-synch-ccd454f76 1 1 0 395d
prometheus-scrapeconfigs-synch-f5544b4f8 0 0 0 408d

Solution

The site isolation is a trigger for the bug https://github.com/kubernetes/kubernetes/issues/82346. The workaround to have these pods in Ready state is to restart the affected pods. The fix is included in upcoming CEE releases.

Initial Pod and System Verification

ssh -p 2024 admin@`kubectl get svc -A| grep " ops-center-cee" | awk '{print $4}'`

show alerts active
show system status

Restart of Affected Pods

Log in to the primary node, on the primary, run these commands. And identify daemonsets and replica sets that have not all members in Ready state.

kubectl get daemonsets -A
kubectl get rs -A | grep -v '0 0 0'

Copy and paste these commands in the notepad and replace all cee-xyz, with the cee namespace on the site.

kubectl describe pods core-retriever -n cee-xyz | egrep "^Name:|False" | grep -B1 False
kubectl describe pods calico-node -n kube-system | egrep "^Name:|False" | grep -B1 False
kubectl describe pods csi-cinder-nodeplugin -n kube-system | egrep "^Name:|False" | grep -B1 False
kubectl describe pods maintainer -n kube-system | egrep "^Name:|False" | grep -B1 False
kubectl describe pods kube-proxy -n kube-system | egrep "^Name:|False" | grep -B1 False
kubectl describe pods path-provisioner -n cee-xyz | egrep "^Name:|False" | grep -B1 False
kubectl describe pods logs-retriever -n cee-xyz | egrep "^Name:|False" | grep -B1 False
kubectl describe pods node-exporter -n cee-xyz | egrep "^Name:|False" | grep -B1 False
kubectl describe pods keepalived -n smi-vips| egrep "^Name:|False" | grep -B1 False
kubectl describe pods prometheus-scrapeconfigs-synch -n cee-xyz | egrep "^Name:|False" | grep -B1 False

Execute the commands and collect the result output. In the result, output identifies the pod names with the corresponding namespace that require a restart.

Restart all affected pods from the list obtained previously when you issue these commands (replace pod name and namespace accordingly).

kubectl delete pods core-retriever-abcde -n cee-xyz
kubectl delete pods core-retriever-abcde -n cee-xyz
…

Verify that the pods are up and running without any issue.

kubeclt get pods -A

Verify Pods and System Status After Restart

Execute commands:

kubectl get daemonsets -A
kubectl get rs -A | grep -v '0 0 0'

Confirm that daemonsets and replica sets show all members in Ready state.

ssh -p 2024 admin@`kubectl get svc -A| grep " ops-center-cee" | awk '{print $4}'`

show alerts active
show system status

Revision History

Revision	Publish Date	Comments
1.0	20-Sep-2021	Initial Release

Contributed by Cisco Engineers

Carlos Franco Garcia
Cisco TAC Engineer
Dennis Lanov
Cisco TAC Engineer

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

Resolve Kubernetes Pods Show in Not Ready State after Site Isolation

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Problem

Solution

Initial Pod and System Verification

Restart of Affected Pods

Verify Pods and System Status After Restart

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products