简介
本文档介绍当Cisco Smart Install(SMI)Pod由于Kubernetes Bug https://github.com/kubernetes/kubernetes/issues/82346而进入未就绪状态时的恢复步骤。
问题
站点隔离后,融合以太网(CEE)报告了CEE中的处理错误警报。系统就绪状态低于100%。
[site1app/pod1] cee# show alerts active
alerts active k8s-deployment-replica-mismatch f89d8d09389c
state active
severity critical
type "Processing Error Alarm"
startsAt 2021-05-27T08:38:58.703Z
source site1app-smi-cluster-policy-oam2
labels [ "component: kube-state-metrics" "deployment: prometheus-scrapeconfigs-synch" "exported_namespace: cee-pod1" "instance: 192.0.2.37:8080" "job: kubernetes-pods" "namespace: cee-pod1" "pod: kube-state-metrics-6c476f7494-tqkrc" "pod_template_hash: 6c476f7494" "release: cee-pod1-cnat-monitoring" ]
annotations [ "summary: Deployment cee-pod1/prometheus-scrapeconfigs-synch has not matched the expected number of replicas for longer than 2 minutes." ]
[site1app/pod1] cee# show system status
system status deployed true
system status percent-ready 92.68
ubuntu@site1app-smi-cluster-policy-mas01:~$ kubectl get rs -n cee-pod1 | grep scrape
NAME DESIRED CURRENT READY AGE
prometheus-scrapeconfigs-synch-ccd454f76 1 1 0 395d
prometheus-scrapeconfigs-synch-f5544b4f8 0 0 0 408d
解决方案
站点隔离是Bug https://github.com/kubernetes/kubernetes/issues/82346的触发器。使这些Pod处于“就绪”状态的解决方法是重新启动受影响的Pod。此修复包含在即将发布的CEE版本中。
初始Pod和系统验证
登录CEE CLI并检查系统状态。
ssh -p 2024 admin@`kubectl get svc -A| grep " ops-center-cee" | awk '{print $4}'`
show alerts active
show system status
重启受影响的Pod
登录主节点,在主节点上运行这些命令。并标识未全部成员都处于就绪状态的守护程序集和复制副本集。
kubectl get daemonsets -A
kubectl get rs -A | grep -v '0 0 0'
将这些命令复制并粘贴到记事本中,并将所有cee-xyz替换为站点上的cee命名空间。
kubectl describe pods core-retriever -n cee-xyz | egrep "^Name:|False" | grep -B1 False
kubectl describe pods calico-node -n kube-system | egrep "^Name:|False" | grep -B1 False
kubectl describe pods csi-cinder-nodeplugin -n kube-system | egrep "^Name:|False" | grep -B1 False
kubectl describe pods maintainer -n kube-system | egrep "^Name:|False" | grep -B1 False
kubectl describe pods kube-proxy -n kube-system | egrep "^Name:|False" | grep -B1 False
kubectl describe pods path-provisioner -n cee-xyz | egrep "^Name:|False" | grep -B1 False
kubectl describe pods logs-retriever -n cee-xyz | egrep "^Name:|False" | grep -B1 False
kubectl describe pods node-exporter -n cee-xyz | egrep "^Name:|False" | grep -B1 False
kubectl describe pods keepalived -n smi-vips| egrep "^Name:|False" | grep -B1 False
kubectl describe pods prometheus-scrapeconfigs-synch -n cee-xyz | egrep "^Name:|False" | grep -B1 False
执行命令并收集结果输出。结果是,输出使用需要重新启动的相应命名空间来标识Pod名称。
在您发出这些命令(相应地替换Pod名称和命名空间)时,从之前获取的列表中重新启动所有受影响的Pod。
kubectl delete pods core-retriever-abcde -n cee-xyz
kubectl delete pods core-retriever-abcde -n cee-xyz
…
验证Pod是否已启动并运行,且没有任何问题。
kubeclt get pods -A
重新启动后验证Pod和系统状态
执行命令:
kubectl get daemonsets -A
kubectl get rs -A | grep -v '0 0 0'
确认守护程序集和复制副本集显示所有成员处于就绪状态。
登录CEE CLI,确认没有活动警报和系统状态必须为100%。
ssh -p 2024 admin@`kubectl get svc -A| grep " ops-center-cee" | awk '{print $4}'`
show alerts active
show system status