Introduzione
In questo documento viene descritto come risolvere i problemi relativi all'arresto anomalo del pod su Cloud Native Deployment Platform (CNDP).
Prerequisiti
Requisiti
Nessun requisito specifico previsto per questo documento.
Componenti usati
Il documento può essere consultato per tutte le versioni software o hardware.
Le informazioni discusse in questo documento fanno riferimento a dispositivi usati in uno specifico ambiente di emulazione. Su tutti i dispositivi menzionati nel documento la configurazione è stata ripristinata ai valori predefiniti. Se la rete è operativa, valutare attentamente eventuali conseguenze derivanti dall'uso dei comandi.
Premesse
In questa configurazione, Cloud Native Deployment Platform (CNDP) ospita Session Management Function (SMF).
Problema
Vengono visualizzati avvisi su Common Execution Environment (CEE) per il blocco del pod.
Command:
cee# show alerts active summary summary
Example:
[smf-rcdn/cee-rcdn] cee# show alerts active summary summary
NAME UID SUMMARY
--------------------------------------------------------------------------------------------
k8s-pod-crashing-loop bd4394046466 Pod smf-rcdn/smf-service-n0-6 (smf-service) is...
k8s-pod-crashing-loop 0ac1019911e3 Pod smf-rcdn/smf-service-n0-14 (smf-service) i...
k8s-pod-crashing-loop eeff8fa16660 Pod smf-rcdn/smf-service-n0-9 (smf-service) is...
k8s-pod-crashing-loop 470ff66822dc Pod smf-rcdn/smf-service-n0-5 (smf-service) is...
k8s-pod-crashing-loop cc8950f07ace Pod smf-rcdn/smf-service-n0-15 (smf-service) i...
k8s-pod-crashing-loop 05a7d1e291a6 Pod smf-rcdn/smf-service-n0-3 (smf-service) is...
Analisi
Connettersi al nodo master e visualizzare tutti i pod kubernetes arrestati in modo anomalo. Grep per CrashLoopBackOff
.
Dallo stesso output, è possibile verificare il numero di riavvii del pod.
Command:
master$ kubectl get pods -n
|grep -v CrashLoopBackOff
Example:
cloud-user@smf-rcdn-master-1:~$ kubectl get pods -n smf-rcdn |grep -v Running
NAME READY STATUS RESTARTS AGE
smf-service-n0-10 1/2 CrashLoopBackOff 1224 6d7h
smf-service-n0-11 1/2 CrashLoopBackOff 1242 6d7h
smf-service-n0-15 1/2 CrashLoopBackOff 1244 6d7h
smf-service-n0-2 1/2 CrashLoopBackOff 1241 6d7h
smf-service-n0-3 1/2 CrashLoopBackOff 1251 6d7h
smf-service-n0-5 1/2 CrashLoopBackOff 1231 6d7h
smf-service-n0-7 1/2 CrashLoopBackOff 1249 6d7h
Descrivere il baccello che si è schiantato. In questo modo è possibile ottenere ulteriori dettagli sul perché il pod si è schiantato. Osservare i registri in Eventi.
Command:
master$ kubectl describe pod -n
|grep -i start
Example
:
cloud-user@smf-rcdn-master-1:~$ kubectl describe pod -n smf-rcdn smf-service-n0-11 |grep -i start Start Time: Tue, 09 Aug 2022 03:13:54 +0000 Started: Tue, 09 Aug 2022 03:13:56 +0000 Restart Count: 0 Started: Mon, 15 Aug 2022 11:33:10 +0000 Started: Mon, 15 Aug 2022 11:26:55 +0000 Restart Count: 1263 Started: Tue, 09 Aug 2022 03:13:58 +0000 Restart Count: 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning BackOff 65s (x15210 over 3d6h) kubelet Back-off restarting failed container
Ad esempio, se avete un baccello smf-service-n1-0
si è verificato un arresto anomalo ed è necessario connettersi al NODE smf-rcdn-service-ims2
per raccogliere i file di base.
ubuntu@smf-rcdn-master1:~$ kubectl get pods -n smf-ims -o wide | grep smf-service-n1-0
NAME READY STATUS RESTARTS AGE IP NODE NOMINDATEDN NODE READINESS GATES
smf-service-n1-0 2/2 Running 10 9h 10.20.9.142 smf-rcdn-service-ims2
Connect to the Node è il Pod host che si è bloccato e raccoglie il file binario. Questo file è necessario per l'analisi da parte di Cisco.
Command:
master1:~$ kubectl cp
/
:/opt/workspace/smf-service /tmp/smf-service
Example:
ubuntu@smf-rcdn-master1:~$ kubectl cp smf-ims/smf-service-n1-0:/opt/workspace/smf-service /tmp/smf-service
Connect to the Node (Connetti al nodo) è il Pod host che si è bloccato e si sposta nella cartella /var/lib/systemd/coredump/e visualizza il contenuto. Se generati, è possibile visualizzarli in questa cartella.
Example:
ubuntu@smf-rcdn-master1:~$ ssh smf-rcdn-service-ims2
ubuntu@smf-rcdn-service-ims2:~$ cd /var/lib/systemd/coredump/
ubuntu@smf-rcdn-service-ims2:/var/lib/systemd/coredump$ ls -ltr
total 982340
-rw-r----- 1 root root 52968460 Sep 21 16:40 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.1232.1599842408000000.lz4
-rw-r----- 1 root root 61609776 Sep 21 16:41 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.3468.1599842463000000.lz4
-rw-r----- 1 root root 74233259 Sep 21 16:46 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.28259.1599842775000000.lz4
-rw-r----- 1 root root 58241763 Sep 21 16:52 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.17155.1599843174000000.lz4
-rw-r----- 1 root root 43732684 Sep 21 16:56 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.3076.1599843385000000.lz4
-rw-r----- 1 root root 52377930 Sep 21 17:06 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.8024.1599844002000000.lz4
-rw-r----- 1 root root 63990106 Sep 21 17:07 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.26962.1599844074000000.lz4
-rw-r----- 1 root root 98058261 Sep 21 17:15 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.13026.1599844546000000.lz4
-rw-r----- 1 root root 59586871 Sep 21 17:24 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.21720.1599845052000000.lz4
-rw-r----- 1 root root 71187759 Sep 21 17:50 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.19705.1599846648000000.lz4
-rw-r----- 1 root root 96949278 Sep 21 17:57 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.11744.1599847049000000.lz4
-rw-r----- 1 root root 6052439 Sep 21 17:57 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.23846.1599847052000000.lz4
-rw-r----- 1 root root 70642243 Sep 21 17:58 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.18327.1599847110000000.lz4
-rw-r----- 1 root root 66052273 Sep 21 18:10 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.1504.1599847843000000.lz4
-rw-r----- 1 root root 65132876 Sep 21 18:10 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.12528.1599847855000000.lz4
-rw-r----- 1 root root 65000665 Sep 21 18:32 core.smf-service.0.a829fbabe2e649a7ab02150838fe47ae.9462.1599849167000000.lz4
ubuntu@smf-rcdn-master1:~$:/var/lib/systemd/coredump$
Tar tutti i file nella cartella.
ubuntu@smf-rcdn-service-ims2:~$ sudo tar czvfsmf-rcdn-service-ims2.tar.gz *.lz4
Da SFTP principale al nodo in cui si trovano i core e scaricarli nella cartella Master /tmp, quindi estrarlo sul PC.
ubuntu@smf-rcdn-master1:~$: sftp smf-rcdn-service-ims2
Il comando stampa i registri prima dell'ultimo riavvio del pod e acquisisce la firma dell'arresto anomalo.
Command:
master:~$ kubectl logs -n
-p
-c
Example:
ubuntu@smf-rcdn-master1:~$
kubectl logs -n smf-ims -p smf-service-n1-0 -c smf-service /usr/local/go/src/runtime/asm_amd64.s:1357 (0x462d01) panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x13d92f6] goroutine 839296 [running]: panic(0x196c320, 0x3441300) /usr/local/go/src/runtime/panic.go:722 +0x2c2 fp=0xc000a9d050 sp=0xc000a9cfc0 pc=0x432d82 runtime.panicmem(...) /usr/local/go/src/runtime/panic.go:199 runtime.sigpanic() /usr/local/go/src/runtime/signal_unix.go:394 +0x3ec fp=0xc000a9d080 sp=0xc000a9d050 pc=0x4487cc smf-service/userplane.(*UpfServData).
ProcessSessionModificationResponse(0xc0059fe660, 0xc005b98f00, 0xc00aa6e3c0, 0x2001181ae72b892, 0xc00ea43570, 0x3, 0x4,
0xc005cd0820, 0xc005b11410, 0xc005b10d20, ...) /opt/workspace/smf-service/src/smf-service/userplane/upfSessionModification.go:743 +0x526 fp=0xc000a9d408 sp=0xc000a9d080 pc=0x13d92f6 smf-service/procedures/4g/pdn5g4gHo.(*Pdn5g4gHoProcedure).awtUpfModifyProcN4ModifyResp(0xc005a17440, 0xc0099e36c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) /opt/workspace/smf-service/src/smf-service/procedures/4g/pdn5g4gHo/mbrUtils.go:485 +0x24d fp=0xc000a9d630 sp=0xc000a9d408 pc=0x1562d0d smf-service/procedures/4g/pdn5g4gHo.(*Pdn5g4gHoProcedure).handleUpfModifyEvents(0xc005a17440, 0xc0099e36c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) /opt/workspace/smf-service/src/smf-service/procedures/4g/pdn5g4gHo/stateHandler.go:196 +0x4a1 fp=0xc000a9d768 sp=0xc000a9d630 pc=0x1570d31 smf-service/procedures/4g/pdn5g4gHo.(*Pdn5g4gHoProcedure).HandleEvent(0xc005a17440, 0xc0099e36c0, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /opt/workspace/smf-service/src/smf-service/procedures/4g/pdn5g4gHo/procedure.go:364 +0x707 fp=0xc000a9d8d0 sp=0xc000a9d768 pc=0x1567887 smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-smf/smf-common.git/src/smf-common/callflow.(*BaseProcedure).Handle(0xc00568b4a0, 0xc0099e36c0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) /opt/workspace/smf-service/src/smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-smf/smf-common.git/src/smf-common/callflow/BaseProcedure.go:54 +0xdb
fp=0xc000a9d978 sp=0xc000a9d8d0 pc=0xf5996b smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-smf/smf-common.git/src/smf-common/callflow.(*SessionState).ProcessContinue(0xc00b79b6d0, 0xc0099e36c0,
0xc00568b4a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /opt/workspace/smf-service/src/smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-smf/smf-common.git/src/smf-common/callflow/SessionState.go:169 +0x1f2
fp=0xc000a9da20 sp=0xc000a9d978 pc=0xf5d552 smf-service/processor.(*SmfAppMessageProcessor).ProcessContinue(0x3a31da0, 0xc005b98f00, 0x1d34988, 0x35, 0x9, 0x1d34988, 0x35) /opt/workspace/smf-service/src/smf-service/processor/grpc_message_processor.go:430 +0x4ab fp=0xc000a9dc20 sp=0xc000a9da20 pc=0x174fc0b smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra.(*masterBlueprint).processTransaction
(0xc0003141e0, 0xc005b98f00, 0xc000a9dd98) /opt/workspace/smf-service/src/smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra/MasterBlueprint.go:301
+0x1a7 fp=0xc000a9dce8 sp=0xc000a9dc20 pc=0xd39ca7 smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra.(*masterBlueprint).
processTransactionWithCR(0xc0003141e0, 0xc005b98f00, 0x1cfeb00) /opt/workspace/smf-service/src/smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra/MasterBlueprint.go:234
+0x394 fp=0xc000a9de78 sp=0xc000a9dce8 pc=0xd396e4 smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra.(*masterBlueprint).
processSessionTransaction(0xc0003141e0, 0xc005b98f00, 0x1, 0x0) /opt/workspace/smf-service/src/smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra/MasterBlueprint.go:177
+0x124 fp=0xc000a9ded0 sp=0xc000a9de78 pc=0xd39104 smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra.(*masterBlueprint).
processEvent(0xc0003141e0, 0xc005b98f00, 0x1d02487) /opt/workspace/smf-service/src/smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra/MasterBlueprint.go:138 +0x5fc
fp=0xc000a9df88 sp=0xc000a9ded0 pc=0xd3869c smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra.(*ApplicationContext).NewTransaction.func2
(0xc0006af400, 0xc005b98f00) /opt/workspace/smf-service/src/smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra/ApplicationContext.go:1268
+0x7c fp=0xc000a9dfd0 sp=0xc000a9df88 pc=0xd9b69c runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc000a9dfd8 sp=0xc000a9dfd0 pc=0x462d01 created by smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra.(*ApplicationContext).NewTransaction /opt/workspace/smf-service/src/smf-service/vendor/wwwin-github.cisco.com/mobile-cnat-golang-lib/app-infra.git/src/app-infra/infra/ApplicationContext.go:1266 +0x62c goroutine 1 [sleep]: runtime.gopark(0x1dbaa10, 0x34ef580, 0xc001f01313, 0x2) /usr/local/go/src/runtime/proc.go:304 +0xe0 fp=0xc000a3bca8 sp=0xc000a3bc88 pc=0x434ea0 runtime.goparkunlock(...)
Connettersi a CEE e raccogliere il debug tac prima e dopo l'arresto anomalo del pod.
tac-debug-pkg create from yyyy-mm-dd_hh:mm:ss to yyyy-mm-dd_hh:mm:ss tac-debug-pkg create from yyyy-mm-dd_hh:mm:ss to yyyy-mm-dd_hh:mm:ss
Piano d'azione
Aprire la richiesta di servizio per Cisco TAC per trovare la causa principale dell'arresto anomalo.