Introduction
This document describes how to troubleshoot CPS(Cisco Policy Suite) VM restart issue caused by CentOS kernel crash.
Problem
Each CPS VMs(qns,lb,pcrfclient and so on) runs based on CentOS. These VM can reboot due to a problem on CentOS side rather than a problem with CPS application side. If a reboot occurs due to a problem with the CentOS kernel, the root cause can not be found even if the CPS capture_env is investigated. The capture_env logs does not contain any error logs from rebooted VM during reboot. In such cases, the logs under /var/crash can be used for investigation.
Solution
CentOS can generate a kernel crash dump when problem occurs with kernel. By default, CPS is configured to collect kernel crash dumps for all VMs.
The status can be checked with this command.
[root@dc1-qns01 ~]# systemctl status kdump.service
● kdump.service - Crash recovery kernel arming
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
Active: active (exited) since Tue 2023-01-10 07:29:35 UTC; 4 months 4 days ago
Main PID: 1023 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 75300)
Memory: 0
CGroup: /system.slice/kdump.service
If a kernel crash occurs with kdump.service enabled, a directory with the name "address-YYYY-MM-DD-HH:MM:SS" is generated under /var/crash. CentOS generates 2 files under this directory.
[root@dc1-lb02 127.0.0.1-2022-10-18-06:18:41]# pwd
/var/crash/127.0.0.1-2022-10-18-06:18:41
[root@dc1-lb02 127.0.0.1-2022-10-18-06:18:41]# ls -rtl
total 161436
-rw-r--r-- 1 root root 89787 Oct 18 2022 vmcore-dmesg.txt
-rw------- 1 root root 165215218 Oct 18 2022 vmcore
- vmcore:
A file that stores the contents of kernel memory as a binary file. Analysis requires tools such as kernel-debuginfo and crash.
- vmcore-dmesg.txt:
dmesg text file when crash occurs.
As an example, in the log on the CPS side, error logs just before the reboot was not confirmed from logs from the VM that rebooted. Analysis result from VMWare side, the reboot was caused with this error log which would caused by guest OS.
The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.
Check the /var/crash of the rebooted VM, if there is a directory matched with the reboot time. It turned out that the reboot was due to a kernel problem on the CentOS side, and we were able to proceed further investigation.