Introduction
This document describes High Load Alert investigation and recommended workarounds in Cisco Policy Suite (CPS).
Prerequisites
Requirements
Cisco recommends that you have knowledge of these topics:
Cisco also recommends that you have privilege root access to CPS CLI.
Components used
The information in this document is based on CPS 19.4
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
Background Information
The load average is the average system load on a Linux server for a defined period of time. In other words, it is the CPU demand of a server that includes the sum of the active and the idle threads.
Measurement of load average is critical to understand how your servers perform; if overloaded, you must kill or optimize the processes that consume high amounts of resources, or provide more resources to balance the workload.
Typically, the top or the uptime command provides the load average of your server with output that looks like:
[root@cps-194-aio-mob ~]# uptime
11:41:08 up 6 days, 5:20, 2 users, load average: 0.71, 0.35, 0.24
[root@cps-194-aio-mob ~]#
[root@cps-194-aio-mob ~]# top
top - 12:17:26 up 6 days, 5:56, 2 users, load average: 0.09, 0.12, 0.13
Tasks: 185 total, 1 running, 183 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.8 us, 0.8 sy, 0.0 ni, 98.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 12137348 total, 4128956 free, 5219860 used, 2788532 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 6586848 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7070 root 5 -15 8263680 1.3g 21728 S 12.5 11.6 561:38.74 java
1 root 20 0 191384 4320 2620 S 0.0 0.0 3:11.17 systemd
These numbers are the averages of the system load over a period of one, five, and 15 minutes.
Before you move further, let’s understand these two important phrases in all Unix-like systems:
System load/CPU Load – is a measurement of CPU over or under-utilization in a Linux system; the number of processes that are executed by the CPU or in idle state.
Load average – is the average system load calculated over a given period of time of 1, 5 and 15 minutes.
Problem
Whenever the load average of a CPS VM goes beyond the defined threshold, HighLoadAlert gets generated. The threshold value for the HighLoad alert is defined as 1.5*No Of CPUs in the VM. This configuration is provided in /etc/snmp/snmpd.conf:
load 12 12 12
# 1, 5 and 15 Minute Load Averages (UCD-SNMP-MIB la)
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.4 .1.3.6.1.4.1.2021.10.1.5.1
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.5 .1.3.6.1.4.1.2021.10.1.5.2
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.6 .1.3.6.1.4.1.2021.10.1.5.3
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.4.0 .1.3.6.1.4.1.2021.10.1.5.1
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.5.0 .1.3.6.1.4.1.2021.10.1.5.2
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.6.0 .1.3.6.1.4.1.2021.10.1.5.3
Sample HighLoad Alert:
2021-10-31T14:25:36.572711+05:30 XXXXX-lb01 snmptrapd[5717]: 2021-10-31 14:25:36 pcrfclient01 [UDP: [XX.XX.XX.XX]:46046->[XX.XX.XX.XX]:162]:#012DISMAN-EVENT-MIB::sysUpTimeInstance = 99307800#011SNMPv2-MIB::snmpTrapOID.0 = OID: DISMAN-EVENT-MIB::mteTriggerFired#011DISMAN-EVENT-MIB::mteHotTrigger.0 = STRING: HighLoadAlert#011DISMAN-EVENT-MIB::mteHotTargetName.0 = STRING: #011DISMAN-EVENT-MIB::mteHotContextName.0 = STRING: #011DISMAN-EVENT-MIB::mteHotOID.0 = OID: UCD-SNMP-MIB::laErrorFlag.1#011DISMAN-EVENT-MIB::mteHotValue.0 = INTEGER: 1#011UCD-SNMP-MIB::laNames.1 = STRING: Load-1#011UCD-SNMP-MIB::laErrMessage.1 = STRING: 1 min Load Average too high (= 64.84)
Troubleshoot HighLoad
Prior to further investigation, ensure that the affected VM has CPU count as per standard. This can be done with the respective CPS Installation guide where it mentions the CPU count required for each VM.
The only Linux command that combined provide load average and CPU utilization by processes, is top command. In order to identify the process that causes HighLoad, top command must be executed in the affected VM at regular intervals for a certain duration that covers the HighLoad instance. This command provides top output for every 3 sec, for 15000 number of times (you can change the number as per your scenario):
#top -b -n15000 >> top.txt &
[root@cps-194-aio-mob ~]# top
top - 09:32:11 up 7 days, 3:11, 3 users, load average: 0.13, 0.16, 0.15
Tasks: 184 total, 1 running, 182 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.8 us, 0.8 sy, 0.0 ni, 98.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 12137348 total, 3911352 free, 5262096 used, 2963900 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 6520076 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7014 redis 20 0 147356 2372 1184 S 6.7 0.0 48:15.15 redis-server
7070 root 5 -15 8263688 1.4g 21744 S 6.7 11.8 645:12.88 java
1 root 20 0 191384 4320 2620 S 0.0 0.0 3:38.65 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.12 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:04.51 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root rt 0 0 0 0 S 0.0 0.0 0:01.76 migration/0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root 20 0 0 0 0 S 0.0 0.0 11:53.47 rcu_sched
Closely relate and compare the HighLoadAlert instance with the top command output, identify the process which is highly utilized CPU at the time of the alert.
Then to gather more information about that process, run this command:
Command Template:
#ps -ef | grep {PID}
Sample command:
[root@cps-194-aio-mob ~]# ps -ef | grep 7070
root 7070 1 6 Dec02 ? 12:17:06 /usr/bin/java -server -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass -Xms2048m -Xmx2048m -javaagent:/opt/broadhop/qns-1/bin/jmxagent.jar -Dqns.config.dir=/etc/broadhop/pcrf -Dqns.instancenum=1 -Dlogback.configurationFile=/etc/broadhop/logback.xml -Djmx.port=9045 -Dorg.osgi.service.http.port=8080 -Dsnmp.port=1161 -Dcom.broadhop.run.systemId=lab -Dcom.broadhop.run.clusterId=cluster-1 -Dcom.broadhop.run.instanceId=cps-194-aio-mob-1 -Dcom.broadhop.config.url=http://pcrfclient01/repos/run/ -Dcom.broadhop.repository.credentials.isEncrypted=true -Dcom.broadhop.repository.credentials=qns-svn/3300901EA069E81CE29D4F77DE3C85FA@pcrfclient01 -Dcom.broadhop.referencedata.local.location=/var/broadhop/checkout -DdisableJms -DrefreshOnChange=true -DenableRuntimePolling=true -DdefaultNasIp=127.0.0.1 -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044 -Dua.version.2.0.compatible=true -Denable.compression=true -Denable.dictionary.compression=true -DuseZlibCompression=true -DenableBestCompression=true -DenableQueueSystem=false -Dredis.keystore.connection.string=lb01:lb01:6379:6379 -DbrokerUrl=failover:(tcp://lb01:61616,tcp://lb02:61616)?randomize=false -DjmsFlowControlHost=lb02 -DjmsFlowControlPort=9045 -Dosgi.framework.activeThreadType=normal -jar /opt/broadhop/qns-1/plugins/org.eclipse.equinox.launcher_1.1.0.v20100507.jar -console cps-194-aio-mob:9091 -clean -os linux -ws gtk -arch x86_64
root 7846 7587 0 11:00 pts/0 00:00:00 grep --color=auto 7070
[root@cps-194-aio-mob ~]#
Workaround
Once the Process that causes HighLoadAlert has been identified, then these workarounds can be considered:
Step 1. Restart the process.
#monit stop {Process Name}
Wait for 10 secs
#monit start {Process Name}
Step 2. If the process includes logback, then verify any logger with debug log level and change the log level of loggers from debug to warn/error.
Step 3. If Step 1. and Step 2. don't work, then tune the respective configuration file, with the help of the development team if required.