Troubleshoot High Load Alert and Recommended Workarounds in CPS

Available Languages

Download Options

PDF (44.9 KB)
View with Adobe Reader on a variety of devices
ePub (85.5 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (72.5 KB)
View on Kindle device or Kindle app on multiple devices

Updated:January 10, 2022

Document ID:217619

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Background Information

Problem

Troubleshoot HighLoad

Workaround

Introduction

This document describes High Load Alert investigation and recommended workarounds in Cisco Policy Suite (CPS).

Prerequisites

Requirements

Cisco recommends that you have knowledge of these topics:

Linux
CPS

Cisco also recommends that you have privilege root access to CPS CLI.

Components used

The information in this document is based on CPS 19.4

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.

Background Information

The load average is the average system load on a Linux server for a defined period of time. In other words, it is the CPU demand of a server that includes the sum of the active and the idle threads.

Measurement of load average is critical to understand how your servers perform; if overloaded, you must kill or optimize the processes that consume high amounts of resources, or provide more resources to balance the workload.

Typically, the top or the uptime command provides the load average of your server with output that looks like:

[root@cps-194-aio-mob ~]# uptime 
11:41:08 up 6 days, 5:20, 2 users, load average: 0.71, 0.35, 0.24
[root@cps-194-aio-mob ~]#

[root@cps-194-aio-mob ~]# top
top - 12:17:26 up 6 days, 5:56, 2 users, load average: 0.09, 0.12, 0.13
Tasks: 185 total, 1 running, 183 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.8 us, 0.8 sy, 0.0 ni, 98.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 12137348 total, 4128956 free, 5219860 used, 2788532 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 6586848 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
7070 root 5 -15 8263680 1.3g 21728 S 12.5 11.6 561:38.74 java 
1 root 20 0 191384 4320 2620 S 0.0 0.0 3:11.17 systemd

These numbers are the averages of the system load over a period of one, five, and 15 minutes.

Before you move further, let’s understand these two important phrases in all Unix-like systems:

System load/CPU Load – is a measurement of CPU over or under-utilization in a Linux system; the number of processes that are executed by the CPU or in idle state.
Load average – is the average system load calculated over a given period of time of 1, 5 and 15 minutes.

Problem

Whenever the load average of a CPS VM goes beyond the defined threshold, HighLoadAlert gets generated. The threshold value for the HighLoad alert is defined as 1.5*No Of CPUs in the VM. This configuration is provided in /etc/snmp/snmpd.conf:

load 12 12 12

# 1, 5 and 15 Minute Load Averages (UCD-SNMP-MIB la)
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.4 .1.3.6.1.4.1.2021.10.1.5.1
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.5 .1.3.6.1.4.1.2021.10.1.5.2
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.6 .1.3.6.1.4.1.2021.10.1.5.3
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.4.0 .1.3.6.1.4.1.2021.10.1.5.1
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.5.0 .1.3.6.1.4.1.2021.10.1.5.2
proxy -v 2c -c broadhop localhost .1.3.6.1.4.1.26878.200.3.2.70.1.6.0 .1.3.6.1.4.1.2021.10.1.5.3

Sample HighLoad Alert:


2021-10-31T14:25:36.572711+05:30 XXXXX-lb01 snmptrapd[5717]: 2021-10-31 14:25:36 pcrfclient01 [UDP: [XX.XX.XX.XX]:46046->[XX.XX.XX.XX]:162]:#012DISMAN-EVENT-MIB::sysUpTimeInstance = 99307800#011SNMPv2-MIB::snmpTrapOID.0 = OID: DISMAN-EVENT-MIB::mteTriggerFired#011DISMAN-EVENT-MIB::mteHotTrigger.0 = STRING: HighLoadAlert#011DISMAN-EVENT-MIB::mteHotTargetName.0 = STRING: #011DISMAN-EVENT-MIB::mteHotContextName.0 = STRING: #011DISMAN-EVENT-MIB::mteHotOID.0 = OID: UCD-SNMP-MIB::laErrorFlag.1#011DISMAN-EVENT-MIB::mteHotValue.0 = INTEGER: 1#011UCD-SNMP-MIB::laNames.1 = STRING: Load-1#011UCD-SNMP-MIB::laErrMessage.1 = STRING: 1 min Load Average too high (= 64.84)

Troubleshoot HighLoad

Prior to further investigation, ensure that the affected VM has CPU count as per standard. This can be done with the respective CPS Installation guide where it mentions the CPU count required for each VM.

The only Linux command that combined provide load average and CPU utilization by processes, is top command. In order to identify the process that causes HighLoad, top command must be executed in the affected VM at regular intervals for a certain duration that covers the HighLoad instance. This command provides top output for every 3 sec, for 15000 number of times (you can change the number as per your scenario):

#top -b -n15000 >> top.txt &

[root@cps-194-aio-mob ~]# top
top - 09:32:11 up 7 days, 3:11, 3 users, load average: 0.13, 0.16, 0.15
Tasks: 184 total, 1 running, 182 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.8 us, 0.8 sy, 0.0 ni, 98.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 12137348 total, 3911352 free, 5262096 used, 2963900 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 6520076 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
7014 redis 20 0 147356 2372 1184 S 6.7 0.0 48:15.15 redis-server 
7070 root 5 -15 8263688 1.4g 21744 S 6.7 11.8 645:12.88 java 
1 root 20 0 191384 4320 2620 S 0.0 0.0 3:38.65 systemd 
2 root 20 0 0 0 0 S 0.0 0.0 0:00.12 kthreadd 
3 root 20 0 0 0 0 S 0.0 0.0 0:04.51 ksoftirqd/0 
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 
7 root rt 0 0 0 0 S 0.0 0.0 0:01.76 migration/0 
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh 
9 root 20 0 0 0 0 S 0.0 0.0 11:53.47 rcu_sched

Closely relate and compare the HighLoadAlert instance with the top command output, identify the process which is highly utilized CPU at the time of the alert.

Then to gather more information about that process, run this command:

Command Template:
#ps -ef | grep {PID}

Sample command:
[root@cps-194-aio-mob ~]# ps -ef | grep 7070
root 7070 1 6 Dec02 ? 12:17:06 /usr/bin/java -server -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass -Xms2048m -Xmx2048m -javaagent:/opt/broadhop/qns-1/bin/jmxagent.jar -Dqns.config.dir=/etc/broadhop/pcrf -Dqns.instancenum=1 -Dlogback.configurationFile=/etc/broadhop/logback.xml -Djmx.port=9045 -Dorg.osgi.service.http.port=8080 -Dsnmp.port=1161 -Dcom.broadhop.run.systemId=lab -Dcom.broadhop.run.clusterId=cluster-1 -Dcom.broadhop.run.instanceId=cps-194-aio-mob-1 -Dcom.broadhop.config.url=http://pcrfclient01/repos/run/ -Dcom.broadhop.repository.credentials.isEncrypted=true -Dcom.broadhop.repository.credentials=qns-svn/3300901EA069E81CE29D4F77DE3C85FA@pcrfclient01 -Dcom.broadhop.referencedata.local.location=/var/broadhop/checkout -DdisableJms -DrefreshOnChange=true -DenableRuntimePolling=true -DdefaultNasIp=127.0.0.1 -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044 -Dua.version.2.0.compatible=true -Denable.compression=true -Denable.dictionary.compression=true -DuseZlibCompression=true -DenableBestCompression=true -DenableQueueSystem=false -Dredis.keystore.connection.string=lb01:lb01:6379:6379 -DbrokerUrl=failover:(tcp://lb01:61616,tcp://lb02:61616)?randomize=false -DjmsFlowControlHost=lb02 -DjmsFlowControlPort=9045 -Dosgi.framework.activeThreadType=normal -jar /opt/broadhop/qns-1/plugins/org.eclipse.equinox.launcher_1.1.0.v20100507.jar -console cps-194-aio-mob:9091 -clean -os linux -ws gtk -arch x86_64
root 7846 7587 0 11:00 pts/0 00:00:00 grep --color=auto 7070
[root@cps-194-aio-mob ~]#

Workaround

Once the Process that causes HighLoadAlert has been identified, then these workarounds can be considered:

Step 1. Restart the process.

#monit stop {Process Name}
Wait for 10 secs
#monit start {Process Name}

Step 2. If the process includes logback, then verify any logger with debug log level and change the log level of loggers from debug to warn/error.
Step 3. If Step 1. and Step 2. don't work, then tune the respective configuration file, with the help of the development team if required.

Revision History

Revision	Publish Date	Comments
1.0	10-Jan-2022	Initial Release

Contributed by Cisco Engineers

Midhun P
Cisco TAC Engineer

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

Policy Suite for Mobile

Troubleshoot High Load Alert and Recommended Workarounds in CPS

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Prerequisites

Requirements

Components used

Background Information

Problem

Troubleshoot HighLoad

Workaround

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products