CUCM Common Problems on UCS Platform: Core, High CPU - I/O, Hung State

Available Languages

Download Options

PDF (410.5 KB)
View with Adobe Reader on a variety of devices
ePub (296.8 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (263.3 KB)
View on Kindle device or Kindle app on multiple devices

Updated:January 6, 2015

Document ID:118702

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

This document describes how to troubleshoot five common problem scenarios encountered with Cisco Unified Communications Manager (CUCM) on the Unified Computing System (UCS) platform.

Scenario 1: High CPU Utilization Due to I/O Wait Issue
Scenario 2: CUCM Reboots Periodically
Scenario 3: CUCM Crashes
Scenario 4: CUCM Hangs
Scenario 5: CUCM is in Read-Only Mode

Some of the common causes are:

Hard Disk failure
Redundant Array of Independent Disks (RAID) controller failure
Battery Backup Unit (BBU) failure

Scenario 1: High CPU Utilization Due to I/O Wait Issue

Symptoms

Cisco Call Manager (CCM) and Computer Telephony Integration (CTI) services restart due to the CCM CTI core.

How to Verify

CUCM Traces

Use these CLI commands in order to collect CUCM traces:

show process using-most cpu
show status
utils core active list
util core analyze output <latest , last two output>

Examine these Real-Time Monitoring Tool (RTMT) logs:

Detailed CCM
Detailed CTI
Real-time Information Server (RIS) Data collector PerfMonLogs
Event Viewer Application logs
Event Viewer System logs

Sample Output

Here is some sample output:

admin:utils core active list
Size Date Core File Name
===============================================                           
355732 KB 2014-X-X 11:27:29 core.XXX.X.ccm.XXXX
110164 KB 2014-X-X 11:27:25 core.XXX.X.CTIManager.XXXX

admin:util core analyze output 

====================================
CCM service backtrace
===================================
#0 0x00df6206 in raise () from /lib/libc.so.6
#1 0x00df7bd1 in abort () from /lib/libc.so.6
#2 0x084349cb in IntentionalAbort (reason=0xb0222f8 "CallManager unable to process
signals. This may be due to CPU or blocked function. Attempting to restart 
CallManager.") at ProcessCMProcMon.cpp:80
#3 0x08434a8c in CMProcMon::monitorThread () at ProcessCMProcMon.cpp:530
#4 0x00a8fca7 in ACE_OS_Thread_Adapter::invoke (this=0xb2b04270) at OS_Thread_
Adapter.cpp:94
#5 0x00a45541 in ace_thread_adapter (args=0xb2b04270) at Base_Thread_Adapter.cpp:137
#6 0x004aa6e1 in start_thread () from /lib/libpthread.so.0
#7 0x00ea2d3e in clone () from /lib/libc.so.6
====================================


====================================
CTI Manager backtrace
===================================
#0 0x00b3e206 in raise () from /lib/libc.so.6
#1 0x00b3fbd1 in abort () from /lib/libc.so.6
#2 0x08497b11 in IntentionalAbort (reason=0x86fe488 "SDL Router Services declared
dead. This may be due to high CPU usage or blocked function. Attempting to restart
CTIManager.") at ProcessCTIProcMon.cpp:65
#3 0x08497c2c in CMProcMon::verifySdlTimerServices () at ProcessCTIProcMon.cpp:573
#4 0x084988d8 in CMProcMon::callManagerMonitorThread (cmProcMon=0x93c9638) at Process
CTIProcMon.cpp:330
#5 0x007bdca7 in ACE_OS_Thread_Adapter::invoke (this=0x992d710) at OS_Thread_
Adapter.cpp:94
#6 0x00773541 in ace_thread_adapter (args=0x992d710) at Base_Thread_Adapter.cpp:137
#7 0x0025d6e1 in start_thread () from /lib/libpthread.so.0
#8 0x00bead3e in clone () from /lib/li
====================================

From the RIS Data collector PerfMonLogs, you can see high disk I/O during the core time.

The backtrace matches Cisco bug ID CSCua79544 : Frequent CCM Process Cores Due to High Disk I/O. This bug describes a hardware problem and explains how to further isolate the problem.

Enable File I/O Reporting (FIOR):

Use these commands in order to enable FIOR:

utils fior start
utils fior enable

Then, wait for next occurrence. Here is the CLI command to collect the output: file get activelog platform/io-stats. Enter these commands in order to disable FIOR:

utils fior stop
utils fior disable

Here is some sample FIOR log output:

kern 4 kernel: fio_syscall_table address set to c0626500 based on user input
kern 4 kernel: fiostats: address of do_execve set to c048129a
kern 6 kernel: File IO statistics module version 0.99.1 loaded. 
kern 6 kernel: file reads > 265000 and writes > 51200 will be logged
kern 4 kernel: fiostats: enabled.
kern 4 kernel: fiostats[25487] started.

Solution

I/O WAIT is usually an issue with the UCS platform and its storage.

The UCS log is required to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions to collect the traces.

Scenario 2: CUCM Reboots Periodically

Symptoms

CUCM reboots due to an ESXI crash but the underlying issue is that the UCS machine loses power.

How to Verify

Examine these CUCM Traces:

Cisco RIS Data collector PerfMonLog
Event viewer - Application log
Event Viewer - System log
Detailed CCM

There is nothing relevant in the CUCM traces. The CUCM stops before the incident and this is followed a normal service restart. This eliminates CUCM and indicates that the cause lies elsewhere.

The UCS Platform where the CUCM runs has the problem. The UCS Platform has many Virtual Machine (VM) instances that run on it. If any VM encounters an error, then it is seen in the UCS logs.

The UCS log is required in order to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions about how to collect the traces.

Sample Cisco Integrated Management Controller (CIMC) Output

Here is some sample output:

5:2014 May 11 13:10:48:BMC:kernel:-:<5>[lpc_reset_isr_handler]:79:LPC Reset ISR ->
ResetState: 1
5:2014 May 11 13:10:48:BMC:kernel:-:<5>drivers/bmc/usb/usb1.1/se_pilot2_udc_usb1_1.c:
2288:USB FS: VDD Power WAKEUP- Power Good = OFF
5:2014 May 11 13:10:48:BMC:kernel:-:<5>[se_pilot2_wakeup_interrupt]:2561:USB HS: 
VDD Power = OFF
5:2014 May 11 13:10:48:BMC:BIOSReader:1176: BIOSReader.c:752:File Close : 
/var/nuova/BIOS/BiosTech.txt
5:2014 May 11 13:10:48:BMC:kernel:-:<5>[block_transfer_fetch_host_request_for_app]:
1720:block_transfer_fetch_host_request_for_app : BT_FILE_CLOSE : HostBTDescr = 27 : 
FName = BiosTech.txt
5:2014 May 11 13:10:48:BMC:IPMI:1357: Pilot2SrvPower.c:466:Blade Power Changed To: 
[ OFF ]
5:2014 May 11 13:10:49:BMC:lv_dimm:-: lv_dimm.c:126:[lpc_reset_seen]LPC Reset Count 
is Different [0x1:0x2] Asserted LPC Reset Seen

Solution

When you encounter this error, Pilot2SrvPower.c:466:Blade Power Changed To: [ OFF ] - Power issue, it means that the UCS machine loses power. Hence, you should ensure that the UCS machine gets sufficient power.

Scenario 3: CUCM Crashes

Symptoms

The CUCM VM crashes but still responds to pings. The vSphere console screen displays this information:

*ERROR* %No Memory Available
*ERROR* %No Memory Available

How to Verify

Examine these CUCM Traces:

Cisco RIS Data collector PerfMonLog
Event viewer - Application log
Event Viewer - System log
Detailed CCM

There is nothing relevant in the CUCM traces. The CUCM stops before the incident and is followed by a normal service restart. This eliminates CUCM and indicates that the cause lies elsewhere.

The UCS Platform where the CUCM runs has the problem. The UCS Platform has many VM instances that run on it. If any VM encounters an error, then it is seen in the UCS logs.

The UCS log is required in order to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions about how to collect the traces.

Workaround

Power off the VM and reboot it. After the reboot, the system works fine.

Scenario 4: CUCM Hangs

Symptoms

The CUCM server goes to a state where it hangs.

How to verify

Examine these CUCM Traces:

Cisco RIS Data collector PerfMonLog
Event viewer - Application log
Event Viewer - System log
Detailed CCM

There is nothing relevant in the CUCM traces. The CUCM stops before the incident and is followed by a normal service restart. This eliminates CUCM and indicates that the cause lies elsewhere.

The UCS Platform where the CUCM runs has the problem. The UCS Platform has many VM instances that run on it. If any VM encounters an error, then it is seen in the UCS logs.

The UCS log is required in order to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions about how to collect the traces.

Workaround

Try a manual restart to see if it helps.

Scenario 5: CUCM is in Read-Only Mode

Symptoms

You receive this error:

The /common file system is mounted read only.
Please use Recovery Disk to check the file system using fsck.

How to Verify

The Publisher (PUB) and one Subscriber (SUB) that are installed on the same UCS machine show the read-only mode error. The recovery disk does not fix the issue.

There is nothing relevant in the CUCM traces. The CUCM stops before the incident and is followed by a normal service restart. This eliminates CUCM and indicates that the cause lies elsewhere.

The UCS Platform where the CUCM runs has the problem. The UCS Platform has many VM instances that run on it. If any VM encounters an error, then it is seen in the UCS logs.

The UCS log is required in order to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions about how to collect the traces.

Solution

After hardware replacement, rebuild the problematic nodes.

How to Collect UCS logs

This section describes how to collect the traces needed to identify the problem or provides links to articles that provide that information.

How to collect CIMC logs: Show tech

Refer to these articles for information about how to collect CICM logs:

Using Cisco CIMC GUI to Collect show-tech Details

Visual Guide to collect Tech Support files (B and C series)

How to collect ESXI logs: System logs

Refer to this article for information about how to collect ESXI logs:

Obtaining Diagnostic Information for ESXi 5.x hosts using the vSphere Client

Sample CIMC CLI Output

Here is some sample CIMC CLI output from a Hard Disk Failure:

ucs-c220-m3 /chassis # show hdd
Name Status LocateLEDStatus 
-------------------- -------------------- -------------------- 
HDD1_STATUS present TurnOFF 
HDD2_STATUS present TurnOFF 
HDD3_STATUS failed TurnOFF 
HDD4_STATUS present TurnOFF 
HDD5_STATUS absent TurnOFF 
HDD6_STATUS absent TurnOFF 
HDD7_STATUS absent TurnOFF 
HDD8_STATUS absent TurnOFF

ucs-c220-m3 /chassis # show hdd-pid
Disk Controller Product ID Vendor Model 
---- ----------- -------------------- ---------- ------------ 
1 SLOT-2 A03-D500GC3 ATA ST9500620NS 
2 SLOT-2 A03-D500GC3 ATA ST9500620NS 
3 SLOT-2 A03-D500GC3 ATA ST9500620NS 
4 SLOT-2 A03-D500GC3 ATA ST9500620NS

ucs-c220-m3 /chassis/storageadapter # show physical-drive
Physical Drive Number Controller Health Status Manufacturer Model Predictive
Failure Count Drive Firmware Coerced Size Type 
--------------------- ---------- -------------- ---------------------- ------
-------- -------------- ------------------------ -------------- -------------- ----- 
1 SLOT-2 Good Online ATA ST9500620NS 0 CC03 475883 MB HDD 
2 SLOT-2 Good Online ATA ST9500620NS 0 CC03 475883 MB HDD 
3 SLOT-2 Severe Fault Unconfigured Bad ATA ST9500620NS 0 CC03 0 MB HDD 
4 SLOT-2 Good Online ATA ST9500620NS 0 CC03 475883 MB HDD

Here is some sample CICM CLI output from RAID Controller failure:

ucs-c220-m3 /chassis/storageadapter # show virtual-drive
Virtual Drive Health Status Name Size RAID Level Boot Drive 
------------- -------------- -------------------- ---------------- ---------- 
---------- ---------- 
0 Moderate Fault Degraded 951766 MB RAID 10 true

Sample CIMC GUI output

Here is some sample CIMC GUI output from a Hard Disk Failure:

Here is some sample CIMC GUI output from a Purple Screen Error:

( Raid Controller failure | Defect: CSCuh86924 ESXi PSOD PF exception 14 - LSI RAID controller 9266-8i )

Here is some sample CIMC GUI output from a BBU Failure:

Contributed by Cisco Engineers

Sivakumar Shanmugam
Cisco TAC Engineer

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

Unified Communications Manager (CallManager)

CUCM Common Problems on UCS Platform: Core, High CPU - I/O, Hung State

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Scenario 1: High CPU Utilization Due to I/O Wait Issue

Symptoms

How to Verify

Sample Output

Solution

Scenario 2: CUCM Reboots Periodically

Symptoms

How to Verify

Sample Cisco Integrated Management Controller (CIMC) Output

Solution

Scenario 3: CUCM Crashes

Symptoms

How to Verify

Workaround

Scenario 4: CUCM Hangs

Symptoms

How to verify

Workaround

Scenario 5: CUCM is in Read-Only Mode

Symptoms

How to Verify

Solution

How to Collect UCS logs

How to collect CIMC logs: Show tech

How to collect ESXI logs: System logs

Sample CIMC CLI Output

Sample CIMC GUI output

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products