The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
This document describes how to troubleshoot five common problem scenarios encountered with Cisco Unified Communications Manager (CUCM) on the Unified Computing System (UCS) platform.
Some of the common causes are:
Cisco Call Manager (CCM) and Computer Telephony Integration (CTI) services restart due to the CCM CTI core.
CUCM Traces
Use these CLI commands in order to collect CUCM traces:
Examine these Real-Time Monitoring Tool (RTMT) logs:
Here is some sample output:
admin:utils core active list
Size Date Core File Name
===============================================
355732 KB 2014-X-X 11:27:29 core.XXX.X.ccm.XXXX
110164 KB 2014-X-X 11:27:25 core.XXX.X.CTIManager.XXXX
admin:util core analyze output
====================================
CCM service backtrace
===================================
#0 0x00df6206 in raise () from /lib/libc.so.6
#1 0x00df7bd1 in abort () from /lib/libc.so.6
#2 0x084349cb in IntentionalAbort (reason=0xb0222f8 "CallManager unable to process
signals. This may be due to CPU or blocked function. Attempting to restart
CallManager.") at ProcessCMProcMon.cpp:80
#3 0x08434a8c in CMProcMon::monitorThread () at ProcessCMProcMon.cpp:530
#4 0x00a8fca7 in ACE_OS_Thread_Adapter::invoke (this=0xb2b04270) at OS_Thread_
Adapter.cpp:94
#5 0x00a45541 in ace_thread_adapter (args=0xb2b04270) at Base_Thread_Adapter.cpp:137
#6 0x004aa6e1 in start_thread () from /lib/libpthread.so.0
#7 0x00ea2d3e in clone () from /lib/libc.so.6
====================================
====================================
CTI Manager backtrace
===================================
#0 0x00b3e206 in raise () from /lib/libc.so.6
#1 0x00b3fbd1 in abort () from /lib/libc.so.6
#2 0x08497b11 in IntentionalAbort (reason=0x86fe488 "SDL Router Services declared
dead. This may be due to high CPU usage or blocked function. Attempting to restart
CTIManager.") at ProcessCTIProcMon.cpp:65
#3 0x08497c2c in CMProcMon::verifySdlTimerServices () at ProcessCTIProcMon.cpp:573
#4 0x084988d8 in CMProcMon::callManagerMonitorThread (cmProcMon=0x93c9638) at Process
CTIProcMon.cpp:330
#5 0x007bdca7 in ACE_OS_Thread_Adapter::invoke (this=0x992d710) at OS_Thread_
Adapter.cpp:94
#6 0x00773541 in ace_thread_adapter (args=0x992d710) at Base_Thread_Adapter.cpp:137
#7 0x0025d6e1 in start_thread () from /lib/libpthread.so.0
#8 0x00bead3e in clone () from /lib/li
====================================
From the RIS Data collector PerfMonLogs, you can see high disk I/O during the core time.
The backtrace matches Cisco bug ID CSCua79544 : Frequent CCM Process Cores Due to High Disk I/O. This bug describes a hardware problem and explains how to further isolate the problem.
Enable File I/O Reporting (FIOR):
Use these commands in order to enable FIOR:
utils fior start
utils fior enable
Then, wait for next occurrence. Here is the CLI command to collect the output: file get activelog platform/io-stats. Enter these commands in order to disable FIOR:
utils fior stop
utils fior disable
Here is some sample FIOR log output:
kern 4 kernel: fio_syscall_table address set to c0626500 based on user input
kern 4 kernel: fiostats: address of do_execve set to c048129a
kern 6 kernel: File IO statistics module version 0.99.1 loaded.
kern 6 kernel: file reads > 265000 and writes > 51200 will be logged
kern 4 kernel: fiostats: enabled.
kern 4 kernel: fiostats[25487] started.
I/O WAIT is usually an issue with the UCS platform and its storage.
The UCS log is required to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions to collect the traces.
CUCM reboots due to an ESXI crash but the underlying issue is that the UCS machine loses power.
Examine these CUCM Traces:
There is nothing relevant in the CUCM traces. The CUCM stops before the incident and this is followed a normal service restart. This eliminates CUCM and indicates that the cause lies elsewhere.
The UCS Platform where the CUCM runs has the problem. The UCS Platform has many Virtual Machine (VM) instances that run on it. If any VM encounters an error, then it is seen in the UCS logs.
The UCS log is required in order to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions about how to collect the traces.
Here is some sample output:
5:2014 May 11 13:10:48:BMC:kernel:-:<5>[lpc_reset_isr_handler]:79:LPC Reset ISR ->
ResetState: 1
5:2014 May 11 13:10:48:BMC:kernel:-:<5>drivers/bmc/usb/usb1.1/se_pilot2_udc_usb1_1.c:
2288:USB FS: VDD Power WAKEUP- Power Good = OFF
5:2014 May 11 13:10:48:BMC:kernel:-:<5>[se_pilot2_wakeup_interrupt]:2561:USB HS:
VDD Power = OFF
5:2014 May 11 13:10:48:BMC:BIOSReader:1176: BIOSReader.c:752:File Close :
/var/nuova/BIOS/BiosTech.txt
5:2014 May 11 13:10:48:BMC:kernel:-:<5>[block_transfer_fetch_host_request_for_app]:
1720:block_transfer_fetch_host_request_for_app : BT_FILE_CLOSE : HostBTDescr = 27 :
FName = BiosTech.txt
5:2014 May 11 13:10:48:BMC:IPMI:1357: Pilot2SrvPower.c:466:Blade Power Changed To:
[ OFF ]
5:2014 May 11 13:10:49:BMC:lv_dimm:-: lv_dimm.c:126:[lpc_reset_seen]LPC Reset Count
is Different [0x1:0x2] Asserted LPC Reset Seen
When you encounter this error, Pilot2SrvPower.c:466:Blade Power Changed To: [ OFF ] - Power issue, it means that the UCS machine loses power. Hence, you should ensure that the UCS machine gets sufficient power.
The CUCM VM crashes but still responds to pings. The vSphere console screen displays this information:
*ERROR* %No Memory Available *ERROR* %No Memory Available
Examine these CUCM Traces:
There is nothing relevant in the CUCM traces. The CUCM stops before the incident and is followed by a normal service restart. This eliminates CUCM and indicates that the cause lies elsewhere.
The UCS Platform where the CUCM runs has the problem. The UCS Platform has many VM instances that run on it. If any VM encounters an error, then it is seen in the UCS logs.
The UCS log is required in order to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions about how to collect the traces.
Power off the VM and reboot it. After the reboot, the system works fine.
The CUCM server goes to a state where it hangs.
Examine these CUCM Traces:
There is nothing relevant in the CUCM traces. The CUCM stops before the incident and is followed by a normal service restart. This eliminates CUCM and indicates that the cause lies elsewhere.
The UCS Platform where the CUCM runs has the problem. The UCS Platform has many VM instances that run on it. If any VM encounters an error, then it is seen in the UCS logs.
The UCS log is required in order to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions about how to collect the traces.
Try a manual restart to see if it helps.
You receive this error:
The /common file system is mounted read only. Please use Recovery Disk to check the file system using fsck.
The Publisher (PUB) and one Subscriber (SUB) that are installed on the same UCS machine show the read-only mode error. The recovery disk does not fix the issue.
There is nothing relevant in the CUCM traces. The CUCM stops before the incident and is followed by a normal service restart. This eliminates CUCM and indicates that the cause lies elsewhere.
The UCS Platform where the CUCM runs has the problem. The UCS Platform has many VM instances that run on it. If any VM encounters an error, then it is seen in the UCS logs.
The UCS log is required in order to isolate the location of the cause. Refer to the How to Collect UCS Logs section for instructions about how to collect the traces.
After hardware replacement, rebuild the problematic nodes.
This section describes how to collect the traces needed to identify the problem or provides links to articles that provide that information.
Refer to these articles for information about how to collect CICM logs:
Using Cisco CIMC GUI to Collect show-tech Details
Visual Guide to collect Tech Support files (B and C series)
Refer to this article for information about how to collect ESXI logs:
Obtaining Diagnostic Information for ESXi 5.x hosts using the vSphere Client
Here is some sample CIMC CLI output from a Hard Disk Failure:
ucs-c220-m3 /chassis # show hdd
Name Status LocateLEDStatus
-------------------- -------------------- --------------------
HDD1_STATUS present TurnOFF
HDD2_STATUS present TurnOFF
HDD3_STATUS failed TurnOFF
HDD4_STATUS present TurnOFF
HDD5_STATUS absent TurnOFF
HDD6_STATUS absent TurnOFF
HDD7_STATUS absent TurnOFF
HDD8_STATUS absent TurnOFF
ucs-c220-m3 /chassis # show hdd-pid
Disk Controller Product ID Vendor Model
---- ----------- -------------------- ---------- ------------
1 SLOT-2 A03-D500GC3 ATA ST9500620NS
2 SLOT-2 A03-D500GC3 ATA ST9500620NS
3 SLOT-2 A03-D500GC3 ATA ST9500620NS
4 SLOT-2 A03-D500GC3 ATA ST9500620NS
ucs-c220-m3 /chassis/storageadapter # show physical-drive
Physical Drive Number Controller Health Status Manufacturer Model Predictive
Failure Count Drive Firmware Coerced Size Type
--------------------- ---------- -------------- ---------------------- ------
-------- -------------- ------------------------ -------------- -------------- -----
1 SLOT-2 Good Online ATA ST9500620NS 0 CC03 475883 MB HDD
2 SLOT-2 Good Online ATA ST9500620NS 0 CC03 475883 MB HDD
3 SLOT-2 Severe Fault Unconfigured Bad ATA ST9500620NS 0 CC03 0 MB HDD
4 SLOT-2 Good Online ATA ST9500620NS 0 CC03 475883 MB HDD
Here is some sample CICM CLI output from RAID Controller failure:
ucs-c220-m3 /chassis/storageadapter # show virtual-drive
Virtual Drive Health Status Name Size RAID Level Boot Drive
------------- -------------- -------------------- ---------------- ----------
---------- ----------
0 Moderate Fault Degraded 951766 MB RAID 10 true
Here is some sample CIMC GUI output from a Hard Disk Failure:
Here is some sample CIMC GUI output from a Purple Screen Error:
( Raid Controller failure | Defect: CSCuh86924 ESXi PSOD PF exception 14 - LSI RAID controller 9266-8i )
Here is some sample CIMC GUI output from a BBU Failure: