System Troubleshooting
Revised: August 10, 2011, OL-25016-01
Introduction
This chapter provides the information needed for monitoring and troubleshooting system events and alarms. This chapter is divided into the following sections:
•System Events and Alarms—Provides a brief overview of each system event and alarm
•Monitoring System Events—Provides the information needed for monitoring and correcting the system events
•Troubleshooting System Alarms—Provides the information needed for troubleshooting and correcting the system alarms
System Events and Alarms
This section provides a brief overview of all of the system events and alarms for the Cisco BTS 10200 Softswitch; the event and alarms are arranged in numerical order. Table 12-1 lists all of the system events and alarms by severity.
|
Note Click the system message number in Table 12-1 to display information about the event or alarm.
|
System (1)
Table 12-2 lists the details of the System (1) information event. For additional information, refer to the "Test Report—System (1)" section.
Table 12-2 System (1) Details
Description |
Test Report |
Severity |
Information |
Threshold |
100 |
Throttle |
0 |
Primary Cause |
This is a test report for the System category. |
Primary Action |
No action is required. |
System (2)
Table 12-3 lists the details of the System (2) minor alarm. To troubleshoot and correct the cause of the alarm, refer to the "Inter-Process Communication Queue Read Failure—System (2)" section.
Table 12-3 System (2) Details
Description |
Inter-Process Communication Queue Read Failure (IPC Queue Read Failure) |
Severity |
Minor |
Threshold |
100 |
Throttle |
0 |
Datawords |
Queue Name—STRING [20] Location Tag—STRING [30] |
Primary Cause |
There is a problem with the inter-process communication (IPC) process. |
Primary Action |
If the problem persists, contact Cisco TAC. |
System (3)
Table 12-4 lists the details of the System (3) minor alarm. To troubleshoot and correct the cause of the alarm, refer to the "Inter-Process Communication Message Allocate Failure—System (3)" section.
Table 12-4 System (3) Details
Description |
Inter-Process Communication Message Allocate Failure (IPC Message Allocate Failure) |
Severity |
Minor |
Threshold |
100 |
Throttle |
0 |
Datawords |
Requested Size—TWO_BYTES Error Code—FOUR_BYTES Location Tag—STRING [30] |
Primary Cause |
There is a system error or there is not enough free memory left to allocate a message buffer. |
Primary Action |
If the problem persists, contact Cisco TAC. |
System (4)
Table 12-5 lists the details of the System (4) minor alarm. To troubleshoot and correct the cause of the alarm, refer to the "Inter-Process Communication Message Send Failure—System (4)" section.
Table 12-5 System (4) Details
Description |
Inter-Process Communication Message Send Failure (IPC Message Send Failure) |
Severity |
Minor |
Threshold |
50 |
Throttle |
0 |
Datawords |
Error Code—FOUR_BYTES Destination Process—FOUR_BYTES Message Number—FOUR_BYTES Location Tag—STRING [30] |
Primary Cause |
The process for which the message is intended is not running. |
Primary Action |
Check to ensure that all components or processes are running. Attempt to restart any component or process that is not running. |
Secondary Cause |
An internal error has occurred. |
Secondary Action |
If the problem persists, contact Cisco TAC. |
System (5)
Table 12-6 lists the details of the System (5) warning event. To monitor and correct the cause of the event, refer to the "Unexpected Inter-Process Communication Message Received—System (5)" section.
Table 12-6 System (5) Details
Description |
Unexpected Inter-Process Communication Message Received (Unexpected IPC Message Received) |
Severity |
Warning |
Threshold |
100 |
Throttle |
0 |
Datawords |
Source Process Type—ONE_BYTE Source Thread Type—ONE_BYTE Message Number—TWO_BYTES Location Tag—STRING [30] |
Primary Cause |
The process reporting the event is receiving messages it is not expecting. |
Primary Action |
Contact Cisco TAC. |
System (6)
Table 12-7 lists the details of the System (6) minor alarm. To troubleshoot and correct the cause of the alarm, refer to the "Index List Insert Error—System (6)" section.
Table 12-7 System (6) Details
Description |
Index List Insert Error (IDX List Insert Error) |
Severity |
Minor |
Threshold |
100 |
Throttle |
0 |
Datawords |
List Name—STRING [20] Index of Entry Being—FOUR_BYTES Location Tag—STRING [30] |
Primary Cause |
An internal error has occurred. |
Primary Action |
If the problem persists, contact Cisco TAC. |
System (7)
Table 12-8 lists the details of the System (7) minor alarm. To troubleshoot and correct the cause of the alarm, refer to the "Index List Remove Error—System (7)" section.
Table 12-8 System (7) Details
Description |
Index List Remove Error (IDX List Remove Error) |
Severity |
Minor |
Threshold |
100 |
Throttle |
0 |
Datawords |
List Name—STRING [20] Index of Entry Being—FOUR_BYTES Location Tag—STRING [30] |
Primary Cause |
An internal error has occurred. |
Primary Action |
If the problem persists, contact Cisco TAC. |
System (8)
Table 12-9 lists the details of the System (8) major alarm. To troubleshoot and correct the cause of the alarm, refer to the "Thread Creation Failure—System (8)" section.
Table 12-9 System (8) Details
Description |
Thread Creation Failure |
Severity |
Major |
Threshold |
100 |
Throttle |
0 |
Datawords |
Error Code—FOUR_BYTES Thread Name—STRING [20] Location Tag—STRING [30] |
Primary Cause |
An internal error has occurred. A process was unable to create one of its threads. |
Primary Action |
Attempt to restart the node on which the error occurred. If the same error occurs, contact Cisco TAC. |
System (9)
Table 12-10 lists the details of the System (9) warning event. To monitor and correct the cause of the event, refer to the "Timer Start Failure—System (9)" section.
Table 12-10 System (9) Details
Description |
Timer Start Failure |
Severity |
Warning |
Threshold |
100 |
Throttle |
0 |
Datawords |
Timer Type—STRING [20] Location Tag—STRING [30] |
Primary Cause |
Process was unable to start a platform timer. |
Primary Action |
If the problem persists, contact Cisco TAC. |
System (10)
Table 12-11 lists the details of the System (10) minor alarm. To troubleshoot and correct the cause of the alarm, refer to the "Index Update Registration Error—System (10)" section.
Table 12-11 System (10) Details
Description |
Index Update Registration Error (IDX Update Registration Error) |
Severity |
Minor |
Threshold |
100 |
Throttle |
0 |
Datawords |
Error Code—FOUR_BYTES Table Name—STRING [20] Location Tag—STRING [30] |
Primary Cause |
Application unsuccessfully requested to be notified of table changes. |
Primary Action |
Contact Cisco TAC. |
System (11)
Table 12-12 lists the details of the System (11) minor alarm. To troubleshoot and correct the cause of the alarm, refer to the "Index Table Add Entry Error—System (11)" section.
Table 12-12 System (11) Details
Description |
Index Table Add Entry Error (IDX Table Add Entry Error) |
Severity |
Minor |
Threshold |
100 |
Throttle |
0 |
Datawords |
Table Name—STRING [20] Index of Entry Being—FOUR_BYTES Error Code—FOUR_BYTES Location Tag—STRING [30] |
Primary Cause |
An internal error has occurred. |
Primary Action |
If the problem persists, contact Cisco TAC. |
System (12)
Table 12-13 lists the details of the System (12) major alarm. To troubleshoot and correct the cause of the alarm, refer to the "Software Error—System (12)" section.
Table 12-13 System (12) Details
Description |
Software Error |
Severity |
Major |
Threshold |
100 |
Throttle |
0 |
Datawords |
Context Description—STRING [80] FileName—STRING [20] Line Number of Code—TWO_BYTES Error Specific Information—STRING [80] |
Primary Cause |
The logic path is not handled by an algorithm in the code. |
Primary Action |
Save a trace log from around the time of the occurrence and contact Cisco TAC. |
System (13)
Table 12-14 lists the details of the System (13) critical alarm. To troubleshoot and correct the cause of the alarm, refer to the "Multiple Readers and Multiple Writers Maximum Q Depth Reached—System (13)" section.
Table 12-14 System (13) Details
Description |
Multiple Readers and Multiple Writers Maximum Q Depth Reached (MRMW Max Q Depth Reached) |
Severity |
Critical |
Threshold |
100 |
Throttle |
0 |
Datawords |
High Mark for Queue Depth—FOUR_BYTES Low Mark for Queue Depth—FOUR_BYTES |
Primary Cause |
Messages are flooding from a malfunctioning network element. |
Primary Action |
Check the messages to process. |
Secondary Cause |
Resource congestion or slow processing of messages from queue has occurred. |
Secondary Action |
Check the process and the system resources. You might need to fail over. |
System (14)
Table 12-15 lists the details of the System (14) minor alarm. To troubleshoot and correct the cause of the alarm, refer to the "Multiple Readers and Multiple Writers Queue Reached Low Queue Depth—System (14)" section.
Table 12-15 System (14) Details
Description |
Multiple Readers and Multiple Writers Queue Reached Low Queue Depth (MRMW Queue Reached Low Queue Depth) |
Severity |
Minor |
Threshold |
100 |
Throttle |
0 |
Datawords |
Lower Queue Depth Limit—FOUR_BYTES Higher Queue Depth Limit—FOUR_BYTES |
Primary Cause |
Messages are being received from the network at a high rate. |
Primary Action |
Check the messages to the system. |
Secondary Cause |
System or processing thread congestion has occurred. |
Secondary Action |
Check the process and the system resources. |
System (15)
Table 12-16 lists the details of the System (15) major alarm. To troubleshoot and correct the cause of the alarm, refer to the "Multiple Readers and Multiple Writers Throttle Queue Depth Reached—System (15)" section.
Table 12-16 System (15) Details
Description |
Multiple Readers and Multiple Writers Throttle Queue Depth Reached (MRMW Throttle Queue Depth Reached) |
Severity |
Major |
Threshold |
100 |
Throttle |
0 |
Datawords |
Throttle Mark for Queue Depth—FOUR_BYTES Throttle Clear Mark for Queue De—FOUR_BYTES |
Primary Cause |
Inbound network messages are arriving at a rate much higher than the processing capacity. |
Primary Action |
Determine the cause of increase in inbound network traffic and try to control the traffic externally. |
Secondary Cause |
Resource congestion resulting in a slowdown in processing messages from queue has occurred. |
Secondary Action |
Check the platform CPU utilization, the IPC queue depth, and the overall availability of system resources. |
Monitoring System Events
This section provides the information you need for monitoring and correcting system events. Table 12-17 lists all of the system events in numerical order and provides cross-references to each subsection.
Test Report—System (1)
The Test Report event is for testing the system event category. The event is informational and no further action is required.
Inter-Process Communication Queue Read Failure—System (2)
The Inter-Process Communication Queue Read Failure alarm (minor) indicates that the IPC queue read has failed. To troubleshoot and correct the cause of the Inter-Process Communication Queue Read Failure alarm, refer to the "Inter-Process Communication Queue Read Failure—System (2)" section.
Inter-Process Communication Message Allocate Failure—System (3)
The Inter-Process Communication Message Allocate Failure alarm (minor) indicates that the IPC message allocation has failed. To troubleshoot and correct the cause of the Inter-Process Communication Message Allocate Failure alarm, refer to the "Inter-Process Communication Message Allocate Failure—System (3)" section.
Inter-Process Communication Message Send Failure—System (4)
The Inter-Process Communication Message Send Failure alarm (minor) indicates that the IPC message send has failed. To troubleshoot and correct the cause of the Inter-Process Communication Message Send Failure alarm, refer to the "Inter-Process Communication Message Send Failure—System (4)" section.
Unexpected Inter-Process Communication Message Received—System (5)
The Unexpected Inter-Process Communication Message Received event serves as a warning that an unexpected IPC message was received. The primary cause of the event is that the IPC process is receiving messages it is not expecting. To correct the primary cause of the event, contact Cisco TAC.
Index List Insert Error—System (6)
The Index List Insert Error alarm (minor) indicates that an error has been inserted in the index list. To troubleshoot and correct the cause of the Index List Insert Error alarm, refer to the "Index List Insert Error—System (6)" section.
Index List Remove Error—System (7)
The Index List Remove Error alarm (minor) indicates that an index list remove error has occurred. To troubleshoot and correct the cause of the Index List Remove Error alarm, refer to the "Index List Remove Error—System (7)" section.
Thread Creation Failure—System (8)
The Thread Creation Failure alarm (major) indicates that a thread creation has failed. To troubleshoot and correct the cause of the Thread Creation Failure alarm, refer to the "Thread Creation Failure—System (8)" section.
Timer Start Failure—System (9)
The Timer Start Failure event serves as a warning that a timer start failure has occurred. The primary cause of the event is that the process was unable to start a platform timer. To correct the primary cause of the event, check and see if the problem persists. If the problem persists, call Cisco TAC.
Index Update Registration Error—System (10)
The Index Update Registration Error alarm (minor) indicates that an index update registration error has occurred. To troubleshoot and correct the cause of the Index Update Registration Error alarm, refer to the "Index Update Registration Error—System (10)" section.
Index Table Add-Entry Error—System (11)
The Index Table Add-entry Error alarm (minor) indicates that an error occurred during the addition of an entry in the index table. To troubleshoot and correct the cause of the Index Table Add-entry Error alarm, refer to the "Index Table Add Entry Error—System (11)" section.
Software Error—System (12)
The Software Error alarm (major) indicates that a software error has occurred. To troubleshoot and correct the cause of the Software Error alarm, refer to the "Software Error—System (12)" section.
Multiple Readers and Multiple Writers Maximum Q Depth Reached—System (13)
The Multiple Readers and Multiple Writers Maximum Q Depth Reached alarm (critical) indicates that the multiple readers and multiple writers (MRMW) maximum queue depth has been reached. To troubleshoot and correct the cause of the Multiple Readers and Multiple Writers Maximum Q Depth Reached alarm, refer to the "Multiple Readers and Multiple Writers Maximum Q Depth Reached—System (13)" section.
Multiple Readers and Multiple Writers Queue Reached Low Queue Depth—System (14)
The Multiple Readers and Multiple Writers Queue Reached Low Queue Depth alarm (minor) indicates that the MRMW queue has reached the low queue depth threshold. To troubleshoot and correct the cause of the Multiple Readers and Multiple Writers Queue Reached Low Queue Depth alarm, refer to the "Multiple Readers and Multiple Writers Queue Reached Low Queue Depth—System (14)" section.
Multiple Readers and Multiple Writers Throttle Queue Depth Reached—System (15)
The Multiple Readers and Multiple Writers Throttle Queue Depth Reached alarm (major) indicates that the MRMW queue has reached throttle depth. To troubleshoot and correct the cause of the Multiple Readers and Multiple Writers Throttle Queue Depth Reached alarm, refer to the "Multiple Readers and Multiple Writers Throttle Queue Depth Reached—System (15)" section.
Troubleshooting System Alarms
This section provides the information you need for monitoring and correcting system alarms. Table 12-18 lists all of the system alarms in numerical order and provides cross-references to each subsection.
Inter-Process Communication Queue Read Failure—System (2)
The Inter-Process Communication Queue Read Failure alarm (minor) indicates that the IPC queue read has failed. The primary cause of the alarm is that there is a problem with IPC communication. To correct the primary cause of the alarm, contact Cisco TAC.
Inter-Process Communication Message Allocate Failure—System (3)
The Inter-Process Communication Message Allocate Failure alarm (minor) indicates that the IPC message allocation has failed. The primary cause of the alarm is that there is a system error, or there is not enough free memory left to allocate a message buffer. This alarm indicates a failure of IPC message allocation. It may be caused by following reasons:
•The message size is too big.
•No free entry in the message pool.
•Any internal errors.
To correct the primary causes of the alarm, contact Cisco TAC.
Prior to contacting Cisco TAC, collect statistics for the message pool and message queue.
To collect the statistics, use the pdm.CAxxx script in the /opt/OptiCall/CAxxx/bin directory.
Example:
pdm.CA146 -> 1.IPC Controls -> 2.Message Pool Stats & 6.MEssage Queue Stats
Also, use the top script to collect the current CPU usage.
Inter-Process Communication Message Send Failure—System (4)
The Inter-Process Communication Message Send Failure alarm (minor) indicates that the IPC message send has failed. The primary cause of the alarm is that the process for which the message is intended is not running. To correct the primary cause of the alarm, check to ensure that all components and processes are running. Attempt to restart any component or process that is not running. The secondary cause of the alarm is that an internal error has occurred. To correct the secondary cause of the alarm, contact Cisco TAC.
Index List Insert Error—System (6)
The Index List Insert Error alarm (minor) indicates that an error has been inserted in the index list. The primary cause of the alarm is that an internal error has occurred. To correct the primary cause of the alarm, contact Cisco TAC.
Index List Remove Error—System (7)
The Index List Remove Error alarm (minor) indicates that an index list remove error has occurred. The primary cause of the alarm is that an internal error has occurred. To correct the primary cause of the alarm, contact Cisco TAC.
Thread Creation Failure—System (8)
The Thread Creation Failure alarm (major) indicates that a thread creation has failed. The primary cause of the alarm is that an internal error occurred. A process was unable to create one of its threads. To correct the primary cause of the alarm, attempt to restart the node on which the error occurred. If the same alarm occurs, contact Cisco TAC.
Index Update Registration Error—System (10)
The Index Update Registration Error alarm (minor) indicates that an index update registration error has occurred. The primary cause of the alarm is that an application unsuccessfully requested to be notified of table changes. To correct the primary cause of the alarm, contact Cisco TAC.
Index Table Add Entry Error—System (11)
The Index Table Add Entry Error alarm (minor) indicates that an error occurred during the addition of an entry in the index table. The primary cause of the alarm is that an internal error has occurred. To correct the primary cause of the alarm, contact Cisco TAC.
Software Error—System (12)
The Software Error alarm (major) indicates that a software error has occurred. The primary cause of the alarm is that a logic path is not handled by any algorithm in the code. To correct the primary cause of the alarm, save the trace log from around the time of occurrence and contact Cisco TAC.
Multiple Readers and Multiple Writers Maximum Q Depth Reached—System (13)
The Multiple Readers and Multiple Writers Maximum Q Depth Reached alarm (critical) indicates that the MRMW maximum queue depth has been reached. The primary cause of the alarm is message flooding from an erratic network element. To correct the primary cause of the alarm, check the messages to process, The secondary cause of the alarm is resource congestion or slow processing of messages from queue. To correct the secondary cause of the alarm, check the process and system resources. The system may need to be failed over.
Multiple Readers and Multiple Writers Queue Reached Low Queue Depth—System (14)
The Multiple Readers and Multiple Writers Queue Reached Low Queue Depth alarm (minor) indicates that the MRMW queue has reached the low queue depth threshold. The primary cause of the alarm is a high rate of messages from the network. To correct the primary cause of the alarm, check the messages to the system. The secondary cause of the alarm is system or processing thread congestion. To correct the secondary cause of the alarm, check process and system resources.
Multiple Readers and Multiple Writers Throttle Queue Depth Reached—System (15)
The Multiple Readers and Multiple Writers Throttle Queue Depth Reached alarm (major) indicates that the MRMW queue has reached the throttle depth. The primary cause of the alarm is that inbound network messages arriving at a rate much higher than processing capacity. To correct the primary cause of the alarm, determine the cause of increase in inbound network traffic, and try to control the traffic externally. The secondary cause of the alarm is that there is resource congestion resulting in a slowdown in processing messages from the queue. To correct the secondary cause of the alarm, check the platform CPU utilization, IPC queue depths, and overall availability of system resources.