Monitoring Alarms and Implementing Alarm Log Correlation
Alarm log correlation extends system logging to include the ability to group and filter messages generated by various applications and system servers and to isolate root messages on the router. This module describes the concepts and tasks related to monitoring or displaying router alarms, configuring alarm log correlation and monitoring alarm logs.
Prerequisites for Implementing Alarm Log Correlation
You must be in a user group associated with a task group that includes the proper task IDs. The command reference guides include the task IDs required for each command. If you suspect user group assignment is preventing you from using a command, contact your AAA administrator for assistance.
Information About Monitoring Alarms and Implementing Alarm Log Correlation
Displaying Router Alarms
You can view the router alarms in brief and detail.
Execute the command show alarms brief to view the router alarms in brief.
RP/0/RSP0/CPU0:router#show alarms brief
-----------------------------------------------------------------------
Active Alarms for 1/0
-----------------------------------------------------------------------
Location Severity Group Set time Description
-----------------------------------------------------------------------
0/1/CPU0 Critical Fabric 11/11/2022 10:34:22 IST LC Bandwidth Insufficient To Support Line Rate Traffic
1/0/CPU0 Major Software 11/11/2022 10:43:36 IST Optics1/0/0/20 - hw_optics: RX LOS LANE-0 ALARM
1/0/CPU0 Major Software 11/11/2022 10:43:36 IST Optics1/0/0/20 - hw_optics: RX LOS LANE-1 ALARM
---------------------------------------------------------------------------------
History Alarms for 1/0
--------------------------------------------------------------------------------
No entries.
--------------------------------------------------------------------------------
Suppressed Alarms for 1/0
--------------------------------------------------------------------------------
No entries.
--------------------------------------------------------------------------------
Conditions for 1/0
--------------------------------------------------------------------------------
No entries.
Execute the command show alarms detail to view the router alarms in detail.
RP/0/RSP0/CPU0:ddc2-uut#show alarms detail
--------------------------------------------------------
Active Alarms for 1/0
--------------------------------------------------------
Description: LC Bandwidth Insufficient To Support Line Rate Traffic
Location: 1/0/CPU0
AID: XR_FABRIC/SW_MISC_ERR/18
Tag String: FAM_FAULT_TAG_HW_FIA_LC_BANDWIDTH
Module Name: N/A
EID: MODULE/MSC/1:MODULE/SLICE/1:MODULE/PSE/1
Reporting Agent ID: 524365
Pending Sync: false
Severity: Critical
Status: Set
Group: Fabric
Set Time: 11/16/2022 20:44:44 IST
Clear Time: -
Service Affecting: NotServiceAffecting
Transport Direction: NotSpecified
Transport Source: NotSpecified
Interface: N/A
Alarm Name: LC-BW-DEG
--------------------------------------------------------
History Alarms for 1/0
--------------------------------------------------------
No entries.
--------------------------------------------------------
Suppressed Alarms for 1/0
--------------------------------------------------------
No entries.
--------------------------------------------------------
Conditions for 1/0
--------------------------------------------------------
No entries.
--------------------------------------------------------
Clients for 1/0
--------------------------------------------------------
Agent Name: optics_fm.xml
Agent ID: 196678
Agent Location: 1/0/CPU0
Agent Handle: 93827323237168
Agent State: Registered
Agent Type: Producer
Agent Filter Display: false
Agent Subscriber ID: 0
Agent Filter Severity: Unknown
Agent Filter State: Unknown
Agent Filter Group: Unknown
Agent Connect Count: 1
Agent Connect Timestamp: 11/16/2022 20:40:18 IST
Agent Get Count: 0
Agent Subscribe Count: 0
Agent Report Count: 8
--------------------------------------------------------
Statistics for 1/0
--------------------------------------------------------
Alarms Reported: 9
Alarms Dropped: 0
Active (bi-state set): 9
History (bi-state cleared): 0
Suppressed: 0
Dropped Invalid AID: 0
Dropped No Memory: 0
Dropped DB Error: 0
Dropped Clear Without Set: 0
Dropped Duplicate: 0
Cache Hit: 0
Cache Miss: 0
Alarm Logging and Debugging Event Management System
Cisco IOS XR Software Alarm Logging and Debugging Event Management System (ALDEMS) is used to monitor and store alarm messages that are forwarded by system servers and applications. In addition, ALDEMS correlates alarm messages forwarded due to a single root cause.
ALDEMS enlarges on the basic logging and monitoring functionality of Cisco IOS XR Software, providing the level of alarm and event management necessary for a highly distributed system with potentially hundreds of line cards and thousands of interfaces.
Cisco IOS XR Software achieves this necessary level of alarm and event management by distributing logging applications across the nodes on the system.
The following figure illustrates the relationship between the components that constitute ALDEMS.
Correlator
The correlator receives messages from system logging (syslog) helper processes that are distributed across the nodes on the router and forwards syslog messages to the syslog process. If a logging correlation rule is configured, the correlator captures messages searching for a match with any message specified in the rule. If the correlator finds a match, it starts a timer that corresponds to the timeout interval specified in the rule. The correlator continues searching for a match to messages in the rule until the timer expires. If the root case message was received, then a correlation occurs; otherwise, all captured messages are forwarded to the syslog. When a correlation occurs, the correlated messages are stored in the logging correlation buffer. The correlator tags each set of correlated messages with a correlation ID.
System Logging Process
The alarm logger is the final destination for system logging messages forwarded on the router. The alarm logger stores alarm messages in the logging events buffer. The logging events buffer is circular; that is, when full, it overwrites the oldest messages in the buffer.
Alarm Logger
The alarm logger is the final destination for system logging messages forwarded on the router. The alarm logger stores alarm messages in the logging events buffer. The logging events buffer is circular; that is, when full, it overwrites the oldest messages in the buffer.
Note |
Alarms are prioritized in the logging events buffer. When it is necessary to overwrite an alarm record, the logging events buffer overwrites messages in the following order: nonbistate alarms first, then bistate alarms in the CLEAR state, and, finally, bistate alarms in the SET state. |
When the table becomes full of messages caused by bistate alarms in the SET state, the earliest bistate message (based on the message time stamp, not arrival time) is reclaimed before others. The buffer size for the logging events buffer and the logging correlation buffer, thus, should be adjusted so that memory consumption is within your requirements.
A table-full alarm is generated each time the logging events buffer wraps around. A threshold crossing notification is generated each time the logging events buffer reaches the capacity threshold.
Messages stored in the logging events buffer can be queried by clients to locate records matching specific criteria. The alarm logging mechanism assigns a sequential, unique ID to each alarm message.
Configuring Alarm Log Correlation
Perform the configuration tasks in this section to configure alarm log correlation as required.
Configuring Logging Correlation Rules
Logging correlation can be used to isolate the most significant root messages for events affecting system performance. When correlation rules are configured, a common root event that is generating secondary (non- root-cause) messages can be isolated and sent to the syslog, while secondary messages are suppressed. An operator can retrieve all correlated messages from the logging correlator buffer to view correlation events that have occurred. If a correlation rule is applied to the entire router, then correlation takes place only for those messages that match the configured cause values for the rule, regardless of the context or location setting of that message. If a correlation rule is applied to a specific set of contexts or locations, then correlation takes place only for those messages that match the configured cause values for the rule and that match at least one of those contexts or locations.
When a correlation rule is configured and applied, the correlator starts searching for a message match as specified in the rule. Timeout can be configured to specify the time interval for a message search once a match is found. Timeout begins when the correlator captures any alarm message specified for a correlation rule.
Configuration Example
This example shows how to configure and apply a logging correlation rule. In this example, timeout is configured as 60000 milliseconds.
Router# configure
Router(config)# logging correlator rule test type stateful
Router(config-corr-rule-st)# timeout 60000
Router(config)# logging correlator apply rule test
Router(config-corr-apply-rule)# all-of-router
or
Router(config-corr-apply-rule)# location 0/RP0/CPU0
or
Router(config-corr-apply-rule)# context rule_2
Router(config)# commit
Configuring a Logging Correlation Rule Set
You can configure a logging correlation rule set and include multiple correlation rules.
Configuration Example
This example shows how to configure and apply a logging correlation rule set for multiple correlation rules. The logging correlation rule set can be applied to the entire router or to a specific context or location.
Router# configure
Router(config)# logging correlator ruleset test
Router(config-corr-ruleset)# rulename test1
Router(config-corr-ruleset)# rulename test2
Router(config-corr-ruleset)# rulename test3
Router(config)# logging correlator apply ruleset test
Router(config-corr-apply-rule)# all-of-router
or
Router(config-corr-apply-rule)# location 0/RP0/CPU0
or
Router(config-corr-apply-rule)# context test123
Router(config)# commit
Correlating a Root Cause and Non Root Cause Alarms
The first message (with category, group, and code triplet) configured in a correlation rule defines the root-cause message. A root-cause message is always forwarded to the syslog process. You can correlate a root cause to one or more non-root-cause alarms and configure them as part of a rule.
Configuration Example
This example shows how to correlate a root cause to one or more non-root-cause alarms and configure them to a rule.
Router# configure
Router(config)# logging correlator rule rule_1 type stateful
Router(config-corr-rule-st)# rootcause CAT_b1 GROUP_b1 Root_b1
Router(config-corr-rule-st)# nonrootcause
Router(config-corr-rule-st-nonrc)# alarm CAT_b1 GROUP_b1 Root_b1
Router(config)# commit
Configuring Logging Suppression Rules
The alarm logging suppression feature enables you to suppress the logging of alarms by defining logging suppression rules that specify the types of alarms that you want to suppress. A logging suppression rule can specify all types of alarms or alarms with specific message categories, group names, and message codes. You can apply a logging suppression rule to alarms originating from all locations on the router or to alarms originating from specific nodes.
Configuration Example
This example shows how to configure logging suppression rules.
Router# configure
Router(config)# logging suppress rule infobistate
Router(config-suppr-rule)# alarm MBGL COMMIT SUCCEEDED
Router(config)# logging suppress apply rule infobistate
Router(config-suppr-apply-rule)# all-of-router
Router(config)# commit
Modifying Logging Events Buffer Settings
The alarm logger stores alarm messages in the logging events buffer. The logging events buffer overwrites the oldest messages in the buffer when it is full. Logging events buffer settings can be adjusted to respond to changes in user activity, network events, or system configuration events that affect network performance, or in network monitoring requirements. The appropriate settings depend on the configuration and requirements of the system. A threshold crossing notification is generated each time the logging events buffer reaches the capacity threshold.
Configuration Example
This example shows configuring the logging event buffer size, threshold, and alarm filter.
Router# configure terminal
Router(config)# logging events buffer-size 50000
Router(config)# logging events threshold 85
Router(config)# logging events level warnings
Router(config)# commit
Modifying Logging Correlation Buffer Settings
When a correlation occurs, the correlated messages are stored in the logging correlation buffer. The size of the logging correlation buffer can be adjusted to accommodate the anticipated volume of incoming correlated messages. Records can be removed from the buffer by specifying the records, or the buffer can be cleared of all records.
Configuration Example
This example shows configuring the correlation buffer size and removing the records from the buffer.
Router# configure terminal
Router(config)# logging correlator buffer-size 100000
Router(config)# exit
Router# clear logging correlator delete 48 49 50 (optional)
Router# clear logging correlator delete all-in-buffer (optional)
Enabling Alarm Source Location Display Field for Bistate Alarms
Bistate alarms are generated by state changes associated with system hardware. The bistate alarm message format is similar to syslog messages. You can optionally configure the output to include the location of the actual alarm source, which may be different from the process that logged the alarm. For more information about bistate alarms see, Bistate Alarms
Configuration Example
This example shows how to enable the alarm source location display field for bistate alarms.
Router# configure
Router(config)# logging events display-location
Router(config)# commit
Configuring SNMP Correlation Rules
In large-scale systems, there may be situations when you encounter many SNMP traps emitted at regular intervals of time. These traps, in turn, cause additional time in the Cisco IOS XR processing of traps. The additional traps can also slow down troubleshooting and increases workload for the monitoring systems and the operators. SNMP alarm correlation helps to extract the generic pieces of correlation functionality from the existing syslog correlator. You can configure correlation rules to define the correlation rules for SNMP traps and apply them to specific trap destinations.
Configuration Example
This example shows how to configure and apply correlation rules for SNMP traps. The SNMP correlator buffer size is also configured as 600 bytes. The default value for buffer size is 64KB.
Router# configure terminal
Router(config)# snmp-server correlator buffer-size 600 (optional)
Router(config)# snmp-server correlator rule test rootcause A varbind A1 value regex RA1 nonrootcause trap B varbind B1 index regex RB1
Router(config)# snmp-server correlator apply rule test host ipv4 address 1.2.3.4
Router(config)# commit
Configuring SNMP Correlation Ruleset
Configuration Example
You can configure a SNMP correlation rule set and include multiple SNMP correlation rules.This example shows how to configure a ruleset that allows you to group two or more rules into a group. You can apply the specified group to a set of hosts or all of them.
RP/0/RP0/CPU0:Router# configure terminal
RP/0/RP0/CPU0:Router(config)# snmp-server correlator ruleset rule1 rulename rule2
RP/0/RP0/CPU0:Router(config)# snmp-server correlator apply ruleset rule1 host ipv4 address 1.2.3.4
RP/0/RP0/CPU0:Router(config)# commit
Alarm Logging Correlation-Details
Alarm logging correlation can be used to isolate the most significant root messages for events affecting system performance. For example, the original message describing a card online insertion and removal (OIR) of a line card can be isolated so that only the root-cause message is displayed and all subsequent messages related to the same event are correlated. When correlation rules are configured, a common root event that is generating secondary (non-root-cause) messages can be isolated and sent to the syslog, while secondary messages are suppressed. An operator can retrieve all correlated messages from the logging correlator buffer to view correlation events that have occurred.
Correlation Rules
Correlation rules can be configured to isolate root messages that may generate system alarms. Correlation rules prevent unnecessary stress on Alarm Logging and Debugging Event Management System (ALDEMS) caused by the accumulation of unnecessary messages. Each correlation rule depends on a message identification, consisting of a message category, message group name, and message code. The correlator process scans messages for occurrences of the message. If the correlator receives a root message, the correlator stores it in the logging correlator buffer and forwards it to the syslog process on the RP. From there, the syslog process forwards the root message to the alarm logger in which it is stored in the logging events buffer. From the syslog process, the root message may also be forwarded to destinations such as the console, remote terminals, remote servers, the fault management system, and the Simple Network Management Protocol (SNMP) agent, depending on the network device configuration. Subsequent messages meeting the same criteria (including another occurrence of the root message) are stored in the logging correlation buffer and are forwarded to the syslog process on the router.
-
Message category
-
Message group
-
Message code
There are two types of correlations configured in rules to isolate root-cause messages, stateful correlation and non-stateful correlation. Nonstateful correlation is fixed after it has occurred, and non-root-cause alarms that are suppressed are never forwarded to the syslog process. All non-root-cause alarms remain buffered in correlation buffers. Stateful correlation can change after it has occurred, if the bistate root-cause alarm clears. When the alarm clears, all the correlated non-root-cause alarms are sent to syslog and are removed from the correlation buffer. Stateful correlations are useful to detect non-root-cause conditions that continue to exist even if the suspected root cause no longer exists.
Alarm Severity Level and Filtering
Filter settings can be used to display information based on severity level. The alarm filter display indicates the severity level settings used to report alarms, the number of records, and the current and maximum log size.
Alarms can be filtered according to the severity level shown in this table.
Severity Level |
System Condition |
---|---|
0 |
Emergencies |
1 |
Alerts |
2 |
Critical |
3 |
Errors |
4 |
Warnings |
5 |
Notifications |
6 |
Informational |
Bistate Alarms
Bistate alarms are generated by state changes associated with system hardware, such as a change of interface state from active to inactive, the online insertion and removal (OIR) of a line card, or a change in component temperature. Bistate alarm events are reported to the logging events buffer by default; informational and debug messages are not.
Cisco IOS XR Software provides the ability to reset and clear alarms. Clients interested in monitoring alarms in the system can register with the alarm logging mechanism to receive asynchronous notifications when a monitored alarm changes state.
Bistate alarm notifications provide the following information:
-
The origination ID, which uniquely identifies the resource that causes an alarm to be raised or cleared. This resource may be an interface, a line card, or an application-specific integrated circuit (ASIC). The origination ID is a unique combination of the location, job ID, message group, and message context.
By default, the general format of bistate alarm messages is the same as for all syslog messages:
node-id:timestamp : process-name [pid] : %category-group-severity-code : message-text
The following is a sample bistate alarm message:
LC/0/0/CPU0:Jan 15 21:39:11.325 2016:ifmgr[163]: %PKT_INFRA-LINEPRO
TO-5-UPDOWN : Line protocol on Interface TenGigE 0/0/0/0, changed state to Down
The message text includes the location of the process logging the alarm. In this example, the alarm was logged by the line protocol on TenGigE interface 0/0/0/0. Optionally, you can configure the output to include the location of the actual alarm source, which may be different from the process that logged the alarm. This appears as an additional display field before the message text.
When alarm source location is displayed, the general format becomes:
node-id:timestamp : process-name [pid] : %category-group-severity-code : source-location message-text
The following is a sample when alarm source location is displayed:
LC/0/0/CPU0:Jan 15 21:39:11.325 2016:ifmgr[163]: %PKT_INFRA-LINEPRO
TO-5-UPDOWN : interface TenGigE 0/0/0/0 : Line protocol on Interface TenGigE 0/0/0/0, changed state to Down
Context Correlation Flag
The context correlation flag allows correlations to take place on a “per context” basis or not.
This flag causes behavior change only if the rule is applied to one or more contexts. It does not go into effect if the rule is applied to the entire router or location nodes.
The following is a scenario of context correlation behavior:
-
Rule 1 has a root cause A and an associated non-root cause.
-
Context correlation flag is not set on Rule 1.
-
Rule 1 is applied to contexts 1 and 2.
If the context correlation flag is not set on Rule 1, a scenario in which alarm A generated from context 1 and alarm B generated from context 2 results in the rule applying to both contexts regardless of the type of context.
If the context correlation flag is now set on Rule 1 and the same alarms are generated, they are not correlated as they are from different contexts.
With the flag set, the correlator analyzes alarms against the rule only if alarms arrive from the same context. In other words, if alarm A is generated from context 1 and alarm B is generated from context 2, then a correlation does not occur.
Duration Timeout Flags
The root-cause timeout (if specified) is the alternative rule timeout to use in the situation in which a non-root-cause alarm arrives before a root-cause alarm in the given rule. It is typically used to give a shorter timeout in a situation under the assumption that it is less likely that the root-cause alarm arrives, and, therefore, releases the hold on the non-root-cause alarms sooner.