Monitoring Alarms and Implementing Alarm Log Correlation

This module describes the concepts and tasks related to monitoring or displaying router alarms, and configuring alarm log correlation. Alarm log correlation extends system logging to include the ability to group and filter messages generated by various applications and system servers and to isolate root messages on the router.

Monitoring Alarms and Implementing Alarm Log Correlation

Alarm log correlation extends system logging to include the ability to group and filter messages generated by various applications and system servers and to isolate root messages on the router. This module describes the concepts and tasks related to monitoring and displaying router alarms, configuring alarm log correlation and monitoring alarm logs.

Prerequisites for Implementing Alarm Log Correlation

You must be in a user group associated with a task group that includes the proper task IDs. The command reference guides include the task IDs required for each command. If you suspect user group assignment is preventing you from using a command, contact your AAA administrator for assistance.

Information About Monitoring Alarms and Implementing Alarm Log Correlation

Displaying Router Alarms

You can view the router alarms in brief and detail.

Execute the command show alarms brief to view the router alarms in brief.


RP/0/RSP0/CPU0:router#show alarms brief

-----------------------------------------------------------------------
Active Alarms for 1/0
-----------------------------------------------------------------------
Location    Severity   Group     Set time               Description

-----------------------------------------------------------------------
0/1/CPU0   Critical   Fabric   11/11/2022 10:34:22 IST  LC Bandwidth Insufficient To Support Line Rate Traffic
1/0/CPU0    Major    Software  11/11/2022 10:43:36 IST   Optics1/0/0/20 - hw_optics:  RX LOS LANE-0 ALARM 
1/0/CPU0    Major    Software  11/11/2022 10:43:36 IST   Optics1/0/0/20 - hw_optics:  RX LOS LANE-1 ALARM 
---------------------------------------------------------------------------------
History Alarms for 1/0
--------------------------------------------------------------------------------
No entries.

--------------------------------------------------------------------------------
Suppressed Alarms for 1/0
--------------------------------------------------------------------------------
No entries.

--------------------------------------------------------------------------------
Conditions for 1/0
--------------------------------------------------------------------------------
No entries.

Execute the command show alarms detail to view the router alarms in detail.


RP/0/RSP0/CPU0:ddc2-uut#show alarms detail

--------------------------------------------------------
Active Alarms for 1/0
--------------------------------------------------------
Description:             LC Bandwidth Insufficient To Support Line Rate Traffic                                                                                                                                                                 
Location:                1/0/CPU0                                                                                       
AID:                     XR_FABRIC/SW_MISC_ERR/18                                                                       
Tag String:              FAM_FAULT_TAG_HW_FIA_LC_BANDWIDTH                                                              
Module Name:             N/A                                                                                            
EID:                     MODULE/MSC/1:MODULE/SLICE/1:MODULE/PSE/1                                                       
Reporting Agent ID:      524365
Pending Sync:            false
Severity:                Critical
Status:                  Set
Group:                   Fabric
Set Time:                11/16/2022 20:44:44 IST
Clear Time:              -
Service Affecting:       NotServiceAffecting
Transport Direction:     NotSpecified
Transport Source:        NotSpecified
Interface:               N/A                                                                                            
Alarm Name:              LC-BW-DEG  
--------------------------------------------------------                                                                                    
History Alarms for 1/0
--------------------------------------------------------
No entries.

--------------------------------------------------------
Suppressed Alarms for 1/0
--------------------------------------------------------
No entries.

--------------------------------------------------------
Conditions for 1/0
--------------------------------------------------------
No entries.

--------------------------------------------------------
Clients for 1/0
--------------------------------------------------------
Agent Name:              optics_fm.xml                                                                                  
Agent ID:                196678
Agent Location:          1/0/CPU0                                                                                       
Agent Handle:            93827323237168                                                                                 
Agent State:             Registered
Agent Type:              Producer
Agent Filter Display:    false
Agent Subscriber ID:     0
Agent Filter Severity:   Unknown
Agent Filter State:      Unknown
Agent Filter Group:      Unknown
Agent Connect Count:     1
Agent Connect Timestamp: 11/16/2022 20:40:18 IST
Agent Get Count:         0
Agent Subscribe Count:   0
Agent Report Count:      8
--------------------------------------------------------
Statistics for 1/0
--------------------------------------------------------
Alarms Reported:                9
Alarms Dropped:                 0
Active (bi-state set):          9
History (bi-state cleared):     0
Suppressed:                     0
Dropped Invalid AID:            0
Dropped No Memory:              0
Dropped DB Error:               0
Dropped Clear Without Set:      0
Dropped Duplicate:              0
Cache Hit:                      0
Cache Miss:                     0

Alarm Logging and Debugging Event Management System

Cisco IOS XR Software Alarm Logging and Debugging Event Management System (ALDEMS) is used to monitor and store alarm messages that are forwarded by system servers and applications. In addition, ALDEMS correlates alarm messages forwarded due to a single root cause.

ALDEMS enlarges on the basic logging and monitoring functionality of Cisco IOS XR Software, providing the level of alarm and event management necessary for a highly distributed system with potentially hundreds of line cards and thousands of interfaces.

Cisco IOS XR Software achieves this necessary level of alarm and event management by distributing logging applications across the nodes on the system.

ALDEMS Component Communications illustrates the relationship between the components that constitute ALDEMS.

Figure 1. ALDEMS Component Communications
Correlator

The correlator receives messages from system logging (syslog) helper processes that are distributed across the nodes on the router and forwards syslog messages to the syslog process. If a logging correlation rule is configured, the correlator captures messages searching for a match with any message specified in the rule. If the correlator finds a match, it starts a timer that corresponds to the timeout interval specified in the rule. The correlator continues searching for a match to messages in the rule until the timer expires. If the root case message was received, then a correlation occurs; otherwise, all captured messages are forwarded to the syslog. When a correlation occurs, the correlated messages are stored in the logging correlation buffer. The correlator tags each set of correlated messages with a correlation ID.

System Logging Process

By default, the router sends system logging messages to a system logging (syslog) process. Syslog helper processes, that are distributed across the nodes of the router, gather the syslog messages. The system logging process controls the distribution of logging messages to the various destinations, such as the system logging buffer, the console, terminal lines, or a syslog server, depending on the network device configuration.

Alarm Logger

The alarm logger is the final destination for system logging messages forwarded on the router. The alarm logger stores alarm messages in the logging events buffer. The logging events buffer is circular; that is, when full, it overwrites the oldest messages in the buffer.


Note


Alarms are prioritized in the logging events buffer. When it is necessary to overwrite an alarm record, the logging events buffer overwrites messages in the following order: nonbistate alarms first, then bistate alarms in the CLEAR state, and, finally, bistate alarms in the SET state.


When the table becomes full of messages caused by bistate alarms in the SET state, the earliest bistate message (based on the message time stamp, not arrival time) is reclaimed before others. The buffer size for the logging events buffer and the logging correlation buffer, thus, should be adjusted so that memory consumption is within your requirements.

A table-full alarm is generated each time the logging events buffer wraps around. A threshold crossing notification is generated each time the logging events buffer reaches the capacity threshold.

Messages stored in the logging events buffer can be queried by clients to locate records matching specific criteria. The alarm logging mechanism assigns a sequential, unique ID to each alarm message.

Configuring Alarm Log Correlation

Perform the configuration tasks in this section to configure alarm log correlation as required.

Configuring Logging Correlation Rules

Logging correlation can be used to isolate the most significant root messages for events affecting system performance. When correlation rules are configured, a common root event that is generating (root-cause) messages can be isolated and sent to the syslog, while secondary messages are suppressed. An operator can retrieve all correlated messages from the logging correlator buffer to view correlation events that have occurred. If a correlation rule is applied to the entire router, then correlation takes place only for those messages that match the configured cause values for the rule, regardless of the context or location setting of that message.

Timeout can be configured to specify the time interval for a message search once a match is found. Timeout begins when the correlator captures any alarm message specified for a correlation rule.

Configuration Example

This example shows how to configure and apply a logging correlation rule. In this example, timeout is configured as 60000 milliseconds.

Router# configure
Router(config)# logging correlator rule rule1 type stateful
Router(config-corr-rule-st)# timeout 60000
Router(config-corr-rule-st)# exit
Router(config)# commit
Correlating a Root Cause and Non Root Cause Alarms

The first message (with category, group, and code triplet) configured in a correlation rule defines the root-cause message. A root-cause message is always forwarded to the syslog process. You can correlate a root cause to one or more non-root-cause alarms and configure them as part of a rule.

Configuration Example

This example shows how to correlate a root cause to one or more non-root-cause alarms and configure them to a rule.

Router# configure
Router(config)# logging correlator rule rule1 type stateful
Router(config-corr-rule-st)# rootcause CAT_BI_1 GROUP_BI_1 CODE_BI_1
Router(config-corr-rule-st)# nonrootcause
Router(config-corr-rule-st-nonrc)# alarm CAT_BI_2 GROUP_BI_2 CODE_BI_2
Router(config)# commit
Applying a Logging Correlation Rule

If a correlation rule is applied to a specific set of contexts or locations, then correlation takes place only for those messages that match the configured cause values for the rule and that match at least one of those contexts or locations. When a correlation rule is configured and applied, the correlator starts searching for a message match as specified in the rule.

Configuration Example

This example shows how to apply a logging correlation rule.

Router# configure
Router(config)# logging correlator apply rule rule1 
Router(config-corr-apply-rule)# all-of-router 
or 
Router(config-corr-apply-rule)# location 0/1/CPU0 
or 
Router(config-corr-apply-rule)# context HundredGigE_0_0_0_0 
Router(config)# commit

Configuring a Logging Correlation Rule Set

You can configure a logging correlation rule set and include multiple correlation rules.

Configuration Example

This example shows how to configure and apply a logging correlation rule set for multiple correlation rules. To configure a ruleset, you should first configure a rule with the same name. The logging correlation ruleset can be applied to the entire router or to a specific context or location.

Router# configure
Router(config)# logging correlator ruleset rule1
Router(config-corr-ruleset)# rulename rule1
Router(config-corr-ruleset)# exit
Router(config)# logging correlator apply ruleset rule1
Router(config-corr-apply-rule)# all-of-router 
or
Router(config-corr-apply-rule)# location 0/2/CPU0
or
Router(config-corr-apply-rule)# context HundredGigE_0_0_0_0
Router(config)# commit

Configuring Hierarchical Correlation Rule Flags

Hierarchical correlation is when a single alarm is both a root cause for one correlation rule and a non-root cause for another rule, and when alarms are generated resulting in a successful correlation associated with both rules. What happens to a non-root-cause alarm depends on the behavior of its correlated root-cause alarm. There are cases in which you want to control the stateful behavior associated with these hierarchies and to implement flags, such as reparenting and reissuing of non-bistate alarms. For detailed information about hierarchical correlation and correlation flags, see Hierarchical Correlation

Configuration Example

This example shows how to configure hierarchical correlation rule flags.

Router# configure
Router(config)# logging correlator rule rule_stateful type stateful
Router(config-corr-rule-st)# reissue-nonbistate 
Router(config-corr-rule-st)# reparent
Router(config-corr-rule-st)# commit
Router(config-corr-rule-st)# exit
Router(config)# exit
Router# show logging correlator rule all (optional)

Configuring Logging Suppression Rules

The alarm logging suppression feature enables you to suppress the logging of alarms by defining logging suppression rules that specify the types of alarms that you want to suppress. A logging suppression rule can specify all types of alarms or alarms with specific message categories, group names, and message codes. You can apply a logging suppression rule to alarms originating from all locations on the router or to alarms originating from specific nodes.

Configuration Example

This example shows how to configure logging suppression rules.

Router# configure
Router(config)# logging suppress rule infobistate
Router(config-suppr-rule)# alarm MBGL COMMIT SUCCEEDED
Router(config-suppr-rule)# exit
Router(config)# logging suppress apply rule infobistate 
Router(config-suppr-apply-rule)# commit

Modifying Logging Events Buffer Settings

The alarm logger stores alarm messages in the logging events buffer. The logging events buffer overwrites the oldest messages in the buffer when it is full. Logging events buffer settings can be adjusted to respond to changes in user activity, network events, or system configuration events that affect network performance, or in network monitoring requirements. The appropriate settings depend on the configuration and requirements of the system. A threshold crossing notification is generated each time the logging events buffer reaches the capacity threshold.

Configuration Example

This example shows configuring the logging event buffer size, threshold, and alarm filter.

Router# configure terminal
Router(config)# logging events buffer-size 50000
Router(config)# logging events threshold 85
Router(config)# logging events level warnings
Router(config)# commit

Modifying Logging Correlation Buffer Settings

When a correlation occurs, the correlated messages are stored in the logging correlation buffer. The size of the logging correlation buffer can be adjusted to accommodate the anticipated volume of incoming correlated messages. Records can be removed from the buffer by specifying the records, or the buffer can be cleared of all records.

Configuration Example

This example shows configuring the correlation buffer size and removing the records from the buffer.

Router# configure terminal
Router(config)# logging correlator buffer-size 100000
Router(config)# commit
Router(config)# exit
Router# clear logging correlator delete 48 49 50 (optional)
Router# clear logging correlator delete all-in-buffer (optional)

Enabling Alarm Source Location Display Field for Bistate Alarms

Bistate alarms are generated by state changes associated with system hardware. The bistate alarm message format is similar to syslog messages. You can optionally configure the output to include the location of the actual alarm source, which may be different from the process that logged the alarm. For more information about bistate alarms see, Bistate Alarms

Configuration Example

This example shows how to enable the alarm source location display field for bistate alarms.

Router# configure
Router(config)# logging events display-location
Router(config)# commit

Configuring SNMP Correlation Rules

In large-scale systems, there may be situations when you encounter many SNMP traps emitted at regular intervals of time. These traps, in turn, cause additional time in the Cisco IOS XR processing of traps. The additional traps can also slow down troubleshooting and increases workload for the monitoring systems and the operators. SNMP alarm correlation helps to extract the generic pieces of correlation functionality from the existing syslog correlator. You can configure correlation rules to define the correlation rules for SNMP traps and apply them to specific trap destinations.

Configuration Example

This example shows how to configure and apply correlation rules for SNMP traps. The SNMP correlator buffer size is also configured as 1024 bytes. The default value for buffer size is 64KB.

Router# configure terminal
Router(config)# snmp-server correlator buffer-size 1024 (optional)
Router(config)# snmp-server correlator rule rule1
Router(config-corr-rule-nonst)# timeout 100
Router(config-corr-rule-nonst)# rootcause 1.3.6.1.2.1.47.1.1
Router(config-corr-rule-nonst-rootvb)# varbind 1.3.6.1.2.1.47.1.2 index regex .*
Router(config-corr-rule-nonst-rootvb)# varbind 1.3.6.1.2.1.47.1.2 value regex .*
Router(config-corr-rule-nonst-rootvb)# exit
Router(config-corr-rule-nonst)# nonrootcause
Router(config-corr-rule-nonst-nonrc)# trap 1.3.6.1.2.1.47.1.1
Router(config-corr-rule-nonst-nonrcvb)# varbind 1.3.6.1.2.1.47.1.3 index regex .*
Router(config-corr-rule-nonst-nonrcvb)# varbind 1.3.6.1.2.1.47.1.3 value regex .*
Router(config-corr-rule-nonst-nonrcvb)# exit
Router(config-corr-rule-nonst-nonrc)# trap 1.3.6.1.2.1.47.1.2
Router(config-corr-rule-nonst-nonrcvb)# varbind 1.3.6.1.2.1.47.1.4 index regex .*
Router(config-corr-rule-nonst-nonrcvb)# varbind 1.3.6.1.2.1.47.1.4 value regex .*
Router(config-corr-rule-nonst-nonrcvb)# exit
Router(config-corr-rule-nonst-nonrc)# exit
Router(config-corr-rule-nonst)# exit
Router(config)# snmp-server correlator apply rule test host ipv4 address 1.2.3.4 
Router(config)# commit

Configuring SNMP Correlation Ruleset

Configuration Example
You can configure a SNMP correlation rule set and include multiple SNMP correlation rules.

This example shows how to configure a ruleset that allows you to group two or more rules into a group. You can apply the specified group to a set of hosts or all of them.

Router# configure terminal
Router(config)# snmp-server correlator ruleset rule1 rulename rule1 
Router(config)# snmp-server correlator apply ruleset rule1 host ipv4 address 1.2.3.4 
Router(config)# commit

Alarm Logging Correlation-Details

Alarm logging correlation can be used to isolate the most significant root messages for events affecting system performance. For example, the original message describing a card online insertion and removal (OIR) of a line card can be isolated so that only the root-cause message is displayed and all subsequent messages related to the same event are correlated. When correlation rules are configured, a common root event that is generating (root-cause) messages can be isolated and sent to the syslog, while secondary messages are suppressed. An operator can retrieve all correlated messages from the logging correlator buffer to view correlation events that have occurred.

Correlation Rules

Correlation rules can be configured to isolate root messages that may generate system alarms. Correlation rules prevent unnecessary stress on Alarm Logging and Debugging Event Management System (ALDEMS) caused by the accumulation of unnecessary messages. Each correlation rule depends on a message identification, consisting of a message category, message group name, and message code. The correlator process scans messages for occurrences of the message. If the correlator receives a root message, the correlator stores it in the logging correlator buffer and forwards it to the syslog process on the RP. From there, the syslog process forwards the root message to the alarm logger in which it is stored in the logging events buffer. From the syslog process, the root message may also be forwarded to destinations such as the console, remote terminals, remote servers, the fault management system, and the Simple Network Management Protocol (SNMP) agent, depending on the network device configuration. Subsequent messages meeting the same criteria (including another occurrence of the root message) are stored in the logging correlation buffer and are forwarded to the syslog process on the router.

If a message matches multiple correlation rules, all matching rules apply and the message becomes a part of all matching correlation queues in the logging correlator buffer. The following message fields are used to define a message in a logging correlation rule:
  • Message category

  • Message group

  • Message code

Wildcards can be used for any of the message fields to cover wider set of messages.

There are two types of correlations configured in rules to isolate root-cause messages, stateful correlation and non-stateful correlation. Nonstateful correlation is fixed after it has occurred, and non-root-cause alarms that are suppressed are never forwarded to the syslog process. All non-root-cause alarms remain buffered in correlation buffers. Stateful correlation can change after it has occurred, if the bistate root-cause alarm clears. When the alarm clears, all the correlated non-root-cause alarms are sent to syslog and are removed from the correlation buffer. Stateful correlations are useful to detect non-root-cause conditions that continue to exist even if the suspected root cause no longer exists.

Alarm Severity Level and Filtering

Filter settings can be used to display information based on severity level. The alarm filter display indicates the severity level settings used to report alarms, the number of records, and the current and maximum log size.

Alarms can be filtered according to the severity level shown in this table.

Table 1. Alarm Severity Levels for Event Logging

Severity Level

System Condition

0

Emergencies

1

Alerts

2

Critical

3

Errors

4

Warnings

5

Notifications

6

Informational

Bistate Alarms

Bistate alarms are generated by state changes associated with system hardware, such as a change of interface state from active to inactive, the online insertion and removal (OIR) of a line card, or a change in component temperature. Bistate alarm events are reported to the logging events buffer by default; informational and debug messages are not.

Cisco IOS XR Software provides the ability to reset and clear alarms. Clients interested in monitoring alarms in the system can register with the alarm logging mechanism to receive asynchronous notifications when a monitored alarm changes state.

Bistate alarm notifications provide the following information:

  • The origination ID, which uniquely identifies the resource that causes an alarm to be raised or cleared. This resource may be an interface, a line card, or an application-specific integrated circuit (ASIC). The origination ID is a unique combination of the location, job ID, message group, and message context.

By default, the general format of bistate alarm messages is the same as for all syslog messages:

node-id:timestamp : process-name [pid] : %category-group-severity-code : message-text

The following is a sample bistate alarm message:

LC/0/0/CPU0:Jan 15 21:39:11.325 2016:ifmgr[163]: %PKT_INFRA-LINEPRO
TO-5-UPDOWN : Line protocol on Interface HundredGigE 0/0/0/0, changed state to Down

The message text includes the location of the process logging the alarm. In this example, the alarm was logged by the line protocol on HundredGigE interface 0/0/0/0. Optionally, you can configure the output to include the location of the actual alarm source, which may be different from the process that logged the alarm. This appears as an additional display field before the message text.

When alarm source location is displayed, the general format becomes:

node-id:timestamp : process-name [pid] : %category-group-severity-code : source-location message-text

The following is a sample when alarm source location is displayed:

LC/0/0/CPU0:Jan 15 21:39:11.325 2016:ifmgr[163]: %PKT_INFRA-LINEPRO
TO-5-UPDOWN : interface HundredGigE 0/0/0/0: Line protocol on Interface HundredGigE 0/0/0/0, changed state to Down

Hierarchical Correlation

Hierarchical correlation takes effect when the following conditions are true:

  • When a single alarm is both a root cause for one rule and a non-root cause for another rule.

  • When alarms are generated that result in successful correlations associated with both rules.

The following example illustrates two hierarchical correlation rules:

Rule 1

Category

Group

Code

Root Cause 1

Cat 1

Group 1

Code 1

Non-root Cause 2

Cat 2

Group 2

Code 2

Rule 2

Root Cause 2

Cat 2

Group 2

Code 2

Non-root Cause 3

Cat 3

Group 3

Code 3

If three alarms are generated for Cause 1, 2, and 3, with all alarms arriving within their respective correlation timeout periods, then the hierarchical correlation appears like this:

Cause 1 -> Cause 2 -> Cause 3

The correlation buffers show two separate correlations: one for Cause 1 and Cause 2 and the second for Cause 2 and Cause 3. However, the hierarchical relationship is implicitly defined.


Note


Stateful behavior, such as reparenting and reissuing of alarms, is supported for rules that are defined as stateful; that is, correlations that can change.


Context Correlation Flag

The context correlation flag allows correlations to take place on a “per context” basis or not.

This flag causes behavior change only if the rule is applied to one or more contexts. It does not go into effect if the rule is applied to the entire router or location nodes.

The following is a scenario of context correlation behavior:

  • Rule 1 has a root cause A and an associated non-root cause.

  • Context correlation flag is not set on Rule 1.

  • Rule 1 is applied to contexts 1 and 2.

If the context correlation flag is not set on Rule 1, a scenario in which alarm A generated from context 1 and alarm B generated from context 2 results in the rule applying to both contexts regardless of the type of context.

If the context correlation flag is now set on Rule 1 and the same alarms are generated, they are not correlated as they are from different contexts.

With the flag set, the correlator analyzes alarms against the rule only if alarms arrive from the same context. In other words, if alarm A is generated from context 1 and alarm B is generated from context 2, then a correlation does not occur.

Duration Timeout Flags

The root-cause timeout (if specified) is the alternative rule timeout to use in the situation in which a non-root-cause alarm arrives before a root-cause alarm in the given rule. It is typically used to give a shorter timeout in a situation under the assumption that it is less likely that the root-cause alarm arrives, and, therefore, releases the hold on the non-root-cause alarms sooner.

Reparent Flag

The reparent flag specifies what happens to non-root-cause alarms in a hierarchical correlation when their immediate root cause clears.

The following example illustrates context correlation behavior:

  • Rule 1 has a root cause A and an associated non-root cause.

  • Context correlation flag is not set on Rule 1.

  • Rule 1 is applied to contexts 1 and 2.

In this scenario, if alarm A arrives generated from context 1 and alarm B generated from context 2, then a correlation occurs—regardless of context.

If the context correlation flag is now set on Rule 1 and the same alarms are generated, they are not correlated, because they are from different contexts.

Active-standby consistency check

In a dual RP system, interface state sync library monitors the state between active and standby RPs. An alarm is raised by an RP when its respective port is down and the same port is up for other RP.

Alarm Clearance

The alarm is cleared when there is no inconsistency between both the RP port state.