Troubleshooting AAAAccSrvUnreachable and AAAAuthSrvUnreachable traps

Available Languages

Download Options

PDF (174.8 KB)
View with Adobe Reader on a variety of devices
ePub (97.7 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (122.8 KB)
View on Kindle device or Kindle app on multiple devices

Updated:July 11, 2015

Document ID:200018

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

Trap triggers

Consecutive failures in a aaamgr process approach

Keepalive approach

Troubleshooting commands/approaches

Radius configuration basics

show task resources facility aaamgr all

show radius counters { {all | server } [instance ] | summary}

show session subsystem facility {aaamgr | sessmgr} {all | instance }

ping

traceroute

radius test instance x auth {radius group | all | server port }

radius test instance x accounting {radius group | all | server port }

show radius info [radius group ] instance { X | all}

Introduction

This article discusses how to troubleshoot SNMP traps AAAAccSrvUnreachable and AAAAuthSrvUnreachable, which are triggered due to reachability issues with a Remote Authentication Dial-In User Service (RADIUS) server used to authenticate subscribers (or operators logging into the node, but that is not what is being discussed here). There are two approaches that can be used to determine when either of these traps will trigger. This article will explain what conditions trigger these traps and what troubleshooting approaches and data collection can be taken to determine root cause and solve them. It also discuss some potential remediation steps that can be considered.

Note that the RESULT of unreachability will be call failures or accounting failures, the same as if radius responses are rejections instead of acceptances. While success/failure (authentication) rate is measured independently of timeout/reachability (there are traps and alarms for this) and can certainly be analyzed in its own right, the focus of this article will be on the reachability problem and not the reject problem.

Example output from the LAB and actual tickets is used throughout to help drive home the discussions. What appears to be public IP addresses in this article are fake addresses.

Trap triggers

There are two different models/algorithms/approaches to choose from to determine the status of a radius server and when to try a different server if failures are occurring:

Consecutive failures in a aaamgr process approach

The original approach and the one used more often by operators involves keeping track of the number of failures that have occurred in a row for a particular aaamgr process. A aaamgr process is responsible for all radius message processing and exchange with a radius server, and many aaamgr process will exist in a chassis, each paired with sessmgr processes (which are main processes responsible for call control). (View all the aaamgr processes with the "show task resources" command) A particular aaamgr process will therefore be processing radius messages for many calls, not just a single call, and this algorithm involves tracking how many times in a row a particular aaamgr process has failed to get a response to the same request it has had to resend - an "Access-Request Timeout" as reported in "show radius counters".

The respective counter "Access-Request Current Consecutive Failures in a mgr", also from "show radius counters" is incremented when this occurs, and the "show radius accounting (or authentication) servers detail" command indicates the timestamps of the radius state change from Active to Not Responding (but no SNMP trap or logs are generated for just one failure). Here is an example for radius accounting:

[source]PDSN> show radius accounting servers detail
Friday November 28 23:23:34 UTC 2008

+-----Type:       (A) - Authentication   (a) - Accounting
|                 (C) - Charging         (c) - Charging Accounting
|                 (M) - Mediation       (m) - Mediation Accounting
|
|+----Preference: (P) - Primary         (S) - Secondary
||
||+---State:     (A) - Active          (N) - Not Responding
|||               (D) - Down             (W) - Waiting Accounting-On
|||               (I) - Initializing     (w) - Waiting Accounting-Off
|||               (a) - Active Pending   (U) - Unknown
|||
|||+--Admin       (E) - Enabled         (D) - Disabled
|||| Status:
||||
||||+-Admin
||||| status     (O) - Overridden       (.) - Not Overridden
||||| Overridden:
|||||
vvvvv IP             PORT GROUP
------ ------------- ----- -----------------------
PNE. 198.51.100.1    1813 default                            

Event History:
2008-Nov-28+23:18:36       Active
2008-Nov-28+23:18:57       Not Responding
2008-Nov-28+23:19:12       Active
2008-Nov-28+23:19:30       Not Responding
2008-Nov-28+23:19:36       Active
2008-Nov-28+23:20:57       Not Responding
2008-Nov-28+23:21:12       Active
2008-Nov-28+23:22:31       Not Responding
2008-Nov-28+23:22:36       Active
2008-Nov-28+23:23:30       Not Responding

If this counter reaches the value configured (Default = 4) without ever being reset, per configurable: (note the brackets [ ] are used to indicate optional qualifier and in these cases captures troubleshooting accounting (authentication is the default if accounting is not specified)

radius [accounting] detect-dead-server consecutive-failures 4

Then this server is marked "Down" for the period (minutes) configured:

radius [accounting] deadtime 10

An SNMP trap and logs is triggered as well, for example, for authentication and/or accounting respectively:

Fri Jan 30 06:17:19 2009 Internal trap notification 39 (AAAAuthSvrUnreachable) server 2 ip address 172.28.221.178
Fri Jan 30 06:22:19 2009 Internal trap notification 40 (AAAAuthSvrReachable) server 2 ip address 172.28.221.178


Fri Nov 28 21:59:12 2008 Internal trap notification 42 (AAAAccSvrUnreachable) server 6 ip address 172.28.221.178
Fri Nov 28 22:28:29 2008 Internal trap notification 43 (AAAAccSvrReachable) server 6 ip address 172.28.221.178


2008-Nov-28+21:59:12.899 [radius-acct 24006 warning] [8/0/518 <aaamgr:231> aaamgr_config.c:1060] [context: source, contextID: 2] [software internal security config user critical-info] Server 172.28.221.178:1813 unreachable

2008-Nov-28+22:28:29.280 [radius-acct 24007 info] [8/0/518 <aaamgr:231> aaamgr_config.c:1068] [context: source, contextID: 2] [software internal security config user critical-info] Server 172.28.221.178:1813 reachable

The traps indicate the server which is unreachable. Take note of any patterns. For instance, is it happening with one server or another or all servers, and what is the frequency of bouncing - is it happening continually or occasionally?

Also note that all it takes for this trap to be triggered is for one aaamgr to fail, and so the tricky part about this trap is that it does not indicate the extent of the issue. It could be very extensive or very minoir - that is up to the operator to determine, and approaches to figuring that out are discussed in this article.

show snmp trap statistics will report the number of times it has triggered since bootup, even if the older traps have long since been deleted. This example shows an accounting unreachable issue:

[source]PDSN> show snmp trap statistics | grep -i aaa
Wednesday September 10 08:38:19 UTC 2014

Trap Name                            #Gen #Disc  Disable Last Generated      
----------------------------------- ----- -----  ------- --------------------
AAAAccSvrUnreachable                  833     0       0  2014:09:10:08:36:54
AAAAccSvrReachable                    839     0       0  2014:09:10:08:37:00

Note that the aaamgr reported in the above example is #231. This is the management aaamgr on ASR 5000 that resides on the System Management Card (SMC). What is deceiving in this output is that when an individual aaamgr or aaamgrs experience reachability issues, the instance number reported in the logs is the management aaamgr instance and not the particular instance(s) experiencing the issue. This is due to the fact that if many instances are experiencing reachability issues, then logging would fill up quickly if they were all reported as such, and so the design has been to report generically on the management instance, which if one did not know this, would certainly be deceiving. In the troubleshooting section further details will be provided on how to determine which aaamgr(s) is/are failing. Starting in some versions of StarOS 17 and v18+, this behavior has been changed so that the corresponding aaamgr instance number having connectivity issues (as reported in SNMP traps) is reported in the logs with the particular id (Cisco CDETS CSCum84773), though still only the first occurrence (across mutliple aaamgrs) of this happening is reported.

The management aaamgr is the maximum sessmgr instance number + 1, and so on an ASR 5500 it is 385 for Data Processing Card (DPC) or 1153 (for DPC 2).

As a sidenote, the management aaamgr is responsible for handling operator/administrator logins as well as handling change of authorization requests initiated from RADIUS servers themselves.

Continuing, the "show radius accounting (or authentication) servers detail" command will indicate the timestamps of the state changes to Down that corresponds to the traps/logs (reminder: Not Responding defined earlier is only a single aaamgr getting a timeout, whereas Down is a single aaamgr getting enough consecutive timeouts per configuration to trigger Down)

vvvvv IP             PORT GROUP
----- --------------- ----- -----------------------
aSDE. 172.28.221.178 1813 default                            

Event History:
2008-Nov-28+21:59:12       Down
2008-Nov-28+22:28:29       Active
2008-Nov-28+22:28:57       Not Responding
2008-Nov-28+22:32:12       Down
2008-Nov-28+23:01:57       Active
2008-Nov-28+23:02:12       Not Responding
2008-Nov-28+23:05:12       Down
2008-Nov-28+23:19:29       Active
2008-Nov-28+23:19:57       Not Responding
2008-Nov-28+23:22:12       Down

If there is only one server configured, then it is not marked down, as that would be critical for successful call setup.

Worth mentioning is that there is another parameter that can be configured on the detect-dead-server config line called “response-timeout”. When specified, a server is marked down only when the consecutive failures and response-timeout conditions are both met. The response-timeout specifies a period of time when NO responses are received to ALL the requests sent to a particular server. (Note that this timer would be continually reset as responses are received.) This condition would be expected when either a server or the network connection is completely down, vs. partially compromised/degraded.

The use case for this would be a scenario where a burst in traffic causes the consecutive failures to trigger, but marking a server down immediately as a result is not desired. Rather, the server is only be marked down after a specific period of time passes where no responses are received, effectively representing true server un-reachability.

This method just discussed of controlling radius state machine changes is dependent on looking at all aaamgr processes and finding one that triggers the condition of failed retries. This method is subject to some degree to some randomness of failures, and so may not be the ideal algorithm to detecting failures. But it is especially good at finding aaamgr(s) that are broken while all others are working fine.

Keepalive approach

Another method of detecting radius server reachability is using dummy keepalive test messages. This involves the constant sending of fake radius messages instead of monitoring live traffic. Another advantage of this method is that it is always active, vs. with the consecutive failures in a aaamgr approach, where there could be periods where no radius traffic is sent, and so there is no way to know if a problem exists during those times, resulting in delayed detection when attempts do start occurring. Also when a server is marked down, these keepalives continue to be sent so that the server can be marked up as soon as possible. The disadvantage to this approach is that it misses issues that are tied to specific aaamgr instances that may be experiencing issues because it uses the management aaaamgr instance for the test messages.

Here are the various configurables relevant to this approach:

radius (accounting) detect-dead-server keepalive
radius (accounting) keepalive interval 30
radius (accounting) keepalive retries 3
radius (accounting) keepalive timeout 3
radius (accounting) keepalive consecutive-response 1
radius (accounting) keepalive username Test-Username
radius keepalive encrypted password 2ec59b3188f07d9b49f5ea4cc44d9586
radius (accounting) keepalive calling-station-id 000000000000000
radius keepalive valid-response access-accept

The command “radius (accounting) detect-dead-server keepalive” turns on the keep-alive approach instead of the consecutive failures in a aaamgr approach. In the example above, the system sends a test message with username Test-Username and password Test-Username every 30 seconds, and retries every 3 seconds if no response is received, and retries up to 3 times, after which it marks the server down. Once it gets its first response, it marks it back up again.

Here is an example authentication request/response for the above settings:

<<<<OUTBOUND 17:50:12:657 Eventid:23901(6)

RADIUS AUTHENTICATION Tx PDU, from 192.168.50.151:32783 to 192.168.50.200:1812 (142) PDU-dict=starent-vsa1
Code: 1 (Access-Request)
Id: 16
Length: 142
Authenticator: 51 6D B2 7D 6A C6 9A 96 0C AB 44 19 66 2C 12 0A
     User-Name = Test-Username
     User-Password = B7 23 1F D1 86 46 4D 7F 8F E0 2A EF 17 A1 F3 BF
     Calling-Station-Id = 000000000000000
     Service-Type = Framed
     Framed-Protocol = PPP
     NAS-IP-Address = 192.168.50.151
     Acct-Session-Id = 00000000
     NAS-Port-Type = HRPD
     3GPP2-MIP-HA-Address = 255.255.255.255
     3GPP2-Correlation-Id = 00000000
     NAS-Port = 4294967295
     Called-Station-ID = 00

INBOUND>>>>> 17:50:12:676 Eventid:23900(6)
RADIUS AUTHENTICATION Rx PDU, from 192.168.50.200:1812 to 192.168.50.151:32783 (34) PDU-dict=starent-vsa1
Code: 2 (Access-Accept)
Id: 16
Length: 34
Authenticator: 21 99 F4 4C F8 5D F8 28 99 C6 B8 D9 F9 9F 42 70
     User-Password = testpassword

The same SNMP traps are used to signify the unreachable/down and reachable/up radius states as with the consecutive failures in a aaamgr approach:

Fri Feb 27 17:54:55 2009 Internal trap notification 39 (AAAAuthSvrUnreachable) server 1 ip address 192.168.50.200
Fri Feb 27 17:57:04 2009 Internal trap notification 40 (AAAAuthSvrReachable) server 1 ip address 192.168.50.200

The “show radius counters all” has a section for keeping track of the keepalive requests for authentication and accounting as well – here are the authentication counters:

   Server-specific Keepalive Auth Counters
   ---------------------------------------
     Keepalive Access-Request Sent:                            33            
     Keepalive Access-Request Retried:                         3              
     Keepalive Access-Request Timeouts:                        4              
     Keepalive Access-Accept Received:                         29            
     Keepalive Access-Reject Received:                         0              
     Keepalive Access-Response Bad Authenticator Received:     0              
     Keepalive Access-Response Malformed Received:             0              
     Keepalive Access-Response Malformed Attribute Received:   0              
     Keepalive Access-Response Unknown Type Received:          0              
     Keepalive Access-Response Dropped:                        0

Troubleshooting commands/approaches

Now that the trigger for AAA Unreachable traps has been explained, the next step is to understand the various troubleshooting commands to use to determine impact and try to figure out root cause. Unreachability is a very wide term. It doesn't explain where the unreachability is - in the network, on the server, or on the ASR. For instance, is it known whether the requests were even sent in the first place? Did the server receive the requests? Did it respond to the requests. Did the responses make it back to the ASR and if so, were they processed or dropped on the internal path (i.e. flows). This section attempt to address how to answer these questions.

Radius configuration basics

There are first some basics that one needs to be familiar with with regards to the RADIUS configuration. Most of the configuration for RADIUS is in a specifically named group, and all contexts have a default group which can be configured as follows. Many times configurations will have just one group, the default group.

[local]CSE2# config
[local]CSE2(config)# context aaa_ctx
[aaa_ctx]ASR5000(config-ctx)# aaa group default
[aaa_ctx]ASR5000(config-aaa-group)#

If specific named aaa groups are used, they are pointed to by the following statement configured in a subscriber profile or Application Point Name (APN) (depending on the call control technology), for example:

subscriber name <subscriber name>
  aaa group <group name>

Note: The system first checks the specific aaa group assigned to the subscriber, and then checks the aaa group default for additional configurables not defined in the specific group.

Here are useful commands that summarize all the values assigned to all the configurables in the various aaa group configurations. This allows quick viewing of all the configurables including default values without having to examine the configuration manually, and possibly help to avoid making mistakes when assuming certain settings. These commands report across all contexts:

show aaa group all
show aaa group name <group name>

The most important configurable is of course the radius access and accounting servers themselves. Here is an example:

radius server 209.165.201.1 key testtesttesttest port 1645 priority 1 max-rate 5
radius server 209.165.201.2 key testtesttesttest port 1645 priority 2 max-rate 5
radius accounting server 209.165.201.1 key testtesttesttest port 1646 priority 1
radius accounting server 209.165.201.2 key testtesttesttest port 1646 priority 2

Note the max-rate feature that limits the number of requests sent to the server per aaamgr per second

In addition, the NAS IP address is also required to be defined, which is the IP address on an interface in the context from which radius requests are sent and responses received. If not defined, requests are not sent and monitor subscriber traces may not post an obvious error (no radius requests sent and no indication why).

radius attribute nas-ip-address address 10.211.41.129

Note that because both authentication and accounting are often handled by the same server, a different port number is used to differentiate authentication vs. accounting traffic on the RADIUS server. For the ASR5K side, the UDP source port number is NOT specified and is chosen by the chassis on a aaamgr basis (more on this later).

Normally multiple access and accounting servers are specified for redundancy purposes. Either a round robin or prioritized order can be configured:

radius [accounting] algorithm {first-server | round-robin}

The first-server option results in ALL requests being sent to the server with the lowest-numbered priority. Only when retry failures occur, or worse, a server is marked down, is the server with the next priority tried. More on this below.

When a radius (accounting or access) request is sent, a reply is expected. When a reply is not received within the timeout period (seconds):

radius [accounting] timeout 3

The request is resent up to the number of times specified:

radius [accounting] max-retries 5

This means that a request can be sent a total of max-retries + 1 times until it gives up on the particular radius server being tried. At this point, it tries the same sequence to the next radius server in order. If each of the servers have been tried max-retries + 1 times without response, then the call is rejected, assuming there is no other reason for failure up to that point.

As a sidenote, there are configurables that allow for users to have access even if authentication and accounting fail due to timeouts to all servers, though a commercial deployment would not likely implement this:

radius allow [accounting] authentication-down

Also, there are configurables that can limit the absolute total number of transmissions of a particular request across all the configured servers, and these are disabled by default:

radius [accounting] max-transmissions 256

For example if this is set = 1, then even if there is a secondary server, it never is attempted because only one attempt for a specific subscriber setup is ever attempted.

show task resources facility aaamgr all

Each aaamgr process is paired with and "works for" an associated sessmgr process (responsible for overall call handling) and is located on a different Packet Services Card (PSC) or Data Processing Card (DPC) but using the same instance ID. Also in this example output note the special aaamgr instance 231 running on the System Management Card (SMC) for ASR 5000 (or Management Input Output card for ASR 5500 (MIO)) which does NOT process subscriber requests but does get used for radius test commands (see later section for more detail on that) AND for operator CLI login processing.

In this snippet, aaamgr 107 located on PSC 13 is responsible for handling all RADIUS processing for the paired sessmgr 107 located on PSC 1. Reachability problems for aaamgr 107 affects calls on sessmgr 107.

                  task   cputime       memory     files     sessions
cpu facility     inst used allc   used alloc used allc used allc S status
----------------------- --------- ------------- --------- ------------- -----
1/0 sessmgr       107 1.6% 100% 119.6M 155.0M   26 500   83 6600 I   good
13/1 aaamgr       107 0.3% 94% 30.8M 77.0M   18 500   --   -- -   good
8/0 aaamgr        231 0.1% 30% 11.6M 25.0M   19 500   --   -- -   good

In the following example, note that problems with aaamgr 92 are affecting the paired sessmgr as easily seen when compared to other sessmgrs with respect to session counts:

                   task   cputime        memory     files      sessions
cpu facility      inst used allc   used  alloc used allc  used  allc S status
----------------------- --------- ------------- --------- ------------- ------
12/0 sessmgr         92 1.2% 100% 451.5M  1220M   43  500   643 21120 I   good
16/0 aaamgr          92 0.0%  95% 119.0M 315.0M   20  500    --    -- -   good
12/0 sessmgr         95 6.9% 100% 477.3M  1220M   41  500  2626 21120 I   good
12/0 sessmgr        105 7.7% 100% 600.5M  1220M   45  500  2626 21120 I   good
12/0 sessmgr        126 3.4% 100% 483.0M  1220M   44  500  2625 21120 I   good
12/0 sessmgr        131 8.1% 100% 491.7M  1220M   45  500  2627 21120 I   good

show radius counters { {all | server <server IP>} [instance <aaamgr #>] | summary}

The number one command to be familiar with is varieties of "show radius counters"

This command reports back many useful counters for troubleshooting radius issues. The "show radius counters all" command is very valuable in tracking success and failures on a server basis, and it is important to understand the meaning of the various counters that compose this command, as it may not be obvious. The command is context-sensitive and so must be run in the same context where the aaa group(s) are defined.

Important note: Over an un-monitored time period, it is difficult to draw any conclusions from the counter values or the relationships amongst counters. To make accurate conclusions, the best approach is to reset the counters and monitor them over a period of time when the issue being troubleshot is occurring.

In the following output, note "Access-Request Sent" = 1, while "Access-Request Retried" = 3. So, any given new request to a particular radius server is only counted once, and all the retries are counted separately. In this case, that is a total of 3 + 1 = 4 access requests sent. Note the counter "Access-Request Timeouts" = 1. A single timeout occurs only when ALL the retries fail, so in this case, 3 retries without a response result in 1 Timeout (not 4). This happens across all of the configured servers until there is success, or all attempts have failed. So pay attention to the counters that are tracked for each server separately. Here is an example of this, where:

radius max-retries 3
radius server 192.168.50.200 encrypted key 01abd002c82b4a2c port 1812 priority 1
radius server 192.168.50.250 encrypted key 01abd002c82b4a2c port 1812 priority 2

[destination]CSE2# show radius counters all

   Server-specific Authentication Counters
   ---------------------------------------
   Authentication server address 192.168.50.200, port 1812:
     Access-Request Sent:                                      1              
     Access-Request with DMU Attributes Sent:                  0              
     Access-Request Pending:                                   0              
     Access-Request Retried:                                   3              
     Access-Request with DMU Attributes Retried:               0              
     Access-Challenge Received:                                0              
     Access-Accept Received:                                   0              
     Access-Reject Received:                                   0              
     Access-Reject Received with DMU Attributes:               0              
     Access-Request Timeouts:                                  1              
     Access-Request Current Consecutive Failures in a mgr:     1            
     Access-Request Response Bad Authenticator Received:       0              
     Access-Request Response Malformed Received:               0              
     Access-Request Response Malformed Attribute Received:     0              
     Access-Request Response Unknown Type Received:            0              
     Access-Request Response Dropped:                          0              
     Access-Request Response Last Round Trip Time:             0.0 ms
     Access-Request Response Average Round Trip Time:          0.0 ms
     Current Access-Request Queued:                            0

...

Authentication server address 192.168.50.250, port 1812:
   Access-Request Sent:                                        1              
     Access-Request with DMU Attributes Sent:                  0              
     Access-Request Pending:                                   0              
     Access-Request Retried:                                   3              
     Access-Request with DMU Attributes Retried:               0              
     Access-Challenge Received:                                0              
     Access-Accept Received:                                   0              
     Access-Reject Received:                                   0              
     Access-Reject Received with DMU Attributes:               0              
     Access-Request Timeouts:                                  1              
     Access-Request Current Consecutive Failures in a mgr:     1            
     Access-Request Response Bad Authenticator Received:       0              
     Access-Request Response Malformed Received:               0              
     Access-Request Response Malformed Attribute Received:     0              
     Access-Request Response Unknown Type Received:            0              
     Access-Request Response Dropped:                          0              
     Access-Request Response Last Round Trip Time:             0.0 ms
     Access-Request Response Average Round Trip Time:          0.0 ms
     Current Access-Request Queued:                            0

Note also that timeouts are NOT counted as failures, the result being that the number of Access-Accept received and Access-Reject received will not add up to Access-Request Sent if there are any timeouts.

Analysis of these counters may not be completely straightforward. For example for Mobile IP (MIP) protocol, as the authentications are failing, there is no MIP Registration Reply (RRP) being sent, and the mobile may continue to initiate new MIP Registration Requests (RRQ) because it has not received a MIP RRP. Each new MIP RRQ causes the PDSN to send a new Authentication request which itself can have its own series of retries. This can be seen in the Id field at the top of a packet trace – it is unique for each set of retries. The result is that the counters for Sent, Retried, and Timeout can be much higher than expected for the number of calls received. There is an option that can be enabled to minimize these extra re-tries, and it can be set in the Foreign Agent (FA) (but not on the Home Agent (HA)) service: “authentication mn-aaa <6 choices here> optimize-retries”

Some other useful counters:

"Access-Request Response Dropped" - occurs if the call fails to setup while waiting for responses to authentication requests.

"Access-Request Response Last Round Trip Time" - indicates any delays between the endpoints, though it obviously would not indicate where the delay might be.

"Access-Request Current Consecutive Failures in a mgr" relates to what was discussed in the first section on triggers for AAA Unreachable traps. It represents the aaamgr(s) with the highest count of consecutive timeouts.

"Current Access/Accounting-Request Queued" indicates requests that are not being responded to and remaining in the queue (accounting allows for a build-up of the queue indefinitely while authentication does not)

The most common scenario seen when AAA Unreachable is reported is that Access Timeouts and/or Response Drops are also occuring, while Access Responses are not keeping up with requests.

If access to priviledged technical support mode is available, then further investigation can be done at the aaamgr instance level to determine if one or more specific aaamgrs are the cause of the increase in overall "bad" counts. For example, look for aaamgrs that are located on a specific PSC/DPC having high counts or maybe a single aaamgr or random aaamgrs having issues - look for patterns. If all or most aaamgrs are having issues, then there is increased likelihood that the root cause is either external to the chassis OR manifesting large-scale on the chassis. General health checks should be done in that case.

Here is example output showing an issue with a specific aaamgr for accounting. (The issue turned out to be a bug in a firewall between the ASR5K and the RADIUS server that was blocking traffic from a specific aaamgr instance (114) port). Over a three week period, only 48 responses have been received, yet over 100,000 timeouts have occurred (and that doesn’t include re-transmits).

[source]PDSN> show radius counters server 209.165.201.1 instance  114 | grep -E "Accounting-Request Sent|Accounting-Response Received|Accounting-Request Timeouts"
Wednesday October 01 18:12:24 UTC 2014
     Accounting-Request Sent:                                  14306189
     Accounting-Response Received:                             14299843
     Accounting-Request Timeouts:                              6342
  
[source]PDSN> show radius counters server 209.165.201.1 instance 114 | grep -E "Accounting server address|Accounting-Request Sent|Accounting-Response Received|Accounting-Request Timeouts"
Wednesday October 22 20:26:35 UTC 2014
   Accounting server address 209.165.201.1, port 1646:
     Accounting-Request Sent:                                  15105872        
     Accounting-Response Received:                             14299891       
     Accounting-Request Timeouts:                              158989
 
[source]PDSN> show radius counters server 209.165.201.1 instance 114 | grep Accounting
Wednesday October 22 20:33:09 UTC 2014
   Per-Context RADIUS Accounting Counters
   Accounting Response
   Server-specific Accounting Counters
   Accounting server address 209.165.201.1, port 1646:
     Accounting-Request Sent:                                  15106321       
     Accounting-Start Sent:                                    7950140        
     Accounting-Stop Sent:                                     7156129        
     Accounting-Interim Sent:                                  52             
     Accounting-On Sent:                                       0              
     Accounting-Off Sent:                                      0              
     Accounting-Request Pending:                               3              
     Accounting-Request Retried:                               283713         
     Accounting-Start Retried:                                 279341         
     Accounting-Stop Retried:                                  4372           
     Accounting-Interim Retried:                               0              
     Accounting-On Retried:                                    0              
     Accounting-Off Retried:                                   0              
     Accounting-Response Received:                             14299891      
     Accounting-Request Timeouts:                              159000         
     Accounting-Request Current Consecutive Failures in a mgr: 11             
     Accounting-Response Bad Response Received:                0              
     Accounting-Response Malformed Received:                   0              
     Accounting-Response Unknown Type Received:                0              
     Accounting-Response Dropped:                              21             
     Accounting-Response Last Round Trip Time:                 52.5 ms
     Accounting-Response Average Round Trip Time:              49.0 ms
     Accounting Total G1 (Acct-Output-Octets):                 4870358614798
     Accounting Total G2 (Acct-Input-Octets):                  714140547011
     Current Accounting-Request Queued:                        17821

In conclusion, determine which counters are incrementing, for which servers, and at what speed.

show session subsystem facility {aaamgr | sessmgr} {all | instance <instance #>}

While it is beyond the scope of this article to examine all the superfluous output from this command, a couple examples are worth looking at. Like any other troubleshooting, comparing the output between what is believed to be good versus bad aaamgr instances often reveals obvious differences in the values reported. This could be reflected in the total number of requests, failure/success rate, auth cancelled, etc. As a reminder, be sure to clear the session subsystem (one instance cannot be cleared, they all must be cleared) so as to eliminate any history which could potentially provide a clouded picture of the current state.

Continuing with the same issue mentioned earlier with respect to a single aaamgr failing for accounting, here is output from a different node with that same issue except a different sessmr instance 36. Note all the interesting fields for the failing aaamgr and how those values increase over time with the two captures of the command. Meanwhile output from instance 37 is shown as an example of a working aaamgr.

[source]PDSN> show session subsystem  facility aaamgr instance 36
Wednesday September 10 08:51:18 UTC 2014

AAAMgr:  Instance 36
39947440 Total aaa requests              17985 Current aaa requests
24614090 Total aaa auth requests             0 Current aaa auth requests
       0 Total aaa auth probes               0 Current aaa auth probes
       0 Total aaa aggregation requests
       0 Current aaa aggregation requests
       0 Total aaa auth keepalive            0 Current aaa auth keepalive
15171628 Total aaa acct requests         17985 Current aaa acct requests
       0 Total aaa acct keepalive            0 Current aaa acct keepalive
20689536 Total aaa auth success        1322489 Total aaa auth failure
   86719 Total aaa auth purged            1016 Total aaa auth cancelled
       0 Total auth keepalive success        0 Total auth keepalive failure
       0 Total auth keepalive purged
       0 Total aaa aggregation success requests
       0 Total aaa aggregation failure requests
       0 Total aaa aggregation purged requests
   15237 Total aaa auth DMU challenged
   17985/70600 aaa request (used/max)
      14 Total diameter auth responses dropped
 6960270 Total Diameter auth requests        0 Current Diameter auth requests
   23995 Total Diameter auth requests retried
      52 Total Diameter auth requests dropped
 9306676 Total radius auth requests          0 Current radius auth requests
       0 Total radius auth requests retried
     988 Total radius auth responses dropped
      13 Total local auth requests           0 Current local auth requests
 8500275 Total pseudo auth requests          0 Current pseudo auth requests
    8578 Total null-username auth requests (rejected)
       0 Total aggregation responses dropped
15073834 Total aaa acct completed        79763 Total aaa acct purged     <== If issue started recently, this may not have yet started incrementing
       0 Total acct keepalive success        0 Total acct keepalive timeout
       0 Total acct keepalive purged
       4 CLI Test aaa acct purged
       0 IP Interface down aaa acct purged
       0 No Radius Server found aaa acct purged
       0 No Response aaa acct purged
14441090 Total acct sess alloc
14422811 Total acct sess delete
   18279 Current acct sessions
       0 Auth No Wait Suppressed
       0 Aggr No Wait Suppressed
       0 Disc No Wait Suppressed
       0 Start No Wait Suppressed
       0 Interim No Wait Suppressed
       0 Stop No Wait Suppressed
       0 Acct OnOff Custom14
       0 Acct OnOff Custom67
       0 Acct OnOff
       0 Recovery Str Suppressed
       0 Recovery Stop Suppressed
       0 Med Chrg Gtpp Suppressed
       0 Med Chrg Radius Suppressed
       0 Radius Probe Trigger
       0 Recovery Stop Acct Session Suppressed
      46 Total aaa acct cancelled
       0 Total Diameter acct requests        0 Current Diameter acct requests
       0 Total Diameter acct requests retried
       0 Total diameter acct requests dropped
       0 Total diameter acct responses dropped
       0 Total diameter acct cancelled
       0 Total diameter acct purged
15171628 Total radius acct requests      17985 Current radius acct requests
      46 Total radius acct cancelled
   79763 Total radius acct purged
   11173 Total radius acct requests retried
      49 Total radius acct responses dropped
       0 Total radius sec acct requests      0 Current radius sec acct requests
       0 Total radius sec acct cancelled
       0 Total radius sec acct purged
       0 Total radius sec acct requests retried
       0 Total gtpp acct requests            0 Current gtpp acct requests
       0 Total gtpp acct cancelled           0 Total gtpp acct purged
       0 Total gtpp sec acct requests        0 Total gtpp sec acct purged
       0 Total null acct requests            0 Current null acct requests
16218236 Total aaa acct sessions         21473 Current aaa acct sessions
    8439 Total aaa acct archived             2 Current aaa acct archived
   21473 Current recovery archives        4724 Current valid recovery records
       1 Total aaa sockets opened            1 Current aaa sockets opened
       1 Total aaa requests pend socket opened
       0 Current aaa requests pend socket open
  133227 Total radius requests pend server max-outstanding
   17982 Current radius requests pend server max-outstanding
       0 Total radius auth req queued server max-rate
       0 Max radius auth req queued server max-rate
       0 Current radius auth req queued server max-rate
       0 Total radius acct req queued server max-rate
       0 Max radius acct req queued server max-rate
       0 Current radius acct req queued server max-rate
       0 Total radius charg auth req queued server max-rate
       0 Max radius charg auth req queued server max-rate
       0 Current radius charg auth req queued server max-rate
       0 Total radius charg acct req queued server max-rate
       0 Max radius charg acct req queued server max-rate
       0 Current radius charg acct req queued server max-rate
       0 Total aaa radius coa requests       0 Total aaa radius dm requests
       0 Total aaa radius coa acks           0 Total aaa radius dm acks
       0 Total aaa radius coa naks           0 Total aaa radius dm naks
       0 Total radius charg auth             0 Current radius charg auth
       0 Total radius charg auth success     0 Total radius charg auth failure
       0 Total radius charg auth purged      0 Total radius charg auth cancelled
       0 Total radius charg acct             0 Current radius charg acct
       0 Total radius charg acct success     0 Total radius charg acct purged
       0 Total radius charg acct cancelled
       0 Total gtpp charg                    0 Current gtpp charg
       0 Total gtpp charg success            0 Total gtpp charg failure
       0 Total gtpp charg cancelled          0 Total gtpp charg purged
       0 Total gtpp sec charg                0 Total gtpp sec charg purged
  161722 Total prepaid online requests       0 Current prepaid online requests
  141220 Total prepaid online success    20392 Current prepaid online failure
       0 Total prepaid online retried      102 Total prepaid online cancelled
       8 Current prepaid online purged
...
 
[source]PDSN> show session subsystem facility aaamgr instance 37
Wednesday September 10 08:51:28 UTC 2014

AAAMgr:  Instance 37
39571859 Total aaa requests                  0 Current aaa requests
24368622 Total aaa auth requests             0 Current aaa auth requests
       0 Total aaa auth probes               0 Current aaa auth probes
       0 Total aaa aggregation requests
       0 Current aaa aggregation requests
       0 Total aaa auth keepalive            0 Current aaa auth keepalive
15043217 Total aaa acct requests             0 Current aaa acct requests
       0 Total aaa acct keepalive            0 Current aaa acct keepalive
20482618 Total aaa auth success        1309507 Total aaa auth failure
   85331 Total aaa auth purged             968 Total aaa auth cancelled
       0 Total auth keepalive success        0 Total auth keepalive failure
       0 Total auth keepalive purged
       0 Total aaa aggregation success requests
       0 Total aaa aggregation failure requests
       0 Total aaa aggregation purged requests
   15167 Total aaa auth DMU challenged
       1/70600 aaa request (used/max)
      41 Total diameter auth responses dropped
 6883765 Total Diameter auth requests        0 Current Diameter auth requests
   23761 Total Diameter auth requests retried
      37 Total Diameter auth requests dropped
 9216203 Total radius auth requests          0 Current radius auth requests
       0 Total radius auth requests retried
     927 Total radius auth responses dropped
      15 Total local auth requests           0 Current local auth requests
 8420022 Total pseudo auth requests          0 Current pseudo auth requests
    8637 Total null-username auth requests (rejected)
       0 Total aggregation responses dropped
15043177 Total aaa acct completed            0 Total aaa acct purged
       0 Total acct keepalive success        0 Total acct keepalive timeout
       0 Total acct keepalive purged
       0 CLI Test aaa acct purged
       0 IP Interface down aaa acct purged
       0 No Radius Server found aaa acct purged
       0 No Response aaa acct purged
14358245 Total acct sess alloc
14356293 Total acct sess delete
    1952 Current acct sessions
       0 Auth No Wait Suppressed
       0 Aggr No Wait Suppressed
       0 Disc No Wait Suppressed
       0 Start No Wait Suppressed
       0 Interim No Wait Suppressed
       0 Stop No Wait Suppressed
       0 Acct OnOff Custom14
       0 Acct OnOff Custom67
       0 Acct OnOff
       0 Recovery Str Suppressed
       0 Recovery Stop Suppressed
       0 Med Chrg Gtpp Suppressed
       0 Med Chrg Radius Suppressed
       0 Radius Probe Trigger
       0 Recovery Stop Acct Session Suppressed
      40 Total aaa acct cancelled
       0 Total Diameter acct requests        0 Current Diameter acct requests
       0 Total Diameter acct requests retried
       0 Total diameter acct requests dropped
       0 Total diameter acct responses dropped
       0 Total diameter acct cancelled
       0 Total diameter acct purged
15043217 Total radius acct requests          0 Current radius acct requests
      40 Total radius acct cancelled
       0 Total radius acct purged
     476 Total radius acct requests retried
      37 Total radius acct responses dropped
       0 Total radius sec acct requests      0 Current radius sec acct requests
       0 Total radius sec acct cancelled
       0 Total radius sec acct purged
       0 Total radius sec acct requests retried
       0 Total gtpp acct requests            0 Current gtpp acct requests
       0 Total gtpp acct cancelled           0 Total gtpp acct purged
       0 Total gtpp sec acct requests        0 Total gtpp sec acct purged
       0 Total null acct requests            0 Current null acct requests
16057760 Total aaa acct sessions          4253 Current aaa acct sessions
      14 Total aaa acct archived             0 Current aaa acct archived
    4253 Current recovery archives        4249 Current valid recovery records
       1 Total aaa sockets opened            1 Current aaa sockets opened
       1 Total aaa requests pend socket opened
       0 Current aaa requests pend socket open
   29266 Total radius requests pend server max-outstanding
       0 Current radius requests pend server max-outstanding
       0 Total radius auth req queued server max-rate
       0 Max radius auth req queued server max-rate
       0 Current radius auth req queued server max-rate
       0 Total radius acct req queued server max-rate
       0 Max radius acct req queued server max-rate
       0 Current radius acct req queued server max-rate
       0 Total radius charg auth req queued server max-rate
       0 Max radius charg auth req queued server max-rate
       0 Current radius charg auth req queued server max-rate
       0 Total radius charg acct req queued server max-rate
       0 Max radius charg acct req queued server max-rate
       0 Current radius charg acct req queued server max-rate
       0 Total aaa radius coa requests       0 Total aaa radius dm requests
       0 Total aaa radius coa acks           0 Total aaa radius dm acks
       0 Total aaa radius coa naks           0 Total aaa radius dm naks
       0 Total radius charg auth             0 Current radius charg auth
       0 Total radius charg auth success     0 Total radius charg auth failure
       0 Total radius charg auth purged      0 Total radius charg auth cancelled
       0 Total radius charg acct             0 Current radius charg acct
       0 Total radius charg acct success     0 Total radius charg acct purged
       0 Total radius charg acct cancelled
       0 Total gtpp charg                    0 Current gtpp charg
       0 Total gtpp charg success            0 Total gtpp charg failure
       0 Total gtpp charg cancelled          0 Total gtpp charg purged
       0 Total gtpp sec charg                0 Total gtpp sec charg purged
  160020 Total prepaid online requests       0 Current prepaid online requests
  139352 Total prepaid online success    20551 Current prepaid online failure
...
 

 
[source]PDSN> show session subsystem facility aaamgr instance 36
Wednesday September 10 09:12:13 UTC 2014

AAAMgr:  Instance 36

39949892 Total aaa requests              17980 Current aaa requests
24615615 Total aaa auth requests             0 Current aaa auth requests
       0 Total aaa auth probes               0 Current aaa auth probes
       0 Total aaa aggregation requests
       0 Current aaa aggregation requests
       0 Total aaa auth keepalive            0 Current aaa auth keepalive
15172543 Total aaa acct requests         17980 Current aaa acct requests
       0 Total aaa acct keepalive            0 Current aaa acct keepalive
20690768 Total aaa auth success        1322655 Total aaa auth failure
   86728 Total aaa auth purged            1016 Total aaa auth cancelled
       0 Total auth keepalive success        0 Total auth keepalive failure
       0 Total auth keepalive purged
       0 Total aaa aggregation success requests
       0 Total aaa aggregation failure requests
       0 Total aaa aggregation purged requests
   15242 Total aaa auth DMU challenged
   17981/70600 aaa request (used/max)
      14 Total diameter auth responses dropped
 6960574 Total Diameter auth requests        0 Current Diameter auth requests
   23999 Total Diameter auth requests retried
      52 Total Diameter auth requests dropped
 9307349 Total radius auth requests          0 Current radius auth requests
       0 Total radius auth requests retried
     988 Total radius auth responses dropped
      13 Total local auth requests           0 Current local auth requests
 8500835 Total pseudo auth requests          0 Current pseudo auth requests
    8578 Total null-username auth requests (rejected)
       0 Total aggregation responses dropped
15074358 Total aaa acct completed        80159 Total aaa acct purged
       0 Total acct keepalive success        0 Total acct keepalive timeout
       0 Total acct keepalive purged
       4 CLI Test aaa acct purged
       0 IP Interface down aaa acct purged
       0 No Radius Server found aaa acct purged
       0 No Response aaa acct purged
14441768 Total acct sess alloc
14423455 Total acct sess delete
   18313 Current acct sessions
       0 Auth No Wait Suppressed
       0 Aggr No Wait Suppressed
       0 Disc No Wait Suppressed
       0 Start No Wait Suppressed
       0 Interim No Wait Suppressed
       0 Stop No Wait Suppressed
       0 Acct OnOff Custom14
       0 Acct OnOff Custom67
       0 Acct OnOff
       0 Recovery Str Suppressed
       0 Recovery Stop Suppressed
       0 Med Chrg Gtpp Suppressed
       0 Med Chrg Radius Suppressed
       0 Radius Probe Trigger
       0 Recovery Stop Acct Session Suppressed
      46 Total aaa acct cancelled
       0 Total Diameter acct requests        0 Current Diameter acct requests
       0 Total Diameter acct requests retried
       0 Total diameter acct requests dropped
       0 Total diameter acct responses dropped
       0 Total diameter acct cancelled
       0 Total diameter acct purged
15172543 Total radius acct requests      17980 Current radius acct requests
      46 Total radius acct cancelled
   80159 Total radius acct purged
   11317 Total radius acct requests retried
      49 Total radius acct responses dropped
       0 Total radius sec acct requests      0 Current radius sec acct requests
       0 Total radius sec acct cancelled
       0 Total radius sec acct purged
       0 Total radius sec acct requests retried
       0 Total gtpp acct requests            0 Current gtpp acct requests
       0 Total gtpp acct cancelled           0 Total gtpp acct purged
       0 Total gtpp sec acct requests        0 Total gtpp sec acct purged
       0 Total null acct requests            0 Current null acct requests
16219251 Total aaa acct sessions         21515 Current aaa acct sessions
    8496 Total aaa acct archived             0 Current aaa acct archived
   21515 Current recovery archives        4785 Current valid recovery records
       1 Total aaa sockets opened            1 Current aaa sockets opened
       1 Total aaa requests pend socket opened
       0 Current aaa requests pend socket open
  133639 Total radius requests pend server max-outstanding
   17977 Current radius requests pend server max-outstanding
...

One should also run show task resources to check for any uneven session counts (used column) amongst all sessmgrs. If any are found, check the paired aaamgrs for those sessmrs with this command to see if there are any fields that are out of line - if the issue is due to RADIUS then there is a good chance to find something.

In the show task resources example in a previous section, there was a signficantly lower session count on sessmgr 92 which was paired to aaamgr 92. The output from show session subsystem shows significant increase in the total max-outstanding and aaa auth purged counters, and elevated Current max-outstanding counters. One can use the grep feature live on the chassis and/or Notepad++ or other powerful search editor to quickly analyze data. Run the command multiple times to see what values are increasing or remaining elevated:

[Ingress]PGW# show session subsystem facility aaamgr all
Tuesday January 10 04:42:29 UTC 2012
    4695 Total aaa auth purged
    4673 Total radius auth requests         16 Current radius auth requests
    4167 Total radius requests pend server max-outstanding
      76 Current radius requests pend server max-outstanding

[Ingress]PGW# show session subsystem facility aaamgr all | grep "max-outstanding"
Tuesday January 10 04:51:00 UTC 2012
    4773 Total radius requests pend server max-outstanding
      67 Current radius requests pend server max-outstanding

[Ingress]PGW# show session subsystem facility aaamgr all | grep "max-outstanding"
Tuesday January 10 04:56:10 UTC 2012
    5124 Total radius requests pend server max-outstanding
      81 Current radius requests pend server max-outstanding

[Ingress]PGW# show session subsystem facility aaamgr instance 92
Tuesday January 10 04:57:03 UTC 2012
    5869 Total aaa auth purged
    5843 Total radius auth requests         12 Current radius auth requests
    5170 Total radius requests pend server max-outstanding
      71 Current radius requests pend server max-outstanding

[Ingress]PGW# show session subsystem facility aaamgr instance 92
Tuesday January 10 05:10:05 UTC 2012
    6849 Total aaa auth purged
    6819 Total radius auth requests          6 Current radius auth requests
    5981 Total radius requests pend server max-outstanding
      68 Current radius requests pend server max-outstanding

[Ingress]PGW# show session subsystem facility aaamgr all | grep "max-outstanding"
Tuesday January 10 05:44:22 UTC 2012
      71 Total radius requests pend server max-outstanding
       0 Current radius requests pend server max-outstanding
      61 Total radius requests pend server max-outstanding
       0 Current radius requests pend server max-outstanding

    7364 Total radius requests pend server max-outstanding   <== instance #92
      68 Current radius requests pend server max-outstanding

      89 Total radius requests pend server max-outstanding
       0 Current radius requests pend server max-outstanding
      74 Total radius requests pend server max-outstanding
       0 Current radius requests pend server max-outstanding

[Ingress]PGW#radius test instance 92 auth server 65.175.1.10 port 1645 test test
Tuesday January 10 06:13:38 UTC 2012

Authentication from authentication server 65.175.1.10, port 1645
Communication Failure: No response received

ping

traceroute

An ICMP Ping tests basic connectivity to see if the AAA server can be reached or not. The ping may need to be sourced with the src keyword depending on the network and needs to be done from the AAA context to have value. If ping to the server fails, then try pinging intermediary elements including the next hop address in the context, confirming there is an ARP entry to the next-hop address if ping fails. Traceroute can also help with routing issues.

[source]CSE2# ping 192.168.50.200
PING 192.168.50.200 (192.168.50.200) 56(84) bytes of data.
64 bytes from 192.168.50.200: icmp_seq=1 ttl=64 time=0.411 ms
64 bytes from 192.168.50.200: icmp_seq=2 ttl=64 time=0.350 ms
64 bytes from 192.168.50.200: icmp_seq=3 ttl=64 time=0.353 ms
64 bytes from 192.168.50.200: icmp_seq=4 ttl=64 time=0.321 ms
64 bytes from 192.168.50.200: icmp_seq=5 ttl=64 time=0.354 ms

--- 192.168.50.200 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4000ms
rtt min/avg/max/mdev = 0.321/0.357/0.411/0.037 ms

radius test instance x auth {radius group <group> | all | server <IP> port <port>} <username> <password>

radius test instance x accounting {radius group <group name> | all | server <IP> port <port>}

With access to the Tech Support Test commands, one can further test whether a specific aaamgr is able to reach any RADIUS server. For a basic RADIUS connectivity test, independent of any specific aaamgr instance, use the generic version of this command which doesn't specify any specific instance # but uses the management instance by default. If this fails, then it may point to a wider issue independent of specific instances.

This command sends a basic authentication request or accounting start and stop requests and waits for a response. For authentication, use any username and password, in which case a reject response would be expected, confirming that RADIUS is working as designed, or a known working username/password could be used, in which case an accept response should be received

Here is an example output from monitor protocol and running the authentication version of the command on a lab chassis:

[source]CSE2# radius test authentication server 192.168.50.200 port 1812 test test

Authentication from authentication server 192.168.50.200, port 1812
Authentication Success: Access-Accept received
Round-trip time for response was 12.3 ms

<<<<OUTBOUND 14:53:49:202 Eventid:23901(6)
RADIUS AUTHENTICATION Tx PDU, from 192.168.50.151:32783 to 192.168.50.200:1812 (58) PDU-dict=starent-vsa1
Code: 1 (Access-Request)
Id: 5
Length: 58
Authenticator: 56 97 57 9C 51 EF A4 08 20 E1 14 89 40 DE 0B 62
     User-Name = test
     User-Password = 49 B0 92 4D DC 64 49 BA B0 0E 18 36 3F B6 1B 37
     NAS-IP-Address = 192.168.50.151
     NAS-Identifier = source

INBOUND>>>>> 14:53:49:214 Eventid:23900(6)
RADIUS AUTHENTICATION Rx PDU, from 192.168.50.200:1812 to 192.168.50.151:32783 (34) PDU-dict=starent-vsa1
Code: 2 (Access-Accept)
Id: 5
Length: 34
Authenticator: D7 94 1F 18 CA FE B4 27 17 75 5C 99 9F A8 61 78
     User-Password = testpassword

Here is an example from a live chassis:

<<<<OUTBOUND 12:45:49:869 Eventid:23901(6)
RADIUS AUTHENTICATION Tx PDU, from 10.209.28.200:33156 to 209.165.201.1:1645 (72) PDU-dict=custom150
Code: 1 (Access-Request)
Id: 6
Length: 72
Authenticator: 67 C2 2B 3E 29 5E A5 28 2D FB 85 CA 0E 9F A4 17
     User-Name = test
     User-Password = 8D 95 3B 31 99 E2 6A 24 1F 81 13 00 3C 73 BC 53
     NAS-IP-Address = 10.209.28.200
     NAS-Identifier = source
     3GPP2-Session-Term-Capability = Both_Dynamic_Auth_And_Reg_Revocation_in_MIP
 
INBOUND>>>>> 12:45:49:968 Eventid:23900(6)
RADIUS AUTHENTICATION Rx PDU, from 209.165.201.1:1645 to 10.209.28.200:33156 (50) PDU-dict=custom150
Code: 3 (Access-Reject)
Id: 6
Length: 50
Authenticator: 99 2E EC DA ED AD 18 A9 86 D4 93 52 57 4C 2F 84
     Reply-Message = Invalid username or password

Here is an example output from running the accounting version of the command. A password is not needed.

[source]CSE2# radius test accounting server 192.168.50.200 port 1813 test
RADIUS Start to accounting server 192.168.50.200, port 1813
Accounting Success: response received
Round-trip time for response was 7.9 ms
 
RADIUS Stop to accounting server 192.168.50.200, port 1813
Accounting Success: response received
Round-trip time for response was 15.4 ms


<<<<OUTBOUND 15:23:14:974 Eventid:24901(6)
RADIUS ACCOUNTING Tx PDU, from 192.168.50.151:32783 to 192.168.50.200:1813 (62) PDU-dict=starent-vsa1
Code: 4 (Accounting-Request)
Id: 8
Length: 62
Authenticator: DA 0F A8 11 7B FE 4B 1A 56 EB 0D 49 8C 17 BD F6
     User-Name = test
     NAS-IP-Address = 192.168.50.151
     Acct-Status-Type = Start
     Acct-Session-Id = 00000000
     NAS-Identifier = source
     Acct-Session-Time = 0

INBOUND>>>>> 15:23:14:981 Eventid:24900(6)
RADIUS ACCOUNTING Rx PDU, from 192.168.50.200:1813 to 192.168.50.151:32783 (20) PDU-dict=starent-vsa1
Code: 5 (Accounting-Response)
Id: 8
Length: 20
Authenticator: 05 E2 82 29 45 FC BC D6 6C 48 63 AA 14 9D 47 5B

<<<<OUTBOUND 15:23:14:983 Eventid:24901(6)
RADIUS ACCOUNTING Tx PDU, from 192.168.50.151:32783 to 192.168.50.200:1813 (62) PDU-dict=starent-vsa1
Code: 4 (Accounting-Request)
Id: 9
Length: 62
Authenticator: 29 DB F1 0B EC CE 68 DB C7 4D 60 E4 7F A2 D0 3A
      User-Name = test
     NAS-IP-Address = 192.168.50.151
     Acct-Status-Type = Stop
     Acct-Session-Id = 00000000
     NAS-Identifier = source
     Acct-Session-Time = 0

INBOUND>>>>> 15:23:14:998 Eventid:24900(6)
RADIUS ACCOUNTING Rx PDU, from 192.168.50.200:1813 to 192.168.50.151:32783 (20) PDU-dict=starent-vsa1
Code: 5 (Accounting-Response)
Id: 9
Length: 20
Authenticator: D8 3D EF 67 EA 75 E0 31 A5 31 7F E8 7E 69 73 DC

The following output is for the same aaamgr instance 36 just mentioned where connectivity to a specific RADIUS accounting server is broken:

[source]PDSN> radius test instance 36 accounting all test
Wednesday September 10 10:06:29 UTC 2014

RADIUS Start to accounting server 209.165.201.1, port 1646
Accounting Success: response received
Round-trip time for response was 51.2 ms

RADIUS Stop to accounting server 209.165.201.1, port 1646
Accounting Success: response received
Round-trip time for response was 46.2 ms

RADIUS Start to accounting server 209.165.201.2, port 1646
Accounting Success: response received
Round-trip time for response was 89.3 ms

RADIUS Stop to accounting server 209.165.201.2, port 1646
Accounting Success: response received
Round-trip time for response was 87.8 ms


RADIUS Start to accounting server 209.165.201.3, port 1646
Communication Failure: no response received

RADIUS Stop to accounting server 209.165.201.3, port 1646
Communication Failure: no response received

RADIUS Start to accounting server 209.165.201.4, port 1646
Accounting Success: response received
Round-trip time for response was 81.6 ms

RADIUS Stop to accounting server 209.165.201.4, port 1646
Accounting Success: response received
Round-trip time for response was 77.1 ms

RADIUS Start to accounting server 209.165.201.5, port 1646
Accounting Success: response received
Round-trip time for response was 46.7 ms

RADIUS Stop to accounting server 209.165.201.5, port 1646
Accounting Success: response received
Round-trip time for response was 46.7 ms

RADIUS Start to accounting server 209.165.201.6, port 1646
Accounting Success: response received
Round-trip time for response was 79.6 ms

RADIUS Stop to accounting server 209.165.201.6, port 1646
Accounting Success: response received
Round-trip time for response was 10113.0 ms

show radius info [radius group <group name>] instance { X | all}

This command reports the Network Processor Unit (NPU) flow ID and UDP port used by the configured NAS IP address to connect to RADIUS servers. This is reported in the aaa group default section of the output. Certainly the port number can be useful if one needs to match RADIUS packets in a packet capture with a specific aaamgr instance #. (Note that NPU flows are complicated and not something discussed in this article but an entity that a support engineer would be able to investigate further.) It also tracks outstanding requests to the server. In the same example issue used throughout this article, only a specific RADIUS server <==> NAS IP / UDP port pair had failed as highlighted.

[source]PDSN> show radius info radius group all instance 114
Wednesday October 01 11:39:15 UTC 2014

Context source:
---------------------------------------------

  AAAMGR instance 114:  cb-list-en: 1 AAA Group: aaa-roamingprovider.com
  ---------------------------------------------
    Authentication servers:
    ---------------------------------------------
    Primary  authentication server address 209.165.201.1, port 1645
      state Active
      priority 1
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0
    Secondary  authentication server address 209.165.201.2, port 1645
      state Active
      priority 2
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0

    Accounting servers:
    ---------------------------------------------
    Primary  accounting server address 209.165.201.1, port 1646
      state Active
      priority 1
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0
    Secondary  accounting server address 209.165.201.2, port 1646
      state Active
      priority 2
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0

  AAAMGR instance 114:  cb-list-en: 1 AAA Group: aaa-maingroup.com
  ---------------------------------------------
    Authentication servers:
    ---------------------------------------------
    Primary  authentication server address 209.165.201.3, port 1645
      state Active
      priority 1
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0
    Secondary  authentication server address 209.165.201.4, port 1645
      state Active
      priority 2
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0

    Accounting servers:
    ---------------------------------------------
    Primary  accounting server address 209.165.201.3, port 1646
      state Down
      priority 1
      requests outstanding 3
      max requests outstanding 3
      consecutive failures 7
      dead time expires in 146 seconds
    Secondary  accounting server address 209.165.201.4, port 1646
      state Active
      priority 2
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0

  AAAMGR instance 114:  cb-list-en: 1 AAA Group: default
  ---------------------------------------------
  socket number: 388550648
  socket state: ready
  local ip address: 10.210.21.234
  local udp port: 25808
  flow id: 20425379
  use med interface: yes
  VRF context ID: 2

    Authentication servers:
    ---------------------------------------------
    Primary  authentication server address 209.165.201.5, port 1645
      state Active
      priority 1
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0
    Secondary  authentication server address 209.165.201.6, port 1645
      state Not Responding
      priority 2
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0

    Accounting servers:
    ---------------------------------------------
    Primary  accounting server address 209.165.201.5, port 1646
      state Active
      priority 1
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0
    Secondary  accounting server address 209.165.201.6, port 1646
      state Active
      priority 2
      requests outstanding 0
      max requests outstanding 3
      consecutive failures 0

[source]PDSN>

monitor subscriber

Monitor subscriber can be used to determine if authentication is at least attempted and whether a reply is being processed for the calls being monitored. Turn on option 'S' which stands for Sessmgr Sender Info - effectively reporting on the sessmgr or aaamgr instance # that is handling the messaging in question. Here is an example for a MIP call on an HA attaching to sessmgr / aaamgr instances 132.

Incoming Call:
----------------------------------------------------------------------
 MSID/IMSI   :                             Callid      : 2719afb2
 IMEI        : n/a                         MSISDN      : n/a
 Username    : 6667067222@cisco.com        SessionType : ha-mobile-ip
 Status      : Active                      Service Name: HAService
 Src Context : source
----------------------------------------------------------------------

*** Sender Info (ON ) ***
Thursday June 11 2015
INBOUND>>>>>  From sessmgr:132 sessmgr_ha.c:861 (Callid 2719afb2) 15:42:35:742 Eventid:26000(3)
MIP Rx PDU, from 203.0.113.11:434 to 203.0.113.1:434 (190)
        Message Type: 0x01 (Registration Request)
               Flags: 0x02
            Lifetime: 0x1C20
        Home Address: 0.0.0.0
  Home Agent Address: 255.255.255.255

Thursday June 11 2015
<<<<OUTBOUND  From aaamgr:132 aaamgr_radius.c:367 (Callid 2719afb2) 15:42:35:743 Eventid:23901(6)
RADIUS AUTHENTICATION Tx PDU, from 203.0.113.1:59933 to 209.165.201.3:1645 (301) PDU-dict=custom9
 Code: 1 (Access-Request)
 Id: 12
 Length: 301

Thursday June 11 2015
INBOUND>>>>>  From aaamgr:132 aaamgr_radius.c:1999 (Callid 2719afb2) 15:42:35:915 Eventid:23900(6)
RADIUS AUTHENTICATION Rx PDU, from 209.165.201.3:1645 to 203.0.113.1:59933 (156) PDU-dict=custom9
 Code: 2 (Access-Accept)
 Id: 12

Thursday June 11 2015
<<<<OUTBOUND  From sessmgr:132 mipha_fsm.c:6617 (Callid 2719afb2) 15:42:36:265 Eventid:26001(3)
MIP Tx PDU, from 203.0.113.1:434 to 203.0.113.11:434 (112)
        Message Type: 0x03 (Registration Reply)
                Code: 0x00 (Accepted)
            Lifetime: 0x1C20
        Home Address: 10.229.6.167

There is a failure example at the end of this article as well.

Packet Capture

Sometimes there is not enough information on the ASR to determine why reachability issues are occuring, in which case a packet capture will be necessary. When troubleshooting individual subscriber issues, identifying the respective packets in a trace should be easy. Otherwise, knowing the UDP port being used at either end of a particular aaamgr instance # <==> RADIUS server pair could be helpful if the issue is tied to specific ports/aaamgr instances. Attempting capture at multiple places in the network may be necessary to determine where packets are getting dropped. In the issue being analyzed throughout this article, it was a packet capture in just the right place in the transport path between the ASR and the RADIUS server that was the break-through in solving the issue.

Remediations

This last section offers some ideas for remediating RADIUS connectivity issues. These are not presented in any particular order but rather simply a list to consider in the troubleshooting process.

If the RADIUS server is getting overloaded, the load could be decreased via the value (default 256) configured for “radius (accounting) max-outstanding”, which sets a limit on the number of outstanding (unanswered) requests for any given aaamgr process. If the limit is reached, logs may indicate this: “Failed to assign message id for radius authentication server x.x.x.x:1812”.

Rate-limiting RADIUS messages to specific servers may also help reduce load via the rate-limit keyword for the respective server configuration lines.

Sometimes it is not a problem of connectivity but of increased accounting traffic, which is not a problem with RADIUS persay, but pointing to another area, such as increased ppp renegotiations which are causing more accounting starts and stops. So one may need to troubleshoot outside of RADIUS to find a cause or trigger for the symptoms being observed.

If during the troubleshooting process it has been decided to remove a radius authentication or accounting server from the list of live servers for whatever reason, there is a (non-config) command that will take a server out of service indefinitely until it is desired to put it back in service. This is a cleaner approach than having to remove it from the configuration manually:

{disable | enable} radius [accounting] server x.x.x.x

[source]CSE2# show radius authentication servers detail

+-----Type:       (A) - Authentication   (a) - Accounting
|                 (C) - Charging         (c) - Charging Accounting
|                (M) - Mediation       (m) - Mediation Accounting
|
|+----Preference: (P) - Primary         (S) - Secondary
||
||+---State:     (A) - Active           (N) - Not Responding
|||               (D) - Down             (W) - Waiting Accounting-On
|||              (I) - Initializing     (w) - Waiting Accounting-Off
|||               (a) - Active Pending   (U) - Unknown
|||
|||+--Admin       (E) - Enabled         (D) - Disabled
|||| Status:
||||
||||+-Admin
||||| status     (O) - Overridden       (.) - Not Overridden
||||| Overridden:
|||||
vvvvv IP             PORT GROUP
----- --------------- ----- -----------------------
APNDO 192.168.50.200 1812 default

A PSC or DPC migration or a line card switchover can often clear problems due to the fact that the migration results in the restart of the processes on the card, including the npumgr which has been the cause of problems from time to time with regards to NPU flows.

But in an interesting twist with the aforementioned example of aaamgr 92, the AAA Unreachable failures actually STARTED when a PSC migration was done. This was triggered due to an NPU flow going missing when a PSC migration was done making PSC 11 standby. When it was made active an hour later, the actual impact of the missing flow started for aaamgr 92. Issues like this are very difficult to troubleshoot without assistance from Technical Support.

[Ingressc]PGW# show rct stat

RCT stats Details (Last 6 Actions)
Action            Type      From To   Start Time                Duration
----------------- --------- ---- ---- ------------------------  ---------- 
Migration         Planned    11   16  2012-Jan-09+16:27:38.135  36.048 sec
Migration         Planned     3   11  2012-Jan-09+17:28:57.413  48.739 sec

Mon Jan 09 17:31:11 2012 Internal trap notification 39 (AAAAuthSvrUnreachable) server 2 ip address 209.165.201.3
Mon Jan 09 17:31:16 2012 Internal trap notification 40 (AAAAuthSvrReachable) server 2 ip address 209.165.201.3

The issue was temporarily resolved with a port switchover which caused the PSC card which had a missing NPU flow for aaamgr 92 to no longer be connected to an active line card.

Tue Jan 10 06:52:17 2012 Internal trap notification 93 (CardStandby) card 27
Tue Jan 10 06:52:17 2012 Internal trap notification 1024 (PortDown) card 27 port 1 ifindex 453050375port type 10G Ethernet
Tue Jan 10 06:52:17 2012 Internal trap notification 55 (CardActive) card 28
Tue Jan 10 06:52:17 2012 Internal trap notification 1025 (PortUp) card 28 port 1 ifindex 469827588port type 10G Ethernet

The last failure trap:

Tue Jan 10 06:53:11 2012 Internal trap notification 43 (AAAAccSvrReachable) server 5 ip address 209.165.201.3

[Ingress]PGW# radius test instance 93 authen server 209.165.201.3 port 1645 test test
Tuesday January 10 07:18:22 UTC 2012

Authentication from authentication server 209.165.201.3, port 1645
Authentication Failure: Access-Reject received
Round-trip time for response was 38.0 ms

[Ingress]PGW# show session subsystem facility aaamgr instance 92
Tuesday January 10 07:39:47 UTC 2012
   12294 Total aaa auth purged
   14209 Total radius auth requests          0 Current radius auth requests
    9494 Total radius requests pend server max-outstanding
       0 Current radius requests pend server max-outstanding

Similarly, restarting specific aaamgrs that get "stuck" may also resolve issues, though this is an activity that Technical Support should do since it involves restricted Tech Support commands. In the aaamgr 92 example introduced in the show task resources section earlier, this was attempted but did not help because the root cause was not aaamgr 92 but rather the missing NPU flow that aaamgr 92 needed (it was an NPU issue, not a aaamgr issue). Here is relevant output of the attempt. "show task table" is run in order to show the association of process id and task instance # 92.

5  2012-Jan-10+06:20:53 aaamgr   16/0/04722 12.0(40466) PLB27085474/PLB38098237

[Ingress]PGW# show crash number 5
********************* CRASH #05 ***********************
Build: 12.0(40466)
Fatal Signal 6: Aborted
  PC: [b7eb6b90/X] __poll()
  Note: User-initiated state dump w/core.

******** show task table *******
     task                           parent
cpu facility      inst    pid pri  facility      inst    pid
---- -----------------------------  -------------------------
16/0 aaamgr          92   4722   0  sessctrl         0   2887

Final Example

Here is a final example of a real outage in a live network that pulls together many of the troubleshooting commands and approaches discussed in this article. Note that this node handles 3G MIP, and 4G Long Term Evolution (LTE) and evolved High Rate Packet Data (eHRPD) call types.

show snmp trap history

By the traps alone, it can be confirmed that the starting point matches with what the customer reported as 19:25 UTC. As an aside, note that AAAAuthSvrUnreachable traps for primary server 209.165.201.3 didnt start happening until hours later (not clear why, but good to note; but accounting unreachable to that server started right away)

Sun Dec 29 19:28:13 2013 Internal trap notification 42 (AAAAccSvrUnreachable) server 5 ip address 209.165.201.3
Sun Dec 29 19:32:13 2013 Internal trap notification 39 (AAAAuthSvrUnreachable) server 2 ip address 209.165.201.3
Sun Dec 29 19:33:05 2013 Internal trap notification 40 (AAAAuthSvrReachable) server 2 ip address 209.165.201.3
Sun Dec 29 19:34:13 2013 Internal trap notification 43 (AAAAccSvrReachable) server 5 ip address 209.165.201.3
Sun Dec 29 19:34:13 2013 Internal trap notification 39 (AAAAuthSvrUnreachable) server 2 ip address 209.165.201.3
Sun Dec 29 19:35:05 2013 Internal trap notification 40 (AAAAuthSvrReachable) server 2 ip address 209.165.201.3
Sun Dec 29 19:38:13 2013 Internal trap notification 42 (AAAAccSvrUnreachable) server 6 ip address 209.165.201.8
...
Sun Dec 29 23:12:13 2013 Internal trap notification 39 (AAAAuthSvrUnreachable) server 4 ip address 209.165.201.3
Sun Dec 29 23:13:03 2013 Internal trap notification 40 (AAAAuthSvrReachable) server 4 ip address 209.165.201.3
Sun Dec 29 23:54:13 2013 Internal trap notification 39 (AAAAuthSvrUnreachable) server 4 ip address 209.165.201.3
Sun Dec 29 23:54:14 2013 Internal trap notification 40 (AAAAuthSvrReachable) server 4 ip address 209.165.201.3
Sun Dec 29 23:58:13 2013 Internal trap notification 39 (AAAAuthSvrUnreachable) server 4 ip address 209.165.201.3
Sun Dec 29 23:58:14 2013 Internal trap notification 40 (AAAAuthSvrReachable) server 4 ip address 209.165.201.3

show task resources

The output shows a much lower count of calls on DPC 8/1. Based on this alone, without any further analysis, one COULD suggest that there is an issue on DPC 8 and propose the option to migrate to the standby DPC. But it is important to acknowledge what the actual subscriber impact is - in these scenarios typically the subscribers will connect successfully on a subsequent attempt and therefore impact is not too significant for the subscriber and they likely will not report anything to the provider, assuming that there is no user-plane outage also going on (which is possible depending on what's broken).

 7/1 sessmgr        230  27% 100% 586.2M  2.49G   43  500  4123 35200 I   good
 7/1 aaamgr         237 0.9%  95% 143.9M 640.0M   22  500    --    -- -   good
 7/1 sessmgr        243  22% 100% 588.1M  2.49G   42  500  4118 35200 I   good
 7/1 sessmgr        258  19% 100% 592.8M  2.49G   43  500  4122 35200 I   good
 7/1 aaamgr         268 0.9%  95% 143.5M 640.0M   22  500    --    -- -   good
 7/1 sessmgr        269  23% 100% 586.7M  2.49G   43  500  4115 35200 I   good
 7/1 aaamgr         274 0.4%  95% 144.9M 640.0M   22  500    --    -- -   good
 7/1 sessmgr        276  30% 100% 587.9M  2.49G   43  500  4123 35200 I   good
 7/1 aaamgr         285 1.0%  95% 142.7M 640.0M   22  500    --    -- -   good
 7/1 aaamgr         286 0.8%  95% 143.8M 640.0M   22  500    --    -- -   good
 7/1 sessmgr        290  28% 100% 588.2M  2.49G   41  500  4115 35200 I   good
 
 8/0 sessmgr        177  23% 100% 588.7M  2.49G   48  500  4179 35200 I   good
 8/0 sessmgr        193  24% 100% 591.3M  2.49G   44  500  4173 35200 I   good
 8/0 aaamgr         208 0.9%  95% 143.8M 640.0M   22  500    --    -- -   good
 8/0 sessmgr        211  23% 100% 592.1M  2.49G   45  500  4173 35200 I   good
 8/0 sessmgr        221  27% 100% 589.2M  2.49G   44  500  4178 35200 I   good
 8/0 aaamgr         222 0.9%  95% 142.0M 640.0M   22  500    --    -- -   good
 8/0 sessmgr        225  25% 100% 592.0M  2.49G   43  500  4177 35200 I   good
 8/0 aaamgr         238 0.9%  95% 140.0M 640.0M   22  500    --    -- -   good
 8/0 aaamgr         243 1.0%  95% 144.9M 640.0M   22  500    --    -- -   good
 8/0 sessmgr        244  31% 100% 593.3M  2.49G   43  500  4177 35200 I   good
 8/0 aaamgr         246 0.9%  95% 138.5M 640.0M   22  500    --    -- -   good
 8/0 aaamgr         248 0.9%  95% 141.4M 640.0M   22  500    --    -- -   good
 8/0 aaamgr         258 0.9%  95% 138.3M 640.0M   22  500    --    -- -   good
 8/0 aaamgr         259 0.8%  95% 139.2M 640.0M   22  500    --    -- -   good
 8/0 aaamgr         260 0.8%  95% 142.9M 640.0M   22  500    --    -- -   good
 8/0 aaamgr         262 0.9%  95% 145.0M 640.0M   22  500    --    -- -   good
 8/0 aaamgr         264 0.9%  95% 143.4M 640.0M   22  500    --    -- -   good
 8/0 sessmgr        270  24% 100% 592.2M  2.49G   44  500  4171 35200 I   good
 8/0 sessmgr        277  20% 100% 593.7M  2.49G   43  500  4176 35200 I   good
 8/0 sessmgr        288  23% 100% 591.9M  2.49G   43  500  4177 35200 I   good
 8/0 sessmgr        296  24% 100% 593.0M  2.49G   42  500  4170 35200 I   good
 
 8/1 sessmgr        186 2.0% 100% 568.3M  2.49G   48  500  1701 35200 I   good
 8/1 sessmgr        192 2.0% 100% 571.1M  2.49G   46  500  1700 35200 I   good
 8/1 aaamgr         200 1.0%  95% 147.3M 640.0M   22  500    --    -- -   good
 8/1 sessmgr        210 2.1% 100% 567.1M  2.49G   46  500  1707 35200 I   good
 8/1 aaamgr         216 0.9%  95% 144.6M 640.0M   22  500    --    -- -   good
 8/1 sessmgr        217 2.0% 100% 567.7M  2.49G   45  500  1697 35200 I   good
 8/1 sessmgr        231 2.2% 100% 565.7M  2.49G   45  500  1705 35200 I   good
 8/1 sessmgr        240 2.0% 100% 569.8M  2.49G   45  500  1702 35200 I   good
 8/1 aaamgr         242 0.9%  95% 148.5M 640.0M   22  500    --    -- -   good
 8/1 sessmgr        252 1.8% 100% 566.5M  2.49G   44  500  1704 35200 I   good
 8/1 aaamgr         261 0.9%  95% 142.0M 640.0M   22  500    --    -- -   good
 8/1 aaamgr         263 1.0%  95% 144.1M 640.0M   22  500    --    -- -   good
 8/1 aaamgr         265 1.0%  95% 146.4M 640.0M   22  500    --    -- -   good
 8/1 aaamgr         267 1.0%  95% 144.4M 640.0M   22  500    --    -- -   good
 8/1 aaamgr         269 1.0%  95% 143.8M 640.0M   22  500    --    -- -   good
 8/1 sessmgr        274 1.9% 100% 570.5M  2.49G   44  500  1704 35200 I   good
 8/1 sessmgr        283 2.0% 100% 570.0M  2.49G   44  500  1708 35200 I   good
 8/1 sessmgr        292 2.1% 100% 567.6M  2.49G   44  500  1703 35200 I   good
 
 9/0 sessmgr          1  30% 100% 587.2M  2.49G   48  500  4161 35200 I   good
 9/0 diamproxy        1 5.2%  90% 37.74M 250.0M  420 1000    --    -- -   good
 9/0 sessmgr         14  25% 100% 587.4M  2.49G   48  500  4156 35200 I   good
 9/0 sessmgr         21  20% 100% 591.5M  2.49G   47  500  4156 35200 I   good
 9/0 sessmgr         34  23% 100% 586.5M  2.49G   48  500  4155 35200 I   good
 9/0 aaamgr          44 0.9%  95% 145.1M 640.0M   21  500    --    -- -   good
 9/0 sessmgr         46  29% 100% 592.1M  2.49G   48  500  4157 35200 I   good

monitor subscriber

A call setup was caught where there was no response to the authentication request to primary 209.165.201.3 for sessmgr 242 on DPC 9/1 which happens to have its paired aaamgr residing on DPC 8/1, confirming 3G failures due to AAA unreachable on 8/1. It also confirms that even though there hadn't been any AAAAuthSrvUnreachable traps for 209.165.201.3 up to that point in time, it doesn't mean that there isn't a problem for handling responses for that server (as shown above, traps do start but hours later).

8/1 aaamgr         242 0.9%  95% 148.5M 640.0M   22  500    --    -- -   good
9/1 sessmgr        242  20% 100% 589.7M  2.49G   43  500  4167 35200 I   good
 
----------------------------------------------------------------------
Incoming Call:
----------------------------------------------------------------------
MSID/IMSI   :                             Callid      : 4537287a
IMEI        : n/a                         MSISDN      : n/a
Username    : 6664600074@cisco.com        SessionType : ha-mobile-ip
Status      : Active                      Service Name: HAService
Src Context : Ingress
----------------------------------------------------------------------
 
INBOUND>>>>>  From sessmgr:242 sessmgr_ha.c:880 (Callid 4537287a) 23:18:19:099 Eventid:26000(3)
MIP Rx PDU, from 203.0.113.1:434 to 203.0.113.3:434 (190)
        Message Type: 0x01 (Registration Request)
 
<<<<OUTBOUND  From aaamgr:242 aaamgr_radius.c:370 (Callid 4537287a) 23:18:19:100 Eventid:23901(6)
RADIUS AUTHENTICATION Tx PDU, from 203.0.113.3:27856 to 209.165.201.3:1645 (301) PDU-dict=custom9
Code: 1 (Access-Request)
Id: 195
Length: 301
Authenticator: CD 59 0C 6D 37 2C 5D 19 FB 60 F3 35 23 BB 61 6B
      User-Name = 6664600074@cisco.com
 
INBOUND>>>>>  From sessmgr:242 mipha_fsm.c:8438 (Callid 4537287a) 23:18:21:049 Eventid:26000(3)
MIP Rx PDU, from 203.0.113.1:434 to 203.0.113.3:434 (140)
        Message Type: 0x01 (Registration Request)
               Flags: 0x02
            Lifetime: 0x1C20
 
<<<<OUTBOUND  From sessmgr:242 mipha_fsm.c:6594 (Callid 4537287a) 23:18:22:117 Eventid:26001(3)
MIP Tx PDU, from 203.0.113.3:434 to 203.0.113.1:434 (104)
        Message Type: 0x03 (Registration Reply)
                Code: 0x83 (Mobile Node Failed Authentication)
 
***CONTROL*** From sessmgr:242 sessmgr_func.c:6746 (Callid 4537287a) 23:18:22:144 Eventid:10285
CALL STATS: <6664600074@cisco.com>, msid <>, Call-Duration(sec): 0
  Disconnect Reason: MIP-auth-failure
  Last Progress State: Authenticating

show sub [summary] smgr-instance X

What is interesting is that the session count for sessmgr 242 is similar to other working sessmgrs. Further investigation showed that 4G calls, also hosted on this chassis, were able to connect and so they made up for the lack of 3G Mobile IP calls being able to connect. It can be determined that going back as far as 8 hours which was after the outage has started, the are no MIP calls for this sessmgr 242, while going back 9 hours to before the outage started, there are connected calls:

[local]PGW# show sub sum smgr-instance 242 connected-time less-than 28800  (8 hours)
Monday December 30 03:38:23 UTC 2013
 
Total Subscribers:             1504     
Active:                        1504          Dormant:                       0        
hsgw-ipv4-ipv6:                0             pgw-pmip-ipv6:                 98       
pgw-pmip-ipv4:                 0             pgw-pmip-ipv4-ipv6:            75       
pgw-gtp-ipv6:                  700           pgw-gtp-ipv4:                  3        
pgw-gtp-ipv4-ipv6:             628           sgw-gtp-ipv6:                  0        
..
ha-mobile-ip:                  0             ggsn-pdp-type-ppp:             0

[local]PGW# show sub sum smgr-instance 242 connected-time less-than 32400 (9 hours)
Monday December 30 03:38:54 UTC 2013      ...
ha-mobile-ip:                  63            ggsn-pdp-type-ppp:             0

LTE and eHRPD calls show a higher ratio to MIP calls when comparing sessmgrs that are connected to working and broken aaamgrs:

[local]PGW# show sub sum smgr-instance 272
Monday December 30 03:57:51 UTC 2013  
hsgw-ipv4-ipv6:                0             pgw-pmip-ipv6:                 125      pgw-pmip-ipv4:                 0             pgw-pmip-ipv4-ipv6:            85       pgw-gtp-ipv6:                  1530          
pgw-gtp-ipv4-ipv6:             1126  

ha-mobile-ip:                  1103

[local]PGW# show sub sum smgr-instance 242
Monday December 30 03:52:35 UTC 2013       
hsgw-ipv4-ipv6:                0             pgw-pmip-ipv6:                 172         pgw-pmip-ipv4:                 0             pgw-pmip-ipv4-ipv6:            115         
pgw-gtp-ipv6:                  1899      
pgw-gtp-ipv4-ipv6:             1348        

ha-mobile-ip:                  447

radius test instance X authentication server

All aaamgrs on 8/1 are dead – no radius test instance commands work for any of those aaamgrs but do work for aaamgrs on 8/0 and other cards:

9/1 sessmgr        242  22% 100% 600.6M  2.49G   41  500  3989 35200 I   good
4/1 sessmgr         20  27% 100% 605.1M  2.49G   47  500  3965 35200 I   good
4/0 sessmgr         27  25% 100% 592.8M  2.49G   46  500  3901 35200 I   good
 
8/1 aaamgr         242 0.9%  95% 150.6M 640.0M   22  500    --    -- -   good
8/1 aaamgr          20 1.0%  95% 151.9M 640.0M   21  500    --    -- -   good
8/0 aaamgr          27 1.0%  95% 146.4M 640.0M   21  500    --    -- -   good
 
[Ingress]PGW# radius test instance 242 auth server 209.165.201.3 port 1645 test test
Monday December 30 01:03:08 UTC 2013
 
Authentication from authentication server 209.165.201.3, port 1645
Communication Failure: No response received
 
[Ingress]PGW# radius test instance 20 auth server 209.165.201.3 port 1645 test test
Monday December 30 01:08:45 UTC 2013
 
Authentication from authentication server 209.165.201.3, port 1645
Communication Failure: No response received
 
[Ingress]PGW# radius test instance 27 auth server 209.165.201.3 port 1645 test test
Monday December 30 01:11:40 UTC 2013
 
Authentication from authentication server 209.165.201.3, port 1645
Authentication Failure: Access-Reject received
Round-trip time for response was 16.8 ms

show radius counters all

The flagship command for troubleshooting RADIUS shows lots of timeouts that are increasing quckly:

[Ingress]PGW> show radius counters all | grep -E "Authentication server address|Access-Request Timeouts"
Monday December 30 00:42:24 UTC 2013
  Authentication server address 209.165.201.3, port 1645, group default
    Access-Request Timeouts:                                  400058    
  Authentication server address 209.165.201.5, port 1645, group default
     Access-Request Timeouts:                                  26479   

[Ingress]PGW> show radius counters all | grep -E "Authentication server address|Access-Request Timeouts"
Monday December 30 00:45:23 UTC 2013
  Authentication server address 209.165.201.3, port 1645, group default
    Access-Request Timeouts:                                  400614
  Authentication server address 209.165.201.5, port 1645, group default
     Access-Request Timeouts:                                  26679  
        
 
[Ingress]PGW> show radius counters all
Monday December 30 00:39:15 UTC 2013
...
   Authentication server address 209.165.201.3, port 1645, group default
     Access-Request Sent:                                      233262801       
     Access-Request with DMU Attributes Sent:                  0               
     Access-Request Pending:                                   22              
     Access-Request Retried:                                   0               
     Access-Request with DMU Attributes Retried:               0               
     Access-Challenge Received:                                0               
     Access-Accept Received:                                   213448486       
     Access-Reject Received:                                   19414836        
     Access-Reject Received with DMU Attributes:               0               
     Access-Request Timeouts:                                  399438          
     Access-Request Current Consecutive Failures in a mgr:     3
     Access-Request Response Bad Authenticator Received:       16187           
     Access-Request Response Malformed Received:               1               
     Access-Request Response Malformed Attribute Received:     0               
     Access-Request Response Unknown Type Received:            0               
     Access-Request Response Dropped:                          9039            
     Access-Request Response Last Round Trip Time:             267.6 ms
     Access-Request Response Average Round Trip Time:          201.9 ms
     Current Access-Request Queued:                            2  

    Authentication server address 209.165.201.5, port 1645, group default
     Access-Request Sent:                                      27731           
     Access-Request with DMU Attributes Sent:                  0               
     Access-Request Pending:                                   0               
     Access-Request Retried:                                   0               
     Access-Request with DMU Attributes Retried:               0               
     Access-Challenge Received:                                0               
     Access-Accept Received:                                   1390            
     Access-Reject Received:                                   101             
     Access-Reject Received with DMU Attributes:               0               
     Access-Request Timeouts:                                  26240           
     Access-Request Current Consecutive Failures in a mgr:     13              
     Access-Request Response Bad Authenticator Received:       0               
     Access-Request Response Malformed Received:               0               
     Access-Request Response Malformed Attribute Received:     0               
     Access-Request Response Unknown Type Received:            0               
     Access-Request Response Dropped:                          0               
     Access-Request Response Last Round Trip Time:             227.5 ms
     Access-Request Response Average Round Trip Time:          32.3 ms
     Current Access-Request Queued:                            0

Remediation

During the maintenance windows, a DPC migration 8 to 10 resolved the issue, the AAAAuthSrvUnreachable traps stopped, and DPC 8 was RMA'd and the root cause was determined to be a hardware failure on DPC 8 (details of that failure are not important to know for the purposes of this article).

Mon Dec 30 05:58:14 2013 Internal trap notification 39 (AAAAuthSvrUnreachable) server 4 ip address 209.165.201.3
Mon Dec 30 05:58:14 2013 Internal trap notification 39 (AAAAuthSvrUnreachable) server 2 ip address 209.165.201.5
Mon Dec 30 05:58:27 2013 Internal trap notification 40 (AAAAuthSvrReachable) server 2 ip address 209.165.201.5
Mon Dec 30 05:58:27 2013 Internal trap notification 40 (AAAAuthSvrReachable) server 4 ip address 209.165.201.3
Mon Dec 30 05:59:14 2013 Internal trap notification 43 (AAAAccSvrReachable) server 5 ip address 209.165.201.5
Mon Dec 30 06:01:14 2013 Internal trap notification 39 (AAAAuthSvrUnreachable) server 4 ip address 209.165.201.3
Mon Dec 30 06:01:27 2013 Internal trap notification 40 (AAAAuthSvrReachable) server 4 ip address 209.165.201.3

Mon Dec 30 06:01:28 2013 Internal trap notification 16 (PACMigrateStart) from card 8 to card 10
 
Mon Dec 30 06:01:49 2013 Internal trap notification 60 (CardDown) card 8 type Data Processing Card
Mon Dec 30 06:01:50 2013 Internal trap notification 1504 (CiscoFruCardStatusChanged) FRU entity Card : 10 operational status changed to Active
Mon Dec 30 06:01:50 2013 Internal trap notification 55 (CardActive) card 10 type Data Processing Card
Mon Dec 30 06:01:50 2013 Internal trap notification 17 (PACMigrateComplete) from card 8 to card 10
 
Mon Dec 30 06:02:08 2013 Internal trap notification 5 (CardUp) card 8 type Data Processing Card
Mon Dec 30 06:02:08 2013 Internal trap notification 1502 (EntStateOperEnabled) Card(8) Severity: Warning
Mon Dec 30 06:02:08 2013 Internal trap notification 93 (CardStandby) card 8 type Data Processing Card
 
Mon Dec 30 06:08:41 2013 Internal trap notification 1504 (CiscoFruCardStatusChanged) FRU entity Card : 08 operational status changed to Offline
Mon Dec 30 06:08:41 2013 Internal trap notification 60 (CardDown) card 8 type Data Processing Card
Mon Dec 30 06:08:41 2013 Internal trap notification 1503 (EntStateOperDisabled) Card(8) Severity: Critical

Mon Dec 30 06:09:24 2013 Internal trap notification 1505 (CiscoFruPowerStatusChanged) FRU entity Card : 08 Power OFF
Mon Dec 30 06:09:24 2013 Internal trap notification 1504 (CiscoFruCardStatusChanged) FRU entity Card : 08 operational status changed to Empty
Mon Dec 30 06:09:24 2013 Internal trap notification 7 (CardRemoved) card 8 type Data Processing Card
Mon Dec 30 06:09:24 2013 Internal trap notification 1507 (CiscoFruRemoved) FRU entity Card : 08 removed
Mon Dec 30 06:09:24 2013 Internal trap notification 1505 (CiscoFruPowerStatusChanged) FRU entity Card : 08 Power OFF
Mon Dec 30 06:09:50 2013 Internal trap notification 1505 (CiscoFruPowerStatusChanged) FRU entity Card : 08 Power ON
Mon Dec 30 06:09:53 2013 Internal trap notification 1504 (CiscoFruCardStatusChanged) FRU entity Card : 08 operational status changed to Offline
Mon Dec 30 06:09:53 2013 Internal trap notification 8 (CardInserted) card 8 type Data Processing Card
Mon Dec 30 06:09:53 2013 Internal trap notification 1506 (CiscoFruInserted) FRU entity Card : 08 inserted
Mon Dec 30 06:10:00 2013 Internal trap notification 1504 (CiscoFruCardStatusChanged) FRU entity Card : 08 operational status changed to Booting
Mon Dec 30 06:11:59 2013 Internal trap notification 1504 (CiscoFruCardStatusChanged) FRU entity Card : 08 operational status changed to Standby
Mon Dec 30 06:11:59 2013 Internal trap notification 5 (CardUp) card 8 type Data Processing Card
Mon Dec 30 06:11:59 2013 Internal trap notification 93 (CardStandby) card 8 type Data Processing Card
 
[local]PGW# show rct stat
Wednesday January 01 16:47:21 UTC 2014

RCT stats Details (Last 2 Actions)
Action            Type      From To   Start Time                Duration
----------------- --------- ---- ---- ------------------------  ----------  
Migration         Planned     8   10  2013-Dec-30+06:01:28.323  21.092 sec
Shutdown          N/A         8    0  2013-Dec-30+06:08:41.483   0.048 sec

Contributed by Cisco Engineers

Dave Damerjian
Cisco Technical Services

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

ASR 5000 Series

Troubleshooting AAAAccSrvUnreachable and AAAAuthSrvUnreachable traps

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Trap triggers

Consecutive failures in a aaamgr process approach

Keepalive approach

Troubleshooting commands/approaches

Radius configuration basics

show task resources facility aaamgr all

show radius counters { {all | server <server IP>} [instance <aaamgr #>] | summary}

show session subsystem facility {aaamgr | sessmgr} {all | instance <instance #>}

ping

traceroute

radius test instance x auth {radius group <group> | all | server <IP> port <port>} <username> <password>

radius test instance x accounting {radius group <group name> | all | server <IP> port <port>}

show radius info [radius group <group name>] instance { X | all}

monitor subscriber

Packet Capture

Remediations

Final Example

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products