STP Congestion, IMSIMGR Over State, and SCTP Link Flaps in SGSN Due to HLR MAP_RESET

Available Languages

Updated:April 16, 2015

Document ID:118937

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

Background Information

Problem

Solution

STP Link Receives Too Much Traffic

IMSIMGR in Warn State

HLR Failure

Recommendations

Traffic Flow

Triggers for M3UA Congested Alarm in SGSN

Introduction

This document describes a problem that is encountered on the Serving General Packet Radio Service (GPRS) Supporting Node (SGSN) of the Cisco 5000 Series Aggregated Services Router (ASR). Some possible workarounds for this issue are also described.

Background Information

This specific chain of events on the ASR SGSN is described in this document:

Nov 21st, 6:25 AM: A MAP_RESET was sent by the Home Location Register (HLR).
Nov 21st, 8:13 AM: A congestion alarm is raised for Signal Transfer Point 2 (STP-2).
Nov 21st, 8:23 AM: A congestion alarm is raised for STP-1 and STP-2.
Nov 21st, 8:48 AM: The International Mobile Subscriber Identity Manager (IMSIMGR) moves into the warn state.
Nov 21st, 10:07 AM: The links reset from STP-2 towards the SGSN.
Nov 21st, 10:15 AM: Improvement is observed in the SGSN Location Update (LU) stats.
Nov 21st, 10:00 â 10:30 AM: The statistics begin to improve at 10:00 am.
Nov 21st, 11:15 AM : A decline is observed in the SGSN LU stats.
Nov 21st, 11:41 AM: The STP team reports that Signalling Link Code (SLC)-1 of STP-2 does not receive traffic, the SLC is reset, and traffic returns to normal.
Nov 21st, 11:42 AM: A congestion alarm is raised on SGSN for SLC-1 of the STP.
Nov 21st, 12:00 PM: After SLC-3 is reset, the GPRS LU stats improve.

Problem

When the HLR receives the MAP_RESET message, it sets a flag for a GPRS Location Update (GLU). When the User Equipment (UE) sends its first uplink packets, the SGSN sends a GLU message to the HLR.

At  7 AM SGSN , Nov 21st 2014 had
******** show subscriber summary *******
Total Subscribers:             2386266
Active:                        2386266
sgsn-pdp-type-ipv4:            942114

As shown in the example output, there are 950,000 Packer Data Protocol (PDP) contexts present on the SGSN, and the UEs attempt to browse through them as the day progresses.

When the first uplink packets are received, the SGSN triggers a GLU message. Since there are hundreds of thousands of UEs, the STP cannot handle the amount of traffic that is generated and it moves into a perennial congestion state.

Messages are queued at the SGSN, and a maximum retransmission timeout occurs. Since all of the GLU messages do not pass from the SGSN to the HLR, the SGSN is forced to detach the mobile subscribers and request that they reattach. All of the detached subscribers then attempt to attach, which causes a sudden surge in the number of inbound attachment requests. Since network overload protection is applied, most of the attempts to attach are rejected due to congestion and the mobile subscribers are forced to make a new attempt.

As this chain of events unfolds, it produces cascading affects. Many Send Authentication Information (SAI) messages, GLU messages, and MAP-IMEI_CHECK messages are stuck in the SGSN queue or dropped. For this reason, all of the STP-1 and STP-2 links reach a congestion state. Each STP has four signaling links, but in this scenario, the first three links of STP-2 do not recover for very long.

Here are the congestion alarms, in which you can see that all of the STP links move into the congestion state on STP-2:

Fri Nov 21 08:13:14 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-1 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:13:14 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-2 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:13:14 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-3 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:13:29 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:18:48 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:20:00 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:22:52 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:22:55 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:23:22 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:26:33 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:28:06 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1
Fri Nov 21 08:28:45 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1
Fri Nov 21 09:27:27 2014 Internal trap notification 1074 (M3UAPSPCongested)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congested congLevel-1

As shown, only Peer Server Process (PSP) 4 was cleared, and the rest are still in the congestion state:

Fri Nov 21 08:18:47 2014 Internal trap notification 1075 (M3UAPSPCongestionCleared)
 ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
 congestion cleared congLevel-0

Solution

This section describes how to troubleshoot the issue that is described in the previous section.

STP Link Receives Too Much Traffic

As described in the previous section, one particular link in the STP receives a large amount of traffic. You can see that the first three links in STP-2 move into the congestion state and never recover, so only one link is available and the congestion alarm is cleared on SLC-3 (or peer-server-2-peer-server-process-4).

As per the SGSN load sharing mechanism, it must send the Message Transfer Part (MTP) Level 3 (MTP3) User Adaptation Layer (M3UA) packets equally on all four links. However, from the Simple Network Message Protocol (SNMP) traps, the first three STP-2 links are perennially congested, which means that all of the traffic is routed to the SLC-3 link (the only available STP link to route traffic). This explains why the traffic distribution is skewed between the STP-2 links.

In congestion situations, one or more links toggle between congested and non-congested states, so only the available links share the traffic. For this reason, there is more utilization in one of the links. This requires a link reset in order to recover the links.

The next output shows the M3UA level statistics and detach statistics. The important statistics to consider are the STP-2 PSP instance 4, where abnormal traffic can be seen:

Time   #1:ss7rd-m3ua-psp-data-tx   #2:ss7rd-m3ua-psp-error-tx  #3:ss7rd-m3ua-psp-data-rx
21-11-14 7:30             37409                      0                     37942
21-11-14 8:00             43677                      0                     43866
21-11-14 8:30             190414                     0                     71844
21-11-14 9:00             547418                     0                     104135
21-11-14 9:30             536019                     0                     102477
21-11-14 10:00            376797                     0                     132227
21-11-14 10:30            100394                     0                     97302
21-11-14 11:00            119652                     0                     114809
21-11-14 11:30            107073                     0                     95354

Here is the STP data:

   DATE             TIME     LSN     LOC  SLC    LINK        TX %     RX %
11/21/2014 9:00 sgsncisco   5216       3   A     IPVL        11.26    62.07
11/21/2014 9:00 sgsncisco   5213       0   A1    IPVL        11.29    4.86
11/21/2014 9:00 sgsncisco   5214       1   A1    IPVL        11.27    4.85
11/21/2014 9:00 sgsncisco   5215       2   A     IPVL        11.23    4.7

This output shows the detaches per second at the time of the issue:

Time                 #13:2G-ms-init-detach      #14:2G-nw-init-detach

21-11-14 6:30                  136465                 7400
21-11-14 7:00                  149241                 9557
21-11-14 7:30                  165788                 12630
21-11-14 8:00                  179311                 16963
21-11-14 8:30                  125564                 44759
21-11-14 9:00                  112461                 95299
21-11-14 9:30                  240341                 112461
21-11-14 10:00                 288014                 116298
21-11-14 10:30                 203261                 123300
21-11-14 11:00                  67788                 122945

This output shows the attaches per second, as per WEM:

Time           #3:2G-total-attach-req-all  Request/Second

21-11-14 8:00         738279                205.078
21-11-14 9:00         14053511              3903.753
21-11-14 10:00        24395071              6776.409
21-11-14 11:00        24663454              6850.959
21-11-14 12:00        17360687              4822.413

IMSIMGR in Warn State

Each new call IMSI/Packet Temporary Mobile Subscriber Identity (P-TMSI) attach and Routing Area Update (RAU) request must be processed by the IMSIMGR.

With a conservative observation, the system receives a peak value of 6,850 2-G attach requests per second and around 5,313 3-G attach requests per second. The maximum value that you can set for network overload protection is 5,000 attach requests per second. In order to keep the IMSIMGR in an operable state, the system cannot handle such a large number of calls from the UEs.

This issue begins after 8 AM, when the queue size reaches 1,500 attach requests per second:

network-overload-protection sgsn-new-connections-per-second 500 action
 reject-with-cause congestion queue-size 1500 wait-time 5

Since there are approximately 12,000 attach requests per second, nearly 9,000 calls are processed by the IMSIMGR and rejected. This causes the IMSIMGR CPU processing to reach a high state.

If the SGSN receives more than the configured number of attach requests in a second, then the requests are buffered in the pacing queue and only dropped when the buffer overflows due to a high inbound attach rate. Messages in the queue are processed in accordance with a First-In, First-Out (FIFO) process until they age-out when the queued message lifetime crosses the configured wait-time.

When you choose the reject or drop options based on your preference, Cisco recommends that you use a reject cause code in order to indicate congestion in the network, which allows you to understand the network conditions before you attempt an uplink procedure.

HLR Failure

As per 3rd Generation Partnership Project (3GPP) Technical Specification (TS) 23.060, this section describes the SGSN behavior during an HLR restart. Whenever the SGSN receives a MAP reset, it is expected to send a UL request towards the HLR for its subscribers.

When an HLR restarts, it sends a reset message to each SGSN to which one or more of its Mobile Stations (MSs) are registered. This causes the SGSN to mark the relevant Mobile Management contexts as invalid if an SGSN-to-Mobile Switching Center (MSC)/Visiting Location Register (VLR) association exists. After receipt of the first valid Logical Link Control (LLC) frame (for A/Gb mode) or after receipt of the first valid GPRS Tunneling Protocol User (GPT-U) packet or uplink signalling message (for Iu mode) from a marked Mobile Station, the SGSN performs a UL to the HLR as in the attach request or inter-SGSN Routing Area (RA) update procedures. Also, if the Non-GPRS Alert Flag (NGAF) is set, the procedure in the Non-GPRS Alert clause is followed. The UL procedure and the procedure towards the MSC/VLR might be delayed by the SGSN for a maximum operator configuration, dependent upon the use of resources at that time in order to avoid high signalling load.

Note: The periodic backup of the HLR data to non-volatile storage is mandatory, as described in TS 23.007 [5].

Recommendations

Cisco recommends that you complete these steps in order to resolve this issue:

Increase the number of new connections per second. This can be calculated based on the average number of attach requests.
Increase the Transactions Per Second (TPS) in the STP link to an ideal value.
Change the default SCTP-RTO-MAX value of 600 (600*100 = 60,000) to 5 (5*100 ms). For example, for two STPs with 4,000 TPS, you can support up to 1,000 attach requests per second from the SGSN.

Note: Each attach request results in four transactions towards the STP, which means that 1,000 attach requests per second results in 4,000 TPS.

Ideally, each STP has four links so that 125 attach requests can be processed per STP link. This is distributed equally across all of the STP links. However, if one of the links goes down, many reconnection attempts are seen, the queue becomes full, and packet discards occur. If more links go down, then traffic is unevenly distributed.

Traffic Flow

The UE traffic does not follow a linear fashion. It usually occurs in a burst and with many reconnection attempts. The SGSN sends traffic in bundles to the STP. At that time, traffic amounts exceed the configured TPS on the STP. This causes some links in the STP to begin to advertise low window size if they already process more calls, and the SGSN begins to queue the SCTP data chunks that are queued. It then waits for the RTO MAX timer to expire.

If the STP periodically sends a good advertised window size, then you should be able to send more SCTP data chunks if the SCTP_RTO_MAX value is reduced to five seconds or less. The queue is cleared faster and an M3UA congestion alarm is not triggered. Additionally, you should not see the Internal Flow Control flag triggered by the SCTP in order to control the packet flow.

The SGSN only sends packets in the amount that the STP can accept, which is based on the advertised window size. If you increase the TPS per STP link, it helps to avoid STP congestion and reduces the SCTP_RTO_MAX timer.

Triggers for M3UA Congested Alarm in SGSN

If the advertised window size in the Stream Control Transmission Protocol (SCTP) Selective Acknowledgement (SACK) message is close to zero (or zero), then the SGSN raises an M3UA alarm in order to indicate that messages should not be sent for that peer endpoint. This causes the link to flap or move into a congested state. Since the SGSN sends a higher window size, you continue to receive M3UA data from the peer-nodes, and those packets might be dropped into the waiting queue if the peer point code never comes out of the congested state.

Here is an example:

The SCTP sends a flow control start indication to the M3UA.
The M3UA sets the congestion active flag for the association and begins to poll the SCTP periodically about its flow control status.
While an association is in flow control, it queues future data requests for that association until the QUEUE_SIZE has reached 8,000. At that point, future messages for the association are discarded.
If the STP sends a proper advertised window size, then the M3UA attempts to empty the messages that are queued until it reaches 5,000. The RTO timer also plays a role in this.

The SCTP messages are queued only for those associations where the flow control flag becomes True, and the SGSN then processes in accordance with the STP response:

*Peer Server Id :        2   Peer Server Process Id:        2 
 
 Association State : ESTABLISHED
 
 Flow Control Flag                       :                TRUE 
 Peer INIT Tag                           :                20229
 SGSN INIT Tag                           :           3315914061
 Next TSN to Assign to                          
 Outgoing Data Chunk                   :           3418060778
 Lowest cumulative TSN acknowledged      :           3418060634
 Cumulative Peer TSN arrived from peer   :            103253660
 Last Peer TSN sent in the SACK          :            103253658
 Self RWND                               :              1048576
 Advertised RWND in received SACK        :                    8
 Peer RWND(estimated)                    :                    8
 Retransmission counter                  :                    0
 Zero Window Probing Flag                :                FALSE
 Last Tsn received during ZWnd Probing   :                    0
 Bytes outstanding on all                       
 addresses of this association        :                19480
 Congestion Queue Length                 :                  143
  Ordered TSN assignment Waiting QLen     :                 8050 
 Unordered TSN assignment Waiting QLen   :                    0
 Total number of GAP ACKs Transmitted    :                  279
 Total number of GAP ACKs Received       :                58787
 
 
 Path No.                                :                    1
 
 Current CWND                            :                11840
 SSThresh                                :                11840
 Partial Bytes Acked                     :                    0
 Bytes Outstanding for this Path         :                19480
 Current RTO for this Path(in ms)        :                60000

As shown, the reason behind the congestion is that the total number of outbound chunks exceeds the 5,000 limit (8050+143=8193) and hits the 60-second RTO maximum timer, which results in discarded SCTP data requests. Also, there is a higher RTO timer.

Revision History

Revision	Publish Date	Comments
1.0	16-Apr-2015	Initial Release

Contributed by Cisco Engineers

Krishna Kishore D V
Cisco TAC Engineer.

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

ASR 5000 Series

STP Congestion, IMSIMGR Over State, and SCTP Link Flaps in SGSN Due to HLR MAP_RESET

Available Languages

Bias-Free Language

Contents

Introduction

Background Information

Problem

Solution

STP Link Receives Too Much Traffic

IMSIMGR in Warn State

HLR Failure

Recommendations

Traffic Flow

Triggers for M3UA Congested Alarm in SGSN

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products