This document describes a problem that is encountered on the Serving General Packet Radio Service (GPRS) Supporting Node (SGSN) of the Cisco 5000 Series Aggregated Services Router (ASR). Some possible workarounds for this issue are also described.
This specific chain of events on the ASR SGSN is described in this document:
When the HLR receives the MAP_RESET message, it sets a flag for a GPRS Location Update (GLU). When the User Equipment (UE) sends its first uplink packets, the SGSN sends a GLU message to the HLR.
At 7 AM SGSN , Nov 21st 2014 had
******** show subscriber summary *******
Total Subscribers: 2386266
Active: 2386266
sgsn-pdp-type-ipv4: 942114
As shown in the example output, there are 950,000 Packer Data Protocol (PDP) contexts present on the SGSN, and the UEs attempt to browse through them as the day progresses.
When the first uplink packets are received, the SGSN triggers a GLU message. Since there are hundreds of thousands of UEs, the STP cannot handle the amount of traffic that is generated and it moves into a perennial congestion state.
Messages are queued at the SGSN, and a maximum retransmission timeout occurs. Since all of the GLU messages do not pass from the SGSN to the HLR, the SGSN is forced to detach the mobile subscribers and request that they reattach. All of the detached subscribers then attempt to attach, which causes a sudden surge in the number of inbound attachment requests. Since network overload protection is applied, most of the attempts to attach are rejected due to congestion and the mobile subscribers are forced to make a new attempt.
As this chain of events unfolds, it produces cascading affects. Many Send Authentication Information (SAI) messages, GLU messages, and MAP-IMEI_CHECK messages are stuck in the SGSN queue or dropped. For this reason, all of the STP-1 and STP-2 links reach a congestion state. Each STP has four signaling links, but in this scenario, the first three links of STP-2 do not recover for very long.
Here are the congestion alarms, in which you can see that all of the STP links move into the congestion state on STP-2:
Fri Nov 21 08:13:14 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-1 (point-code-782)
congested congLevel-1
Fri Nov 21 08:13:14 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-2 (point-code-782)
congested congLevel-1
Fri Nov 21 08:13:14 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-3 (point-code-782)
congested congLevel-1
Fri Nov 21 08:13:29 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
Fri Nov 21 08:18:48 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
Fri Nov 21 08:20:00 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
Fri Nov 21 08:22:52 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
Fri Nov 21 08:22:55 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
Fri Nov 21 08:23:22 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
Fri Nov 21 08:26:33 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
Fri Nov 21 08:28:06 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
Fri Nov 21 08:28:45 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
Fri Nov 21 09:27:27 2014 Internal trap notification 1074 (M3UAPSPCongested)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congested congLevel-1
As shown, only Peer Server Process (PSP) 4 was cleared, and the rest are still in the congestion state:
Fri Nov 21 08:18:47 2014 Internal trap notification 1075 (M3UAPSPCongestionCleared)
ss7-routing-domain-1 peer-server-2 peer-server-process-4 (point-code-782)
congestion cleared congLevel-0
This section describes how to troubleshoot the issue that is described in the previous section.
As described in the previous section, one particular link in the STP receives a large amount of traffic. You can see that the first three links in STP-2 move into the congestion state and never recover, so only one link is available and the congestion alarm is cleared on SLC-3 (or peer-server-2-peer-server-process-4).
As per the SGSN load sharing mechanism, it must send the Message Transfer Part (MTP) Level 3 (MTP3) User Adaptation Layer (M3UA) packets equally on all four links. However, from the Simple Network Message Protocol (SNMP) traps, the first three STP-2 links are perennially congested, which means that all of the traffic is routed to the SLC-3 link (the only available STP link to route traffic). This explains why the traffic distribution is skewed between the STP-2 links.
In congestion situations, one or more links toggle between congested and non-congested states, so only the available links share the traffic. For this reason, there is more utilization in one of the links. This requires a link reset in order to recover the links.
The next output shows the M3UA level statistics and detach statistics. The important statistics to consider are the STP-2 PSP instance 4, where abnormal traffic can be seen:
Time #1:ss7rd-m3ua-psp-data-tx #2:ss7rd-m3ua-psp-error-tx #3:ss7rd-m3ua-psp-data-rx
21-11-14 7:30 37409 0 37942
21-11-14 8:00 43677 0 43866
21-11-14 8:30 190414 0 71844
21-11-14 9:00 547418 0 104135
21-11-14 9:30 536019 0 102477
21-11-14 10:00 376797 0 132227
21-11-14 10:30 100394 0 97302
21-11-14 11:00 119652 0 114809
21-11-14 11:30 107073 0 95354
Here is the STP data:
DATE TIME LSN LOC SLC LINK TX % RX %
11/21/2014 9:00 sgsncisco 5216 3 A IPVL 11.26 62.07
11/21/2014 9:00 sgsncisco 5213 0 A1 IPVL 11.29 4.86
11/21/2014 9:00 sgsncisco 5214 1 A1 IPVL 11.27 4.85
11/21/2014 9:00 sgsncisco 5215 2 A IPVL 11.23 4.7
This output shows the detaches per second at the time of the issue:
Time #13:2G-ms-init-detach #14:2G-nw-init-detach
21-11-14 6:30 136465 7400
21-11-14 7:00 149241 9557
21-11-14 7:30 165788 12630
21-11-14 8:00 179311 16963
21-11-14 8:30 125564 44759
21-11-14 9:00 112461 95299
21-11-14 9:30 240341 112461
21-11-14 10:00 288014 116298
21-11-14 10:30 203261 123300
21-11-14 11:00 67788 122945
This output shows the attaches per second, as per WEM:
Time #3:2G-total-attach-req-all Request/Second
21-11-14 8:00 738279 205.078
21-11-14 9:00 14053511 3903.753
21-11-14 10:00 24395071 6776.409
21-11-14 11:00 24663454 6850.959
21-11-14 12:00 17360687 4822.413
Each new call IMSI/Packet Temporary Mobile Subscriber Identity (P-TMSI) attach and Routing Area Update (RAU) request must be processed by the IMSIMGR.
With a conservative observation, the system receives a peak value of 6,850 2-G attach requests per second and around 5,313 3-G attach requests per second. The maximum value that you can set for network overload protection is 5,000 attach requests per second. In order to keep the IMSIMGR in an operable state, the system cannot handle such a large number of calls from the UEs.
This issue begins after 8 AM, when the queue size reaches 1,500 attach requests per second:
network-overload-protection sgsn-new-connections-per-second 500 action
reject-with-cause congestion queue-size 1500 wait-time 5
Since there are approximately 12,000 attach requests per second, nearly 9,000 calls are processed by the IMSIMGR and rejected. This causes the IMSIMGR CPU processing to reach a high state.
If the SGSN receives more than the configured number of attach requests in a second, then the requests are buffered in the pacing queue and only dropped when the buffer overflows due to a high inbound attach rate. Messages in the queue are processed in accordance with a First-In, First-Out (FIFO) process until they age-out when the queued message lifetime crosses the configured wait-time.
When you choose the reject or drop options based on your preference, Cisco recommends that you use a reject cause code in order to indicate congestion in the network, which allows you to understand the network conditions before you attempt an uplink procedure.
As per 3rd Generation Partnership Project (3GPP) Technical Specification (TS) 23.060, this section describes the SGSN behavior during an HLR restart. Whenever the SGSN receives a MAP reset, it is expected to send a UL request towards the HLR for its subscribers.
When an HLR restarts, it sends a reset message to each SGSN to which one or more of its Mobile Stations (MSs) are registered. This causes the SGSN to mark the relevant Mobile Management contexts as invalid if an SGSN-to-Mobile Switching Center (MSC)/Visiting Location Register (VLR) association exists. After receipt of the first valid Logical Link Control (LLC) frame (for A/Gb mode) or after receipt of the first valid GPRS Tunneling Protocol User (GPT-U) packet or uplink signalling message (for Iu mode) from a marked Mobile Station, the SGSN performs a UL to the HLR as in the attach request or inter-SGSN Routing Area (RA) update procedures. Also, if the Non-GPRS Alert Flag (NGAF) is set, the procedure in the Non-GPRS Alert clause is followed. The UL procedure and the procedure towards the MSC/VLR might be delayed by the SGSN for a maximum operator configuration, dependent upon the use of resources at that time in order to avoid high signalling load.
Cisco recommends that you complete these steps in order to resolve this issue:
Ideally, each STP has four links so that 125 attach requests can be processed per STP link. This is distributed equally across all of the STP links. However, if one of the links goes down, many reconnection attempts are seen, the queue becomes full, and packet discards occur. If more links go down, then traffic is unevenly distributed.
The UE traffic does not follow a linear fashion. It usually occurs in a burst and with many reconnection attempts. The SGSN sends traffic in bundles to the STP. At that time, traffic amounts exceed the configured TPS on the STP. This causes some links in the STP to begin to advertise low window size if they already process more calls, and the SGSN begins to queue the SCTP data chunks that are queued. It then waits for the RTO MAX timer to expire.
If the STP periodically sends a good advertised window size, then you should be able to send more SCTP data chunks if the SCTP_RTO_MAX value is reduced to five seconds or less. The queue is cleared faster and an M3UA congestion alarm is not triggered. Additionally, you should not see the Internal Flow Control flag triggered by the SCTP in order to control the packet flow.
The SGSN only sends packets in the amount that the STP can accept, which is based on the advertised window size. If you increase the TPS per STP link, it helps to avoid STP congestion and reduces the SCTP_RTO_MAX timer.
If the advertised window size in the Stream Control Transmission Protocol (SCTP) Selective Acknowledgement (SACK) message is close to zero (or zero), then the SGSN raises an M3UA alarm in order to indicate that messages should not be sent for that peer endpoint. This causes the link to flap or move into a congested state. Since the SGSN sends a higher window size, you continue to receive M3UA data from the peer-nodes, and those packets might be dropped into the waiting queue if the peer point code never comes out of the congested state.
Here is an example:
The SCTP messages are queued only for those associations where the flow control flag becomes True, and the SGSN then processes in accordance with the STP response:
*Peer Server Id : 2 Peer Server Process Id: 2
Association State : ESTABLISHED
Flow Control Flag : TRUE
Peer INIT Tag : 20229
SGSN INIT Tag : 3315914061
Next TSN to Assign to
Outgoing Data Chunk : 3418060778
Lowest cumulative TSN acknowledged : 3418060634
Cumulative Peer TSN arrived from peer : 103253660
Last Peer TSN sent in the SACK : 103253658
Self RWND : 1048576
Advertised RWND in received SACK : 8
Peer RWND(estimated) : 8
Retransmission counter : 0
Zero Window Probing Flag : FALSE
Last Tsn received during ZWnd Probing : 0
Bytes outstanding on all
addresses of this association : 19480
Congestion Queue Length : 143
Ordered TSN assignment Waiting QLen : 8050
Unordered TSN assignment Waiting QLen : 0
Total number of GAP ACKs Transmitted : 279
Total number of GAP ACKs Received : 58787
Path No. : 1
Current CWND : 11840
SSThresh : 11840
Partial Bytes Acked : 0
Bytes Outstanding for this Path : 19480
Current RTO for this Path(in ms) : 60000
As shown, the reason behind the congestion is that the total number of outbound chunks exceeds the 5,000 limit (8050+143=8193) and hits the 60-second RTO maximum timer, which results in discarded SCTP data requests. Also, there is a higher RTO timer.
Revision | Publish Date | Comments |
---|---|---|
1.0 |
16-Apr-2015 |
Initial Release |