The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
This document reviews the subject of video call quality and provides a tutorial on things to keep in mind while Quality of Service (QoS) is configured on a Cisco Unified Border Element(CUBE) or a Time-Division Multiplexing (TDM) gateway.
Contributed by Baktha Muralidharan, Cisco TAC Engineer, edited by Anoop Kumar.
This document is most beneficial for the engineers familiar with voice over IP (VoIP), although others might find it useful.
There is no specific hardware or sofware used to write this document.
Digitized audio in its simplest form is a set of audio samples, each sample describing the sound pressure during that period. Conversational audio can be captured and reproduced to a high degree of accuracy, with just 8000 samples per second[1]. This then means that as long as the network is able to transport the samples without excessive delay, jitter and packet loss, audio can be faithfully reproduced at the other end.
In contrast presentation, processing and transport of video is much more complex. Brightness, contrast, color saturation, responsiveness (to motion) and lip-sync are just some of the attributes that determine the quality of the video. Video samples generally require much larger space. Not surprisingly, video places a much larger demand on network bandwidth, on the transport network. Audio quality is determined by :Microphone Speaker in the headset Codec - compression transport network video call quality is affected by: Camera Display device Video codec Transport network Compatibility/Interoperability
Note: It is important to understand that unlike audio, quite a bit goes on at video endpoints, when it comes to tuning quality.
QoS in general is a vast and complex subject requiring consideration of overall traffic requirements (rather than just the traffic you wish to improve the quality of) and needs to be checked on every network component along the path of the media flow. Achieving video quality in a video conference is even more complex as it involves in addition to the network components, review and examination of configuration and tuning at the endpoints. Broadly, video quality entails this:
The specific focus in this document will be the QoS considerations on the IOS gateway or CUBE when handling video calls.
Tuning at the endpoints would involve adjust a set of parameters on the video endpoints. This of course depends on the product but here are a few general “knobs” :
Tuning the network for video generally involves the following :
Interoperability comes into play when heterogeneous (video telephony as well as telepresence (TP)) systems participate in a conference call. The experience provided by a TP and video phone system are fundamentally different. Interoperability between them is generally achieved by bridging them using a process known as cascading.
This is not a design document and not a comprehensive video QoS document either. Specifically this document does not cover these topics:
Video, like audio is real-time. Audio transmissions are constant-bit-rate (CBR). In contrast, video traffic tends to be bursty and is referred to as being variable-bit-rate(VBR.) Consequently bit rate for video transmission will necessarily not be constant, if we need to maintain a certain quality[2].
Image 1
Determination of bandwidth and bursting required for video is also more involved. This is discussed later in this document.
Why is video bursty?
The answer lies in the way video is compressed. Remember that video is a sequence of images (frames) played to provide a visual motion effect. Compression techniques used by video codecs use an approach called Delta encoding[3], which works by storing values of bytes as differences (deltas) between sequential (samples) values rather than the values themselves. Accordingly video is encoded (and transmitted) as consecutive frames carrying just the “moving parts” rather than whole frames.
You are probably wondering Why, the audio changes incrementally too ? Well, true enough, but “motion” (or dynamics) doesn’t impact audio nearly as much as it does video. The 8-bit audio samples do not compress better when delta encoded, video samples (frames) do. The relative change from sample (frame to frame) to sample is video is much smaller than that in audio. Depending on the nature and degree of motion, video samples can greatly vary in size. Image 2 illustrates video compression-
Image 2
An I‑frame is an Intra-coded picture, in effect a fully specified picture, like a conventional static image file.
A P‑frame (Predicted picture) holds only the changes in the image from the previous frame. The encoder does not need to store the unchanging background pixels in the P‑frame, thus saving space. P‑frames are also known as delta‑frames.
A B‑frame (Bi-predictive picture) saves even more space by using differences between the current frame and both the preceding and following frames to specify its content.
Cisco video gear do not measure or report on video quality as such, so video quality is perceived rather than measured. There are standardized algorithms that measure quality by means of a MOS (Mean Opinion Score). However, if issues reported on audio-quality are any indication, video quality (TAC) cases are more likely to be opened because user perceived quality issues rather than reports by a tool.
Factors that affect video quality include:
Generally each of the above is selectable/controllable at endpoints.
Quilting, Combing & Banding get used to these terms, part of video impairment taxonomy. Refer to this document for details on the common video impairments:
Ref:
Recommended network SLA for video[4] is as follows:
Incidentally the recommended network SLA for transporting audio are:
Note: Clearly Video is more sensitive to packet loss than voice. This should be expected once you understand that interframes require information from previous frames, which means that loss of interframes can be devastating to the process of reconstructing the video image.
Generally the SLA for video transport can be delivered using QoS policies that are very similar to those used for audio transport. There are some differences however owing to the nature of video traffic.
Note: Although the scope of this document is limited to the CUBE component, remember QoS is end-to-end.
Are all video same? Well, not quite. The variations of video as a medium include:
Note: In the interest of brevity, illustrations are not extensively provided for each type of video listed above.
Note: Video, like audio, is carried in Realtime Protocol (RTP)
In principle the QoS mechanisms employed to deliver the SLAs for a video transport network are mostly the same as those for audio. There are some differences however, mostly due to the bursty nature of video and VBR transmission.
There are two approaches to QoS, namely Interated Services(intserv) and differentiated Services(diffserv).
Think of Intserv as operating at signaling level and diffserv at media-level. In other words, the intserv model ensures quality by operating at control plane; diffserv aims to ensure quality by oeprating at date plane level.
In IntServ architecture network devices make requests for static bandwidth reservations and maintain the state of all reserved flows while performing classification, marking and queuing services these flows; the IntServ architecture operates-and integrates-both the control plane and the data plane, and as such has been largely abandoned due to inherent scaling limitations. The protocol used to make the bandwidth reservations is RSVP (Resource reSerVation Protocol).
There is also IntServ/DiffServ Model, which is sort of a mix. This model separates control plane operations from data plane operations. RSVP operation is limited to admission control only; with DiffServ mechanisms handling classification, marking, policing and scheduling operations. As such, the IntServ/DiffServ model is highly scalable and flexible.
Note: This document only focuses on diffserv (viz-a-viz prioritization scheme, LLQ) apprach.
Bandwidth is obviously the most fundamental qos parameter. This depends on several parameters, most notably:
The old trick of throwing bandwidth at the problem is not always the solution. This is especially true for video quality. For example, with CUVA (Cisco Unified Video Advantage) there is no synchronization mechanism between the two devices (phone and PC) involved. Thus QoS should be configured to minimize jitter, latency, fragmented packets, and out-of-order packets.
Note: Interactive Video has the same service level requirements as VoIP because a voice call is embedded within the video stream.Streaming Video has much laxer requirements, because of the high amount of buffering that has been built into the applications.
Finally it is important to understand that unlike VoIP there are no clean formulas for calculating the required incremental bandwidth. This is because video packet sizes and packet rates vary significantly and are largely a function of the degree of motion within the video images being transmitted. More on this later.
Low Latency Queuing (LLQ) is the preferred queuing policy for VoIP audio. Given the stringent delay/jitter sensitive requirements of TP and the need to synchronize audio and video for CUVA, priority (LLQ) queuing is the recommended for all video traffic as well. Note that, for video, priority bandwidth is generally fudged up by 20% to account for the overhead.
Not recommended for video.
LFI is a popular mechanism to ensure jitter doesn’t get out of control on slow links, where serialization delays can be high.
But then again Interactive-Video is not recommended for slow links. This is because LLQ which the video traffic is assigned to, are not subject to fragmentation. This means that the large Interactive-Video packets (such as 1500-byte full-motion I-Frames) could cause serialization delays for smaller Interactive-Video packets.
Selective discard based on RTCP
This QoS mechanism is an important for video traffic, which, as mentioned earlier, is bursty.
The optional burst parameter can be configured as part of the priority command[6].
With H.264, the worst-case burst would be the full screen of (spatially-compressed) video. Based on extensive testing on TP systems, this is found to be 64 KB. Therefore the LLQ burst parameter should be configured to permit up to 64 KB of burst per frame per screen. Thus the CTS-1000 system running at 1080p-Best (with the optional support of an auxiliary video stream[7]) would be configured with an LLQ with an optimal burst parameter of 128 (2x64) KB.
So, how much bandwidth is required to transport a video call faithfully? Before we get down to the calculations, it is important to understand the following concepts, which are unique to video.
This basically refers to the size of the image. Other commonly used terms for this include video format and screen size. Commonly used video formats are shown below.
Format |
Video Resolution (pixels) |
||
SQCIF |
128x96 |
||
QCIF |
176x144 |
||
SCIF |
256x192 |
||
SIF |
352x240 |
||
CIF |
352x288 |
||
DCIF |
528x384 |
||
|
704x576 |
||
16CIF |
1408x1152 |
Vast majority of video conferencing equipment run at CIF or 4CIF formats.
Ref: http://en.wikipedia.org/wiki/Common_Intermediate_Format
Note: There is no equivalence for (video) resolution in the audio world
This refers to the rate at which an imaging device produces unique consecutive images called frames. Frame rate is expressed as frames per second (fps).
Note: The equivalent metric in audio world is sampling time. E.g. 8000 for g.711ulaw.
Bandwidth calculations for video telephony systems and other traditional video conference systems tend to be simpler.
As an example, consider a TP call with resolution of 1080 x1920. The bandwidth required is calculated as follows-
2,073,600 pixels per frame
x3 colors per pixel
x1 Byte (8 bits) per color
x 30 frames per second
= 1.5Gbps per screen. Uncompressed!
Wth compression, a bandwidth of 4Mbps per screen ( > 99% compressed) is enough to transport the above frame!
The following table lists some of the combinations-
Picture |
Luminance |
Luminance |
Uncompressed |
|||
10 frames/s |
30 frames/s |
|||||
Grey |
Color |
Grey |
Color |
|||
SQCIF |
128 |
96 |
1.0 |
1.5 |
3.0 |
4.4 |
QCIF |
176 |
144 |
2.0 |
3.0 |
6.1 |
9.1 |
CIF |
352 |
288 |
8.1 |
12.2 |
24.3 |
36.5 |
4CIF |
704 |
576 |
32.4 |
48.7 |
97.3 |
146.0 |
16CIF |
1408 |
1152 |
129.8 |
194.6 |
389.3 |
583.9 |
Note that above calculations are for a single screen. A TP call could involve multiple screens and so, total bandwidth for the call would be a multiple of the per-screen bandwidth.
Refer to https://supportforums.cisco.com/thread/311604 for a good bandwidth calculator for Cisco TP systems.
How is video traffic identified/distinguished? One way to classify packets on CUBE is using DSCP markings.
The following table illustrates DSCP markings per Cisco QoS baseline as well as RFC 4594.
Traffic |
Layer 3 PHB |
Layer 3 DSCP |
Call Signaling |
CS3 |
24 |
Voice |
EF |
46 |
Video conference |
AF41 |
34 |
TelePresence |
CS4 |
32 |
Multimedia Streaming |
AF31 |
26 |
Broadcast video |
CS5 |
40 |
PHB - Per Hop Behavior. Refers to what the router does as far as packet classification and traffic conditioning functions, such as metering, marking, shaping, and policing.
By default, prior to version 9.0 CUCM (Cisco Unified Call Manager) marked any and all video traffic (including TelePresence) to AF41. Starting from version 9.0, CUCM preconfigures the following DSCP values:
Configuring to tune for audio quality entails calculating priority bandwidth and implementing LLQ policy on a WAN link. This is generally based on the anticipated call volume and audio codec used.
While the principles are same, video bandwidth through a CUBE is not so easily calculable. This is due to a number of factors, including:
Therefore, the bandwidth provisioning for video systems sometimes happens in the reverse order- i.e. amount of bandwidth that a transport network can deliver, with LLQ policy, is determined first and based on that, the endpoint is configured. Endpoint video systems are smart enough to adjust the various video parameters for the pipe size! Accordingly, the endpoints signal the call.
So, how does CUBE handle Bandwidth in its (SIP) offer/answers when signaling video calls? CUBE populates the video bandwidth fields in SDP as follows-
1. From bandwidth attribute in the incoming SDP. In SDP, there exists a bandwidth attribute, which has a modifier used to specify what type of bit-rate the value refers to. The attribute has the following form: b=<modifier>:<value>
2. From video bandwidth configured on the CUBE. For example, the estimated maximum bandwidth is calculated based on the features used by CTS user and the estimated bandwidth is pre-configured on CUBE, using the CLI-
3. Default video bandwidth (384 Kbps)
The call flow shown below illustrates how CUBE populates bandwidth in call signaling messages-
Specifically, the CUBE uses the following logic:
At the SDP session level, the TIAS value is the maximal amount of bandwidth needed when all declared media streams are used[8].
This is another area in which video differs from audio. Audio codecs use static payload types. Video codecs, in contrast, use dynamic RTP payload types, which use the range 96 to 127.
The reason for the use of dynamic payload type has to do with the wide applicability of video codecs. Video codecs have parameters that provide a receiver with the properties of the stream that will be sent. Video payload types are defined in SDP, using the a=rtpmap parameter. Additionally, the "a=fmtp:" attribute MAY be used to specify format parameters. The fmtp string is opaque and is just passed to the other side.
Here is an example-
m=video 2338 RTP/AVP 97 98 99 100 c=IN IP4 192.168.90.237 b=TIAS:768000 a=rtpmap:97 H264/90000 a=fmtp:97 profile-level-id=42800d;max-mbps=40500;max-fs=1344;max-smbps=40500 a=rtpmap:98 H264/90000 a=fmtp:98 profile-level-id=42800d;max-mbps=40500;max-fs=1344;max-smbps=40500;packetiza tion-mode=1 a=rtpmap:99 H263-1998/90000 a=fmtp:99 custom=1024,768,4;custom=1024,576,4;custom=800,600,4;cif4=2;custom=720,480,2 ;custom=640,480,2;custom=512,288,1;cif=1;custom=352,240,1;qcif=1;maxbr=7680 a=rtpmap:100 H263/90000 a=fmtp:100 cif=1;qcif=1;maxbr=7680
Note that the two endpoints involved in a call might use different payload type for the same codec. CUBE responds to each side with a=rtpmap line received on the other leg. This means that the config "asymmetric payload full" is needed for video calls to work.
L2 bandwidth
Unlike voice, real-time IP video traffic in general is a somewhat bursty, variable bit rate stream. Therefore video, unlike voice, does not have clear formulas for calculating network overhead because video packet sizes and rates vary proportionally to the degree of motion within the video image itself. From a network administrator's point of view, bandwidth is always provisioned at Layer 2, but the variability in the packet sizes and the variety of Layer 2 media that the packets may traverse from end-to-end make it difficult to calculate the real bandwidth that should be provisioned at Layer 2. However, the conservative rule that has been thoroughly tested and widely used is to over-provision video bandwidth by 20%. This accommodates the 10% burst and the network overhead from Layer 2 to Layer 4.
As mentioned earlier video endpoints do not report a MOS as such. However the following tools could be used to measure/monitor the transport network performance, and to monitor video quality.
A feature embedded in IOS, IP SLAs (Service Level Agreements) performs the active monitoring of the network performance. IP SLAs video operation differs from other IP SLA operations in that all traffic is one way only, with a responder required to process the sequence numbers and time stamps locally and to wait for a request from the source before sending the calculated data back.
The source sends a request to the responder when the current video operation is done. This request signals the responder that no more packets will arrive, and that the video sink function in the video operation can be turned off. When the response from the responder arrives at the source, the statistics are read from the message, and the relevant fields in the operation are updated.
CiscoWorks IPM (IOS Performance Monitor) uses IP SLA probe and MediaTrace[9] to measure user traffic performance and reports.
The VQM (Video Quality Monitor) feature, available on the CUBE, is a great tool to monitor video quality between two points of interest. Results are presented as MOS.
This is available from IOS 15.2(1)T and above. Note that VQM uses DSP resources.
[1] Based on highest audio human-audible frequency of approx. 4000Hz. Ref: Nyquist theorem.
[2] Constant Bit Rate (CBR) transmission schemes are possible with video, but they trade off quality to maintain CBR.
[3] For Inter-frame compressions
[4] Note that SLA is more stringent for TP.
[5] Life-size images and high quality audio
[6] The default value for this parameter is 200ms of traffic at priority bandwidth. The Cisco LLQ algorithm has been implemented to include a default burst parameter equivalent to 200 ms worth of traffic. Testing has shown that this burst parameter does not require additional tuning for a single IP Videoconferencing (IP/VC) stream. For multiple streams, this burst parameter may be increased as required.
[7] An auxiliary video stream is a 5 fps video channel for sharing presentations or other collateral vial the data projector.
[8] Note that some systems use the “AS” (Application Specific) modifier to convey maximum bandwidth. Interpretation of this attribute is dependent on the application's notion of maximum bandwidth.
CUBE is agnostic as to the specific bandwidth modifier (TIAS or AS).
[9] Mediatrace is an IOS Software feature that discovers the routers and switches along the path of an IP flow.
StartSelection:0000000199 EndSelection:0000000538