Thursday, 15 September 2011

VoIP Basics: About Jitter & RTP


 If you ever experimented with the program ping you probably know that if you send a sequence of packets from point A to some point B, each of the packets will need a slightly different time to reach the destination. The varying transit times are not an issue if you are downloading a web page but they matter if you wish to transmit a stream of real-time data. For example, let's suppose that a VoIP device sends out one RTP packet each 20 milliseconds. Figure 1 shows what the stream might look like at the receiving end. The fact that the packets do not arrive precisely each 20 milliseconds means that we cannot play them out as they arrive unless we are willing to accept poor quality of the audio output.

 
Figure 1
Formally, jitter is defined as a statistical variance of the RTP data packet inter-arrival time. In the Real Time Protocol, jitter is measured in timestamp units. For example, if you transmit audio sampled at the usual 8000 Hertz, the unit is 1/8000 of a second.
The first step to dealing with jitter successfully is to know how large it is. However, we do not need to compute the precise value. In RTP, the receiving endpoint computes an estimate using a simplified formula (a first-order estimator). The jitter estimate is sent to the other party using RTCP (the Real Time Control Protocol).
The formula for estimating jitter is as follows (if you are not much into math, skip to jitter buffer explanation):
J(i) = J(i-1) + ( |D(i-1,i)| - J(i-1) )/16
The estimator computes jitter iteratively. To estimate the jitter J(i) after we receive an i-th packet, we calculate the change of inter-arrival time, divide it by 16 to reduce noise, and add it to the previous jitter value. The division by 16 helps to reduce the influence of large random changes. A change of the inter-arrival time needs to repeat several times to influence the jitter estimate significantly.
In the jitter estimator formula, the value D(i-1, i) is the difference of relative transit times for the two packets. The difference is computed as
D(i,j) = (Rj - Ri) - (Sj - Si) = (Rj - Sj) - (Ri - Si)
Si is the timestamp from the packet i and Ri is the time of arrival for packet i.
Still not very clear? Let's try to do the math with a few sample values. We will asume the sender sends one packet each 20 milliseconds, and that the ideal transit time is 10 milliseconds. To make the example a bit easier to grasp, we will use milliseconds instead of timestamp units. We also start from zero, not from a random value. The table below shows the calculation:
Si
Ri
D(i, i-1)
J(i)
1
0
10
0
0
2
20
30
0
0
3
40
49
-1
0.0625
4
60
74
5
0.3711
5
80
90
-4
0.5979
6
100
111
1
0.6230
7
120
139
8
1.0841
8
140
150
-9
1.5788
9
160
170
0
1.4802
10
180
191
1
1.4501
11
200
210
-1
1.4220
12
220
229
-1
1.3956
13
240
250
1
1.3709
14
260
271
1
1.3477

As you can see in the table, the jitter value starts to grow slowly despite large differences — this the an influence of the noise reduction. When the large differences disappear (i > 8), the estimate starts to approach the approximate mean value.
Jitter Buffer
The network delivers RTP packets asynchronously, with variable delays. To be able to play the audio stream with reasonable quality, the receiving endpoint needs to turn the variable delays into constant delays. This can be done by using a jitter buffer.
The jitter buffer implementation is quite simple: You create a buffer to hold, say, 100 milliseconds of audio — with the sampling rate of 8000 Hz, 100 milliseconds correspond to 800 samples. You place incoming audio frames to the buffer and start the playout when the buffer is, say, at least half full.
Once you start to play the audio, it's a bit of a gamble: you risk both buffer underflow (you need to play another frame but the buffer is empty) and buffer overflow (the buffer is full and you need to throw away the just received packet). To reduce the risk, you can increase the size of the buffer, but you simultaneously increase latency: if you start playing when there's at least 50 milliseconds of audio, you delay the signal by those 50 milliseconds. To improve the odds, you can implement an adaptive buffer — the buffer will change its size based on the current jitter.
Sources of Jitter
I would like to conclude this piece with an observation about the sources of jitter. In addition to varying transit times, jitter can sometimes originate right in the sending computer. This happens when the audio data is not read directly from a sound card (sound cards have a very stable clock, more precise than the computer's on-board clock) but comes from another source — for example, the audio stream is generated by a text-to-speech software or simply read from a file. In other words, we are talking about applications like voice mail and interactive voice response (IVR) systems.
When run on a standard operating system, IVR and voice mail applications can have a problem with precise timing and thus cause a high jitter. Quite often, the operating system process schedulers works with 10 milliseconds quanta. Consider an application that wants to send one RTP packet each 30 milliseconds. The application spends, say, 5 milliseconds doing some processing (e.g. text-to-speech synthesis). After that, it would need to sleep for precisely 25 milliseconds, so that the interval between packets is exactly 30 ms. But because of the 10 ms quantum, the length of the sleep is rounded up to the nearest multiple of 10ms. In other words, the interval between packets ends up being 35 milliseconds. Should this happen in between each pair of packets, you will get a really poor audio quality.
To overcome the issue, you can do two things:
  • Reconfigure the operating system or install a kernel module or driver that will support a more precise timing.
  • Or, at the very least, use an adaptive sending algorithm that tries to compensate the incorrect sleep lengths (see section 6 of the OpenH323 tutorial for more about how to do this).
Real Time Protocol
Vladimír Toncar
 
In the previous parts of the Voice over IP Overview, we described how the voice gets digitized, how it is encoded using codecs, and we also touched latency and bandwidth optimization issues. Now it's the right time to learn more about how audio (and possibly video) streams are sent across the network.
The protocol used to send real-time streams of data across a network is simply called the Real Time Protocol (RTP for short). RTP has been originally defined by IETF in RFC1889 and the up-to-date definition is in RFC3550.
When transmitting the streams of data, the protocol needs to handle the following conditions in the network:
  • The network can de-sequence packets
  • Some packets can be lost
  • Jitter is introduced (jitter is a variance of packet inter-arrival time, we will get to it later in greater detail).
Out of these three, RTP aims to solve only two issues, packet de-sequencing and jitter (using sequence numbers and timestamps). When it comes to packet loss, the protocol prefers "real-timeness" to reliability. If some packets get lost, they get lost, it's more important to transmit the stream in real time. Because of this, RTP works on top of UDP. TCP is not suitable for real-time protocols because of its retransmission scheme.
 
RTP header
Let's have a look at the RTP packet header and point out the most important fields.
  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |V=2|P|X|  CC   |M|     PT      |       sequence number         |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                           timestamp                           |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |           synchronization source (SSRC) identifier            |
 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
 |            contributing source (CSRC) identifiers             |
 |                             ....    (optional)                |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |            data...                                            |
 |                                                               |
Figure 1: RTP Packet header

Figure 1 shows a simplified RTP packet structure (we left out the optional extensions, see RFC3550 for full description). The important fields are as follows:
Payload type (PT): Payload type for the data carried in the packet. The PT field is 7 bit long, so it allows values between 0 and 127. There are several static values defined, for example "0" represents G.711 uLaw, "8" represents G.711 ALaw, and "18" stands for G.729. The interval between 96 and 127 is reserved for dynamic payload types. These dynamic payload types need to be negotiated by whatever signaling protocol is used to establish the VoIP call (e.g. SIP or H.323).
Sequence number: The sequence number starts at a random value and is incremented with each RTP packet sent. This helps to identify packets received out of sequence.
Timestamp: Similar to the sequence number above, the timestamp is initialized with a random value. The clock frequency depends on the payload type. With the most usual narrow-band audio, the frequency is 8000 Hz and the timestamp is the tick count when the first audio sample in the payload was sampled.
Synchronization Source Indentifier (SSRC): A 32-bit identifier of the audio/video stream producer. In a special situation, the stream can be produced by a mixer from several streams. The IDs of the contributing sources can be listed in the CSRS fields and the field CC gives the number of contributing sources. However, you will not see this used very often in practice.
In the most typical situation (no CSRC fields, no header extension), the RTP header consists of 12 bytes.
 
Real Time Control Protocol
RTCP accompanies RTP and is used to transmit control information about the RTP session. RTCP packets are send only from time to time since there is a recommendation that the RTCP traffic should consume less than 5 percent of the session bandwidth.
The most important content types carried in RTCP packets include:
  • information about call participants (for example, name and e-mail address)
  • statistics about the quality of the transmission (for example inter-arrival jitter and the number of lost packets). The report sent by a participant who both sends and receives data is called a sender report (SR), while reports sent by participants who only receive RTP streams are called receiver reports (RR).
There is a rule that RTP should use an even UDP port number (e.g. 5000) and the related RTCP should use the next odd port (e.g. 5001).

1 comment: