QoS Aspects of Mobile Multimedia Applicationsmoncef/publications/curcio.pdf · QoS Aspects of Mobile Multimedia Applications Thesis for the degree of Doctor of Science in Technology

Tampereen teknillinen yliopisto. Julkaisu 973 Tampere University of Technology. Publication 973

Igor Danilo Diego Curcio QoS Aspects of Mobile Multimedia Applications Thesis for the degree of Doctor of Science in Technology to be presented with due permission for public examination and criticism in Tietotalo Building, Auditorium TB109, at Tampere University of Technology, on the 13th of June 2011, at 12 noon. Tampereen teknillinen yliopisto - Tampere University of Technology Tampere 2011

ISBN 978-952-15-2593-3 ISSN 1459-2045

Thesis Supervisor: Prof. Moncef Gabbouj

Department of Signal Processing Faculty of Computing and Electrical Engineering

Tampere University of Technology Tampere, Finland

Pre-Examiners: Prof. Pascal Frossard Electrical Engineering Institute École Polytechnique Fédérale de Lausanne (EPFL) Lausanne, Switzerland

Prof. Paolo Bellavista

Dipartimento di Informatica, Elettronica e Sistemistica (DEIS) Universitá di Bologna Bologna, Italy

Opponents: Prof. C.-C. Jay Kuo Department of Electrical Engineering University of Southern California Los Angeles, CA, U.S.A.

Prof. Jussi Kangasharju Department of Computer Science

Faculty of Science University of Helsinki Helsinki, Finland

i

Abstract

Multimedia technologies have emerged during the last years because of the constant users’ needs of having always more information in multimedia form (speech, audio, visual, etc.), rather than just purely in textual form. A variety of applications and services that use multimedia content have been deployed to the Internet market. Examples of the most popular applications and services that use real-time multimedia content are YouTube and Skype. However, these applications do not offer a perfect Quality of Service (QoS) in any networking scenario. In fact, even today it is not uncommon to experience a YouTube PC video watching session with glitches and interruptions, despite the high speed network connection (e.g., a home ADSL of several Mbps). Similarly, it is not uncommon to run into a Skype video call where the audio or video sometimes freeze or become unpleasant, or the call is completely dropped. The same application scenarios become even more challenging over mobile networks because of the nature of the physical layer. This generates the need of further research for optimizing real-time multimedia applications for wireless environments.

This thesis is about QoS aspects of mobile multimedia applications. The applications considered are mobile multimedia telephony and multimedia streaming. In addition, a new type of mobile application can be obtained from the merging of these two applications. Such new application is called Mobile Interactive Social TV, and it stems from the merging of the paradigms for real-time voice, video telephony and multimedia streaming.

One of the aspects presented in this thesis is the matching between network capabilities and applications requirements through the analysis of different issues (such as bandwidth, delay, error rates, handovers, etc.) and their impact to the applications. The design, implementation and deployment of these applications over mobile networks present several technical challenges in terms of QoS.

Circuit-switched and packet-switched architectures for mobile multimedia telephony are analyzed, along with their challenges and solutions. For example, processing and network delays should be minimized, so that the end user will experience a real feeling of interactivity with the other party in the same session. When end-to-end delays are variable, and when several media are transmitted (e.g., audio and video), lip-synchronization of different media may be a challenge, and the results in this thesis have shown that there is a maximum user tolerance in terms of media skew, which is different from the tolerance in

ii

traditional TV systems. Also, when a media session is established, the start-up delay is an important factor. For instance, as in traditional circuit-switched phone calls, the dial-to-ring delay is critical to determine the overall experience in the user’s mind. This and other QoS issues and improvement methods are contributed with simulation results using specific QoS metrics.

Mobile media streaming is the second type of application in the focus of this thesis. Several use cases are considered with considerations on media traffic characteristics. The main QoS issues for mobile streaming are tackled, and some of them addressed with QoS improvements methods (e.g., robust cell reselections). Mobile streaming performance over GPRS, EGPRS and WCDMA is assessed and results are presented.

Mobile multimedia telephony and streaming applications may be deployed over guaranteed bit rate bearers, which make sure the required bandwidth for the media streams is available all the time during the lifetime of a session. However, it depends primarily on the mobile operator whether or not to allow the usage of these bearers for such multimedia applications. Often, only best-effort bearers are available to non-premium users; in this case, the network bandwidth for each user may be variable over time. Mobile streaming applications do not generally have very stringent real-time requirements. However, these (similarly to multimedia telephony applications) do require a guaranteed bandwidth in order to perform optimally. Whenever this is not available, bit rate adaptation techniques must be used in order to fight against bandwidth variability and unavailability. Several adaptation models are possible, and methods for mobile media adaptation for multimedia telephony and streaming are presented with their performance results. The Geo-Predictive adaptation method presented in this thesis represents currently the state-of-the-art in context-based media adaptation.

Finally, a new experimental application, Mobile Interactive Social TV, which combines social interaction together with the TV watching experience, is introduced in this thesis along with different deployment scenarios and the first user experience results.

iii

Acknowledgments

The research work included in this thesis covers a total five-year time within the period 2001-2010 during which I worked at Nokia Corporation in several R&D units (Nokia Research Center, Mobile Phones, Mobile Software, Technology Platforms) in the field of mobile multimedia applications research and standardization. I would like to thank Nokia and my past managers for continuously supporting and funding my Ph.D. studies along the years. In particular, I am grateful to my current manager, Dr. Jyri Huopaniemi, for allowing me to walk through the last mile of this journey in a smooth way. Part of this thesis has been written with the support of the Nokia Foundation that greatly helped me to complete the work with the right concentration and mindset.

My sincere acknowledgement goes to my supervisor, Prof. Moncef Gabbouj of the Department of Signal Processing, Faculty of Computing and Electrical Engineering, Tampere University of Technology, for his endless patience and unbeatable professional academic guidance. I would like to thank also Dr. Petri Haavisto for guiding me as supervisor during the first years of research. Thanks to the pre-examiners of this thesis, Prof. Pascal Frossard (Electrical Engineering Institute, École Polytechnique Fédérale de Lausanne) and Prof. Paolo Bellavista (DEIS, Universitá di Bologna) for their precious comments that helped increasing the quality of the dissertation.

This work would have not been possible without the contribution of the co-authors of my publications, many of which are still my colleagues. Special thanks to Miikka Lundan, David Léon, Sujeet Mate, Ville Lappalainen, Vinod K.M. Vadakital, Varun Singh, Prof. Jörg Ott, Miraj-E-Mostafa, Juha Kalliokulju and Miska Hannuksela. With them I really had constructive discussions that sparked out great ideas during many days and nights of work. I will not forget to mention Marko Luomi that I wish to thank for hiring me at Nokia Research Center in 1998 and for making all this possible. Thanks also to Ari Hourunranta for being my tutor and introducing me to the 3G-324M technology at that time, and to Viktor Varsa and David Léon for inspiring and contributing ideas on PSS and rate adaptation during the standardization period in the 3GPP SA4 Working Group. Thanks also to Francesco Cricrí for developing part of the Mobile Interactive Social TV system.

iv

Last, I would like to express my gratitude to my mother, for always encouraging and supporting me to pursue my intellectual ambitions in life. This thesis is dedicated to her. Tampere, 13 June 2011 Igor Danilo Diego Curcio

“Istud quod tu summum putas gradus est”

(Seneca, 62-65 A.D.)

v

Contents

ABSTRACT .............................................................................................................................. I

ACKNOWLEDGMENTS ..................................................................................................... III

CONTENTS ............................................................................................................................. V

LIST OF PUBLICATIONS ................................................................................................... IX

LIST OF FIGURES ................................................................................................................ XI

LIST OF TABLES .............................................................................................................. XIII

LIST OF ABBREVIATIONS .............................................................................................. XV

1. INTRODUCTION ................................................................................................................ 1

1.1. OBJECTIVES AND SCOPE OF THE RESEARCH ..................................................................... 3 1.2. AUTHOR’S CONTRIBUTION TO THE PUBLICATIONS .......................................................... 5 1.3. ORGANIZATION OF THE THESIS ....................................................................................... 6

2. MOBILE NETWORKS ....................................................................................................... 7

2.1. WIRED AND WIRELESS NETWORKS .................................................................................. 7 2.2. CIRCUIT-SWITCHED MOBILE NETWORKS ....................................................................... 10

2.2.1. GSM ...................................................................................................................... 10 2.2.2. HSCSD, ECSD and CS UMTS ............................................................................. 10

2.3. PACKET-SWITCHED MOBILE NETWORKS ....................................................................... 11 2.3.1. General Packet Radio Service (GPRS) ................................................................ 12 2.3.2. Enhanced GPRS (EGPRS) and GERAN improvements ....................................... 13 2.3.3. UMTS .................................................................................................................... 14 2.3.4. IMS and HSDPA ................................................................................................... 16

2.4. QOS OF CIRCUIT-SWITCHED AND PACKET-SWITCHED NETWORKS ................................. 17 2.4.1. GSM, HSCSD and ECSD ..................................................................................... 17 2.4.2. GPRS .................................................................................................................... 17 2.4.3. UMTS .................................................................................................................... 19

3. APPLICATIONS REQUIREMENTS AND NETWORK CAPABILITIES ................ 23

3.1. MOBILE MULTIMEDIA APPLICATIONS PROPERTIES ......................................................... 23 3.1.1. Mobile multimedia telephony ............................................................................... 23 3.1.2. Mobile multimedia streaming ............................................................................... 24

vi

3.2. APPLICATIONS QOS ISSUES AND MOBILE NETWORK ASPECTS ...................................... 25 3.2.1. Bandwidth ............................................................................................................ 25 3.2.2. Error rates and delivery of erroneous packets .................................................... 28 3.2.3. Delivery order ...................................................................................................... 29 3.2.4. Delay .................................................................................................................... 30 3.2.5. Delay jitter............................................................................................................ 32 3.2.6. Handovers and cell changes ................................................................................ 33 3.2.7. Segmentation issues ............................................................................................. 34

3.3. RECOMMENDED NETWORK CHANNELS FOR MOBILE MULTIMEDIA APPLICATIONS ......... 36

4. MOBILE MULTIMEDIA TELEPHONY ....................................................................... 37

4.1. MOBILE MULTIMEDIA TELEPHONY ARCHITECTURES AND SERVICES ............................. 37 4.1.1. Circuit-Switched multimedia telephony ............................................................... 38 4.1.2. Packet-Switched multimedia telephony ................................................................ 40

4.2. MEDIA TRAFFIC CHARACTERISTICS ............................................................................... 41 4.2.1. 3G-324M traffic.................................................................................................... 41 4.2.2. MTSI traffic .......................................................................................................... 42

4.3. PDP CONTEXTS CONSIDERATIONS ................................................................................. 42 4.3.1. Number of PDP contexts ...................................................................................... 43

4.4. MOBILE MULTIMEDIA TELEPHONY QOS METRICS .......................................................... 45 4.4.1. Frame-based QoS metrics .................................................................................... 46 4.4.2. PSNR-based QoS metrics ..................................................................................... 46 4.4.3. Delay-based QoS metrics ..................................................................................... 47 4.4.4. Service flexibility-based QoS metrics ................................................................... 50 4.4.5. Call control-based QoS metrics ........................................................................... 50 4.4.6. Other QoS metrics ................................................................................................ 51

4.5. MULTIMEDIA TELEPHONY QOS IMPROVEMENTS .......................................................... 51 4.5.1. Bit errors or packet loss handling ........................................................................ 51 4.5.2. Delay optimization ............................................................................................... 53 4.5.3. Jitter buffer management ..................................................................................... 55 4.5.4. Inter-media synchronization ................................................................................ 55 4.5.5. Packetization overheads ....................................................................................... 57 4.5.6. Session control signaling delay ............................................................................ 57

5. MOBILE MEDIA STREAMING ..................................................................................... 59

5.1. MOBILE STREAMING ARCHITECTURES AND SERVICES .................................................. 59 5.1.1. Classification ........................................................................................................ 60 5.1.2. The PSS Standard ................................................................................................. 62

5.2. MEDIA TRAFFIC CHARACTERISTICS .............................................................................. 63 5.2.1. Content creation and distribution ........................................................................ 63 5.2.2. Media content and rate controls .......................................................................... 64 5.2.3. Speech streaming traffic ....................................................................................... 65 5.2.4. Video streaming traffic ......................................................................................... 65 5.2.5. Other traffic .......................................................................................................... 68

5.3. PDP CONTEXTS CONSIDERATIONS ................................................................................ 68

vii

5.4. STREAMING QOS METRICS ........................................................................................... 69 5.4.1. QoE metrics for PSS ............................................................................................. 69

5.5. MOBILE STREAMING QOS IMPROVEMENTS ................................................................... 70 5.5.1. Content creation ................................................................................................... 70 5.5.2. Packet loss handling ............................................................................................. 70 5.5.3. Session control signaling delay ............................................................................ 71 5.5.4. Receiver buffer management ................................................................................ 72 5.5.5. Packetization overheads and optimal packet sizes ............................................... 73 5.5.6. Robust cell reselection management .................................................................... 75 5.5.7. Optimization in the lower protocol layers ............................................................ 76

6. MOBILE MEDIA ADAPTATION ................................................................................... 79

6.1. PROBLEM STATEMENT .................................................................................................. 79 6.1.1. Bit rate evolution plots and the STRP model ....................................................... 81

6.2. ADAPTATION MODELS................................................................................................... 83 6.2.1. Architecture-based adaptation models ................................................................. 83 6.2.2. Time-based adaptation models ............................................................................. 83 6.2.3. Responsibility split in rate adaptation management ............................................ 84

6.3. BASIC END-TO-END SIGNALING SUPPORT ..................................................................... 86 6.3.1. Application awareness of network QoS ................................................................ 86 6.3.2. RTCP .................................................................................................................... 86

6.4. MEDIA ADAPTATION FOR MOBILE MULTIMEDIA TELEPHONY ........................................ 87 6.4.1. Sender-driven adaptation ..................................................................................... 87 6.4.2. Receiver-driven adaptation .................................................................................. 88 6.4.3. Network-driven adaptation ................................................................................... 89

6.5. MEDIA ADAPTATION FOR MOBILE STREAMING ............................................................. 90 6.5.1. Server-driven adaptation ...................................................................................... 90 6.5.2. Buffering aspects .................................................................................................. 93 6.5.3. Co-operative adaptation ....................................................................................... 94 6.5.4. Client-driven adaptation ...................................................................................... 95 6.5.5. Network-driven adaptation ................................................................................... 96 6.5.6. Geo-Predictive adaptation ................................................................................... 96 6.5.7. Implications of packet retransmission on media adaptation ................................ 98

7. MOBILE AND INTERACTIVE SOCIAL TELEVISION............................................. 99

7.1. FUSING DIFFERENT APPLICATION PARADIGMS .............................................................. 99 7.2. INTERACTION MODALITIES ......................................................................................... 100 7.3. CONTENT AND INTERACTION MIXING ARCHITECTURES .............................................. 101

7.3.1. Centralized mixing architecture ......................................................................... 101 7.3.2. Endpoint mixing architecture ............................................................................. 102

7.4. PROOF-OF-CONCEPT SYSTEM ...................................................................................... 103 7.5. USER EXPERIENCE....................................................................................................... 104 7.6. SESSION MOBILITY ...................................................................................................... 105

8. CONCLUSIONS AND FUTURE WORK...................................................................... 107

viii

8.1. FUTURE DEVELOPMENTS ............................................................................................ 109

BIBLIOGRAPHY ................................................................................................................ 111

PUBLICATIONS ................................................................................................................. 129

ix

List of Publications

This thesis consists of a summary part and the following publications. In the summary part, the publications are referred to as [P1], [P2], etc.

[P1] Igor D.D. Curcio, Ville Lappalainen, Miraj-E-Mostafa, “QoS Evaluation of 3G-324M Mobile Videophones over WCDMA Networks”, Computer Networks, Elsevier, Vol. 37, No. 3-4, 5 Nov. 2001, pp. 425-445.

[P2] Igor D.D. Curcio, Miikka Lundan, “SIP Call Setup Delay in 3G Networks”, Proc. 7th IEEE Symposium on Computers and Communications (ISCC '02), 1-4 Jul. 2002, Taormina-Giardini Naxos (Italy), pp. 835-840.

[P3] Igor D.D. Curcio, Miikka Lundan, “Human Perception of Lip Synchronization in Mobile Environment”, Proc. 8th IEEE Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM ’07), 18-21 Jun. 2007, Helsinki, Finland.

[P4] Miikka Lundan, Igor D.D. Curcio, “Mobile Streaming Services in WCDMA Networks”, Proc. IEEE International Symposium on Computers and Communications (ISCC ’05), 27-30 Jun. 2005, Cartagena, Murcia, Spain, pp. 231-236.

[P5] Miikka Lundan, Igor D.D. Curcio, “Optimal 3GPP Packed-switched Streaming Service (PSS) over GPRS”, Multimedia Tools and Applications Journal, Vol. 35, No. 3, Dec. 2007, pp. 285-310.

[P6] Igor D.D. Curcio, David Léon, “Application Rate Adaptation for Mobile Streaming”, Proc. IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM ’05), 13-16 Jun. 2005, Taormina/Giardini Naxos, Italy, pp. 66-71.

[P7] Igor D.D. Curcio, David Léon, “Evolution of 3GPP Streaming for Improving QoS over Mobile Networks”, Proc. IEEE International Conference on Image Processing (ICIP ’05), Genova, Italy, 11-14 Sep. 2005, Vol. III, pp. 692-695.

x

[P8] Igor D.D. Curcio, Juha Kalliokulju, Miikka Lundan, “AMR Mode Selection Enhancement in 3G Networks”, Multimedia Tools and Applications Journal, Vol. 28, No. 3, Mar. 2006, pp. 259-281.

[P9] Varun Singh, Jörg Ott, Igor D.D. Curcio, “Rate Adaptation for Conversational 3G Video”, Proc. 2nd International Workshop on Mobile Video Delivery (MoViD), (in

conjunction with the 28th IEEE Conference on Computer Communications (INFOCOM ‘09)), 24 Apr. 2009, Rio de Janeiro, Brazil.

[P10] Igor D.D. Curcio, Vinod K.M. Vadakital, Miska M. Hannuksela, “Geo-Predictive Real Time Media Delivery in Mobile Environment”, Proc. 3rd ACM International

Workshop on Mobile Video Delivery (MoViD) (in conjunction with 18th ACM Multimedia Conference 2010), 25 Oct. 2010, Firenze, Italy.

[P11] Sujeet Mate, Igor D.D. Curcio, “Mobile and Interactive Social Television”, IEEE Communications Magazine, Vol. 47, No. 12, Dec. 2009, pp. 116-122.

xi

List of Figures

Figure 1. The three axes scheme for wireless systems……………………………...................... 2

Figure 2. Protocol stack for a sender and a receiver...................................................................... 3

Figure 3. Simplified architecture of mobile network.…............................................................... 8

Figure 4. GPRS user plane protocol stack..................................................................................... 12

Figure 5. User plane protocol stack for UMTS networks (Iu-PS mode)...………………............ 15

Figure 6. A typical mobile multimedia telephony system………………………………............. 38

Figure 7. System architecture of 3G-324M terminals................................................................... 39

Figure 8. MTSI client protocol stack ……………………………………………………............ 40

Figure 9. Delay components in 3G-324M terminals……………………………………............. 54

Figure 10. A typical mobile multimedia streaming system………………………………............. 60

Figure 11. PSS protocol tack……………………………………………………………………... 62

Figure 12. Bit rate variation for RealNetworks streaming over different network scenarios……. 67

Figure 13. Bandwidth repartition among payload and headers for different IP packet sizes…….. 74

Figure 14. Buffer status under a cell reselection event…………………………………………… 76

Figure 15. The rate adaptation problem........................................................................................... 81

Figure 16. A mobile media streaming buffer model....................................................................... 93

Figure 17. Media data flow in the endpoint MIST mixing architecture………………………...... 102

xii

xiii

List of Tables

Table 1. A comparison of Circuit-Switched and Packet-Switched networks………………..…. 12

Table 2. QoS profile for GPRS networks..................................................................................... 18

Table 3. Residual error probabilities for reliability classes in GPRS networks….……………... 18

Table 4. QoS profile for UMTS networks……………………………………………………… 20

Table 5. Value ranges for UMTS bearer QoS profile attributes………………………………... 21

Table 6. Delay bounds for multimedia telephony applications………………………………… 30

Table 7. HSCSD RLP key parameters for non-transparent mode……………………………… 31

Table 8. LLC key parameters for GPRS/GERAN………………………………........................ 32

Table 9. Speech and video codecs supported by MTSI clients…………………………………. 40

Table 10. Most used symbols…………………………………………………………………….. 45

Table 11. Total delays for H.263+ video on 3G-324M………………………………………….. 55

Table 12. Lip synchronization thresholds for mobile environment................................................ 56

Table 13. Classification of use cases for unicast streaming……………………………………… 61

Table 14. Speech, audio and video codecs supported by PSS systems………………………….. 63

Table 15. Speech traffic characteristics for streaming……………………………........................ 65

Table 16. Packet size statistics (in bytes) for different rate controls and packetizations…………66

Table 17. Streaming bit rate statistics (in kbps) for different rate controls……………………… 67

Table 18. Session control signaling delays (in seconds) for GPRS, EGPRS and UTRAN……… 71

Table 19. Maximum media bit rates after lower layer retransmissions and protocol headers…… 77

Table 20. Performance results for NOR, RAT and GPT……………………................................ 98

xiv

xv

List of Abbreviations

16QAM 16-state Quadrature Amplitude Modulation 3G-324M 3G Mobile terminal based on ITU-T H.324 3GP 3GPP file format 3GPP(2) Third Generation Partnership Project (2) 8-PSK Octagonal Phase Shift Keying AAC Advanced Audio Coding ABU Available Bandwidth Utilization ACK ACKnowledged or Acknowledgement ADSL Asymmetric Digital Subscriber Line ADU Application Data Unit AIUR Air Interface User Rate AL Adaptation Layer AMBR Aggregated MBR AMPS Advanced Mobile Phone System AMR Adaptive Multi-Rate AMR-WB AMR WideBand ANSI American National Standards Institute APP APPlication-defined RTCP packet ARIB Association of Radio Industries and Businesses ARQ Automatic Repeat reQuest ASD Answer-Signal Delay ASM Adaptive Stream Management AVPF Audio-Video Profile with Feedback B (frame) Bi-directional predicted video frame BAC BAckward error Correction BER Bit Error Rate BLER BLock Error Rate Bph Bytes per hour Bps Bytes per second BS Base Station BSC BS Controller BSS Base Station Subsystem BTS Base Tranceiver Station CBR Constant Bit Rate CBRP CBR Packet transmission

xvi

CDF Cumulative Distribution Function CDMA Code Division Multiple Access cdmaOne 2G mobile system (a.k.a. IS-95) CIF Common Interchange Format (352x288 pixel resolution) CN Core Network C-NADU Conversational NADU CPU Central Processing Unit CR Cell Reselection CRC Cyclic Redundancy Check CRD Call-Release Delay CS Circuit-Switched CS-X(..Y) Coding Scheme X(up to coding scheme Y) CSD CS Data CSR Call Success Rate D-AMPS Digital AMPS (a.k.a. IS-54 and IS-136 (or ANSI-136)) dB Decibels DL DownLink DLNA Digital Living Network Alliance DLR Delta Loss Rate DSP Digital Signal Processing DVB-H Digital Video Broadcast – Handheld ECN Explicit Congestion Notification ECSD Enhanced CSD EDGE Enhanced Data rates for GSM Evolution EGPRS Enhanced GPRS ETACS Extended TACS EV-DO Evolution Data Optimized FBS Free Buffer Space FEC Forward Error Correction FIR Full Intra Request FOMA Freedom of Mobile Multimedia Access fps frames per second GBR Guaranteed Bit Rate GERAN GSM/EDGE RAN GGSN Gateway GPRS Support Node GMSK Gaussian Minimum Shift Keying GOB Group Of Blocks GPRS General Packet Radio Service GPT Geo-Predictive Transmission GSM Global System for Mobile communications GTP GPRS Tunneling Protocol HARQ Hybrid ARQ HO HandOver HSCSD High Speed CSD HSDPA High Speed Downlink Packet Access HSN Highest SN HSPA High Speed Packet Access

xvii

HTTP HyperText Transfer Protocol IETF Internet Engineering Task Force I (frame) Intra-coded video frame IMS IP Multimedia Subsystem IMT International Mobile Telecommunications IP(v4/v6) Internet Protocol version 4 or version 6 IR Incremental Redundancy ISDN Integrated Service Digital Network ITU(-T) International Telecommunications Union (Telecommunications

sector) KB Kilo Bytes kbps kilo bits per second km kilometers LC Low Complexity LLC Logical Link Control LTE Long Term Evolution LTP Long Term Prediction MAC Medium Access Control MAC-hs MAC high speed Mbps Mega bits per second MBMS Multimedia Broadcast Multicast Service MBR Maximum Bit Rate (M)CS (Modulation) and Coding Scheme MIST Mobile and Interactive Social Television MML Mobile Multilink Layer MMS Multimedia Messaging Service MONA Media Oriented Negotiation Acceleration MP4 MPEG-4 file format MPD Media Presentation Description MPEG Motion Picture Expert Group MS Mobile Station ms milliseconds MSE Mean Square Error MSS Multimedia Streaming Service MTL Mobile-To-Land MTM Mobile-To-Mobile MTS Mobile Telephone System MTSI Multimedia Telephony Service over IMS MTU Maximum Transfer Unit MUX MultipleXer NACC Network Assisted Cell Change NACK Negative ACKnowledgement NADU Next ADU APP RTCP packet NAL Network AL NAT Network Address Translator NCCR Network Controlled Cell Reselection NMT Nordic Mobile Telephone

xviii

NOR NO Rate adaptation transmission NSN Next SN NTP Network Time Protocol NTT Nippon Telegraph and Telephone NUN Next Unit Number OBSN Oldest Buffered SN P(t) or P Playout curve PC Personal Computer PDC Personal Digital Cellular PDCP Packet Data Convergence Protocol PDD Post-Dialing Delay PDF Probability Distribution Function PDP Packet Data Protocol PDU Protocol Data Unit PLI Picture Loss Indication PLR Packet Loss Rate PoC Push to talk over Cellular PPP Point to Point Protocol pps Packets per second PS Packet-Switched PSC Picture Start Code PSNR Peak Signal to Noise Ratio PSS Packet-switched Streaming Service PSTN Public Switched Telephone Network QCIF Quarter CIF (176x144 pixel resolution) QoE Quality of Experience QoS Quality of Service QP Quantization Parameter R(t) or R Reception curve RAN Radio Access Network RAT Rate Adaptation Transmission RBER Residual BER RLC Radio Link Control RLP Radio Link Protocol RNC Radio Network Controller ROHC Robust Header Compression RR Receiver Report RTCP Real-time Transport Control Protocol RTCP XR RTCP eXtended Reports RTMP Real-Time Messaging Protocol RTP Real-time Transport Protocol RTSP Real Time Streaming Protocol RTT Round Trip Time SAP Service Access Point SCTP Stream Control Transmission Protocol SD Standard Definition SDP Session Description Protocol

xix

SDPCapNeg SDP Capability Negotiation SDU Service Data Unit s seconds S(t) or S Sampling Curve S60 Series 60 Nokia phones SGSN Serving GPRS Support Node SIP Session Initiation Protocol SLI Slice Loss Indication SMIL Synchronized Multimedia Integration Language SMS Short Message Service SN Sequence Number SNDCP SubNetwork Dependent Convergence Protocol SQCIF Sub-QCIF (128x96 pixel resolution) SR Sender Report STRP Sampling, Transmission, Reception, Playback curve model SYN SYNchronize sequence numbers control flag in TCP T Transparent T(t) or T Transmission curve TACS Total Access Communications System TC Traffic Class TCP Transmission Control Protocol TDMA Time Division Multiple Access TFRC TCP Friendly Rate Control TMMBN/R Temporary Maximum Media stream Bit rate Notification/Request TMMBR-A TMMBR network Assisted TMMBR-U TMMBR Unassisted TMN Test Model Near term TS Time Slot TTI Transmission Time Interval UDP User Datagram Protocol UE User Equipment UEP Unequal Error Protection UGC User Generated Content UL UpLink UMTS Universal Mobile Telecommunications System UNACK UNACKnowledged URL Uniform Resource Locator UTRAN UMTS Terrestrial RAN VANET Vehicular Ad-hoc NETworks VBR Variable Bit Rate VBRP VBR Packet transmission VCR Video Cassette Recorder VoIP Voice over IP VSS Virtual Shared Space WCDMA Wideband CDMA WiMAX Worldwide interoperability for Microwave Access WLAN Wireless Local Area Network

xx

WNSRP Windowed Numbered Simple Retransmission Protocol WUSB Wireless Universal Serial Bus

1

Chapter 1

Introduction

obile communications, Internet connectivity, and multimedia technologies are progressively merging in a single paradigm of personal and social communication. Mobile communications technologies derive from the increasing

users’ needs to have information available anytime, anywhere. Internet connectivity allows putting an ever increasing amount of information resource at users’ disposal, including applications and services such as searching, browsing, e-mail, e-commerce and those which are socially-oriented. Multimedia technologies have emerged in the last years, as users desire having more information in multimedia form (audio, visual, etc.), rather than just purely in textual form.

Mobile networks have been developed since the last century to enable users making phone calls in total mobility. These systems have evolved from the Zero Generation (0G) analog phones (e.g., the Mobile Telephone System (MTS) developed in the U.S.A. in 1946 [191]), to the First Generation (1G) of analog networks (such as AMPS, (E)TACS, NMT, and NTT) in the 1980s, to the Second Generation (2G) of digital networks (such as GSM, PDC, D-AMPS, cdmaOne) in the early 1990s. Digital networks offer better data services and more advanced roaming capabilities than the analog systems. Furthermore, digital mobile networks have evolved to offer more advanced services for circuit and packed switched data transmission. Those networks are commonly referred to as the 2.5G networks and were introduced around 1997 (e.g., HSCSD, GPRS and CDMA2000 1xRTT). Further developments led to the 2.75G networks (e.g., EDGE allows bit rates up to 473.6 kbps). Third Generation (3G) networks (e.g., UMTS, CDMA 2000 1xEV-DO, FOMA) are able to carry even multimedia traffic at higher bit rates (up to 2.4 Mbps) and were deployed since 2001. Further developments called 3.5G (HSPA, EV-DO Rev. A and B), 3.75G (HSPA+) and 3.9G (LTE) brought speeds to hit respectively 14.7 Mbps, 84 Mbps and 100 Mbps in

M

INTRODUCTION

2

2007, 2008 and 2009. 4G networks (e.g., LTE Advanced) will have peak data rates up to 1 Gbps. Its standard specifications are planned to be completed in 2011. Some views of future of 5-7G networks are presented in [144].

Recent advances in media compression technology have made possible the transmission of real-time media over low bit rate links. However, the deployment of high-quality mobile media presents a number of technical challenges. Media processing, including compression and decompression, is CPU intensive. This and the constraints of a mobile device mean that the DSP platform must be of limited size and weight, but still be capable of processing a large quantity of data, possibly in real-time.

There are three orthogonal forces that impose constraints on wireless systems (see Figure

1). The error rate is inherently present in wireless network systems. Reducing the error rates means also reducing the available bandwidth and increasing the delay. Guaranteed QoS networks may offer guaranteed error rates. However, in many cases networks are just best effort and the error rates are variable. Applications and services are bandwidth hungry. More bandwidth means also higher error rates and shorter delays (e.g., less retransmissions imply lower delays and more available bandwidth). The last variable, delay, is critical in conversational applications. Users do not tolerate network and applications latencies. Therefore, it is desirable to achieve the shortest possible delays. However, these imply having higher error rates.

The three axes system must be kept well in balance when designing a mobile network and its applications. For example, in order to cope with high error rates and deliver high-bandwidth and low-delay applications, efficient error-resilience techniques must be developed in the applications in order to recover from errors that occur during data transmission. If bandwidth is variable, then efficient methods for bit rate adaptation should be implemented in the applications with the objective of delivering the best user experience at any time. If delays are variable, then optimal delay jitter buffering schemes are required. If, on the other hand, all three axes values can change over time, yielding variable error rates, delays and bandwidth, the problem for an application becomes multidimensional, and the

programming logic becomes more complex.

Figure 1. The three axes scheme for wireless systems

Bandwidth

Error rate

Delay

3

In this challenging framework, in the recent years, the research and standardization community has helped implementers by providing a set of tools for developing multimedia applications that offer an adequate Quality of Service (QoS). This is generically defined by 3GPP as “the collective effect of service performances which determine the degree of satisfaction of a user of a service” [25]. In this thesis, the terms QoS and QoE (Quality of Experience) will be regarded as equivalent.

Despite, the research and implementation efforts, recent products deployments do not offer a perfect QoS. For example, it is not uncommon to experience a YouTube PC video watching session with glitches and interruptions, despite the high speed network connections (e.g., a home ADSL of several Mbps). Similarly, it is not uncommon to run into a Skype video call where the audio or video are freezing, or the call is completely dropped. At this point, it is intuitive that the same application scenarios over mobile networks become even more challenging. This generates the need of further research for optimizing real-time multimedia applications for wireless environments.

1.1. OBJECTIVES AND SCOPE OF THE RESEARCH

This thesis focus is on research of various aspects of mobile multimedia applications that impact on the user QoS. The point of view will be that of the protocol side. The rest of the thesis will therefore be heavily centered on protocols, its properties, algorithms and performances. Figure 2 depicts a generic Internet protocol stack according to [140].

Figure 2. Protocol stack for a sender and a receiver

In the figure, the sender could be a mobile terminal or a server behind a wired network connection. The receiver is always a mobile terminal (e.g., a mobile phone) connected via a mobile network. The protocol layers will not be analyzed in detail, but only the relevant aspects in the context of mobile multimedia applications will be in the scope of the thesis.

The approach will be both top-down and bottom-up. The top-down approach takes the point of view of an application developer that might not be familiar with the lower layer protocol aspects. The bottom-up approach takes the view of a lower layers protocol engineer

Media codecs (audio, video, etc.)

TCP, UDP, RTP, SIP, SDP, HTTP, RTSP

IP

RLC/MAC

Application layer

Transport layer

Network layer

Data Link layer

Physical layer

Application layer

Transport layer

Network layer

Data Link layer

Physical layer

Sender Receiver

INTRODUCTION

4

that might have limited knowledge of the application layer aspects. The goal is to build an end-to-end knowledge for delivering the best user experience.

This thesis is about QoS aspects of mobile multimedia applications. The term QoS has a broad meaning, and the space of applications is growing in the last decade. The mobile multimedia applications components considered here are real-time voice, video telephony and multimedia streaming. Real-time voice is in the scope of this thesis (even if it could not strictly be classified as multimedia application, since a single media is involved). The definition of ‘components’ has a specific meaning, because it is possible to design new types of applications from these basic components. For instance, within the scope of this thesis is also a new type of application called Mobile Interactive Social TV. This can be defined as a combination of real-time voice, video telephony and multimedia streaming.

The design, implementation and deployment of these applications over mobile networks present several technical challenges in terms of QoS. For example, low-delay multimedia applications such as multimedia telephony should be implemented in such a way that the processing and network delays are minimized, so that the end user will experience the real feeling of interactivity with the other party connected to the same session. These aspects will be treated in this thesis. When end-to-end delays are variable, and when several media are transmitted (e.g., audio and video), the lip-synchronization of different media is a challenge, and results have shown that there is a maximum user tolerance in terms of media skew, which is different from the tolerance in traditional TV systems, as it will be evident in the next chapters. Also, when a multimedia session is established (for example, a multimedia telephony call or a streaming session), the session start-up delay is an important factor in the QoS space. For instance, as in traditional Circuit-Switched (CS) phone calls, the dial-to-ring delay is critical to determine the overall experience in the user’s mind. This will also be one of the aspects of this research.

Mobile multimedia telephony and streaming applications may be deployed over guaranteed bit rate bearers, which make sure the required bandwidth for the media streams is available all the time during the lifetime of a session. However, it depends primarily on the mobile operator whether or not to allow the usage of these traffic classes for those multimedia applications. Often, only best-effort bearers are available to non-premium users; in this case, the network bandwidth available for each user may be variable over time. Mobile streaming applications do not generally have stringent real-time requirements. However, these applications (similarly to multimedia telephony applications) do require a guaranteed bandwidth in order to perform optimally. Whenever this is not available, bit rate adaptation techniques could help in fighting against bandwidth unavailability. These will also be in scope with the thesis as well as performance of these applications under different network scenarios (circuit-switched, GPRS, UMTS).

For the sake of clarity, it is important to frame the scope of this thesis and mention also what is not in scope. The protocol details about the physical and data link layers will be only those that are relevant to build the link with the properties of the other upper layers in the context of the applications that are under consideration. This thesis will not look at media

5

coding issues (e.g., audio and video compression). Media transport is more within the focus of this thesis, and media characteristics will only be referenced, rather than the algorithms or syntax for the bit stream formation. However, when necessary, some media codec features and general characteristics will be mentioned, because of the QoS advantage they bring to the multimedia applications under consideration, and will be introduced without any additional background information assuming the reader is familiar with these notions. Speech and video will be the media utlized for the research and experiments. However, video will have a major stress in this thesis. When looking at the networking technologies (both wired and wireless), data transmission over WiMaX(2), (W)USB and Infrared are out of the scope of this thesis, as well as transmission over non 3GPP mobile networks. Support of 3GPP features in other mobile networks (e.g., EDGE Compact in ANSI-136) is out of the scope of this thesis. When mentioning mobile network features and protocols, no core network details will be given. This will be considered as a zero loss, zero delay pipe. The focus will be on the Radio Access Network (RAN). Regarding the space of mobile multimedia services, broadcast multimedia services (such as MBMS), Push to Talk over Cellular (PoC) and Multimedia Messaging (MMS) are out of scope of this thesis.

1.2. AUTHOR’S CONTRIBUTION TO THE PUBLICATIONS

The Author was heavily involved in the standardization of the Packet-switched Streaming Service in 3GPP and the Architectures and Protocols standard in DLNA.

Publication [P1] is about performance evaluation of 3G-324M mobile videophones over WCDMA networks. The Author is the main author of the paper. He contributed with the main idea behind the work, and with most of the writing of the paper, including defining the test cases and the QoS metrics to be used for QoS assessment.

SIP call set-up delays over UMTS are analyzed in Publication [P2]. The Author is the main author of the paper and he contributed the main idea behind the paper, its writing and the definition of the test plan and the QoS metrics.

The Author is also the first author of Publication [P3] which is about human perception of lip synchronization in mobile environment. The Author proposed the idea, contributed to the paper writing and helped define the test plan.

Mobile streaming services is the topic of Publication [P4]. The Author contributed to the idea and the definition of the test cases, and QoS metrics to assess the system performance. The Author was also leading the co-author to run the simulation work.

Publication [P5] is about finding the optimal settings for deploying multimedia streaming over GPRS networks. The Author contributed to generating the idea, writing the paper, defining the interesting research items to investigate and the test plan to be executed. The Author was also leading the co-author to perform the simulation work. The robust handover management technique, contributed by the Author to 3GPP, is today part of the PSS standard.

INTRODUCTION

6

Application rate adaptation for mobile streaming is the topic of Publications [P6, P7]. The Author is the main author of both papers. He contributed to the idea and he is the main writer of the paper. The idea is today part of the 3GPP PSS and DLNA specifications, and was contributed by the Author to both standards organizations.

A network-based AMR rate adaptation mechanism is the subject of Publication [P8]. The Author is the main author of this paper. He contributed with most of the writing and with the algorithm idea for congestion control as well as with the definition of the simulation plan.

Rate adaptation algorithms for conversational 3G video are the subject of Publication [P9]. The Author introduced and proposed the research topic to the main author; he was also leading the main author’s work, as well as assessed the simulation work.

Publication [P10] is about geo-predictive rate adaptation for mobile streaming. The Author is the main author. He contributed to the main architectural idea. The Author supervised the simulations, the definition of the test plan, and ensured the continuous assessment of the results. The Author has been writing all the paper except the simulation part that was contributed by the second author.

Publication [P11] is about Mobile and Interactive Social TV. The Author was leading the main author’s work and the research path during all the research period, and he contributed also to the actual paper writing.

1.3. ORGANIZATION OF THE THESIS

The thesis is organized as follows. Chapter 2 introduces circuit-switched and packet-switched networks and their features relevant to the mobile multimedia applications under consideration. This chapter introduces also the QoS offered by these networks. The applications requirements and network capabilities are described in Chapter 3. In Chapter 4 some QoS aspects of mobile multimedia telephony applications are analyzed. QoS metrics for performance assessment are also introduced. Chapter 5 is about mobile media streaming. Similarly to the previous chapter, the key QoS aspects for this application are presented. Relevant QoS metrics are introduced, and performance is assessed. Rate adaptation for multimedia telephony and streaming systems are treated in Chapter 6. Here methods for improving the QoS are introduced and assessed. The Mobile and Interactive Social TV paradigm and architectures are introduced in Chapter 7, along with a perspective for session mobility. Finally, Chapter 8 presents the conclusions and draws the lines for future research work.

7

Chapter 2

Mobile Networks

his chapter builds the ground for the rest of the thesis. It surveys the main 3rd Generation Partnership Project (3GPP) mobile networks up to Release 6 from the particular viewpoint of mobile multimedia applications. The main differences

between wired and wireless network are firstly introduced, together with the basic mobile network architecture. The chapter continues with the introduction of both circuit-switched and Packet-Switched (PS) mobile networks, focusing the attention on the functionalities of the user plane. The chapter will be concluded with a description of the QoS for 3GPP networks. In the rest of the thesis, the terms mobile station, (mobile) terminal and (mobile) client will be used interchangeably.

2.1. WIRED AND WIRELESS NETWORKS

When comparing wired networks (also referred to as fixed networks) and mobile networks (also referred to as wireless networks), the first difference to capture is of architectural nature.

A wired network (either a CS network such as a Public Switched Telephone Network (PSTN) or Integrated Services Digital Network (ISDN), or a PS network such as the Internet) can be visualized as an interconnected network of switches, routers, bridges and gateways that connect the endpoints (e.g., home phones or Internet Protocol (IP) [86] hosts) via wired lines.

A mobile network (either CS or PS), as a wired network, is made of a set of functionally similar network entities, but they interwork to connect endpoints (i.e., mobile users) via wireless links. This makes a mobile network architecturally divided in two parts: the fixed part (also referred to as Core Network (CN)), and the mobile part (also referred to as RAN). Figure 3 shows a simplified mobile network architecture.

T

MOBILE NETWORKS

8

In this scheme, a Mobile Station (MS) communicates with the RAN via a wireless radio interface (here generically named Interface A). The RAN can be, for example, of the UMTS Terrestrial RAN (UTRAN) type, and it embeds network elements called Node-B and Radio Network Controller (RNC). The RAN communicates with the CN, which can be of the CS or PS type (or a combination of both), through an interface (here generically named Interface B). Internetworking between the mobile network and external CS or PS networks (e.g., IP, X.25, ISDN, and PSTN) is enabled through the use of appropriate interfaces (generically named Interface C).

External networks (IP, X.25,

ISDN, PSTN)

Interface C Mobile stations

Interface A

Radio Access Network

(BSS, GERAN, UTRAN)

Core Network (CS, PS)

Interface B

Figure 3. Simplified architecture of mobile network

A second aspect that must be taken into account, when considering wired and wireless networks, is related to their QoS. For a service/application deployed over a fixed IP network (such as the Internet), the main reasons for achieving a good or bad QoS are bandwidth, packet losses and delays.

In non-QoS-guaranteed networks (e.g., best effort networks), the primary source of insufficient QoS is given by the shared network access, whenever no mechanism for bandwidth reservation is in place. Concurrent access of many users on the same network link produces a limited per-user available bandwidth. The amount of available bandwidth may be variable over time, and this depends on instantaneous load and congestion conditions. Variable bandwidth of best effort networks is a critical factor for multimedia applications that require rather stable and non-oscillating network bit rates for carrying continuous media data (e.g., audio and video of a multimedia streaming session).

Packet losses are mainly caused by congestion in the hosts/routers along the path between the endpoints (e.g., between a streaming server and a client). Congestion in a router occurs

9

whenever the packet arrival rate at the router is higher than its packet departure rate. This may be due to the physical processing speed of the router, or because the output network link is slower than the input network link. If a router is congested and its buffers are full, it starts to drop packets. This fact will likely have an effect on the perceived QoS. Lost IP packets are normally not retransmitted by the network protocols, unless reliable transport protocols (e.g., the Transmission Control Protocol (TCP) [87]) or ad-hoc retransmission techniques at the application layer are employed.

Delays in the network may depend on congestion issues (as explained above), out-of-sequence packet reordering and on the physical capacity of the network trunks between the endpoints. Too long delays can also produce packet losses (because of congestion). A variable delay over time is called delay jitter; this may be perceived by a receiver endpoint (e.g., a streaming client) whenever the media packets inter-arrival time is too variable. Normally, a good buffer management at the receiver side can help in de-jittering the incoming data flow.

QoS of mobile networks is influenced by all the above factors. However, there are more QoS factors that are merely dependent on the properties of the mobile part of the network architecture: 1. Radio link quality. In mobile networks the air interface between the MS and the RAN is

inherently affected by bit errors. High Bit Error Rates (BER) can be caused for example by a weak radio signal in a determined area (such as under bridges, behind buildings or hills) [63], a large distance between MS and Base Station (BS), weather conditions, multipath propagation (due to reflection, diffraction or scattering of radio waves), fading, interference, radio resource scarcity, or because of handover due to movement of the user [137]. All these factors may cause packet corruption or packet losses that can produce noticeable media quality impairment.

2. Mobility. As users are mobile, mobility management is a very important issue, and may cause service interruption (or, in general, a bad radio link quality) for a certain amount of time, and cause delay and packet losses in the user application. For example, when moving from a cell to another of the same or another operator network (i.e., performing an handover or cell change or roaming), the network capacity (and in general the QoS) that was available in the old cell might not be longer available in the new cell. Handovers can also be triggered by bad signal quality or congestion in the current cell [109]. In these cases the QoS may be subject to change as the user moves. The management of network bandwidth variation is one of the critical points for successful deployments of mobile multimedia applications.

In general, packet losses derived by congestion are to be identified and treated differently from packet losses caused by the radio link and mobility. This is one of the fundamental differences that discriminates fixed Internet and mobile Internet applications.

MOBILE NETWORKS

10

2.2. CIRCUIT-SWITCHED MOBILE NETWORKS

This section reviews the main CS mobile channels based on the Global System for Mobile communications (GSM) and Universal Mobile Telecommunications System (UMTS). The basic idea of a CS connection is that once a call between two parties has been set up, a dedicated path between them exists and will continue until the call is finished. As a consequence of the established path, there is no danger of congestion (unless a lack of trunk capacity occurs at set up time) [212].

2.2.1. GSM

The introduction of GSM networks has begun in Europe in the year 1992 [159]. At that time, mobile communications were almost exclusively speech-oriented. The GSM channel capacity is 9.6 kbps, which gives little possibility for the transmission of multimedia data. In fact, the GSM channel speed is suitable just for voice calls and non real-time data applications at very low bit rates, such as e-mail and Web access. One of the few real-time applications enabled at GSM bit rates is video surveillance, which could allow video at frame rates of 1.8-4.3 fps [78]. GSM channels and such kind of applications, that mark the low-end of real-time mobile video applications, will not be considered anymore in the rest of this thesis.

2.2.2. HSCSD, ECSD and CS UMTS

High Speed Circuit Switched Data (HSCSD) is a technology derived by the GSM standard, and defined since the GSM 1996 standard. HSCSD is an enhancement of GSM, and strives to remove the limit of 9.6 kbps of GSM, to enable multimedia applications and faster non real-time data connections.

The basic idea is to allow a user to simultaneously be allocated several Time Division Multiple Access (TDMA) time slots (or channels) of a carrier. To achieve this, a new functionality is introduced in the network and MS for splitting and combining data into several data streams which will then be transferred via n (n = 1,2,…,8) channels over the radio interface. Once split, the data streams are carried by the n full rate channels through the Base Transceiver Station (BTS) as if they were independent of each other, until the point in the network (Base Station Controller (BSC)) where they are combined. Logically the n channels at the radio interface belong to the same HSCSD configuration, and therefore they are controlled as one radio link by the network for the purpose of cellular operation, e.g. handover [4].

The data rate of a single time slot can be increased up to 14.4 kbps by puncturing (i.e., by deleting) certain error correction bits of the existing 9.6 kbps channel. In theory, the available user bit rate could be as high as 115.2 kbps (8 * 14.4 kbps). In practice, however, the maximum bit rate per user is limited to 64 kbps, since this is the maximum reserved per user in the A interface of the GSM network infrastructure [4]. This limits to 4 the maximum

11

number of 14.4 kbps time slots that can be allocated to achieve a user bit rate of 57.6 kbps in the uplink and/or downlink directions.

HSCSD has both transparent and non-transparent modes. Transparent mode offers error protection at the channel coding level only. In this mode, retransmission of packets hit by errors is not used. As a result, the bit rates and network delays are constant [26], but the BER is variable, depending on the channel conditions. There is no QoS flexibility (resources upgrade or downgrade) for the transparent mode [26]. This means that after a connection at a certain bit rate has been established, this is either maintained at that rate or dropped (e.g., in case of sudden unavailability of resources after a handover) [4]. Non-transparent mode offers retransmission of erroneous frames, using the GSM Radio Link Protocol (RLP) [6], in addition to error correction made by channel coding. The available throughput and transmission delay vary with the channel quality (the higher the BER, the lower the throughput and the higher the network delay) [26]; the throughput may also vary during a connection (e.g., after to a handover) or because requested by the user (service level upgrade or downgrade) [4], but never exceeds the Air Interface User Rate (AIUR) [26]. Non-transparent services with limited retransmissions increase the delay and buffering requirements, making conversational video applications unattractive [114], but video surveillance possible [94].

HSCSD services can be further classified in symmetrical and asymmetrical services. A Symmetrical service allows allocating equal bit rates to both the uplink and downlink connections. An Asymmetrical service can provide different data rates in the uplink and downlink directions and are only applicable in non-transparent mode [26].

Enhanced Circuit Switched Data (ECSD) [26] is a technology defined within the Enhanced Data rates for GSM Evolution (EDGE) in 1999, and it follows the same basic principle as HSCSD. The user bit rates are not increased (i.e., the limitation of the A Interface to 64 kbps is always in place), but the same rates can be offered with a smaller number of time slots and a simpler MS implementation. The main enhancement consists of a new modulation scheme in the air interface called Octagonal Phase Shift Keying (8-PSK) which allows tripling the data rate per time slot.

Although the main characteristic of UMTS networks is the transport of PS traffic, UMTS offers also CS channels. The MS protocol stack for a CS connection is similar to that depicted in Figure 5, with the difference that the IP and PDCP layers do not take part in a CS transmission. The reader can refer to section 2.3.3 for further general details. CS UMTS channels have been used in Publication [P1].

2.3. PACKET-SWITCHED MOBILE NETWORKS

A PS connection is characterized by the fact that data is split into packets that have a maximum size. This makes sure that a user cannot monopolize a transmission line very long, but that is shared among many users [212]. The fundamental difference between CS and PS connections is that CS connections statically reserve the required bandwidth in advance,

MOBILE NETWORKS

12

whereas PS connections allocate and release bandwidth as it is needed. With CS networks, any unused bandwidth on the allocated circuit is just wasted. With PS networks, the unused bandwidth can be allocated to other users. Furthermore, data packets of PS connections usually follow different routes, while CS traffic follows always the same route. Table 1 summarizes the differences between CS and PS networks [212].

TABLE 1. A COMPARISON OF CIRCUIT-SWITCHED AND PACKET-SWITCHED NETWORKS

2.3.1. General Packet Radio Service (GPRS)

The first concept of packet data in mobile networks has been introduced in 1997 with the General Packet Radio Service (GPRS). Packet data is suitable for applications that exploit a bursty traffic, and resources are allocated from a common pool. GPRS networks are built to support packet-switched traffic based on IP that makes it easy to connect GPRS networks to IP-based backbones, such as the public Internet. Figure 4 shows the GPRS protocol stack for the user plane [28]. In Publication [P5] mobile streaming experiments are conducted over a GPRS network.

Relay

Network Service

GTP

Application

IP / X.25

SNDCP

LLC

RLC

MAC

GSM RF

SNDCP

LLC

BSSGP

L1bis

RLC

MAC

GSM RF

BSSGP

L1bis

Relay

L2

L1

IP

L2

L1

IP

GTP

IP / X.25

Um Gb Gn GiMS BSS SGSN GGSN

Network Service

UDP / TCP

UDP / TCP

Figure 4. GPRS user plane protocol stack

A GPRS MS can use up to 8 Time Slots (TS), which are dynamically allocated separately for downlink and/or uplink when there is traffic to be transferred. The allocation depends on the resource availability. In GPRS, different channel coding schemes are defined in the radio

Feature CS networks PS networks Dedicated “copper” path Yes No Bandwidth available Fixed Dynamic Potentially wasted bandwidth Yes No Each packet follows the same route Yes No When congestion occurs At call set-up On every packet Charging Time & distance based Traffic volume based

13

interface. They use the GMSK modulation and are named CS-1, CS-2, CS-3 and CS-4. The four coding schemes offer decreasing error protection levels, where CS-4 uses no Forward Error Correction (FEC) [11].

In GPRS there is a Link Adaptation mechanism that works by adapting the protection of the data to be sent according to the instantaneous radio link quality. For this purpose, channel measurements are performed and the coding schemes are automatically switched to more or less robust modes with a granularity of a radio block, if needed. The selection of the initial coding scheme is determined according to the radio link quality, and the choice of the coding scheme at any instant is always controlled by the network [11].

The bit rates for a single time slot and for different coding schemes (from 1 to 4) are the following: {9.05, 13.4, 15.6, 21.4} kbps. Therefore, depending on the combination of time slots (1 to 8) and the coding scheme, the GPRS bit rates can range from 9.05 kbps up to 171.2 kbps (the full bit rates table is available in [73]).

Between the MS and the Base Station Subsystem (BSS) transmission can occur in Unacknowledged (UNACK) or Acknowledged (ACK) mode at the Radio Link Control (RLC) layer. UNACK mode is a transparent mode, while in ACK mode, the RLC layer provides to retransmit the erroneous frames that have been lost or corrupted by errors in the air interface. Typical RLC layer Round Trip Times (RTT) are in the order of 240 ms [14]. RLC, among other things, provides segmentation and reassembly of Logical Link Control (LLC) Protocol Data Units (PDU) into RLC/MAC blocks [13].

Between the MS and the Serving GPRS Support Node (SGSN) the communication can also be in UNACK or ACK mode via the LLC layer [3].

The SubNetwork Dependent Convergence Protocol (SNDCP) [8] layer provides TCP/IP header compression [130] and V.42bis/V.44 data compression, to enhance the network capacity. The former allows a reduction of the TCP/IPv4 packet header size from 40 to 3 bytes [130]. SNDCP provides, among other things, segmentation and reassembly of SNDCP PDUs into LLC PDUs, and can run in ACK and UNACK modes.

2.3.2. Enhanced GPRS (EGPRS) and GERAN improvements

The EDGE enhancement for GPRS networks is called Enhanced GPRS (EGPRS) [28] and was specified in 1999. EGPRS networks are also called GERAN (GSM/EDGE RAN) networks. Therefore, in the rest of this thesis, the terms EGPRS and GERAN will be used interchangeably. In Publications [P6, P7] streaming rate adaptation experiments were conducted over a simulated EGPRS network.

The major changes of EGPRS, compared to GPRS, are in layers 1 (physical) and 2 (data link) of the protocol stack, in order to increase network capacity. In layer 1, a new set of Modulation and Coding Schemes (MCS) are defined. The GPRS GMSK coding schemes (CS-1 to CS-4) are replaced with four new GMSK schemes (MCS-1 to MCS-4) with decreasing error protection. In addition, five 8-PSK coding schemes with decreasing error

MOBILE NETWORKS

14

protection (MCS-5 to MCS-9) are defined. In practice, the GMSK modulation provides the robustness more for wide area coverage, while 8-PSK provides higher data rates [11].

In EGPRS, the bit rates for a single time slot and for different MCSs (from 1 to 9) are the following: {8.8, 11.2, 14.8, 17.6, 22.4, 29.6, 44.8, 54.4, 59.2} kbps. Therefore, depending on the combination of time slots (1 to 8) and the coding scheme, the EGPRS bit rates can range from 8.8 kbps up to 473.6 kbps (the full bit rates table is available in [73]).

The link adaptation mechanism works as in GPRS. However, in EGPRS, it is possible to change the MCS for retransmissions, i.e., the RLC block can be sent again, but with a higher protection than for its initial transmission [13]. This more efficient transmission scheme is called Type II Hybrid Automatic Repeat ReQuest (ARQ) referred commonly to as Incremental Redundancy (IR) [11], and is effectively increasing the probability of data reception at the RLC.

Another improvement offered by EGPRS is the TCP and UDP (User Datagram Protocol) (over IPv4/v6) header compression in the SNDCP layer of the protocol stack [8]. This header compression algorithm allows reducing the packet headers from a maximum size of 60 bytes to 4-7 bytes [89].

The GPRS cell reselection (or cell change) was not initially designed for services that require seamless cell change operations, such as real-time multimedia traffic. As a result, a GPRS cell change introduces a service break of several seconds [99], which can harm the QoS of a multimedia flow. A functionality of the network called Network Assisted Cell Change (NACC) [13] aims at reducing the cell change time [14]. With NACC, the MS informs the network of its wish to change cell. The network assists the MS before and during the cell change, and it sends neighbor cell system information to the MS, which stores it for 30 seconds. Additionally, with the Network Controlled Cell Reselection (NCCR) feature [11, 13], it is possible to further optimize the cell reselection performance (by shifting the cell reselection decision making from the MS to the network). NACC is limited to cell reselection procedures within the same BSC (Intra-BSC NACC). This limits the value of the NACC as Inter-BSC cell changes and GERAN to UTRAN cell changes are more frequent in some network configurations. For this purpose, the GERAN specifications introduce Inter-BSC and BSC-RNC NACC (also referred to as External NACC) [28]. The service interruption time has been later further reduced thanks to the PS Handover (HO) [12] feature that improves also the buffer handling in order to reduce the losses at cell change for Intra BSS HO, Intra SGSN HO, Inter SGSN HO, or Inter Radio Access Technology HO (Inter mode HO). Further GERAN developments beyond Release 6 are not within the scope of this thesis.

2.3.3. UMTS

The IMT-2000 specifications for third generation mobile networks written by 3GPP have defined standards for UMTS networks. The air interface technology for UMTS is the Wideband Code Division Multiple Access (WCDMA), which is a different channel allocation

15

technology than the TDMA technology used in GSM-based mobile networks (e.g., GSM, GPRS, EGPRS, etc.). The main characteristics of WCDMA-based networks can be summarized as:

Higher bit rates than (E)GPRS (at least 2048 kbps in indoor/low range outdoor radio environment) [27];

Delay requirements that range from the most stringent values for real-time traffic (20-400 ms), to more relaxed ones for best-effort traffic; BERs lower than 10-3 for real-time services, and lower than 10-5 for non real-time services [27];

Multiplexing of services with different quality of service requirements on a single bearer (for example a speech call, a multimedia streaming session and a Web browsing session).

Publication [P8] deals with AMR mode selection over WCDMA, while Publications [P2, P9] show SIP call set-up delays and rate adaptation experiments over simulated WCDMA networks. Publication [P4] includes mobile streaming experiments over a real WCDMA network. In the following, the focus is on the RAN part, called UTRAN (UMTS Terrestrial RAN). Figure 5 shows the user plane protocol stack for UMTS PS networks (Iu-PS mode [28]). The A, B and C interfaces of Figure 3, correspond respectively to the UTRAN Uu, Iu-PS and Gi interfaces in Figure 5.

L1

RLC

PDCP

MAC

E.g., IP,PPP

Application

L1

RLC

PDCP

MAC

ATM

UDP/IP

GTP-U

AAL5

Relay

L1

UDP/IP

L2

GTP-U

E.g., IP,PPP

3G-SGSNUTRANMS

Iu-PSUu Gn Gi

3G-GGSN

ATM

UDP/IP

GTP-U

AAL5

L1

UDP/IP

GTP-U

L2

Relay

Figure 5. User plane protocol stack for UMTS networks (Iu-PS mode)

At layer one data arrives from the MAC layer to the coding/multiplexing unit in form of transport block sets once every Transmission Time Interval (TTI). This is selected from the set {10, 20, 40, 80} ms [16]. The transport channels [16] are unidirectional (i.e., only uplink or downlink), and are either shared or dedicated. The MAC layer [18] provides UNACK data transfer and does not offer a segmentation/reassembly functionality.

The RLC protocol [19] between the MS and the RAN can operate in Transparent (T), UNACK and ACK modes. RLC is the only protocol that allows transmission in ACK mode

MOBILE NETWORKS

16

up to the transport layer (excluded); all the other protocol layers operate in T (or UNACK) mode. The services that the RLC layer provides to the upper layer are: T/UNACK/ACK data transfer, maintenance of the QoS as defined by the upper layers (by means of a retransmission protocol), and error notification.

The efficiency of the RLC depends on two parameters [19]: the discard timer and the maximum number of retransmissions (MaxDAT). The SDU discard function is used by the RLC sender to discard from the buffer the RLC PDUs that do not succeed in transmission for a certain period of time (the discard timer) or a number of retransmissions (MaxDAT). This prevents RLC buffer overflow in the sender. Such function can be configured in different ways, among which are [19]:

Timer based discard: this option is insensitive to variations in the channel rate and provides means for exact definition of the maximum delay between RLC peer entities; however, the SDU loss rate of the connection increases as SDUs are discarded.

Discard after n retransmissions: This makes the SDU discard function dependent on the channel rate, and it tries to achieve a constant SDU loss rate at the cost of a variable delay. When MaxDAT is large, a fully persistent retransmission scheme is achieved.

The Packet Data Convergence Protocol (PDCP) layer [20] exists only for the PS domain. Its main functionality is that of compressing higher layer protocol headers for reducing the bit rate towards the radio interface. TCP and UDP header compression is available [89]. In addition, also the RObust Header Compression (ROHC) algorithm [57] is used to efficiently compress RTP/UDP/IP or UDP/IP headers from a maximum of 60 bytes to 1-6 bytes [131]. The PDCP protocol is defined only between MS and RAN. This increases speed. PDCP may also maintain sequence numbers to guarantee a lossless PDCP data transfer, and therefore no losses during cell changes [20].

2.3.4. IMS and HSDPA

One of the major features introduced in UMTS is the IP Multimedia Subsystem (IMS) [30] in the CN to efficiently support applications with multiple media components (e.g., audio, video, shared whiteboards) with different QoS requirements and the possibility to add/drop media during a session. IMS is based on the Session Initiation Protocol (SIP) [192]. For decreasing the end-to-end SIP signaling time, SIP compression [188] is mandatory in networks and MSs.

In order to reduce delays and achieve higher peak rates of downlink channels, UTRAN specifications include the definition of a concept called High Speed Downlink Packet Access (HSDPA). This technology allows increasing the downlink speed up to 10.2 Mbps [23] in urban environments, which typically have low mobile speeds and small cell sizes. HSDPA uses a new adaptive modulation scheme and a new faster retransmission mechanism over the air interface. In particular, the data rates are increased thanks to the 16-state Quadrature Amplitude Modulation (16QAM) [15]. HDSPA is very advantageous for asymmetric multimedia services, such as streaming.

17

The main differentiating factor in the protocol stack is that a new MAC functionality is added in the RAN: this is called MAC-hs (MAC high speed) [17] and is located in the Node-B. With HSDPA, the MAC-hs protocol is run between MS and Node-B. The MAC-hs in the Node-B includes the scheduling functionality, which enables a more efficient implementation that can faster adapt the modulation to the most recent information on channel condition and fading environment [22]. TTIs for HSDPA are chosen from the set {2, 5, 10} ms depending on the network configuration. The MAC layer in the MS includes the MAC-hs functionality of Hybrid ARQ (HARQ). HARQ handles MAC lower layer retransmissions between Node-B and MS. Retransmissions can use a different channel coding and Incremental Redundancy [17, 23].

In this thesis, Publication [P11] shows experiments of Mobile and Interactive Social TV over a 3.5G HSDPA network, whereas Publication [P10] includes results over simulated LTE HSPA.

2.4. QOS OF CIRCUIT-SWITCHED AND PACKET-SWITCHED NETWORKS

This section focuses on the QoS offered by the network channels described in sections 2.2 and 2.3, and the primary intention is that of building the first link between network and application QoS. The network should offer the possibility to set different levels of QoS. On the other hand, an application should have a set of means (offered by the network through appropriate interfaces) for choosing the QoS, depending on the application QoS requirements, in order to provide the best user experience.

2.4.1. GSM, HSCSD and ECSD

GSM-based circuit-switched channels, such as GSM itself, HSCSD and ECSD, have a limited possibility of offering different QoS to an application. GSM has no relevant parameters that can be chosen by an application, in order to select the desired QoS. HSCSD and ECSD have no direct interfaces towards an application. However, the MS has the capability of choosing

Bit rate: the number of time slots allocated (each one offering 9.6, 14.4, 28.8, 32 or 43.2 kbps) allows choosing the required bandwidth;

Reliability: the choice of transparent or non-transparent mode transmission determines the level of error protection and the delay of the connection.

2.4.2. GPRS

In GPRS, the resources are allocated on a dynamic basis. In this way, a lengthy file transfer would require more resources. However, the resource allocation policy is controlled by the operator [96]. GPRS introduces the concept of QoS profile for a PDP context to define a set

MOBILE NETWORKS

18

of attributes that characterize the quality of the connection. These attributes are described in Table 2 [28, 96].

TABLE 2. QOS PROFILE FOR GPRS NETWORKS

QoS profile

Description

Precedence class

It indicates a priority in case of abnormal network behaviour. For example, in case of congestion, it determines which packet to discard first. Values: [1..3] in decreasing order of precedence.

Delay class It defines the maximum values for the mean transfer delay and 95-percentile delay within a GPRS network end-to-end. It includes radio channel access delay (in uplink) or radio channel scheduling delay (in downlink), the radio channel transit delay (in uplink or downlink) and the GPRS network transit delay (multiple hops). It does not include transfer delays in external networks. Example delay bounds are shown in Table 3 of Publication [P5]. Values: [1..4].

Reliability class

Data reliability is defined in terms of residual probabilities of data loss, out-of-sequence delivery, duplicate data delivery and data corruption (undetected errors). These probabilities are defined for three classes in Table 3. The reliability class specifies the requirements of the various network protocol layers. The combinations of the GTP, LLC, and RLC transmission modes support the reliability class performance requirements. The combinations are shown in Table 2 of Publication [P5]. Values: [1..5].

Mean throughput class

It specifies the average rate at which data is expected to be transferred across the GPRS network during the remaining lifetime of a PDP context. The rate is measured in Bytes per hour (Bph). Values: [1..18, 31], where the value 31 means best effort, and the values from 1 to 18 define discrete rates in the range [100, 50x106] Bph, i.e., in the range [0.00022, 111] kbps.

Peak throughput class

It specifies the maximum rate at which data is expected to be transferred across the GPRS network for a PDP context. There is no guarantee that this peak rate can be achieved or sustained for any time period, and this depends upon the MS capability and the available radio resources. The rate is measured in bytes per second (Bps). Values: [1..9], which define discrete rates in the range [1x103, 256x103] Bps, i.e., in the range [8, 2048] kbps.

TABLE 3. RESIDUAL ERROR PROBABILITIES FOR RELIABILITY CLASSES IN GPRS NETWORKS

Reliability class

SDU loss probability

Duplicate SDU

probability

Out of Sequence

SDU prob.

SDU corruptionprobability

Example of application characteristics

1 10-9 10-9 10-9 10-9 Error sensitive, no error correction capability, limited error tolerance capability.

2 10-4 10-5 10-5 10-6 Error sensitive, limited error correction capability, good error tolerance capability.

3 10-2 10-5 10-5 10-2 Not error sensitive, error correction capability and/or very good error tolerance capability.

During the negotiation of the QoS profile parameters, the MS is also able to request one or more subscription default values stored in the network, whereas the network must negotiate each attribute to a level that is in accordance with the available GPRS resources. It must be noted, that the throughput values can be re-negotiated by the network at any time during a session. In addition, the RLC/MAC supports four radio priority levels for uplink transmission (with values from 1 to 4, where 1 represents the highest priority) [28]. The

19

radio priority level is determined by the SGSN, based on the negotiated QoS profile, and is delivered to the MS during the PDP context activation.

2.4.3. UMTS

The QoS for GERAN and UTRAN has been later unified into the same UMTS QoS concept. In order to guarantee end-to-end QoS, the UMTS specifications define a new important parameter: the QoS (or traffic) class. This is considered as a fundamental way to distinguish services of different type and their respective quality. There are four QoS classes defined for traffic over UMTS networks [29]. The practical difference between the four classes is in terms of delay and error rates. While Conversational and Streaming (real-time) traffic classes guarantee low delays at the cost of higher error rates, Interactive and Background (best-effort) traffic classes guarantee lower error rates at the cost of higher delays.

The conversational class is the one with the most stringent delay requirements, whereas the background class is the one with less stringent delay requirements. Some of the characteristics of the UMTS QoS are the dynamic QoS behavior (QoS negotiation and re-negotiation), i.e., the QoS can be requested at the beginning and modified during an active session, and asymmetry, i.e., the QoS can be using different attributes for uplink and downlink directions [29].

The QoS profile for an UMTS bearer is defined in a slightly different way compared to the GPRS QoS profile. Table 4 contains a description of the 15 QoS profile attributes, whereas Table 5 includes their range values for different traffic classes. There may be some limitations in the applicability of combinations of QoS attributes in a bearer. For example, it might not be possible that the shortest delay is used together with the lowest SDU error ratio [29]. The maximum bit rates in Table 5 can be achieved only when a minimum (or zero) overhead is induced by layer 2 protocols (for example using the transparent mode). Otherwise, bandwidth cost of the non-transparent modes must be subtracted from the maximum achievable rate. The highest bit rates should be achievable with LTE 4G networks. For GERAN networks, the maximum bit rates are limited to 473.6 kbps.

For the Conversational and Streaming TCs, when an application generates less traffic than guaranteed for the bearer, the unused resources can be used by other users [29]. For Interactive and Background TCs, the maximum bit rate and SDU error ratio are the only practical attributes used to control the QoS (the Interactive TC can also make use of the traffic handling priority). The 3GPP QoS model assumes that the application requirements of delay jitter are expressed through the transfer delay attribute. This implies that there is no explicit delay jitter attribute [29].

The maximum SDU size is limited to 1502 bytes for PPP (Point to Point Protocol) connections, and 1500 bytes for other cases. This size was chosen to accommodate the requirement of IPv6 links [88]. The maximum SDU size and the IP layer Maximum Transfer Unit (MTU) have no relationship; networks should not perform IP fragmentation based on the maximum SDU size [29].

MOBILE NETWORKS

20

TABLE 4. QOS PROFILE FOR UMTS NETWORKS

QoS profile DescriptionTraffic class (TC) Type of application for which the UMTS bearer service is optimized. UMTS can

make assumptions about the traffic source and optimize the transport for that traffic type.

Maximum bit rate (MBR)

Maximum number of bits a user application can accept or provide in a certain period of time. All UMTS bearer service attributes may be fulfilled for traffic up to the MBR depending on the network conditions. Its purpose is 1) to limit the delivered bit rate to applications or external networks with such limitations 2) to allow a MBR for applications that operate with different rates (e.g., with adaptive codecs).

Guaranteed bit rate (GBR)

Guaranteed number of bits delivered to a user application in a certain period of time. It may be used to facilitate admission control based on available resources. The UMTS service attributes (e.g., delay and SDU error rate) are guaranteed for traffic up to the GBR (for traffic exceeding such bit rate, the UMTS service attributes are not guaranteed).

Delivery order Indicates whether the UMTS bearer shall provide in-sequence packet delivery or not. Maximum SDU size

The maximum SDU (packet) size for which the network shall satisfy the negotiated QoS. It is used for admission control, policing and for optimizing transport. Packets larger than the maximum SDU size may be dropped or forwarded with a decreased QoS.

SDU format information

List of possible exact sizes of SDUs. The RAN needs packet size information to operate efficiently in transparent RLC mode. This attribute applies only to Conversational and Streaming TCs but is not directly visible to an application.

SDU error ratio Indicates the fraction of packets lost or detected as erroneous. For Conversational and Streaming traffic classes, the SDU error ratio is independent of loading conditions. Conversely, for Interactive and Background TCs, where there is no resource reservation, the SDU error ratio is used as target value.

Residual bit error ratio

It Indicates the undetected bit error ratio in the delivered packets. If no error detection is requested, it indicates the bit error ratio in the delivered packets.

Delivery of erroneous SDUs

It indicates whether packets detected as erroneous shall be forwarded to the upper layers or not.

Transfer delay It indicates the delay tolerated by the application. In other words, it is the maximum delay for 95th percentile of the delay distribution for all delivered packets during the lifetime of a bearer; delay is defined as the time from a request to transfer a packet at one SAP (Service Access Point) to its delivery at the other SAP.

Traffic handling priority

It specifies the relative importance for handling the packets belonging to the bearer compared to packets of other bearers. Used only for the Interactive TC.

Allocation/retention priority

In situations where resources are scarce, the network can use this attribute to prioritize bearers with a high priority over bearers with a low priority when performing admission control. This attribute is a subscription attribute, which is not negotiated from the MS.

Evolved allocation/retention priority

Enhances the previous attribute with an increased value range of the priority level, and additional information about pre-emption capability and pre-emption vulnerability of the bearer. The former defines whether a lower priority bearer should be dropped to free up resources. The latter defines whether a certain bearer is applicable for such dropping.

Source statistics descriptor

Conversational speech has a well-known traffic pattern. By being informed that the packets are generated by a speech source, the network and the mobile station may calculate a statistical multiplex gain for use in admission control. This is defined only for the Conversational and Streaming TCs but not directly visible to an application.

Signalling indication

It indicates that the traffic is of signalling type (e.g., IMS signalling). This allows enhancing the network operations for example by providing low delay or higher priority to signalling SDUs. This attribute is defined only for the Interactive traffic class.

21

TABLE 5. VALUE RANGES FOR UMTS BEARER QOS PROFILE ATTRIBUTES

Traffic class Conversational class

Streaming class Interactive class

Background class

Maximum bit rate (kbps) <= 256 000 <= 256 000 <= 256 000 <= 256 000 Guaranteed bit rate (kbps) <= 256 000 <= 256 000 N/A N/A Delivery order Yes/No Yes/No Yes/No Yes/No Maximum SDU size (bytes) <=1500 or 1502 <=1500 or 1502 <=1500 or 1502 <=1500 or 1502 SDU format information (bits)

List of SDU sizes List of SDU sizes N/A N/A

SDU error ratio 10-2, 7*10-3, 10-3, 10-4, 10-5

10-1, 10-2, 7*10-3, 10-3, 10-4, 10-5

10-3, 10-4, 10-6 10-3, 10-4, 10-6

Residual bit error ratio 5*10-2, 10-2, 5*10-3, 10-3, 10-4, 10-5, 10-6

5*10-2, 10-2, 5*10-3, 10-3, 10-4,

10-5,10-6

4*10-3, 10-5, 6*10-8

4*10-3, 10-5, 6*10-8

Delivery of erroneous SDUs Yes/No/- Yes/No/- Yes/No/- Yes/No/- Transfer delay (ms) 100 – max. value 300 – max. value N/A N/A Traffic handling priority N/A N/A 1,2,3 N/A Allocation/Retention priority 1,2,3 1,2,3 1,2,3 1,2,3 Evolved Allocation/retention priority - Priority level - Pre-emption capability - Pre-emption

vulnerability

1-15 Yes/No Yes/No

1-15 Yes/No Yes/No

1-15 Yes/No Yes/No

1-15 Yes/No Yes/No

Source statistic descriptor Speech/unknown Speech/unknown N/A N/A Signalling Indication N/A N/A Yes/No N/A

In addition to the QoS profile attributes, there is a bearer unrelated QoS attribute that is relevant: it is called User Equipment Aggregated Maximum Bit Rate (UE AMBR), which defines the maximum bit rate that can be expected across all non GBR PDP contexts for a MS. The excess traffic may be discarded by a rate shaping function [29].

23

Chapter 3

Applications Requirements and Network Capabilities

he previous chapter deals with mobile networks, with no particular regard to the requirements of mobile multimedia applications. The intent of this chapter is to present the opposite view, in which mobile multimedia applications are firstly

considered with their QoS properties and requirements. Subsequently, high level application properties and requirements are translated into network properties. For each of the aspects considered, advantages and limitations will be shown, and a first attempt to match the application requirements and mobile networks capabilities will be done; then a first set of optimal mobile network channels for mobile multimedia applications is derived. From now on, the focus will be on mobile multimedia telephony and mobile streaming applications.

3.1. MOBILE MULTIMEDIA APPLICATIONS PROPERTIES

3.1.1. Mobile multimedia telephony

In the category of mobile multimedia telephony applications, those that involve a single media (e.g., Voice over IP (VoIP), video over IP) as well as multiple media (e.g., audio, video, and others) will be included. A multimedia telephony application has the following properties [74]:

Bandwidth greedy. The transport of continuous media, such as audio and video, requires the allocation of a certain amount of bandwidth.

Conversational (delay sensitive). The parties (even more than two) involved in a session, have the impression of being part of a natural multimodal conversation, as if they are really interacting face to face. This implies that a high level of interactivity between the parties is essential to guarantee system usability for multimedia telephony calls. A

T

APPLICATIONS REQUIREMENTS AND NETWORK CAPABILITIES

24

conversation can be held only if the end-to-end delay is low and preferably constant. The conversational property between two parties is lost in case the end-to-end delay is larger than few hundred milliseconds [27]. This is the most critical success factor for a multimedia telephony service. In order to guarantee low end-to-end delays, both network and MSs must be optimized for processing conversational traffic.

Bi-directional. Within a mobile device, the media flow is always symmetrically carried in both directions, incoming and outgoing. There may be multiple parties in a session. If the media flow becomes unidirectional, the conversational property described above does not hold anymore.

Real-time. The media encoding, transmission, decoding and playback must occur in an isochronous fashion at each endpoint. This requirement implies that mobile devices need a high processing power because of the additional encoding need (devices for mobile streaming require decoding capability only). Real-time encoding must be performed efficiently, with the shortest delays for bi-directional multimedia streams.

In summary, bandwidth and low delays are the main requirements for a mobile multimedia telephony application. In a mobile network, delays can be traded off with error rates. The lower are the delays, the higher are the error rates. Therefore, error resilience becomes also an important factor in mobile multimedia telephony systems: any mechanism for error detection and correction/concealment must be run within the maximum delay budget allowed. For this reason, retransmission algorithms at network or application level cannot normally be used and the set of error resilience tools is more limited. Forward Error Correction or decoder error concealment algorithms are usually the only possible choices for providing error resilience against bit errors (or packet losses) produced by the air interface.

3.1.2. Mobile multimedia streaming

A streaming system is a real-time system of the non-conversational type. It is real-time because the playback of continuous media must occur is an isochronous fashion. Similarly to multimedia telephony applications, a streaming application is bandwidth greedy. However, it differs from a multimedia telephony application in at least few aspects, and it has the following properties [73]:

One-way data distribution. The media flow is always unidirectional in the downlink direction, from the streaming server to the mobile client. The receiver device needs only decoding and playback functionalities. The user has a limited control over a streaming session, and typical control commands in the uplink direction include Play, Pause, Stop, Fast-Forward, and Rewind.

Off-line media encoding. A streaming system is similar to a Video-On-Demand system, where the user can play just pre-stored content. This is often pre-encoded in off-line fashion using specific content creation tools. The concept of live streaming i.e., streaming from a live data source, is also possible, and the encoding can occur in near real-time (i.e., with delays much larger than those required for conversational multimedia

25

applications). In live streaming, the user has limitations on the seekability within a live stream (e.g., it is not possible to Fast-Forward a stream over the latest playback point).

Not highly delay sensitive. There is not a high level of interactivity between a mobile client and a streaming server. Therefore, end-to-end delays can be relaxed. For example, the time required to the streaming client to execute a command issued by the user (such as Play) does not need to be in the order of milliseconds. Media can be streamed after an initial latency period of several seconds, since there is no bi-directional conversational requirement. This allows the mobile client to smooth out eventual network jitter without compromising the user QoS.

3.2. APPLICATIONS QOS ISSUES AND MOBILE NETWORK ASPECTS

In the following subsections, relevant mobile multimedia applications properties will be analyzed more thoroughly and translated into more concrete QoS measurement points and requirements for mobile networks.

3.2.1. Bandwidth

The fundamental characteristic of multimedia applications is that of being bandwidth hungry. Multimedia traffic patterns are different compared to bursty non real-time Web traffic [43]. Since multimedia applications need to deliver continuous media, such as speech, audio or video, the requirement is more on the continuous availability of bandwidth offered by the network. Multimedia streams require less bits when there is little or no motion in the scenes (or silence in a speech stream), and more bits when there is a higher motion (or active speech). Media can be encoded at a Constant Bit Rate (CBR) or at Variable Bit Rate (VBR). CBR streams generally offer a variable media quality, because the amount of bits to allocate for media frames is constant irrespective of the complexity of the scene. The final quality depends heavily on the complexity of the scene. CBR works well when the complexity of the scene is constant and not fluctuating over time (e.g., for video scenes with little motion). CBR coding offers variable frame rates which have a negative impact on the subjective quality [43]. In order to offer a constant quality, media streams are often encoded in VBR fashion. Adaptive speech codecs such as the AMR or the Wideband AMR can use multiple encoding bit rates which are known in advance [P8], and are capable of producing a CBR or a VBR media stream. Encoded video streams traffic typically exploits VBR characteristics [43] with a minimum, average and a maximum encoding bit rates. Let us suppose now that a video stream in a session has an average bit rate of 64 kbps and a maximum bit rate of 128 kbps.

In QoS-guaranteed networks, one of the problems is setting the right values for the QoS profile. There are three cases [162]:


26

1. GBR=64 kbps and MBR=64 kbps. Here for the time period when the video bit rate exceeds the 64 kbps MBR, all the packets will be shaped (i.e., delayed) [109] and then eventually dropped by the network. In case of a streaming application, the server will likely try to retransmit the lost packets causing even more congestion and losses in the network. In a multimedia telephony application, the receiver may try to use decoder error resilience algorithms to recover the lost data. In this example, the only possibility is to use a CBR stream at 64 kbps.

2. GBR=64 kbps and MBR=128 kbps. Here the network resources that exceed 64 kbps are used on a best-effort manner [109]. Depending on the network load, there might still be packet losses. A streaming server could retransmit the lost packets and succeed in the error recovery or, in case the best-effort resources are momentarily unavailable, try to retransmit the data again. A multimedia telephony application could behave as in the first case.

3. GBR=128 kbps and MBR=128 kbps. Here there will be no packet losses because the resources are reserved up to the stream maximum bit rate. However, this approach leads to an inefficient usage of network resources and waste of bandwidth.

The second option represents the case of elastic bandwidth, where the bandwidth is used if needed (otherwise it can be used by other users). Application awareness of the GBR and MBR attribute values [37, 162] is an advantage, and adaptive methods for bandwidth management can be used for making an optimal use of the available best-effort bandwidth at any point of time (see Chapter 6).

Non-QoS-guaranteed network channels can also be utilized for video transmission. However, in order to achieve a minimum user quality, there is a need of not keeping the network overloaded or congested, in order to prevent service disruptions. In this case, only the MBR can be defined, and resources up to the MBR are handled in a best-effort manner. If the application sends data at a bit rate higher than the MBR, the traffic that exceeds the MBR is shaped [109]. QoS of multimedia applications over best effort channels can be provided by a careful over-dimensioning of the network, or by fixing a lower subscribers/network capacity ratio. If such operation is not feasible, the risk is that of having sub-optimal deployment of real-time multimedia services over best effort channels, where the QoS is not guaranteed. In this case, much of the performance improvements are left to the application that must be built in an adaptive way (see Chapter 6).

The (E)GPRS network throughput can be estimated analytically [109]. However, dynamic factors may determine deviations from these values. The main variables that influence the network throughput are: 1) (Modulation and) Coding Schemes; 2) Network load; 3) RLC and LLC modes; 4) Incremental redundancy; 5) Interference level (including distance from the base station).

If the radio link quality is good, an (M)CS with low error protection can be used, leaving more room for user data, and therefore the user network throughput is higher. If the link quality is bad, an (M)CS with higher error protection are used. This reduces the network throughput and, therefore, the user bandwidth. One or more time slots can be shared by

27

many users in GERAN, but this increases load and reduces the throughput available per user. When RLC and/or LLC are used in ACK mode, the retransmission system consumes bandwidth that reduces the network throughput. However, when incremental redundancy is used, the probability of erroneous blocks (i.e., BLER) in successive transmissions is lower than in the initial transmission. Therefore, the use of incremental redundancy can increase the network throughput.

For a fixed (M)CS and network load, and when no use of RLC or LLC ACK mode nor incremental redundancy is done, the network throughput per time slot at layer 3 (IP layer) is given by

& , (1)

where PT (Peak Throughput) is the air interface throughput, and EO (Encapsulation Overhead) is the sum of the higher layer protocols overhead up to IP excluded (i.e., RLC/MAC, LLC and SNDCP). A similar formula applies to UMTS, and in this case only RLC/MAC and PDCP generate encapsulation overhead below the IP layer. If the RLC ACK mode is used, then Eq. (1) becomes [109]:

1 (2)

where BLERRLC is the block error rate at the RLC layer. If also LLC ACK mode is used and for a generic but fixed (M)CS, Eq. (2) becomes:

& 1 1 (3)

where BLERLLC is the block error rate at the LLC layer. Finally, some considerations about network load are provided next. Since resources are

shared in (E)GPRS networks, if already N<8 users are sharing a time slot, a new user would get a portion of that time slot, and the throughput for that time slot is reduced to ThroughputL3 / (N+1) [109] when bit rates are not guaranteed. If guaranteed bit rates are required (for example for supporting the conversational or streaming TCs), then a certain number of time slots can be dedicated or, when time slots are shared, the network scheduler can assign transmission turns in order to guarantee a minimum throughput [109]. Since the load represents a throughput Reduction Factor (RF), the last Eq. (3) for a generic number of user time slots (NTS) taking into account the load can be rewritten as:

& 1 1 (4)

where RF is a function of NTS, the total number of time slots in the system, the average system resource utilization and the distance of the mobile station from the base station [109]. In the same reference are given examples of EGPRS throughput for varying load conditions. Results show that the minimum throughput is less than 50% for high load conditions. Also, if the MS distance from the BTS is about 5 km, the throughput is reduced even more than 50% of the maximum value [109].


28

3.2.2. Error rates and delivery of erroneous packets

Mobile network channels are not inherently error-free. High error rates can be caused for example by weak radio signal in a determined area (such as under bridges, behind buildings or hills, elevators, etc.) or because of handovers due to user mobility [63]. On the other hand, multimedia streams, such as speech, audio and video, are not required to be transmitted over an error-free channel, because they are tolerant to a certain level of noise. For example, a video stream with some erroneous bits (or some missing packets) may show visual artifacts; a speech/audio stream under the same error conditions may sound with audible gaps or with artificial disturbing noises. Despite multimedia streams are to some extent error tolerant, there is a limit over which the user QoS is not acceptable anymore. That limit should not be exceeded in order not to compromise the user experience [186].

Errors during transmission over the air interface are essentially bit errors or missing bits. At layer 2 of the receiver’s terminal, these errors are received as corrupted or missing RLC blocks. If the protocol layers above allow the reception of corrupted packets, then single bit errors can go up to the decoder, otherwise these are discarded and not forwarded to the upper layer. In circuit-switched networks, reception of erroneous packets is possible for example over transparent UTRAN connections (in HSCSD/ECSD networks this is not possible, as these are treated by the RLP as improper frames and not forwarded to the higher layers [6]). In packet-switched networks, the delivery of erroneous IP packets depends on the packet checksum. Normally, the checksum is activated at UDP or TCP level, and usually these protocols discard the corrupted packets, whenever the computed checksum does not match the checksum field [87, 187]. Despite there is a theoretical possibility of deactivating such checksum, in most of the IP implementations, the checksum is activated. This means that erroneous IP packets are not forwarded to the higher layers (i.e., packets are discarded and considered lost). The UDP-Lite protocol [142] allows partial error detection by making use of variable coverage checksum, and it enables the forwarding of packets with bit errors to the higher layers. The usage of UDP-lite is suitable for applications that use codecs able to handle payloads containing bit errors (e.g., AMR and AMR-WB speech codecs).

Media codecs have certain error resilience characteristics, and one must make sure that the input of a decoder does not get more errors (or missing data) than it is able to correct and conceal. At this point it is important to give a clarification on two of the QoS profile attributes: SDU Error Rate (i.e., the Packet Loss Rate (PLR)) and Residual BER (RBER). They assume different meaning for an application depending on whether the delivery of erroneous SDU is allowed or not. There are three cases: 1. Delivery of Erroneous SDUs = NO (erroneous SDUs are detected and discarded). In this

case, the SDU Error Rate indicates the fraction L of packets not delivered (i.e., discarded or lost). Let us indicate with D the fraction of delivered packets (L+D=100%). Then RBER indicates the undetected BER in the fraction D of delivered packets. This means that the delivered packets may still contain some errors that should be calculated on the top of the SDU Error Rate.

29

2. Delivery of Erroneous SDUs = YES (erroneous SDUs are detected and delivered). In this case, the SDU Error Rate indicates the fraction E of delivered packets with errors. Let us indicate with N the fraction of delivered packets without errors, and with T the total delivered packets (T=E+N=100%). Then RBER indicates the BER in the total T of delivered packets. Similarly, as in the previous case, this means, that the delivered packets may still contain some errors that should be calculated on the top of the SDU Error Rate.

3. Delivery of Erroneous SDUs = “-“ (erroneous SDUs are delivered without considering error detection).

The lowest residual BER required by the UMTS QoS [29] is equal to 6*10-8. This is achieved also by GERAN in A/Gb mode, as the 24 bits CRC at LLC allows it [14]. The SDU error rate in GERAN depends on whether LLC ACK mode or UNACK mode is used. If LLC uses RLC ACK mode, the SDU error ratio is limited to 10-3 (but can be as high as 10-2, see Table 3), because of the 12 bits RLC CRC [14]. HSCSD BERs in transparent mode can be up to 10-3, and less than 10-6 in non-transparent mode [78]. Clearly, the performance offered by GERAN and UTRAN networks differ in terms of error rates, and a multimedia application must be designed to cope with the highest error rates, if radio access technology independence is one of the goals to achieve. It must be pointed out that for multimedia telephony applications, in a mobile-to-mobile call, the errors impact two radio links. Therefore, the single-link residual BERs or SDU error rates need to be multiplied by two in order to achieve an approximation of the right end-to-end value.

To guarantee a given SDU error ratio, the larger the SDU size, the smaller RLC BLER the radio interface has to provide, which means that the reliability requirements for the radio link are more stringent. Maximum SDU size should be commonly considered with the required SDU error ratio. From the network viewpoint, smaller SDUs allow easier compliance to the reliability requirements. The application should be conservative when specifying a maximum SDU size, and set this parameter to be larger than the maximum expected RTP packet size (plus UDP/IP overhead) [43].

Finally, if a network is not able to meet the required maximum error rate tolerated by the application, it is possible to increase the error resilience level by other application means, built above the network layer (e.g., by using redundancy or application layer FEC, RTP retransmission, Unequal Error Protection (UEP) or interleaving; see sections 4.5.1 and 5.5.2).

3.2.3. Delivery order

Ideally all the packets delivered by the network should arrive in order. However, this is not always the case, and situations of out-of-order delivery may happen sporadically [180]. Streaming applications are more tolerant to out-of-order delivery, since the more relaxed delay constraints allow for packet reordering either below the network layer (when the delivery order is required to the network), or in the application (when the delivery order is


30

not required to the network, but the terminal application takes care of re-sequencing the received packets).

Multimedia telephony applications are more delay sensitive. Network in-sequence packet delivery may be too expensive in terms of additional delay, and the application could not be willing to tolerate it. In fact, if a packet is out-of-order, it has to wait n (n >=1) packets before being re-ordered and then played back. This produces a delay accumulation in the receiver equal to n times the playback duration of each packet, plus a potential discontinuous playback of the same duration. If the receiver wishes to remove that delay, it must skip the playback of n packets and cause a playback glitch. Therefore, for multimedia telephony, enforcement of packet delivery order is not recommended.

3.2.4. Delay

The requirements on delay are not stringent for mobile streaming applications. The unidirectionality of the media flow allows using large client buffers, but end-to-end delays for mobile streaming should be well below 10 seconds [27]. In multimedia telephony, the conversational characteristics impose to limit the delays as much as possible. In the delay budget, about 100 ms are due to pure network delay for one link in UMTS [27]. For a mobile-to-mobile call, at least 2x100 ms must be taken into account in the total end-to-end delay. Any additional delay is caused by the terminals (e.g., encoding + decoding delays). To preserve interactivity in multimedia telephony applications (which usually include speech) 3GPP recommends the following limits on end-to-end one-way delays [27]:

TABLE 6. DELAY BOUNDS FOR MULTIMEDIA TELEPHONY APPLICATIONS

End-to-end delay Comment 0-150 ms Preferred range (< 30 ms not noticeable; < 100 ms not noticeable, if

echo cancellation is provided and there are no distortions on the link) 150-400 ms Acceptable range (but with increasing degradation)

> 400 ms Unacceptable range

The reasoning made on delay so far is statistically valid for 95% of the packets (see Table 4). What happens to the remaining 5%? In such cases, the transfer delay is larger than that guaranteed by the network. The effect of a larger transfer delay on the application can be one of the following: 1. Buffer underflow, since packets are excessively delayed by the network. 2. Packets unusable, because the packet utility function decreases over time. By the time a

late packet arrives to the application, its utility might be null (e.g., if it arrives after its playout time) so that the packet is considered lost.

3. Packets lost, because they have been dropped by the network. All these cases result in a potential QoS degradation. The first one can be handled by

using a receiver de-jittering buffer larger than the transfer delay in the QoS profile, in order to accommodate more than 95% of the network delay distribution [2]. In this way, the packets arriving later than the transfer delay can be captured and used, without risking to

31

underflow the receiver buffer. The percentage of required accuracy in capturing late packets must be traded off with memory consumption and initial buffering delay. Therefore, whenever is required more than 95% accuracy on the network delay (for example 99%), this should be handled by the application.

The second case cannot be handled by the application, since the real-time properties of continuous media impose a limited time window for playback of media data, after which the data loses its utility.

For the third case, the network policing mechanism may drop all the packets experiencing a delay over the transfer delay in the QoS profile. However, this would mean an increased error rate by 5%. Of course, this is undesirable. Networks are required to always deliver packets also if they experience a delay larger than the transfer delay in the QoS profile, in normal QoS conditions. If dropping late packets would help in maintaining the current QoS, which could not be anymore guaranteed if those excessively late packet would be delivered, then packet dropping is the preferred solution [1].

In HSCSD there are two variables that must be considered in the RLP when this is operating in non-transparent mode: T1 and N2 [6]. T1 is the retransmission timer that triggers RLP retransmissions. N2 is the maximum number of retransmissions. Their default and minimum values are shown in Table 7 [6]. These figures prove that non-transparent HSCSD is not suitable for multimedia telephony applications. In fact, an error in an RLP frame would cause its retransmission, but not before the retransmission timer fires. In the best case, this is 2x380 ms for two radio links, if the same data block is hit by errors in both radio links. If the error persists, it would require even more retransmissions. Typical delay values are from 0.4 s up to 1 s for mobile-to-mobile connections [160]. These are outside the delay bounds for multimedia telephony applications indicated in Table 6.

TABLE 7. HSCSD RLP KEY PARAMETERS FOR NON-TRANSPARENT MODE

RLP Parameter Range of values Default value T1 >420 ms (for 14.4 kbps TS)

>380 ms (for 9.6 kbps TS) 520 ms (for 14.4 kbps TS) 480 ms (for 9.6 kbps TS)

N2 >0 6

Results for GPRS with three time slots [109] show that average 1-link transfer delays (i.e., the delay for delivering LLC PDUs) are up to 1.6 s for CS-1..2 configurations, depending on the load. For a CS-1..4 configuration, average delays for 1-link are up to 0.85 s. For EGPRS with three time slots and MCS-1..9 configuration, average delays for 1-link are up to 0.7 s. For UMTS [21], the estimated delays are 390 ms for circuit-switched video calls over H.324 for bit rates in the range 32-384 kbps.

In the LLC protocol, the T200 (retransmission timer), N200 (maximum number of retransmissions), and N201 (maximum number of octets in an information field) deserve attention. These parameters can heavily impact the performance of the network and applications. Table 8 shows their range and default values [3]. The timer T200 must be set to a value larger than the LLC RTT. In [14] it is recommended a value between 6 and 8


32

seconds, when using LLC ACK mode in presence of 1 s long cell changes (smaller values would trigger retransmissions too early). The four default values indicated in table are chosen to correspond to the four GPRS QoS delay classes [3].

TABLE 8. LLC KEY PARAMETERS FOR GPRS/GERAN

LLC Parameter Range of values Default value T200 0.1-409.5 s 5, 10, 20 or 40 s N200 1-15 retransmissions 3 retransmissions N201-U (UNACK 140-1520 bytes 500 bytes N201-I (ACK mode) 140-1520 bytes 1503 bytes

3.2.5. Delay jitter

This quantity expresses the delay variation over time. For example, if a certain network has an average delay of 500 ms, and this has a 200 ms jitter, it is like equivalently stating that the delay can vary in the range [300, 700] ms. Delay jitter is caused, among other things, by retransmission in the lower layers (e.g., RLC, LLC, SNDCP, TCP, RTP) when some form of acknowledged mode is used. Although delay jitter is a very important parameter that may have severe impact on the application performance, this cannot be explicitly specified in the current UMTS mobile networks. The issue is then to find a way to define the delay jitter.

The delay in GPRS is defined as “maximum values for the mean transfer delay and 95-percentile delay”, while in UMTS networks it is defined as “maximum delay for 95th percentile of the distribution of delay“. The two definitions can be expressed as

, 95% , and (5)

95% . (6)

It is easy to see that the two definitions relate to maximum delays over a certain time interval t. The delay jitter is derivable from the above delay definitions, if some assumptions on the minimum network delays can be done, or ultimately, if this is a known value. For example, if it is known that the minimum transfer delay in the network is 100 ms, and the delay QoS profile attribute is equal to 500 ms, this means that the delay jitter can be between 100 ms and 500 ms, inducing a jitter of 400 ms. Given that the definition of delay starts from maximum delays, the delay jitter can only be negative. The expression for delay jitter over GPRS or UMTS networks follows:

. (7)

If the minimum network delay is not known, or it can vary (for example whenever an application is built to work over different access networks subject to different minimum transfer delays), this can be estimated by the application itself.

At service level, the human ear is highly intolerant to short-term delay variation of speech. It is therefore natural that the jitter is reduced as much as possible. A 1 ms limit is suggested as target [27]. However, since the terminal application has means to smooth out jitters, through buffering, the acceptable network delay variation is much greater than the

33

delay variation given by the limits of the human perception. For example, for streaming applications, a value up to 2 seconds has been suggested [27] for both audio and video. For multimedia telephony applications, the jitter tolerance depends on the jitter buffer dimension, but reasonably it should be no larger than 150 ms.

Whenever there are two media carried in a streaming or a multimedia telephony session, there is also another requirement related to delay and jitter. This is the inter-media synchronization (or lip sync), that must be performed within a certain threshold. Section 4.5.4 will discuss lip synchronization thresholds for mobile applications.

3.2.6. Handovers and cell changes

Since mobile applications imply mobility, and mobility implies network cell changes (or handovers), it is necessary to consider also these aspects. Handovers often produce an interruption in the data flow for a certain period of time ranging from few tens milliseconds up to several seconds, depending on the type of originating and destination cells, and the type of handover. These may have a severe impact on a multimedia application, depending on whether handovers are perceived as lossless or lossy data flow interruptions. If a handover is lossless, the application will experience it as an increased delay or delay jitter in the packet arrival time. If a handover is lossy, the application will experience it as an increased loss rate. The amount of lost packets is a function of the handover duration. Different types of cell reselections and handovers are classified in Publications [P4, P5]. In [14, 99] some results on cell reselection delays are reported for GERAN A/Gb mode. For Intra-BSC cell reselections the service outage time is 0.5-4 seconds. For Inter-BSC cell reselections the service outage time is 1-15 seconds, with typical values of 2-3 seconds. With NACC the goal is to reduce the service outage time to few hundred milliseconds (but less than 1 second [14]). These values are exceeding the 3GPP requirements given in [9], where it is stated that the handover duration shall be less than 600 ms.

There are no standard applications or service requirements regarding handovers. However, since handovers are essentially service interruption times, these must be minimized as much as possible, in order to avoid QoS degradation. While a mobile streaming application can tolerate longer handovers, because of the longer buffering capability that can absorb larger delay variations, a multimedia telephony application can tolerate shorter handovers, because the receiver jitter buffer is smaller.

The GERAN LLC protocol running in acknowledged mode can be seen as a mean to correct the residual errors between BSS and SGSN occurred over the Gb interface. These errors cannot be corrected by the RLC/MAC retransmissions. However, the Gb interface is not an air interface, but a wired interface, and the amount of errors expected over this network trunk is much lower than the amount of errors occurring over the Um interface (see Figure 4). Therefore, running LLC in ACK mode is not always useful, considering the additional protocol overhead for providing acknowledgements for every LLC frame transmitted, and the increased delay induced by the LLC retransmissions. Especially for


34

multimedia telephony applications, LLC must be run in UNACK mode. However, one of the advantages of LLC ACK mode is that of providing lossless Intra- and Inter-BSC cell reselections [P5]. This is of great advantage to mobile streaming applications that can tolerate the extra delays caused by the LLC layer, and benefit of near-seamless cell reselections at the application layer (see section 5.5.6). In fact, unless the cell reselection is of the Inter-SGSN type, data is buffered in the network during the cell change and delivered to the mobile station with an additional delay, which is perceived by the application as increased jitter. Yet another issue that has impact on the amount of losses during cell reselections is the PDU lifetime, which defines the remaining period of time that a PDU is considered valid within a BSS. When the lifetime of a PDU expires, this is discarded, causing a data loss. The PDU lifetime can be set to any value up to over 10 minutes or even to infinite [10].

3.2.7. Segmentation issues

Segmentation (and reassembly) is one of the features offered by several protocols at different layers (e.g., IP, SNDCP, and RLC). The principle is that of splitting SDUs that are larger than the maximum allowed SDU into two or more PDUs of legal size. Segmentation performed at any layer has the advantages of admitting traffic that is not conformant to the protocol specification, and hiding lower layer protocol details to the higher layer protocols (e.g., transport optimizations or even a different lower layer protocol design). However, segmentation has the disadvantages of inducing an additional delay and delay jitter (caused by the operations of splitting and reassembly of SDUs), and an increased protocol header overhead (e.g., an SDU of 100 bytes which is split into two PDU of 50 bytes, carries often few bytes of protocol headers in the two PDUs) that produces a reduction of the user bit rate. Yet another disadvantage is given by the increase of the error rate. In fact, in general, segmentation at layer n produces a higher error rate at layer n+1, depending on the number of segments. In other words, the goodput at layer n+1 is lower than the goodput at layer n.

To investigate the causes of segmentation, one must start from the highest layer of the protocol stack. An application often generates large packets in order to reduce the impact of the protocol headers, and therefore to minimize the waste of bandwidth caused by the headers. If too small packets are generated by the application, the waste of bandwidth caused by the protocol headers is too high, and the channel is not utilized in an efficient manner. Furthermore, small packets increase the application packet rate (given a fixed application target bit rate) that may cause problems in the sending/receiving mobile terminal or the network. The use of large packets imposes on the lower layers more stringent requirements in terms of error rates, especially if the application negotiates with the network specific error rates. In fact, the use of large higher layer packets in conjunction with lower layer acknowledged mode transmission (e.g., RLC ACK mode), generates a certain number of segmented packets that must be delivered to the peer-end of the protocol stack possibly within a given delay budget and within the negotiated error rate. If acknowledged mode is

35

not used in the lower layers, large packets increase the effort of the network to provide the given error rate. For instance, more resources may be used to guarantee the negotiated error rate. If this cannot be fulfilled, the application becomes more vulnerable, since large lost packets produce in general more perceptual media impairment than short lost packets do. To fight against this problem, smaller packets can be used by the application, but this would reduce the amount of available bandwidth to the user application because of the excessive use of protocol headers.

Segmentation, error rates and SDU sizes at any protocol layer are linked with each other by the following expression [169]:

1 1 . (8)

The expression implies that the SDU error rate increases as function of the PDU error rate, if one SDU is segmented into n PDUs (and assuming these are not correlated). Conversely, the PDU error rate can be expressed as a function of the SDU error rate (assuming the PDUs are not correlated):

1 . (9)

This last expression helps understanding the requirements on lower layer protocols in terms of error rate. For example, if the desired SDU error rate at the IP layer is 1% (= 10-2), and each IP packet is fragmented into 5 lower layer PDUs (e.g., at SNDCP layer), then the error rate requirement for the lower layer is equal to 2*10-3 (i.e., 5 times lower than the requirement of the IP layer). In other words, an error rate of 2*10-3 at the SNDCP layer will produce a much higher error rate at the IP layer, reducing the overall IP layer goodput. Segmentation must be used with care, as the requirements on the lower layers can be hard (and even impossible) to achieve, and the impact on the QoS of the applications might be worse than expected. A mobile multimedia application can prevent compromising the perceived QoS by using packets that are not too large.

Segmentation in GERAN at the SNDCP layer is performed to guarantee that each LLC PDU is not larger than the N201 LLC field [8] (see Table 8). In SNDCP frames, the reliability class (from the QoS profile) indicates whether the LLC frame carrying the SNDCP PDU must be transmitted in protected or unprotected mode, and whether the RLC/MAC ACK or UNACK mode must be used. One implication of LLC UNACK mode is that the default value for the N201-U field is equal to 500 bytes. This means that when an application is not aware of the LLC configuration, it should pay attention in generating IP packets that fit into the default N201-U size (considering intermediate layer protocol headers). If this does not happen, the SNDCP layer will segment of too large LLC SDUs. Alternatively, the MS (or the application via an interface) could take care of negotiating the maximum value for the N201-U field, in order to avoid segmentation. The problem of the N201-U field is more sensitive for mobile streaming applications, which may make use of packets larger than 500 bytes. For multimedia telephony application this is not a real problem, since typically this kind of traffic is made of packets smaller than 500 bytes.


36

3.3. RECOMMENDED NETWORK CHANNELS FOR MOBILE MULTIMEDIA APPLICATIONS

Based on the discussion in this chapter and section 2.4, it is already possible to have an idea of the best channels and network configurations for multimedia telephony and mobile streaming applications. For the former type of applications, there are two cases depending on whether the channel is CS or PS:

For CS channels, a mobile multimedia telephony service can offer bit rates up to 64 kbps for HSCSD, ECSD or UMTS1. BERs can be up to 2*10-3 in mobile-to-mobile calls for HSCSD/ECSD transparent mode (non-transparent mode is not recommended because of the high one-way delays that can be up to 1 s). UTRAN networks offer BERs up to 10-1 for mobile-to-mobile calls. These require additional error resilience features at the receiver end in order to cope with these high error rates. One-way delays are at least 200 ms for a mobile-to-mobile call in UMTS (and most likely higher for HSCSD/ECSD).

For PS channels, a mobile multimedia telephony service can offer theoretical bit rates up to 473.6 kbps for GERAN and up to 256 Mbps for UTRAN. However, examples Conversational TC bearers for multimedia telephony have been given by 3GPP only for VoIP and for bit rates up to 42.8 kbps (over RLC UNACK mode) [24]. It has to be noted that GERAN does not offer guaranteed QoS, so the bit rate has to be considered as best-effort. SDU error rates are as low as 2*10-2 for GERAN (using reliability class 3) and UTRAN for mobile-to-mobile calls. Additional error resilience at the codec level is very beneficial also in this case. Long delays make current GERAN networks unsuitable for mobile multimedia telephony services: for mobile-to-mobile calls, delays can be up to 2.7 s and cell reselections service outage times can be up to 1 s. Delays in UTRAN are as low as 200 ms for mobile-to-mobile calls, and handover outage times can approach to zero as it will be shown in Chapter 4.

Mobile streaming services can be deployed over all types of networks and bearers. CS bearers are up to 64 kbps for non-transparent HSCSD and for UTRAN with Streaming TC [24]. PS bearers can offer theoretical bit rates up to 473.6 kbps for GERAN and up to 256 Mbps for UTRAN. For actual UTRAN deployments, bit rates up to 4096 and 14336 kbps (respectively for uplink and downlink for Interactive and Background TCs), and up to 384 kbps in uplink and up to 1024 GBR/14336 MBR kbps in downlink for the Streaming TC [24] are considered at the time of writing of this thesis. Because of the non-conversational nature of the application, if network resources cannot be reserved, higher network delays can be utilized by the applications to deal with variable network bandwidth and variable SDU error rates (including those caused by service outages because of cell reselections).

1 CS bearers up to this rate are considered for testing (with Conversational TC and RLC T mode [24]), despite the UMTS supports higher rates.

37

Chapter 4

Mobile Multimedia Telephony

ultimedia telephony is not a new technology, as it was proposed few decades ago for home usage [91, 209]. However, for several reasons (technical, social, marketing or others), this service has not been as successful and widespread as

initially expected. User Generated Content (UGC) is also progressively breaking the traditional social

barrier upon which people do not sometimes like to show themselves in video calls [91]. At the same time, mobile communication technologies have recently facilitated the introduction of multimedia calls. For example, several standardization organizations, such as ITU-T (International Telecommunication Union, Telecommunication sector), IETF (Internet Engineering Task Force) and 3GPP have made efforts to specify mobile network architectures, protocols and codecs for multimedia telephony. This application has also recently hit the mass market. The first Nokia CS mobile multimedia call was introduced in 2005. PS mobile multimedia telephony is even more recent. In November 2009 Fring released an application for Nokia phones; Skype with video calls for the iPhone was introduced only in the end of December 2010.

This chapter will introduce mobile multimedia telephony architectures, services and protocols. Also data about typical traffic patterns, bearer considerations and QoS metrics will be presented. QoS improvements algorithms will be introduced together with their related performance results.

4.1. MOBILE MULTIMEDIA TELEPHONY ARCHITECTURES AND SERVICES

The general characteristics of a typical mobile multimedia telephony system have been introduced in section 3.1.1. Its architecture can be represented as a point-to-point bi-directional communication system (see Figure 6). In a more complex scenario, also a multi-

M

MOBILE MULTIMEDIA TELEPHONY

38

party call use case could be considered. However, for simplicity, this scenario will not be further elaborated in this thesis.

A mobile multimedia telephony system is mainly made of two components: a certain number of mobile terminals supporting a multimedia telephony application, and the mobile network. The application of the mobile terminal A in Figure 6 is connected to the mobile network through a logical connection established between the network and the mobile terminal (i.e., a PDP context for IP based architectures). The latter uses physical transport channels in the downlink and uplink directions to enable data transfer in the two directions.

Mobile device A Mobile device B

Mobile network Audio/video streams and session control

data (uplink)

Audio/video streams and session control

data (downlink)

Bearer


data (downlink)


data (uplink)

Figure 6. A typical mobile multimedia telephony system

For mobile device A, the speech and video content is created live from the microphone and camera input. This is encoded in real-time by the device application and transmitted in the uplink direction towards the network and mobile device B. Speech and video data in the opposite direction (downlink) is conveyed from mobile device B, via the network, to mobile device A. Upon media reception, device A performs media decoding and display/playback of video and speech data. In addition, the device sends and receives information for session establishment, QoS control and media synchronization. Each multimedia device may react promptly upon reception of QoS reports, taking appropriate actions for guaranteeing the best possible user experience at any instant. The device B is placed at the other end, and its functionality is symmetrically identical to that provided by device A.

4.1.1. Circuit-Switched multimedia telephony

3GPP mobile standards include the specification of circuit-switched terminals for multimedia telephony since 1999. These are called 3G-324M and are based on ITU-T H.324 terminals with Annex C (and optional Annex H) [126], and with modifications specified by 3GPP [33]. The system architecture of a 3G-324M terminal is depicted in Figure 7 [32]. The mandatory elements are a wireless interface, the H.223 multiplexer with Annex A and B [121], and the H.245 system control protocol for in-band signalling (version 3 or

39

successive). 3G-324M terminals are specified to work at bit rates of at least 32 kbps and may carry speech (AMR [31], AMR Wideband [35], G.723.1), video (H.263 [122], H.264 [124], MPEG-4 Visual [118], H.261), real-time text (T.140), image/data transfer and sharing of whiteboard and applications (T.120). As already mentioned in section 3.3, the maximum bit rate of CS channels for multimedia telephony is 64 kbps. Therefore, 3G-324M terminals can operate in the range of 32 to 64 kbps. The optional Mobile Multilink Layer (MML) [126] allows the data transfer along up to 8 independent physical connections, which provide the same transmission rate, in order to yield a higher aggregate bit rate. The MML provides the split functionality towards the lower protocol stack layers and the aggregation functionality towards the upper protocol stack layers. 3G-324M out-of-band call set-up/control issues and a more detailed description of its protocols and codecs are included in Publication [P1] and [74].

Video I/O Equipment

Audio I/O Equipment

User Data Applications [T.120, T.140, …]

System Control

Video Codec H.263, [H.264, MPEG-4, H.261, …]

Speech Codec AMR, [AMR-WB, G.723.1, …]

Data Protocols [V.14, LAPM, …]

H.245 CCSRL NSRP [LAPM/V.42]

Multiplex/ Demultiplex H.223, H.223 Annex A, H.223 Annex B, [H.223 Annex C, H.223 Annex D]

Optional Receive Path Delay

Optional Multilink

H.324 Annex H

3GPP Network

Call Set-up

Scope of TS 26.110

Figure 7. System architecture of 3G-324M terminals

3G-324M implementation guidelines give some important recommendations for the transport of multimedia streams [41]: to improve error resilience, video codecs are at least recommended to align the Picture Start Code (PSC) or GOB header or the start code prefix of the first NAL unit with the start of an AL-SDU. For the control protocol, the Windowed Numbered Simple Retransmission Protocol (WNSRP) is recommended for supporting efficient and reliable control message exchange. Furthermore, the support for call set-up time reduction (Media Oriented Negotiation Acceleration (MONA)) is recommended. Some H.263 video decoder features are recommended (Profile 3) for improving error resilience (Slice Structured Mode) or compression efficiency (Advanced Intra Coding, Deblocking Filter and Modified Quantization). The same reasoning applies to MPEG-4 Visual (Resynch


40

Marker, Header Extension Code, Data Partitioning, Reversible Variable Length Codes and Adaptive Intra Refresh for improving error resilience).

4.1.2. Packet-Switched multimedia telephony

The 3GPP mobile standards for packet-switched multimedia telephony are defined in two groups of specifications: those for PS Conversational multimedia applications [38], and those for the Multimedia Telephony Service over IMS (MTSI) [34]. The former does not include the specification of a client, although it includes the specification of codecs (speech, video, real-time text) and transport protocols for video telephony applications (including PoC). MTSI defines a full client on the top of IMS, including interworking with 3G-324M and non-IMS terminals. In the rest of this thesis the reference will be to the MTSI standard. An MTSI call could start with only one type of media, and additional media may be added (or dropped) by the users during the call. Therefore, a particular MTSI call is a voice call (VoIP) that consists of only one type of media, i.e., speech.

IPv4/IPv6

UDP

RTP

RTCP Payload formats

Speech codec

Video codec

Text

Conversational Multimedia Application

TCP

SIP/SDP

Figure 8. MTSI client protocol stack

TABLE 9. SPEECH AND VIDEO CODECS SUPPORTED BY MTSI CLIENTS

Media codec Properties AMR 8 kHz sampling frequency. 9 codec modes with bit rates ranging from 4.75 to 12.2 kbps

(including a silence mode at 1.8 kbps). AMR-WB 16 kHz sampling frequency. 10 codec modes with bit rates ranging from 6.6 to 23.85 kbps

(including a silence mode at 1.75 kbps).

H.263 Profile 0 (baseline) Level 45, and Profile 3 (Annex I, J, K, T) Level 45. Up to QCIF resolution. Up to 128 kbps bit rate.

MPEG-4 Visual

Simple Profile Level 3. Up to 30 fps frame rate. Up to CIF resolution. Up to 384 kbps bit rate.

H.264 Baseline Profile Level 1.1. Example resolution up to CIF. Up to 192 kbps bit rate.

41

Figure 8 shows the MTSI client protocol stack. Media codecs can be for continuous media (speech or video) or for discrete media (real-time text for chat applications). The supported speech codecs are AMR [31] and AMR Wideband [35]. The supported video codecs are H.263 [122], MPEG-4 Visual [118] and H.264 [124]. The supported codec for real-time text is T.140. Table 9 summarizes the supported speech and video codecs for MTSI clients.

MTSI clients use the Session Initiation Protocol (SIP) [192], the Session Description Protocol (SDP) [110] and the SDP Capability Negotiation (SDPCapNeg) [48] for session control, media negotiation and configuration. The protocol used for the transport of packetized media data is the Real-Time Transport Protocol (RTP) [199]. RTP provides real-time delivery of media data, including functionalities such as packet sequence numbers, and time stamping. The latter allows inter-media synchronization in the receiving terminal. RTP runs on the top of UDP and IPv4/v6. RTP comes with its control protocol (RTCP) that allows QoS monitoring (see section 6.3.2). Each endpoint receives and sends quality reports from/to the other endpoint.

SIP performs the logical bound between the media streams of two MTSI terminals. As shown in Figure 8, SIP can run on top of TCP or UDP. However, UDP is assumed to be the preferred transport in 3GPP IPv4/v6-based networks [7]. SIP uses SDP to describe the session properties (IP addresses, ports, payload formats, type of media (audio, video, etc.), media codecs, and session bandwidth). An IETF SIP signaling example between two endpoints is included in Publication [P2], along with issues related to reliability of signaling over UDP. Mobile applications based on the IETF SIP can be implemented on top of non-IMS networks. In this case, only the applications resident in mobile terminals run the SIP protocol, while the network considers SIP as transport protocol.

In 3GPP IMS networks, where SIP has been selected to govern the core call control mechanism [7], both the network and the mobile terminal implement the SIP protocol and exchange SIP messages for establishing and releasing calls. This choice has been made to enable the transition towards ALL-IP mobile networks. In [7, 74] are given examples of SIP signaling for call set-up and release between a mobile terminal and a 3GPP network.

4.2. MEDIA TRAFFIC CHARACTERISTICS

4.2.1. 3G-324M traffic

The 3G-324M implementer’s recommendations [41] over CS networks, suggest that H.223 MUX-PDUs should be limited to 100-200 bytes (and negotiate the maximum size via H.245). Also, in order to minimize delay, the maximum number of speech frames encapsulated into a MUX-PDU should be limited to 3. In addition, it is recommended that video transport uses the Adaptation Layer Type 2 (AL2), which is the same used for speech transport, and that does not use retransmission. Finally, it is recommended to encapsulate one MPEG-4 Visual video packet into an H.223 AL-SDU.


42

Taking into account the guidelines above, and the fact that the 100-200 bytes for each MUX-PDU include 5 bytes headers overhead (for CRC, error protection, packet length and other information using AL2 for speech and video, and H.223 Level 2 multiplexing [P1, 121]), this leaves 95-195 bytes for speech or video payload, assuming no header overhead for RLC in T mode and MAC in non-multiplexed mode for UTRAN (see section 2.3.3). If the channel bit rate is 64 kbps, the H.223 multiplexer overhead is 5% (i.e., 3.2 kbps) for 100 bytes MUX-PDUs and, after H.245 control traffic, the remaining bandwidth for media (audio and video) is a bit more than 60 kbps, which would require a multimedia telephony application to generate about 80 pps. Similarly, for 200 bytes long MUX-PDUs, the H.223 multiplexer overhead is 2.5% (i.e., 1.6 kbps), and the remaining bandwidth for media is a bit more than 62 kbps, which equals to about 40 pps.

4.2.2. MTSI traffic

An example for media at a bit rate of 407.85 kbps will be shown. Let us suppose there are no MAC headers, and the RLC header is 1 byte. Let us also assume to use an RLC payload size of maximum 104 bytes (it can be flexibly set up to a size of 12000 bits, i.e., 1500 bytes [24]) in order to avoid segmentation, and have a PDCP layer header of 1 byte [24]. The RTP/UDP/IPv6 header can be assumed to be compressed down to 3 bytes with ROHC [24]. Therefore, each RTP packet payload can be up to 100 bytes to carry media data. The traffic is carried over the Conversational TC with RLC running in UNACK mode. Here one should start from the media bit rate of 407.85 kbps. This was calculated from the highest AMR-WB mode (23.85 kbps) plus the highest video bit rate allowed by the MTSI standard (384 kbps for MPEG-4 Visual). Let us also suppose the application will transmit RTP media payload at a maximum of 100 bytes per packet. If each RTP AMR packet containing 1 speech frame is 61 bytes long (excluding the 3 bytes for the compressed RTP/UDP/IP header), then also in this case the application needs to generate a traffic of 50 pps. The header overhead generated for speech traffic amounts to 2 kbps. A video bit rate of 384 kbps means also 480 video pps, and together with the speech traffic it means a total of 530 pps. The header overhead generated for video amounts to 19.2 kbps. Then the total bearer bit rate required for this case is 429 kbps (and the total header overhead is 21.2 kbps, i.e., about 5% of the bearer bandwidth). It has to be noted that choosing larger payload sizes decreases the incoming and/or outgoing packet rates to/from a MS. This might be beneficial for low-end MTSI terminals with limited processing capabilities.

4.3. PDP CONTEXTS CONSIDERATIONS

In this section only packet-switched data will be considered. There are several IP data flows involved in the communication between two multimedia telephony SIP parties. These are described hereafter:

43

A bi-directional SIP data flow used for signaling (user control plane). This flow is carried over UDP. The transport of SDP data occurs within SIP messages. This type of traffic is delay insensitive. The typical length of SIP packets range from few tens bytes up to few thousands bytes per message [132]. With SIP compression [188], the packet size can be reduced over 95% [132].

One or more uni-directional RTP media flows, related to speech or video data (user data plane). In a VoIP call there is normally a single RTP stream (i.e., speech), while in a multimedia telephony session there are usually two media streams involved (speech and video). RTP flows are carried over unreliable connections (UDP), and are sent from each endpoint towards other endpoints (and vice versa). This traffic is delay sensitive.

Zero or more uni-directional RTCP flows. Each flow (if used) is associated with an RTP flow. For example, if a session is made of speech and video RTP streams, there will be two associated RTCP flows. It is also possible to use a single RTCP flow to multiplex information from both the speech and video streams, but this is not recommended. RTCP can be optional (for example in the case of feedback suppression), even if its usage is recommended for QoS management. This traffic is delay insensitive. Typical lengths of RTCP packets are below 200 bytes [80].

Each IP flow can be carried in a PDP Context, which is a logical data connection over which the MS and the network (GGSN) can exchange IP packets. Each PDP context carries several attributes, among which are the requested and the granted QoS profile attributes. For each PDP Context, a different QoS profile may be requested [28]. In addition, each PDP context allows definining the GBR and MBR separately in the uplink and downlink directions, while all the other QoS attributes are fixed in both directions [5]. The PDP context is then mapped into a physical radio bearer.

4.3.1. Number of PDP contexts

When several PDP contexts can be activated in a MS, the problem is to find the configuration in terms of number of contexts, such that it results optimal from the application perspective as well as for the radio network resources usage. Different cases can be considered: 1. Hard separation. In this case the SIP traffic is carried in its own PDP context; every RTP

media stream is carried in its own PDP context, and every RTCP stream is carried in its own PDP context. In case of a multimedia telephony session with audio and video, the number of PDP contexts required would be 5: one for SIP traffic, one for RTP speech traffic in the uplink (UL) and downlink (DL) directions, one for RTP video traffic in the UL and DL directions, one for RTCP traffic associated to the speech stream in UL and DL directions, and the last one for RTCP traffic associated to the video stream in UL and DL directions;

2. Medium separation. Here the SIP traffic is carried in its own context. Every RTP and its associated RTCP stream are carried in one context. In case of a session with audio and


44

video, the number of required PDP contexts would be 3: 1 for the SIP traffic, 1 for RTP speech traffic (and RTCP) in UL and DL directions, and 1 for RTP video traffic (and RTCP) in UL and DL;

3. Soft separation. In this case the SIP traffic is carried in its own PDP context. All the media components and their associated RTCP streams are carried in a separate PDP context. In case of a session with audio and video, the number of PDP contexts required would be 2: one for the SIP traffic, and one for the RTP and RTCP traffic for speech and video in the UL and DL directions.

The case SIP traffic is carried in the same PDP context as RTP and RTCP traffic is not considered here, because SIP may also be used for other types of traffic within a MS, and therefore an existing PDP context for SIP could be allocated even before the multimedia telephony session starts. Also, SIP traffic is delay insensitive, and does not require a guaranteed QoS. Finally, SIP traffic may disturb the RTCP traffic, producing additional delays to the real-time media. Therefore, it is better to keep SIP traffic separate. It is also possible to group media streams that belong to different SIP sessions onto the same PDP context [58]. This case will not be further analyzed.

Before discussing the three scenarios above, it is important to make some considerations when using separate or the same PDP contexts for RTP and RTCP traffic. If RTP and RTCP streams are sent over the same PDP context, the RTT calculation done with RTCP is easier, because the network channel has the same properties for both RTP and RTCP. For the same reason, it is also easily possible to detect any link breakage it might occur. Also, RTP traffic may make use of the instantaneous spare RTCP bandwidth whenever possible. The disadvantage is given by the fact that multiplexing RTP and RTCP packets may lead to extra delays and potentially RTP stream losses [170]. On the other hand, when RTP and RTCP flows are sent over separate PDP contexts, the RTT calculation may not reflect reality because RTCP would perform RTT estimations on a different channel with different properties (e.g., different bit rate) than the one used for RTP, and this could make part of the RTCP features useless. This case also produces additional delays caused by the allocation of the additional PDP contexts and, and as the number of PDP contexts is higher, the MS and network memory and processing requirements become higher. Also, if the RTCP flow is allocated over an Interactive TC PDP context, this becomes inefficient because the Interactive TC has no delay bound. RTCP is delay insensitive, but if the packets arrive several seconds late, they become useless for the purpose of QoS control. Therefore, RTCP needs a reasonable guarantee on timely delivery, and Streaming or Conversational TCs are the best choices for this.

From the reasoning above, it turns that it is more beneficial to place RTP and RTCP flows in the same PDP context. This choice is supported also by 3GPP [30]. Scenario 1 becomes too expensive and inefficient. Scenario 2 is more reasonable whenever speech and video require different QoS (e.g., in terms of delay). Scenario 3 is a typical case of media grouping [58], when speech and video require the same QoS profile attributes, and has been followed in the simulations of Publication [P9]. If speech and video require different QoS, the QoS

45

profile attributes could be set in order to satisfy the most stringent QoS requirements among the aggregated media. However, one aspect to be considered is the reciprocal media influence on the overall QoS. For example, VBR high bit rate video bursts may produce losses of RTP speech packets or RTCP packets, and vice versa.

4.4. MOBILE MULTIMEDIA TELEPHONY QOS METRICS

When implementing mobile multimedia telephony applications, special attention has to be paid to the QoS metrics that are to be chosen to prove that the system is of good quality. If the metrics are standardized, they also allow a fair comparison among different implementations. Since there are no general standard procedures, the need is to decide which fundamental quality parameters should be selected in QoS assessment. In this section, the discussion will be limited to video and session control signaling. For the purpose of video quality assessment, both subjective and objective metrics must be used, since they can be considered complementary. The ITU-T standards P.911, P.912 and P.920 define methods for subjective video quality assessment of multimedia applications. Test procedures for the selection of the MPEG-4 Visual codec are described in [183]. The ANSI standard [47] describes metrics to measure noise, blurring, jerky motion as well as Peak Signal to Noise Ratio (PSNR). The ITU standard [128] outlines some basic metrics for frame rate measurements and Mean Square Error (MSE) methods of measurement. 3G-324M terminal evaluation metrics are defined in [42]. Jitter and its perception are described in [153, 208, 224]. Some QoS metrics have been used during the research period related to this thesis. For instance, in Publications [P1, P2] several metrics for mobile multimedia telephony are described.

QoS evaluation is eased if the modules in each protocol layer implement a log file system that enables to output as much data traces as possible for QoS assessment. An example of log file system and a more extensive discussion of the described objective metrics in this thesis and subjective metrics are available in [71, 72]. The metrics in this section are categorized into five classes, depending on the type of the information they provide: frame-based, PSNR-based, delay-based, service flexibility-based and call control-based QoS metrics. In the following, some definitions which are summarized in Table 10 will be used.

TABLE 10. MOST USED SYMBOLS

Symbol Description N Number of runs (simulations) S Length of the video sequence (in seconds)

#A Number of elements of a generic set A (cardinality) Dj

I Initial decoding time of frame j Dj

E Ending decoding time of frame j


46

4.4.1. Frame-based QoS metrics

Let us define Ek={Set of encoded frames for the run k} and Dk={Set of decoded frames for the run k}. Then the Encoding frame rate defines the average rate at which the input frames are encoded by the video encoder. It is expressed in frames per second (fps) and can be computed by the expression

∑ # . (10)

The encoding frame rate can be different from the decoding frame rate, if some frames are lost or dropped. In this case Ek must be replaced by Dk in the previous expression. It is valuable to measure the minimum achieved decoding frame rate, because it translates to a practical threshold below which the user may not accept the service anymore. The reciprocal of the average frame rate above defines the average frame inter-arrival time (see section 4.4.3). It is also useful to compute the application instantaneous frame rate. For the jth frame at the decoder output this is defined as

. (11)

4.4.2. PSNR-based QoS metrics

PSNR is a measure of the difference between the original frame and the corresponding encoded (or decoded) frame. It is known that PSNR is not always well correlated to subjective quality measurements [85]. However, it is the easiest to apply and still the most popular metric for video quality assessment. For a video sequence, the PSNR can be computed for every frame and then averaged over the entire sequence. An important issue to consider is that video is a three dimensional entity, whereas PSNR is essentially a bi-dimensional function that does not take into account the temporal dimension (the frame rate), which is important in video QoS assessment. For this purpose, there are basically two ways to compute the PSNR [71]:

PSNR at source frame rate that is dependent on both the source frame rate and the decoded frame rate. Using this method, higher PSNRs are obtained at higher encoded frame rates.

PSNR at decoded frame rate that does not take into account the source frame rate and has no cognition of the third dimension.

There is no unique method to calculate the PSNR of the video sequence. The reader may refer to [127] for an alternative formula. Here the formula used by ARIB for the IMT-2000 Video Multimedia Codec Simulation Test for 3G-324M terminals [P1, 42] will be used. The PSNR is computed on the video sequences in raw YUV format. The combined PSNR for all the three Y (luma), U and V (chrominance) channels for a single video frame l is computed at run k by the formula

47

, 10 log.

∑ ∑ , , ∑ ∑ , , ∑ ∑ , ,////

(12)

where , , , , , and , , , , , a indicate the three channels of the original and

decoded frames respectively. and indicate the Y channel support for 4:2:0. For

example, for the QCIF size, 176 and 144. The average PSNR at the decoded frame rate, expressed in decibels (dB), computed for the whole video sequence over N runs is then obtained as:

10 log.

∑#

∑ ,# (13)

where , is the denominator of Eq. (12). If the PSNR is computed at the source frame rate,

the set Dk must be replaced by the set of original frames and the decoded sequence must be reset to the original frame rate by frame insertion (or alternatively the temporal reference information can be used to select the correct frames to compare). It could be useful to plot the instantaneous PSNR to show the video quality variation over the whole sequence, and calculate the minimum PSNR that represents a minimum level of video quality acceptability.

The video quality loss, as used in Publication [P1], is computed as

∆ . (14) where PSNRfree is the PSNR of the error free sequence, and PSNRdecoded is the PSNR of the

decoded sequence. In some research and standardization work, the ∆ is calculated according to [55].

The standard deviation of PSNR helps to verify, in multiple-run tests, how much the PSNR results are deviating from the average value PSNRTot computed with the previous equations. It is expressed in dB and can be calculated by the formula

∑ (15)

where is the average PSNR computed at run k. When additional statistical

information about PSNR is needed, it may be useful to calculate the confidence interval for the average PSNR or plot the Probability Distribution Function (PDF) and the Cumulative Distribution Function (CDF) of PSNR.

4.4.3. Delay-based QoS metrics

Since the delay is made of many different components, an approach would be to measure the different delays and try to optimize them separately. A set of measurable delays (expressed in seconds) is described in the following. Please refer to Figure 9.

Capturing delay is the delay occurred to capture a video frame. This delay is computed only for the frames that are being encoded and it does not depend on whether the capture


48

mode is on demand or some of the captured frames are actually skipped by the video encoder. It may be regarded rather constant for a specific device and can be calculated as

. (16)

The encoding delay is the time to encode a frame. The total average encoding delay for N runs is

∑

#∑# (17)

where for the jth encoded frame, EDelayj is equal to . The packetization delay is the time needed to the RTP packetizer (or H.223 multiplexer)

to split a video frame into packets and send it to the network interface. It may overlap with the encoding delay. The total and net packetization delays are given by

∑#

∑# (18)

where, for the encoded frame j, MDelayj is equal to (for total delay) or

to (for net delay).

Transmission delay is the time to transmit a video frame over the air interface. The transmission delay can overlap with the packetization delay. Total and net transmission delays are computed by

∑#

∑# (19)

where TDelayj is equal to (for total transmission delay) or (for net

transmission delay). The value is computed by the packetizer, whereas is computed by

the depacketizer (supposed they are synchronized on the same clock). The depacketization delay is the time needed to depacketize (or demultiplex) a whole

video frame. It may overlap with the transmission delay and can be computed with a similar procedure as the packetization delay (see Figure 9). The decoding delay is the time needed to decode a video frame. It may overlap with the depacketization delay and can be computed with a similar procedure as the packetization delay (see Figure 9). The display delay is the time needed to display a video frame. It measures only the time it takes to show the frame on the screen, and not how long the frame is displayed until the next frame arrives for being displayed. It can be considered rather constant for a given display device and can be computed by

. (20)

The end-to-end delay is the delay occurred from capturing up to display of the processed video frame. It can be computed by the expression

2 ∑#

∑ 2# (21)

49

where E2EDelayj is equal to . This is equivalent to the sum of capturing, encoding,

display delays and all the other net delays. The value is computed by the receiver,

whereas is computed by the sender (supposed they are synchronized on the same clock).

When additional statistical information about the end-to-end delay or for each of the processing modules in the end-to-end chain is needed, it is useful to plot the PDF and the CDF of delays.

The out of delay constraints rate is a metric that measures the rate of delay violation over a fixed threshold T of time. This is useful, for example, to ensure that a certain percentage of frames arrive within a specified time. Let us define

1 0

. (22)

Delayj represents a generic delay metric. Then the out of delay constraints rate (in percentage) is

% ∑#

∑# . (23)

The delay jitter describes the variation of delay over time, and can be computed separately in every module of the sender-receiver chain or end-to-end. The examples presented here are for decoder delay jitter. The average decoder delay jitter for the run k can be calculated by the formula

#∑ ,# (24)

where for the jth decoded frame

(25)

and DDelayj is equal to . The absolute values of the above differences are used to

avoid that positive and negative jitters would cancel with each other. The previous jitter defined is a relative delay jitter, because it is computed taking into account only two consecutive decoding delays of frames. The absolute jitter can be computed using the decoder instantaneous frame inter-arrival time for the whole video sequence. The average decoder output absolute delay jitter can be computed for the run k by

#∑ ,# (26)

where for the jth decoded frame

(27)

where the instantaneous frame inter-arrival time (delay) for the jth frame (measured at the exit of the decoder) is defined as

. (28)


50

The absolute delay jitter measures how far off is the difference of delivery times of frames from the ideal time difference in a perfectly periodic sequence where frames are spaced by the instantaneous frame inter-arrival time2.

The shaping delay and its function are defined in section 4.5.2. If the frame rate of the video sequence is CodingFR, then

. (29)

The total net delay is the total processing end-to-end delay, excluding buffering. It is defined as

2 . (30)

The overlapping factor is a metric meant to assess the degree of overlapping between the different modules. The overlapping factor shows the relations between the overall net and total delays, and can be used to optimize the system. The generic overlapping factor is computed by

∑ 1∑

∑. (31)

The range of its values is between 0 and 100, and 100 indicates the maximum level of overlapping. This metric can also provide a rough estimation about how much the overall delay can be reduced.

4.4.4. Service flexibility-based QoS metrics

The traffic over a wireless channel can change as a function of time as well as of the location and speed of the mobile terminal in the cellular environment. Moreover, due to handovers it is sometimes difficult to maintain the agreed QoS. Different channel conditions may allow using a higher bit rate, with associated better QoS, or can put constraints on the maximum possible data rate. The same applies to error rates. Let us define the service flexibility as

∆ . (32) where QoScurrent and QoSprevious are values of QoS for the current and the previous service

states, and may be expressed in terms of PSNR (QoSPSNR), video delay (QoSdelay), frame

rate (QoSFPS) or other metrics [P1].

4.4.5. Call control-based QoS metrics

The ITU-T specification [120] defines metrics for circuit-switched services over mobile networks. However, similar concepts could be applied for packet-switched services [97]. For

2 The instantaneous frame inter-arrival time can be variable, and therefore can accommodate realistic situations of variable frame rates. The case of perfectly constant frame rates can be considered as a special case where the instantaneous frame inter-arrival time is always constant.

51

SIP calls, the measurement phases can be divided into three parts: Post-Dialing Delay (PDD), Answer-Signal Delay (ASD) and Call-Release Delay (CRD). For further details the reader may refer to Figure 1 in Publication [P2] and the details therein. In addition to the above, the Call Success Rate (CSR) defines the ratio of the successfully established calls over the total number of calls [79, 152].

4.4.6. Other QoS metrics

Quality Of Experience (QoE) metrics for MTSI have been defined in [34]. These are useful for optimizing the application performance and for quality monitoring. Examples of such metrics (applicable only to continuous media) are video corruption duration, successive loss of RTP packets, jitter duration, synch loss duration. QoE metrics reporting mechanisms are illustrated in [174].

There are also various other sources of QoS metrics. Metrics specifically designed for SIP telephony services are described in [152]. VoIP monitoring is enabled by the RTCP eXtended Reports (RTCP XR) [103] with metrics such as the amount of lost and discarded packets, bursts information and jitter buffer parameters. Transport layer measurements methods such as one-way and round-trip delays, packet delay variation, effective packet loss, packet loss correlation are described in [97]. IP packet rates metrics are discussed in [129]. Metrics related to calculating the net bandwidth after repair mechanisms (e.g., FEC or retransmission) are discussed in [66]. Metrics to determine packets that remain lost after all repair methods are applied are described in [52].

4.5. MULTIMEDIA TELEPHONY QOS IMPROVEMENTS

In this section some relevant aspects for improving the QoS of mobile multimedia telephony applications and solutions will be introduced.

4.5.1. Bit errors or packet loss handling

Bit errors and packet losses may be caused by the inherent lossy nature of the air interface and by handovers during user mobility (see section 3.2.6). There are many techniques for performing repair of corrupted or missing data, caused by erroneous bits in a media stream or packet losses in error-prone environments, in order to mitigate their effects on the user experience. Error concealment algorithms are usually located in the media decoder, and use various interpolation techniques to try to reconstruct the correct data (or the best approximation of it) from corrupted or missing data. A review of several algorithms is available in [111, 184, 203]. FEC algorithms add a certain amount of redundancy to media data before transmission, in such a way that the receiver is able to reconstruct correct data, even in case of a certain amount of data loss. Some FEC algorithms are reviewed in [60, 158, 184]. UEP methods allow giving more protection to more important data, and less protection to less important data, for example by using different levels of FEC within the media stream


52

[90]. The techniques above produce a certain additional delay at the sender and receiver sides (e.g., FEC, UEP) or at the receiver side (media error concealment). This additional delay must be taken into account when designing a multimedia telephony application. A method for interleaving is described in [215], while in [185] other options for repairing media are described.

The methods above may use a feedback loop between receiver and sender to function in a more interactive manner. This means that the receiver may signal the sender that some data is missing, or some other abnormal event or specific encoder instructions, so that the sender can take appropriate actions [90]. For instance, FEC or UEP can be used adaptively, so that the amount of redundancy or protection is adapted to the channel conditions [46, 56, 133, 156]. RTP/AVPF [178] allows, among the others, Picture Loss Indication (PLI) and Slice Loss Indication (SLI) for signaling the loss of some amount of video data and to trigger the video encoder, for example, to send an Intra-coded picture refresh or take other action. RTP/AVPF also allows signaling of specific application layer feedback messages. The standard [218], allows a receiver to send explicitly a Full Intra Request (FIR) to trigger the sender to transmit a Decoder Refresh Point (e.g., and Intra frame). H.271 [125] is similar to [178] and [218]. The above methods do not have hard real-time constraints. This means that even if the sender reaction to the feedback message arrives at the receiver a bit later, this does not compromise the playback. However, the faster the media data is repaired, the better is for the user experience. Among the actions that a sender can take for fighting against error rates is that of changing dynamically the packet size based on the error rate [90]. Smaller packets lost produce less visible or audible effect on the QoS; however, they increase the packet header overhead and, therefore, the needed bandwidth. Retransmission for multimedia telephony is analyzed in [90]. Studies on event-driven RTCP feedback compared to constant feedback for video telephony and VoIP are available in [80, 81].

In Publication [P1] and [78] are shown performance results of a 3G-324M terminal when operating in error-prone channel conditions over simulated HSCSD and UTRAN CS networks for channel bit rates between 14.4 and 64 kbps. The results presented focus on H.263+ QCIF video transmitted over H.223 Level 2. H.263+ was configured with Annexes F, I, J, T and decoder error concealment. Two types of connections were analyzed: connections between mobile and PSTN/ISDN networks (Mobile-To-Land, MTL), and connections between two mobile terminals (Mobile-To-Mobile, MTM). The BERs were in the range between 3*10-5 and 6*10-4, and a low-motion video sequence (Akiyo) as well as a high-motion video sequence were used (Carphone). Results for MTM connections at 64 kbps over UTRAN show that the maximum quality loss (ΔPSNR) is 0.32 dB for low-motion video, and 0.48 dB for high-motion video under the most severe error conditions. The authors in Publication [P1] investigate also service flexibility (see section 4.4.4 for the definition of the QoS metrics). Results show that a change in the channel condition or user’s speed, results in a quality variation (ΔQoSPSNR) in the range [-0.25, 0.25] dB for low-motion video, and in the range [-0.37, 0.37] dB for high-motion video for 64 kbps MTM

53

connections. Further details, including the achievable video frame rates and their ΔQoSFPS, are available in Publication [P1]. Simulation results for HSCSD are available in [78].

4.5.2. Delay optimization

Delay optimizations are often possible only at the sender and receiver terminals, through a careful software design and implementation. An end-to-end delays analysis for video conferencing over PS networks is presented in [50]. An interesting perspective about the relation between delays and rate constraints, and the fact that violating a rate constraint is equivalent to violating a delay constraint can be found in [113]. VBR video delay is also analyzed in [141]. In the rest of this section the focus will be on CBR video delay analysis of 3G-324M terminals [112], excluding the network propagation delay and the layers 1-2 delays in both sender and receiver terminals.

One drawback of CBR video is that the video delay is variable, since the frames have variable size depending on the compression efficiency and the characteristics of the sequence. The processing of the video signal in a multimedia telephony terminal can be divided to the following steps: 1) Capture video frame; 2) Encode video frame; 3) Multiplex compressed video bit stream to packets; 4) Transmit multiplexed packets over the channel (Layers 1-2); 5) Demultiplex received packets; 6) Decode video packets to a video frame; 7) Display video frame.

In addition to processing delays, there are buffers in each module. Therefore, the delay consists of several processing and buffering components. Buffering is needed to regulate the variable rate video stream to fit to a constant rate channel (shaping). Let us analyze each component separately.

Firstly, a video frame is captured and possible sub-sampling and format conversion is done. Typically the frame rate in the camera is 29.97 frames per second (fps). At low bit rates, only a subset of the captured frames can be encoded and transmitted. The encoded frame rate can be fixed or variable. Since the output bit rate of the video encoder depends on the video content, and in practice it is hard to regulate the variation only by quantization, variable frame rates give better opportunities to maintain the target bit rate.

Since the H.223 multiplexer can divide the video frames into several packets, the video encoder can pass data to the multiplexer before a whole frame has been encoded. This reduces the overall delay, since the encoding and multiplexing delays overlap. In general, the shorter the video packets are, the shorter the delays are, the larger the overhead produced by the multiplexer.

In CBR transmission, variable rates need to be shaped to a constant output bit rate using the so-called leaky bucket algorithm [212]. In this context, the shaping buffer is located at the input of the multiplexer. If video data is read from the buffer at a constant rate, the average waiting time per frame is at least the time it takes to transmit a frame. For example, if the average video frame rate is 10 fps, the average (minimum) buffering delay per video frame is 1/10 s = 100 ms. This delay is called shaping delay [112]. The multiplexer


54

processing delay can be considered negligible, since typically only CRC is calculated and added to the data to fill each multiplexer packet.

In the receiver, the H.223 demultiplexer receives data in packets. Since the Layer 2 outputs blocks of data, the demultiplexer can start demultiplexing right away. This needs to store data until the whole packet has been received. Although it would be technically possible to forward the data to the video decoder right away, to guarantee better error resilience it is preferable to wait and use the CRC to check if the received packet is corrupted.

The frame is then displayed. Figure 9 shows the delay components and an overlapping scenario.

CI CE

EI EE

MXI MXE

TI TE

DXI DXE

DI DE DSI DSE

Capture Display delay delay Net Decoding delay Encoding delay Net Demutiplexing Delay Net Multiplexing delay Net Transmission delay

Encoding

Multiplexing

Transmission

Demultiplexing

Decoding

Capturing

Display

End-to-end Delay

Figure 9. Delay components in 3G-324M terminals

Simulation results of a 3G-324M terminal running over a 500 MHz Pentium III computer are shown in Table 11 [112]. The video bit stream was transmitted over a zero delay and zero losses network channel at 64 kbps. Two video codecs were compared: H.263+ and MPEG-4 Visual (Simple Profile), and also in this case the video sequences Akiyo and Carphone were used. Results did not show significant differences between H.263+ and MPEG-4 Visual, so the table contains only results for H.263+. The table shows the total delays (except capture and transmission delays). If these delays are summed to the UMTS delays over CS channels (see section 3.3), an end-to-end delay of about 340 ms is achieved, which is acceptable for multimedia telephony applications (see Table 6). The overlapping factor (see section 4.4.3) measures the degree of processing overlapping between different modules (0% indicates no overlapping, i.e., the different steps are processed in a completely serial manner). The reader may refer to [112] for further details, and note that this paper

55

improves the results achieved in Publication [P1] because a faster hardware platform was used.

TABLE 11. TOTAL DELAYS FOR H.263+ VIDEO ON 3G-324M

Video sequence

Shaping delay (ms)

Net processing delay (ms)

Total delay (ms)

Overlapping factor (%)

Akiyo 100 37 137 41 Carphone 98 42 140 40

4.5.3. Jitter buffer management

Packet-switched networks are generally affected by a variable delay. In addition, the sender and receiver terminals may also be affected by random delays (e.g., if other applications are processed simultaneously). Also, handovers may cause additional variation in delay when moving from one cell to another (see section 3.2.6). Therefore, it is a safe to assume that the received stream will suffer from delay jitter. When packets are received too fast, the packets that arrive too early and before their playback time would need to be discarded. The user may notice a playback glitch. Similarly, packets that arrive too late and after their scheduled playback time are useless, because the utility of a real-time packet decreases over time. These packets would need to be discarded, and may cause a noticeable playback glitch too.

Typically, the adopted solution to fight against delay jitter and avoid packet discarding is to place a de-jittering buffer between the RTP receiver and the media decoder, so that the variation in delay could be smoothed out and made seamless while data is played out. However, even the presence of a small buffer implies an increase in the service latency (i.e., at least an increase in the delay for receiving the first fragment of speech or video data). The size of the jitter buffer should not be too large for not incurring in too large additional delays; it should also not be too small for not making the jitter buffer useless. The goal should be that of minimizing additional delays while at the same time minimizing discarded packets at the receiver. Adaptive schemes for jitter buffer management for speech are described in [34, 207]. Other schemes are available in [184].

4.5.4. Inter-media synchronization

Synchronization between multiple media (i.e., lip synchronization) is not a new problem, and several solutions exist in literature. The sync skew is often due to delay jitter in the network. In other scenarios, temporal media scaling (e.g., stream thinning) may also lead to media synchronization problems. In H.324, the receive path delay function adds delay in audio, if needed, in order to help media synchronization (see Figure 7). The problem is essentially that of reducing the delay jitter at the receiver side (see section 4.5.3). The RTP protocol allows performing media synchronization in a rather easy way, especially if an SDP extension is used [59]. A survey of media synchronization techniques is presented in [116, 117]. Prior art methods for ensuring lip synchronization include, among others, gradually accelerating or retarding the media streams [135] or synchronizing only once every several


56

video frames or only during speech talk spurts [202]. Interaction of speech and video delays are described in [115].

Human perception of lip synchronization in mobile environment is the subject of Publication [P3]. Depending on the amount of synchronization skew, the media can still be perceived as synchronized. For traditional TV environment [208], experience showed that the typical acceptable synch skew is in the range [-80, 80] ms. This range is called in-synch region. The synch skew starts to be annoying to humans when it falls outside of the range [-160, 160] ms (out-of-synch region). In the ranges [-160, -80] ms and [80, 160] ms there is a transient region where the detection of lip synchronization problems depends of the speaker size (head, shoulder and body view). In the TV environment, the numbers -160 ms, -80 ms, 80 ms, 160 ms are called lip synchronization thresholds.

Knowing the right lip synch thresholds in mobile environment allows developing optimized mobile applications. A typical application buffers for a certain time some media fragments that arrive too early (i.e., before time A in Figure 2 of Publication [P3]), and it discards media fragments that arrive too late for playback (i.e., after time B in Figure 2 of Publication [P3]). In order to exploit this application behavior, the knowledge of the lip synchronization thresholds is crucial, in order to dimension application buffers [206, 221] and to determine the boundaries of the in-synch region.

The findings in Publication [P3] include the research of the detection point (i.e., the first point where the out-of-sync is detected by a subject, rather than the annoyance point (i.e., a point beyond the detection point, where the out-of-sync is not only detected but also not anymore tolerable). The detection point is a tight and more conservative value than the annoyance point. We were researching the thresholds A and B of Figure 2 in Publication [P3] in the case of mobile environment. Two cases were analyzed: the case video was displayed before its corresponding audio, and the case video was displayed after its corresponding audio. QCIF and SQCIF sizes were assessed for frame rates of 5-15 fps. The results are presented in Table 12 and show that mobile environments are different from traditional TV in that there is a higher tolerance if video comes much earlier than audio (compared to TV systems). This means that the maximum temporal skew between two media can be [-280, 80] ms (the ideal lip sync has 0 ms skew). The maximum synch skew among media streams could be signaled between multimedia telephony terminals using the mechanism in [172].

TABLE 12. LIP SYNCHRONIZATION THRESHOLDS FOR MOBILE ENVIRONMENT

Case Video Video SQCIF 5 fps -240 ms +80 ms SQCIF 10 fps -240 ms +80 ms SQCIF 15 fps -280 ms +80 ms QCIF 5 fps -160 ms +80 ms QCIF 10 fps -200 ms +80 ms QCIF 15 fps -200 ms +80 ms

57

4.5.5. Packetization overheads

The example 2 of section 4.2.2 has already shown that the incoming/outgoing packet rates for low-end multimedia telephony terminals could be challenging. Since terminals may have different capabilities also in terms of processing power, it could be advantageous to signal end-to-end the maximum packet rate supported by a terminal (or an application), so that the terminal at the other end will try to generate media traffic that is more compliant to the receiver processing capabilities [219]. This might mean generating larger packets at the risk of less error resilience, but probably a better choice than not having a session at all. Of course, as already mentioned, using ROHC can always give advantages for the purpose of reducing the RTP/UDP/IP headers overhead.

Mobile VoIP RTP traffic is particularly sensitive to RTCP traffic, if the two flows are carried on the same PDP context, because the RTCP traffic may delay or cause RTP packet losses [170]. When RTCP cannot be turned off, one solution to reduce the potential disruption of RTCP onto the RTP flow is to keep the RTCP bandwidth and the size of RTCP packets as small as possible. RTCP packet sizes can be minimized by using only the minimal set of optional parts of RTCP fields that are required by the application. A practical size limit for the RTCP sender is in the order of 2-5 times the speech RTP packet sizes. Additionally, the RTCP sender can attempt to schedule RTCP packets during speech inactivity periods. For example, if an RTCP packet is scheduled at a future time and a silence period starts, this RTCP packet could be sent immediately. The subsequent RTCP packets are scheduled according to the normal rules (i.e., as if the previous packet was sent as originally scheduled) [38]. Other solutions include using reduced size non-compound RTCP packets [134].

Experimentations with 3G-324M in Publication [P1] have shown that the packetization overhead caused by the H.223 multiplexer for a multimedia telephony session at 64 kbps is 8%. This is a bit higher than the figure in the example shown in section 4.2.1, because the experiments used the AL3 for the transport of video which employs a higher error protection than AL2.

4.5.6. Session control signaling delay

Experience shows that, when trying to make a phone call, waiting too long before hearing the first ring is annoying. In this section, the author looks at the issues related to the connection delays with a focus on MTSI terminals that use the SIP protocol.

In [100] the authors present results of SIP calls based on Internet traces, considering also proxy, redirect servers and UDP error bursts. In [79] the authors show results of SIP signaling for local calls when one of the endpoints is connected via WLAN with radio coverage from 0% to 99%. The PDDs (refer to section 4.4.5 for the metrics used here) were up to 150 ms. The authors investigated also the effect of a proxy server located 1500 km away, and found that the proxy produced an additional delay of 2.4-2.7 s. Finally, the authors simulated local calls in heavy loss environment (up to 60% loss rate), and found that with


58

losses higher than 5% the PDD was considerably increasing (up to over 600 ms, whereas in better conditions the PDD was up to 55 ms). The CSR was 100% for loss rates up to 20%; it was dropping to 77% for loss rates of 50%, and it was 0% for losses of 60%.

In Publication [P2] the authors assess the impact of a 3G network on the signaling delays. In the experiments, one of the endpoints was connected to a 3G network emulator. The idea was to compare 3G calls against Intranet calls for local, national (500 km), international (1500 km) and overseas (over 10000 km) scenarios. The results for PDD, ASD and CRD were all below 250 ms. Low bit rate channels were also simulated, since the SIP traffic may be routed over a thin bearer. The channel rates ranged from 2 to 256 kbps, and a constant 2% loss rate was injected in the network. The CSR was always 100%. Results show that a very small channel (2-5 kbps) is not suitable and may yield PDDs up to one second. For bearers of at least 16 kbps the PDD is below 200 ms.

59

Chapter 5

Mobile Media Streaming

n the previous chapter, several relevant issues related to mobile multimedia telephony were analyzed from different angles. In this chapter a similar approach will be taken for another application: mobile multimedia streaming. Despite Internet streaming is a

technology more a decade old, mobile streaming technology is relatively young. The first mobile phones with streaming capabilities appeared on the market around 2003 (e.g., the Nokia 3650 or 6600).

The basic differences between mobile streaming and multimedia telephony have already been emphasized in section 3.1, where it has been made clear that the network requirements for multimedia streaming applications are less stringent compared to multimedia telephony. This chapter will begin with architectural, services and classification considerations. Then, media traffic characteristics will be introduced followed by PDP context considerations and QoS metrics. Finally, the chapter will end with methods for QoS improvements.

5.1. MOBILE STREAMING ARCHITECTURES AND SERVICES

The high-level architecture of a typical IP-based mobile multimedia streaming system is depicted in Figure 10 (see also [P4] for a PSS end-to-end system architecture). The system enables point-to-point uni-directional communication. For the sake of clarity, distributed network topologies like those for peer-to-peer streaming [182] multi-path streaming [136], or broadcast/multicast streaming (such as MBMS [169]) scenarios will not be under consideration in this dissertation.

A mobile streaming system is mainly made of three components: the streaming server, the mobile network, and the mobile streaming client. The streaming server is connected to a fixed IP network trunk. The server may reside either within the mobile operator's domain or outside it (e.g., in the Internet). The location of the streaming server is important when considering the end-to-end QoS of a streaming service. In fact, if the server is located in the

I

MOBILE MEDIA STREAMING

60

public Internet, the QoS of the network trunk between the streaming server and the mobile network is not usually controlled by the mobile operator, and it could be of the non-QoS guaranteed type. This may have impact on the perceived streaming service quality.

The media content is usually created off-line and is loaded onto the streaming server before any user can actually request its playback. Content which is expected to be highly requested can be replicated or cached in proxy servers employing appropriate techniques that make use of usage patterns. In other scenarios the content may be created live and streamed in near real-time.

Mobile streaming client

Mobile network Media streams and

session control

Fixed IP network

Media streams and session

control (downlink)

Session control (uplink)

Session control

Bearer

Streaming server

Figure 10. A typical mobile multimedia streaming system

The mobile network carries multimedia streaming traffic between the streaming server and mobile streaming client via a logical connection (PDP context). This uses physical transport channels in the downlink and uplink directions to enable the data transfer in the two directions.

The mobile streaming client keeps a radio connection with the mobile network. The data flows received by the mobile client in the downlink direction are, for example, audio and video plus additional information for session establishment, control and media synchronization. The data flows sent by the client in the uplink direction are mainly session control data and QoS reports. The streaming server may react accordingly upon reception of those QoS reports, taking appropriate actions for guaranteeing the best possible media quality at any instant.

5.1.1. Classification

Streaming applications and use cases can be classified at least in the following three groups: 1. Real-time streaming. Here the server transmits media streams that have been encoded

off-line or in real-time. Usually the transmission rate is approximately the same as the encoding rate. The streaming client receives the media streams with a little latency, decodes them and plays them back without additional delays. After playback, data is discarded and not stored in the terminal. Therefore, the memory requirements in the

61

client are minimal and there is no practical limitation on the temporal length of a media stream to be played back.

2. Downloading. This is also known as Download & Play. Here the server transmits the media that have been encoded only off-line. The transmission rate can be lower or higher than the encoding rate. The client must wait until all the streams (i.e., files) are received, and subsequently can decode and play them back. After playback, data remains stored in the client device. This requires a larger amount of storage in the client device (e.g., for downloading a movie).

3. Progressive downloading. This is also known as Progressive Streaming and is a mix of the previous two. The server transmits off-line pre-encoded media at a transmission rate usually higher than the encoding rate, in order to allow the client to play back the streams while downloading. During playback, a larger client buffer is required (compared to real-time streaming), depending on whether past played data is discarded or not, and the amount of media to be downloaded for future playback. After playback, data is discarded or stored on the client. An example of progressive streaming is YouTube.

Table 13 summarizes the main characteristics of real-time streaming, downloading and progressive downloading.

TABLE 13. CLASSIFICATION OF USE CASES FOR UNICAST STREAMING

Real-time Streaming Downloading Progressive downloading Content type Pre-encoded / Live Pre-encoded Pre-encoded Server transmission rate ~encoding rate Any rate >encoding rate Client delay for playback Low (seconds) High (even hours) Low (seconds) Post-playback data Data is discarded Data is stored Data is stored or discarded Memory requirements Low (few seconds buffer) High (up to GBs) Medium-High

A typical example of real-time mobile streaming is the PSS standard. The protocol used for media transport is RTP over an unreliable connection (UDP/IP), and the protocol for session set-up and control is RTSP over a reliable connection (TCP). See the next section for more details. Reliable RTSP streaming, meaning tunneling RTP traffic over RTSP and TCP/IP is also an option for real-time streaming [200]. However, this will not be considered in the rest of the thesis.

Downloading is usually performed over a reliable connection. For example by using the HTTP protocol over TCP. The requirement for a reliable connection is given by the fact that the user assumes that the download is error free (like for any other file downloaded from the Internet).

Progressive downloading of media is usually carried over a reliable connection (e.g., TCP). The larger client buffer required derives from the fact that if the playback is paused, the media stream continues to be downloaded until the media stream is over. The playback will be eventually smoother, but the downside is that if the user does not like the video and decides to quit the playback, the video would have been fully downloaded in vain.

In the rest of this thesis the focus will mostly be on protocols and architecture for real-time streaming and progressive downloading with a particular reference to the 3GPP PSS


62

standard (and also the DLNA standard, wherever explicitly mentioned). Other protocols (such as SCTP) and standards for streaming (e.g., the 3GPP2 MSS) are out of scope of this thesis.

5.1.2. The PSS Standard

The 3GPP Packet-switched Streaming Service (PSS) is defined in several specifications [36, 37, 43]. In The basic use cases for PSS are music streaming, news-on-demand, video streaming, live radio and TV programs. The signaling for different use cases (including progressive downloading) is described in [36]. In [P4, P5] and [73] are shown signaling charts of typical streaming sessions. Reference [37] includes the PSS transport protocols and Figure 2 in Publication [P5] shows the functional components of a PSS client. The PSS capability exchange mechanism [37] enables to adapt the streams to the mobile terminal characteristics (e.g., screen size). The scene description in PSS consists of a spatial layout (for different media components on the screen) and a temporal relation between different media (for the purpose of synchronization). The PSS protocol stack is depicted in Figure 11.

The session description file can be obtained via MMS or fetched via a URL. Progressive downloading of 3GP files is enabled by the HTTP protocol [101] that ensures reliable transmission via TCP. The RTSP protocol [200] is used for session set-up and control. RTSP allows controlling the media playback with VCR-type commands (e.g., Play, Stop, Fast Forward, Rewind, Pause). Typical RTSP message flows are shown in Figure 2 of [P4] and Figure 4 of [P5]. Figure 11 shows that the discrete media are transported reliably over HTTP/TCP, and the continuous media are transported unreliably over RTP (and UDP). However, also continuous media can be transported over HTTP in case of (progressive) downloading. RTP media may also be transported over RTSP/TCP (RTSP/RTP interleaving [200]) or directly over TCP [143] for allowing an easier NAT (Network Address Translator) traversal, but this case will not be analyzed here.

Figure 11. PSS protocol stack

63

The supported set of speech, audio and video codecs for PSS systems is richer than the set of codecs for multimedia telephony applications. This is described in Table 14. In addition, PSS supports synthetic audio, still images, bitmap graphics, vector graphics, dynamic scene description, timed text and timed graphics media types [37].

TABLE 14. SPEECH, AUDIO AND VIDEO CODECS SUPPORTED BY PSS SYSTEMS

Media codec Properties AMR and AMR-WB Same as in Table 9. Extended AMR-WB Audio encoded with a sampling frequency up to 48 kHz [39]. 16 modes with bit

rates from 5.2 to 96 kbps with variable frame lengths from 13.33 to 40 ms. Enhanced aacPlus Audio encoded with a sampling frequency up to 48 kHz [40]. MPEG-4 AAC Audio encoded up to 48 kHz sampling frequency [119]. H.263 and MPEG-4 Same as in Table 9. H.264 Constrained Baseline Profile Level 1.3 (example resolution up to CIF, up to 768

kbps) and High Profile Level 3.0 (example resolution up to 625 SD at 25 fps, up to 12.5 Mbps).

One of the advanced features that PSS provides is the time shifting functionality [37]. This allows to pause and play a live stream from a different point than the live current time, and to perform seek operations for fast-forward, slow-forward and rewind (i.e., trick mode operations). This functionality is achieved by using a time-shift buffer in the PSS server for each of the live media streams. The buffer is updated in a sliding window manner using a given depth, offering only visibility to a limited (most recent) amount of media data.

5.2. MEDIA TRAFFIC CHARACTERISTICS

5.2.1. Content creation and distribution

This aspect is related to the fact that the streaming server may or may not be located within the operator’s domain. If the server is located and controlled by the mobile operator, it is easier to make the streaming traffic comply with the network behavior, because the content is created and delivered by the same entity that manages the network resources. Despite this case will ease the deployment scenarios, it is not the most common case. In general, the streaming server is located outside the operator’s domain. In the most difficult case, there may even be two separate entities that manage the server chain: content creators and content distributors.

Content creators provide to create the content (media encoding), while content distributors provide to deliver the content to the end users. These two entities are generally both outside the mobile operator’s domain. The challenge becomes that of creating content that could then be optimally delivered over mobile networks. If content creators are general Internet content providers, it is possible that the content created is optimized for the general Internet use, and not optimized for mobile delivery. If the content distributor is outside the operator’s network, the content may still be delivered in a PSS-friendly way, even if the network trunk between the server and the entry-point of the mobile network may still pose


64

some challenges due to the unpredictable nature of the network traffic. If the streaming server is not even PSS-friendly, then the PSS client must make the worst case assumptions regarding content creation, compliance of the streaming server with respect to the PSS standard, and location of the streaming server within the end-to-end delivery chain. Here by PSS-friendly it is meant that the streaming server is aware of how the content has been created (e.g., its traffic characteristics) and also aware of the underlying network characteristics and client features. This awareness allows the server to make minimal adjustments during the media delivery. If the content is made available to the streaming server in advance, this may perform a content pre-scanning in order to extract the relevant statistical information from the media. This analysis could be performed also at delivery time.

5.2.2. Media content and rate controls

The second aspect, linked to the first one, is related to the nature of the content and its creation. Streaming content could be live or pre-encoded. In case of live content, and depending on the location of the encoder, the media could be generated in a more or less PSS-friendly way, depending on the level of awareness that the encoder has with respect of the network and the PSS client.

If the content is pre-encoded off-line, the problem is that of using optimal encoder settings such that the media transmission does result in good end user experience. The encoder rate control algorithm plays an important role in this framework, because it is the mechanism that allocates the bits for each media frame. Rate control algorithms for multimedia telephony applications are different from those optimized for streaming. Algorithms for streaming aim at achieving a constant video quality at a constant frame rate with reasonably small initial buffer delay and buffer size requirements. An example rate control algorithm for video streaming is given in [213]. The reader interested in a comparison between a streaming rate control, a constant bit rate control and an unconstrained variable bit rate control for video (with unlimited buffer size) may refer to [43].

After encoding, the server decides the right packetization for ensuring optimal delivery. For example, speech data could be packetized using an arbitrary (but reasonable) number of speech frames per RTP packet, and using bit- or byte alignment, along with options such as interleaving [205]. Video data could be encapsulated using for example one slice per RTP packet, or one GOB (row of macroblocks) per RTP packet or one frame per RTP packet [176]. Transmission of RTP packets can occur in different ways according to a second (transport) layer rate control strategy [43, 141]:

CBRP (Constant Bit Rate Packet) transmission: the delay between sending consecutive packets is continuously adjusted to maintain a near constant rate.

VBRP (Variable Bit Rate Packet) transmission: the transmission time of a packet depends solely on the timestamp of the media frame the packet belongs to, and it may

65

also be subject to speed-up or slow-down depending on the adaptation capability of the server (see Chapter 6). Therefore, the media rate variation or the server adaptation strategy is directly reflected to the channel.

5.2.3. Speech streaming traffic

This section will be limited to speech traffic, and the following discussion will be made in the context of AMR and AMR-WB codecs. The packetizations considered are for 1, 10, 20 and 50 speech frames per RTP packet (corresponding respectively to 20 ms, 200 ms, 400 ms, 1 s of speech data). The RTP packet sizes described in Table 15 are for AMR when it requires the largest payload header (i.e., when using the octect-aligned mode, CRC and interleaving [205]). All the calculations include uncompressed RTP/UDP/IPv4 headers. Only the boundary bit rate cases are included in the table, in order to show the operating ranges (additional data is contained in [P5]). Considerations and recommendations for optimal speech traffic packetization for streaming are made in section 5.5.5.

TABLE 15. SPEECH TRAFFIC CHARACTERISTICS FOR STREAMING

AMR/AMR-WB mode (kbps) Packet size (bytes) Total Bit rate (kbps) Packet rate (pps) 1 AMR frame per RTP packet AMR 4.75 56 22.4 50 AMR 12.2 75 30.0 50 AMR-WB 6.6 61 24.4 50 AMR-WB 23.85 104 41.6 50 10 AMR frames per RTP packet AMR 4.75 182 7.3 5 AMR 12.2 372 14.9 5 AMR-WB 6.6 232 9.3 5 AMR-WB 23.85 662 26.5 5 20 AMR frames per RTP packet AMR 4.75 322 6.4 2.5 AMR 12.2 702 14.0 2.5 AMR-WB 6.6 422 8.4 2.5 AMR-WB 23.85 1282 25.6 2.5 50 AMR frames per RTP packet AMR 4.75 742 5.9 1 AMR 12.2 1692 13.5 1 AMR-WB 6.6 992 7.9 1 AMR-WB 23.85 3142 25.1 1

5.2.4. Video streaming traffic

In this section, examples of video streaming traffic characteristics will be shown in order to have an idea of how variable the streaming traffic generated by a server could be. This will also build the base and justification for adaptive streaming systems, which is the subject of next chapter. In [161] data was collected from two different video streaming servers: 1. RTP-based PSS server with H.263+ video (Profile 0, Level 10), with CBRP and VBRP;


66

2. RealNetworks system with proprietary transport protocol and some server adaptation strategy [68].

In case of the PSS server, different packetization algorithms were considered: 1.I. One frame per RTP packet without maximum packet size limitation; 1.II. One GOB per RTP packet; 1.III. A target RTP packet payload size of 75 bytes when using H.263 Annex K slices

[122]. Also, in the case of the PSS server, different video rate control algorithms were used:

1.A. Fixed-QP encoding (QP=10); therefore, the rate variation of the encoded video sequence is not modified;

1.B. Long window rate control [213] (StreamRC). It maintains a fixed frame rate and consistent quality, and requires an initial buffering time before starting decoding;

1.C. TMN5 rate control [123] which is not video streaming optimized, but designed for real-time low-delay applications [213], that results in video frame rate variation.

Two different networks models were used:

LAN with low, near-constant delay and no packet losses;

Simulated UTRAN with 76.8 kbps channel, RLC running in UNACK mode (with no errors).

The video sequence was captured at 15 fps in QCIF resolution and was a combination of different type of scenes (slow/fast motion with panning) with multiple scene cuts. The packet size statistics shown in Table 16 include RTP/UDP/IP headers and are related to LAN traffic [161].

TABLE 16. PACKET SIZE STATISTICS (IN BYTES) FOR DIFFERENT RATE CONTROLS AND PACKETIZATIONS

Average Std. dev. Minimum Maximum

PSS Fixed QP

III (Slice) 106 56 45 181

PSS StreamRC

I (Frame) 573 398 67 4303

II (GOB) 99 88 43 663

III (Slice) 108 56 45 210

PSS TMN5

I (Frame) 595 229 62 3375

II (GOB) 102 79 43 759

III (Slice) 109 56 45 241

RealNetworks

N/A 521 154 64 668

The table shows that slice and GOB packetizations always generate (on average) rather

small packet sizes (less than 110 bytes long). This means a high packet rate must be sustained in the server and client. On the other hand, frame packetization generates packets that are larger (on average less than 600 bytes), but have a maximum size that exceeds 4 KB. Such large packets are subject to fragmentation in the IP layer (and the lower layers too) and

67

could produce a lower QoS (even if ACK mode is used in the lower layers) for the reasons already explained in section 3.2.7, especially if these large packet sizes are related to intra-coded video frames. RealNetworks traffic takes this aspect into account, because the generated packets are all smaller than 670 bytes.

The bit rate statistics for streaming traffic are shown in Table 17 [161]. Results for PSS are for transmissions over LAN and video slice packetization (1.III) for the VBRP and CBRP cases. Results for RealNetworks streaming are also shown.

TABLE 17. STREAMING BIT RATE STATISTICS (IN KBPS) FOR DIFFERENT RATE CONTROLS

Average Std. dev. Minimum Maximum

PSS with VBRP

A (QP10) 64.0 58.1 5.4 356.3

B (StreamRC) 64.5 27.2 17.7 184.4

C (TMN5) 63.2 1.8 54.7 71.4

PSS with CBRP

A (QP10) 62.9 0.8 60.8 66.0

B (StreamRC) 63.5 0.7 61.3 66.2

C (TMN5) 63.5 1.0 59.9 67.9

RealNetworks

LAN 49.3 5.0 40.9 66.1

UTRAN 0% loss rate 49.5 5.6 39.1 70.3

As it can be seen from the results, the fixed QP encoding exploits a large rate variation which is fully transparent if VBRP transmission is used (from about 5 up to over 350 kbps). However, if the streaming server is targeting to a constant output rate, the VBR nature of the encoded stream can be constrained to exploit a CBR nature at the output of the server (with a bit rate ranging from about 61 up to 66 kbps). Similar considerations apply also to StreamRC which exploits a maximum bit rate of about 185 kbps with VBRP. The TMN5 rate control has a rather CBR character, but it also exploits a variable quality and frame rates, and is more suited for multimedia telephony applications.

Figure 12. Bit rate variation for RealNetworks streaming over different network scenarios

Bitrate variation over time (2.)

0

10000

20000

30000

40000

50000

60000

70000

80000

0 100 200 300 400 500 600

time (s)

bit

rate

(b

its/

s)

UTRAN

LAN


68

Figure 12 shows the instantaneous bit rate statistics for RealNetworks streaming over LAN and UTRAN (0% loss rate) [161]. More data on video streaming traffic is included in Publication [P5].

5.2.5. Other traffic

Typical RTSP messages in the downlink direction (from server to client) contain responses. In an ordinary session with no exceptions or errors, the responses are of the 200/OK type, the first of which contains also SDP information for session description. Typical packet lengths for this type of traffic for a session with speech and video are in the range of 80-250 bytes (including uncompressed TCP/IPv4 headers). When including more advanced header fields, the size may increase significantly. SDP data can increase the size of RTSP packet by some 500-1000 bytes, depending on the session information. Examples of individual RTSP messages sizes can be found in [150].

RTSP messages in the uplink direction contain messages from the streaming client to the streaming server. Examples are Options, Describe, Setup (one for every media stream involved, unless the fast start-up feature is used in PSS [37]), Play, Pause, Teardown. Typical packet lengths for this kind of traffic are in the range of 120-300 bytes (including uncompressed TCP/IPv4 headers).

RTCP traffic in downlink direction is related to packets sent by the streaming server and received by the client (Sender Reports (SR)). The minimum size of RTCP packets (SR) is 92 bytes, and typical packets do not exceed 200 bytes (including uncompressed UDP/IPv4 headers).

RTCP traffic in uplink direction is related to the packets sent by the streaming client and received by the server (Receiver Reports (RR)). The minimum size of RTCP packets (RR) is 72 bytes, and typical packets do not exceed 300 bytes (depending on the optional parts used).

Typical HTTP messages in the downlink direction containing media are as follows: SMIL scene descriptions are usually few KBs; still images and bitmap graphics can be up to several MBs; vector graphics and synthetic audio can be up to few hundred KBs; text, timed text and synthetic audio are typically few KBs. It has to be noted that the maximum legal IP packet size is 64 KB, and packets larger than 1500 bytes are subject to fragmentation in the IP layer.

Finally, standard HTTP messages in the uplink direction (this does not include messages related to adaptive HTTP media) are usually 200/OK signals and are below 350 bytes.

5.3. PDP CONTEXTS CONSIDERATIONS

When RTSP is used for establishing a multimedia streaming session, there are several IP data flows involved in the communication between the streaming server and the streaming client. These are described also in Publication [P5] and include an RTSP data flow for signaling purpose (user control plane). This flow is bi-directional, to allow a handshake and

69

control between server and client. Usually RTSP signaling is carried over a reliable connection (TCP/IP), but it is also possible that this traffic travels over an unreliable connection (UDP/IP) (see Figure 11).

If HTTP is used for progressive streaming or downloading media, a single IP flow is involved in the communication between server and client. The flow contains HTTP messages that travel over a reliable connection (TCP/IP). These messages are bi-directional to allow a handshake between the server and the client. In addition, the same flow contains HTTP messages that include media data that travel uni-directionally from the server to the client in the downlink direction over a reliable connection (TCP/IP). The signaling data carried over HTTP are capability exchange and SDP. The media data carried over HTTP can be the same set of continuous media as carried over RTP, or static media. All could also be encapsulated into 3GP files in order to be downloaded.

Regarding other aspects related to the number of PDP contexts, similar considerations made in section 4.3.1 for multimedia telephony apply also here. In Publications [P4, P5], scenario 3 as described in section 4.3.1, has been used. Furthemore, QoS attributes usage in streaming session is defined in [75].

5.4. STREAMING QOS METRICS

In this section, QoS metrics for multimedia streaming applications will be introduced. The rationale for using QoS metrics for streaming is the same as for mobile multimedia telephony applications. Some of the metrics described in section 4.4 could certainly be reused also for streaming with little or no changes. These will not be mentioned in this section any longer. Instead, this section will focus on specific QoS metrics that find applicability in the mobile multimedia streaming application domain. Some useful streaming metrics are defined in Publications [P4, P5] and [150]. These are: connection setup delay, Pause-Play delay, round-trip time, initial buffering delay, total user delay, Teardown delay, packet flow stop time (or handover time) perceived by the application, memory requirement for buffering, packet loss rate, media bit rate.

As it can be noticed, the above metrics are all related to delays except the last three. Results presented in the remainder of this chapter will be shown using these metrics.

5.4.1. QoE metrics for PSS

Quality Of Experience (QoE) metrics for have been proposed in [76, 84, 171, 214, 217] and standardized in [37]. These are very useful for optimizing the application performance and also for performance monitoring made by operators. Example of such metrics (applicable only to continuous media) are: video corruption duration, successive loss of RTP packets, jitter duration, synch loss duration, rebuffering duration, initial buffering duration, frame rate deviation, content switch time, average codec bit rate, buffer status. In addition, also a QoE protocol has been developed to allow tuning the metrics sending rate and internal resolution.


70

The former defines how frequently the streaming client must report the recorded metrics to the server, whereas the latter defines how frequently the client must sample the QoE metrics to be stored in the terminal for transmission according to the sending rate [37]. The metrics are timestamped and refer to a precise measurement period.

5.5. MOBILE STREAMING QOS IMPROVEMENTS

5.5.1. Content creation

In sections 5.2.2 and 5.2.4 rate control was mentioned as one important aspect for content creation. Another aspect is the Intra-coded frame rate to be used in this phase. This depends heavily on the expected loss rate. Since the off-line encoding is made in advance, efficient network loss models are required. In our investigations in Publication [P5] we used an Intra-coded frame rate of 0.2 fps (i.e., one Intra frame every 5 seconds), because it has shown to give the best trade-off between smooth frame rate and error resilience under a packet loss rate of 3%. The reader may refer to Figure 5 of Publication [P5]. It has to be noted that we used the TMN5 rate control which is not optimized for streaming. A different rate control may yield different optimal Intra-coded frame rates.

5.5.2. Packet loss handling

To fight against packet losses in streaming there are many techniques similar to those mentioned already in section 4.5.1. An interleaving technique for streaming is presented in [197]. Automatic Repeat Request (also referred to as Backward Error Correction (BAC) or simply retransmission) is a technique that allows the receiver to give positive or negative acknowledgements to the sender, and receive the retransmission of erroneous or missing data. These methods can be used only if the RTT in the network is low enough and the whole feedback loop can be closed within the delay bounds required by the application. In other words, retransmitted data must be received before their playout time in order for the ARQ system to be useful. An example is RTP retransmission [190] which may handle packet losses, whenever the efficiency of the RLC ACK mode is limited (or when the RLC is run in UNACK mode). In this sense, RTP retransmission can fight against residual packet losses that might still be present despite of using the RLC ACK mode, and fight against all the losses that occur between the streaming server and the RAN (e.g., congestion losses in the fixed Internet trunk). RTP retransmission can also be used to handle packet losses caused by cell reselections (see section 5.5.6).

RTP retransmission is suitable for multimedia streaming because, unlike TCP, is not fully persistent. After being informed by the receiver through a RTCP NACK (Negative Acknowledgement), the server has the freedom to decide whether to retransmit a missing RTP packet or not. In order to make the retransmission decision, the server estimates whether a retransmitted packet could still arrive at the client before its scheduled playout,

71

assuming continuous real-time playback. In addition, not all packet losses have the same effect on the decoded media quality, as some packets lost can be more easily concealed than others. The server is also able to consider such aspects in the retransmission decision, and can selectively choose whether to retransmit a packet depending on how difficult loss concealment would be (Publication [P6]). RTP retransmission is used in conjunction with RTP/AVPF [178] to signal missing packets via NACKs or PLI/SLI messages (see section 4.5.1) from receiver to sender. Finally, an analysis of event-driven RTCP feedback compared to constant feedback for mobile streaming is available in [80, 81].

Mobile streaming experimental results under GPRS (see Table 12 in Publication [P5]) have shown that Reliability Classes 2 and 3 (that use ACK modes) are packet loss free and the optimal ones to deploy streaming applications over GPRS, despite this network has not been designed to support real-time traffic; Reliability Classes 4 and 5 offer increasing loss rates up to an average of 1% in normal conditions (i.e., under no cell reselections). EGPRS networks offer average loss rates of 0% for MCS-1..7 (see Publication [P4]), with 0.2% and 0.3% loss rates respectively for MCS-8 and MCS-9 [149]. UTRAN networks offer average loss rates of 0.1% as reported in Publication [P4].

5.5.3. Session control signaling delay

Session set-up delay is one of the most important factors to determine the efficiency of a streaming service. When RTSP is carried over TCP, two types of connections are possible:

Persistent, where a connection is used for several RTSP request/response pairs;

Non-persistent, where a connection is used for a single RTSP request/response pair. Every non-persistent connection starts with the TCP synchronization which is a three-way

handshake (SYN, ACK, SYN) before any RTSP message can be sent (see Figure 4 in Publication [P5]), and it causes a considerable increase of the signaling time. For this reason, the use of persistent TCP connections is recommended in order to keep the signaling time as low as possible [37]. In the experiments carried out in Publications [P4, P5], TCP persistent connections are used, and the TCP synchronization occurs only one time before the first client RTSP message is sent. This synchronization delay is always included in the reported set-up delays. The individual TCP synchronization delay is reported to be 3 seconds over a GPRS 2+1TS connection with CS-2 [150].

TABLE 18. SESSION CONTROL SIGNALING DELAYS (IN SECONDS) FOR GPRS, EGPRS AND UTRAN

Set-up delays (s) Pause delays (s) Play delays (s) Teardown delays (s)GPRS EGPRS UTRAN UTRAN Loaded GPRS EGPRSUTRANGPRS EGPRSUTRAN GPRS 2+1TS CS-2

7.3 5.2 1.4 1.9 1.6 1.1 0.4 1.0 1.1 0.2 1.7

Table 18 summarizes the average streaming session control signaling delays under a GPRS connection with 3+1TS, CS-2 coding scheme and Reliability Class 3 (see Publication [P5]), except the teardown delay that was measured under a 2+1TS configuration [150];


72

these are compared with results over EGPRS with 2+1TS using the MSC-7 coding scheme, and over UTRAN with a 384 kbps bearer (see Publication [P4]).

From the table above, the following observations can be done. The higher is the available bandwidth for signaling, the lower are the signaling delays; when the UTRAN network is loaded up to the point of yielding only a 64 kbps bit rate, the set-up delays increase; Pause delays are higher than Play delays, because the former message is sent while the network is filled with RTP packets, whereas the latter message is sent when the server is not sending media data; the Teardown delay can be interpreted as the system time required before a new streaming session can be started. The Play delay for EGPRS unexpectedly appears slightly higher than that for GPRS; however, the maximum values for EGPRS and GPRS are respectively 1.4 s [149] and 1.6 s, as reasonably expected.

One of the interesting recent features provided by PSS is the fast start-up capability that aims at shortening the initial set-up delay [37, 216]. The mechanism works in such way that the Setup and Play RTSP messages are pipelined in the same message. This optimization allows reducing the number of RTTs from 3-4 to 1-2 (see Figure 4 in Publication [P5] for reference). A similar optimization is done in PSS for allowing fast content switching. This is very useful when the user who is watching a media clip, wants to switch to view another clip or another subtitles or audio track for the same video. This feature allows also adding or removing media components for an ongoing media session. PSS enables these operations in just the time of one RTT [37].

5.5.4. Receiver buffer management

One of the main functions of the receiver buffer is that is smoothing out delay jitter that may occur between the sender and the receiver. For example, a GPRS network may naturally introduce additional jitter just by using the link adaptation mechanism [149]. De-jittering the incoming media flow is therefore essential for a good user experience. Examples of jitter buffer algorithms for audio and video streaming are given in [146, 184].

The receiver buffer can be adaptive in size, so that performance can be improved. Usually, when a streaming session starts, a certain amount of buffer is pre-filled before starting playback. There is a trade-off between the initial playback delay and the client resilience against delay jitters or bandwidth variations. For example, if a period of low bandwidth occurs, a buffer underflow may occur and this will cause a certain period during which the displayed video is frozen and the buffer needs to be re-filled again (rebuffering) so that the playback can restart. It is a good practice to pre-fill the buffer for an amount that would keep the streaming client at a healthy state for all the duration of a session. This amount may be translated into waiting time for the user. To reduce the initial delay some optimization techniques can be used. For example the RealNetworks server uses a higher transmission rate at the beginning of a session in order to fast fill the receiver buffer in the shortest possible time (see Figure 12). This is an efficient technique, but it requires that the extra bandwidth for achieving the speed-up is available. Whenever this extra bandwidth is

73

not available, another option is to fast-fill the receiver buffer in the time dimension, by keeping the transmission rate unaltered, but switching the media stream to a lower bit rate (if available). This temporarily reduces the media quality, but ensures a reduced initial playback delay.

Results in Publication [P4] proved that mobile streaming over a UTRAN 384 kbps bearer requires a client buffer of at least 3 seconds in order to guarantee pause-less playback. At this point it is possible to compute the Total User Delay from the time the user presses the play button up to the time the media plays back. It is the sum of the set-up delay + the buffering delay. For UTRAN it is 1.4 s+3 s=4.4 s (assuming a constant transmission rate equal to the encoding rate and without optimizations such as fast start-up or fast filling). This time is only a lower bound, because results in Publication [P5] have shown that for filling X seconds of media buffer at a transmission rate equal to the encoding rate, the required time is usually greater than X seconds.

Rebuffering delays during a streaming session can be caused by different reasons as extensively reported in Publication [P5]: 1) rebuffering caused by bandwidth variability (among other reasons, because of oscillations in the coding schemes and number of time slots in (E)GPRS), delay jitter and packet losses that lead the client buffer to the underflow state; 2) rebuffering caused by a cell reselection event; 3) rebuffering with a cell reselection in the middle. All these cases generate variable waiting delays for the user, and expressions for the calculation of such rebuffering delays are given in Publication [P5].

5.5.5. Packetization overheads and optimal packet sizes

While there are no theoretical limitations for the usage of small packet sizes, one must be aware of the implications of using too small RTP packets that may produce three drawbacks [43]: 1. The RTP/UDP/IP packet header overhead becomes too large compared to the media

data; 2. The bandwidth requirement for the bearer allocation increases, for a given media bit rate; 3. The packet rate increases considerably, producing challenging situations for server,

network and mobile client. As an example, Figure 13 shows a chart with the bandwidth repartition among RTP

payload media data and RTP/UDP/IPv6 headers for different RTP payload sizes. The space occupied by RTP payload headers is considered to be included in the RTP payload. The smallest IP packet sizes (74, 92 and 121 bytes) are related to minimum payload sizes for AMR at 4.75 kbps, 12.20 kbps and for AMR-WB at 23.85 kbps (1 speech frame per packet). As the figure shows, too small packet sizes (<= 160 bytes) yield an RTP/UDP/IPv6 header overhead from 38 to 81%. For large packets (>= 560 bytes) the header overhead is 4 to 11%. For small media payload and signaling packets, this overhead could be reduced by using header or data compression algorithms (see section 2.3). However, one should also be aware of the implications of using large packets, and the opportunity of setting limits for maximum


74

packet sizes generated by PSS servers. In general it must be assumed that the larger the payload sizes, the higher the end-to-end latency and the jitter at the PSS client [43], especially if RLC ACK mode is used. If RLC UNACK mode is used, large packets are more susceptible to losses. Therefore, in this case, smaller packet sizes are preferable. See section 3.2.2 for other considerations.

Figure 13. Bandwidth repartition among payload and headers for different IP packet sizes

A few recommendations can be given for AMR and AMR-WB speech traffic (see Table 15):

Encapsulating one speech frame per packet is possible, but not recommended. This would yield a packet rate of 50 pps, equivalent to a conversational multimedia service. If RLC ACK mode is going to be used as option for the transport bearer, larger packets can be sent by the streaming server, without compromising the speech quality (at the cost of additional end-to-end delay and jitter). However, a client does not know in advance what packetization strategy is used at the server side, and it is recommended that a client is able to handle such situations when the server sends very small speech packets at a high rate. Another disadvantage of encapsulating one or a small number of speech frames per RTP packet is the amount of RTP/UDP/IP header overhead that grows and leaves less room for the media bandwidth, for a fixed bearer size. Reversely, for a fixed media bit rate, the need would be to set up an enlarged bearer.

A typical server implementation should encapsulate 10 or 20 speech frames per RTP packet. Packet sizes for AMR are from 182 to 702 bytes, and from 232 up to 1282 bytes for AMR-WB.

Encapsulating one second of speech per RTP packet (50 speech frames) yields packets larger than 1500 bytes (up to 3142 bytes). This would cause packet fragmentation at the IP layer, and a higher header overhead plus additional delay and vulnerability against

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

74 92 121 160 260 560 810 1060 1310 1460

IP packet size (bytes)

RTP/UDP/IPV6 headers

RTP payload

75

losses. Therefore, it is not recommended to packetize one second of speech into an RTP packet (unless the AMR bit rates are known to be below 12.2 kbps and the AMR-WB bit rates are known to be below 12.65 kbps).

The recommended maximum number of speech frames to be encapsulated in an RTP packet should take into account the guidelines given in [43], that suggest to limit the payload packets to 1376 bytes (over IPv6).

Video traffic is subject to a broader range of operating points compared to speech or audio. For instance, the highest H.264 profile and level allows video bit rates up to 12.5 Mbps. For these rates it is recommended to use the largest possible packet sizes, in order to decrease the instantaneous packet rates outgoing from a server and incoming to a mobile device, in order not to consume too many resources at the client side. With a video payload of 1400 bytes, a high video bit rate as mentioned above would yield packet rates over 1100 pps which is considerably challenging for mobile devices.

5.5.6. Robust cell reselection management

Cell reselections and handovers can be of different nature. In section 3.2.6 a classification of different cell reselections was given. Depending on its type the cell reselections can produce higher delay jitter or higher packet loss rates for the duration of the cell reselection. Some cell reselections are lossless and do not produce any negative effect on the application. The problem is how to detect cell reselections, and ultimately how to hide the mobility effect to the application layer. Streaming handover management solutions in WLAN proxy-based and IMS architectures have been described in [53, 54]. In [70] the authors present a WLAN scheme based on the concept of soft handover and multipath streaming where a server sends two simultaneous streams through two access points for a certain period of time.

A more detailed description of the cell reselections effects on the signaling and media transmission is contained in Publication [P5]. Figure 6 in [P5] shows clearly the dynamics of GPRS cell reselections with Reliability Class 4: A) the packet loss rate increases first; B) after there is a period when the packet flow stops and no data is received by the streaming client; C) subsequently, data is again received with a certain decreasing packet loss rate. Experimental results for Intra-BSC cell reselections over GPRS in Table 14 of Publication [P5] have shown that the length of the period B is variable depending on the reliability class. The higher is the reliability used, the longer is the packet flow stop stop time perceived by the application. For instance, when Reliability Class 2 is used, the average packet flow stop time is 3.7 seconds. However, in this case the cell reselection is lossless and the disruption is perceived by the application only as delay jitter. For other reliability classes, the packet flow stop produces packet losses for all its time duration. Better results are achieved by EGPRS (see Table 6 in Publication [P4]) that exploits a packet flow stop time of 2.9 seconds. The best results were achieved for soft and softer handovers in UTRAN (see Table 4 in Publication [P4]), where the packet flow stop time was zero and there were no losses or increased delay jitter.


76

The buffer scenario under a cell reselection (CR) is shown in Figure 14 [163]. Before the CR, the buffer is at a healthy state. During the CR, the buffer becomes partially empty because no data arrives, and the data that has arrived is played at constant rate. After the CR, the buffer contains a temporal gap, and severe prediction errors occur if the first received video frame after the B period is not an Intra-coded frame. In addition, the buffer is emptier than before the CR and it must be refilled up to the normal level (otherwise another near term CR would produce a buffer underflow).

Figure 14. Buffer status under a cell reselection event

In Publication [P5] a simple and robust solution for lossy handovers is described. This was originally designed in [83], and has been later adopted in the 3GPP PSS standard [43, 163]. It is based on retransmission. The client can detect a handover by monitoring the cell identifier or the amount of received data. If the client does not receive data for an amount X of time, and later it starts again to receive data after an amount Y of time (Y>X is the real handover duration), the client can trigger a robust handover procedure and request the server to re-Play the stream from the break point. Results in Publication [P5] have shown the benefits of this technique in the case of video streaming. With the robust handover mechanism, the PSNR gain was 2.2 dB compared to the case where no robust mechanism was used. Figure 7 in Publication [P5] shows the instantaneous PSNR in both the cases. During the CR period, when this technique is not used, the PSNR dramatically drops to very low values. When using the proposed solution, the PSNR is not impacted by any degradation. Publication [P5] gives also some recommendations on the optimal buffer size for GPRS streaming, which is 9 seconds. This should guarantee pause-less playback also in case of handovers.

Since the buffer level after a CR is low, a fast buffer filling is possible [163]. The PSS server may switch to a lower rate stream, but use the same sending rate for the period of time the buffer needs to be fast filled. This causes more data to be written in the buffer than read in the time dimension.

5.5.7. Optimization in the lower protocol layers

Considerations on LLC and RLC layers regarding retransmissions have been made in section 3.2.4. Despite the advantages offered by the RLC ACK mode [148], this has some disadvantages:

Buffer (Temporal gap)

After CR During CR Before CR

Buffer (full) Buffer (partially empty) Data ** *

* Data stored in the buffer after CR ** Data stored in the buffer before CR

77

Increased end-to-end delay (since retransmissions attempts at layer 2 increase the delay budget, and the delay jitter as well, especially if the retransmissions are fully persistent).

Less available bandwidth for media (since part of the bearer is left for RLC retransmissions).

The first issue can be handled by setting a bigger buffer in the client. For the second issue, Table 19 shows examples of estimated maximum media bit rates, after RLC ACK mode retransmissions and layers 2-3-4 headers overheads. It also includes real measurements over (E)GPRS and UTRAN (see Publications [P4, P5] and [149]). For media bit rates estimations, these assumptions were made for UTRAN: number of RLC ACK retransmissions = 2 (0% target SDU error rate), BLER = 10%, RTP payload size = 500 bytes, RLC block size = 80 bytes, RLC header = 3 bytes, PDCP header = 1 byte, uncompressed RTP/UDP/IPv4 header = 40 bytes, ROHC RTP/UDP/IPv4 header = 4 bytes.

TABLE 19. MAXIMUM MEDIA BIT RATES AFTER LOWER LAYER RETRANSMISSIONS AND PROTOCOL HEADERS

Bearer bit rate (kbps) Media bit rate (uncompressed headers) (kbps)

Media bit rate (compressed headers) (kbps)

16 (UTRAN) 9.6 (estimated) 12.9 (estimated) 17.6 (EGPRS 2TS MCS-1) 12 N/A 22.4 (EGPRS 2TS MCS-2) 17 N/A 27.15 (GPRS 3TS CS-1) 25 N/A 29.3 (EGPRS 2TS MCS-3) 23 N/A 32 (UTRAN) 21.3 (estimated) 25.4 (estimated) 35.2 (EGPRS 2TS MCS-4) 28 N/A 40.2 (GPRS 3TS CS-2) 35 N/A 44.8 (EGPRS 2TS MCS-5) 37 N/A 59.2 (EGPRS 2TS MCS-6) 50 N/A 64 (UTRAN) 45.6 (estimated) 51.9 (estimated) 89.6 (EGPRS 2TS MCS-7) 76 N/A 108.8 (EGPRS 2TS MCS-8) 89 N/A 118.4 (EGPRS 2TS MCS-9) 96 N/A 128 (UTRAN) 114 N/A 384 (UTRAN) 342 N/A

The required number of RLC ACK mode retransmissions is 1 for BLERs up to 1%, and 2 for BLERs up to 10%, in order to yield error-free transmission at the application layer. In general, it is recommended to set to 2 the maximum number of retransmissions in the RLC layer. Despite the results in Table 19 show maximum network media bit rates for different bearers, it is possible that a network is able to handle transient higher bit rate transmissions without causing any losses. EGPRS with 2TSs can handle for example 2 kbps excess bit rate for about 71 seconds or 20 kbps excess bit rate for 18 seconds without any losses [149].

79

“There are two options: adapt or die”.

Andrew S. Grove (Intel co-founder)

Chapter 6

Mobile Media Adaptation

he objective of this chapter is to have an in depth analysis of one of the fundamental reasons for low QoS of multimedia applications: bandwidth. When network bandwidth availability clashes against continuous application media bandwidth

requirements, and when network bandwidth cannot be guaranteed for the whole duration of a multimedia session, the risk is that of incurring into a session with bad QoS. End users tend to form their opinion based on the device and application they use. It is not uncommon to run for example into a YouTube session where the playback is frozen or intermittent, or a Skype session where the media quality is not satisfactory because of insufficient bandwidth. Optimization methods are required, especially in the mobile domain, in order to make multimedia sessions work in a seamless way.

In this chapter optimization methods for mobile media adaptation will be introduced. First, the rate adaptation problem will be presented. Next, adaptation models will be classified. The chapter will end with adaptation mechanisms for mobile multimedia telephony and media streaming.

6.1. PROBLEM STATEMENT

Continuous (or pause-less) playback is the number one requirement for a successful streaming or multimedia telephony service. When the network throughput is varying all the time during a session, the effect on the end user’s client is that of picture freezes, pauses in the audio/video playback, continuous rebufferings (i.e., re-loading from the streaming server a sufficient amount of media data to be streamed with no interruptions) and bad media quality (caused by packet losses derived by network buffers overflow).

T

MOBILE MEDIA ADAPTATION

80

The term adaptive here means that an application is given the feature that enables to adapt to varying network conditions. Examples of such variations include variations of throughput, delay, and intra/inter-operator roaming to networks with or without QoS support.

Let us define the network rate as the function RN(t) that represents the variable network rate at time t>0. Let the encoding stream rate be the function RS(t) defined as the variable output rate of a media encoder at time t>0. This general definition can easily accommodate the VBR and well the CBR cases, because the latter can be considered as a special case of the former (i.e., a CBR stream is a particular VBR stream where the rate is constant). Let feedback be the function FS(t1) defined as the feedback information received at time t1>0. An example of feedback is the information received by a sender via RTCP packets. Let the stream transmission rate be the function TS(t)=(RN(t), RS(t), FS(t1)) defined as the variable output rate of the application packetizer at time t>0. This general definition can accommodate the VBRP as well as the CBRP cases, because the latter can be considered as a special case of the former (i.e., a CBRP stream is a particular VBRP stream where the rate is constant).

TS(t) means that the transmitted output rate of an application depends on the stream encoding rate, on the network rate and on the feedback received. This definition allows very flexibly handling a wide range of cases from the simplest ones where no adaptation is present, up to cases where sophisticated adaptation logic is implemented. An example of the former is when RTCP feedback is not received or not used at all. An example of the latter is when a streaming server calculates the transmission rate based on feedback information, on multiple rate media streams and based on the network rate prediction. Another example is given by the fast buffer filling techniques explained in section 5.5.4.

The feedback information can be related to a time t1<t (in the case of past feedback data), or to a time t1>=t (in the case of predicted feedback data). The first case applies, for example, to report feedback after the occurrence of an event (e.g., when travelling through a tunnel where no radio coverage is present). The second case applies when the occurrence of an event can be predicted in advance (e.g., by using geo-predictive techniques (see section 6.5.6) the tunnel occurrence can be predicted well in advance).

The hardest problem in the media adaptation space is the fact that the function RN(t) is generally unknown. Therefore, a lot of research efforts in the last years have been devoted to find ways how to estimate or predict the character of this function. Examples of rate estimation techniques developed in the past literature are available in [93, 225]. The simplest way to approximate the RN(t) function is by consider it constant. However, this applies only to guaranteed bit rate networks where the rate function is known. In the most general case of best effort network, the network rate function is variable and unknown.

In general terms, the rate adaptation problem can be defined as the problem of finding a function TS(t) such that

0 . (33)

81

The expression (33) indicates that the distance between the network rate and the transmitted rate must be within a given limit. In other words, the transmission rate should adapt to the network rate variation and “follow” it. The value of ε is important because it determines the efficiency of a rate adaptation algorithm. If ε is too large the network bandwidth is used in a sub-optimal way, and the transmission rate does not well adapt to the network rate. The same expression indicates also that the transmitted rate function should always be “under” the network rate function. This is obvious because if this would not be the case, the transmitted rate could exceed the network rate and packet losses may occur. This would lead to a bad QoS. Even if occasionally the transmission rate may exceed the network rate without negative consequences on the QoS (as mentioned in section 5.5.7), in this chapter the stricter assumption will be made. It has to be noted that these functions and expression will have to take into account also the lower layer protocol headers. Figure 15 depicts a scenario for the rate adaptation problem.

Figure 15. The rate adaptation problem

6.1.1. Bit rate evolution plots and the STRP model

The discussion in the section will focus on streaming applications, but it could be extended to the multimedia telephony case with little reconsiderations. The mobile network can be modeled as a VBR bottleneck link as described in Figure 1 of Publication [P7]. The encoder generates a VBR media stream. The transmission rate of the server is adapted to the network bandwidth. This might be a constant bit rate, in the case of guaranteed bit rate networks or, in the more general case, a variable bit rate. When the network rate is lower than the server transmission rate, data accumulates in the network buffer. If the network rate is higher than the transmission rate, the network buffer empties.

Transmission over the network introduces a certain delay jitter. The streaming client is able to withstand some variations in the received throughput as it uses a jitter buffer. The

0

100

200

300

400

500

600

700

800

900

1000

0 50 100 150 200

Band

width (k

bps)

Time (s)

Network Rate

Transmission rate

ε

RN(t)

TS(t)


82

buffer is built up from a short initial buffering at the beginning of the session when the client receives the media data, but delays playing it out for a certain period of time. Therefore, during periods when the received throughput drops, the client is able to play data accumulated in its buffer. However, since the set-up time of the session has to be minimized, the buffer typically holds only a few seconds of data. The client will thus run out of data and the playback will be interrupted, if the rate cannot be precisely controlled and/or if consecutive handovers occur because of user mobility. In addition, the client buffer can compensate for the accumulated encoding rate and transmission rate difference [43].

The rate adaptation problem can also be modeled with reference to the bit rate evolution plots and the STRP model (i.e., Sampling (or encoding) curve, Transmission curve, Reception curve, Playout curve) as described in Publication [P7]. Figure 1 of Publications [P7] indicates the points where the different curves can be observed in a simplified streaming model. Figure 2 of Publications [P7] shows an example of bit rate evolution plot. The horizontal axis in the graph denotes the time in seconds. The vertical axis denotes the cumulative amount of data in bits. The sampling curve S(t) indicates the progress of data generation if the media encoder was run in real-time time on a media stream. On a chart, two sampling curves with different slopes indicate two media streams encoded at different bit rates [175]. The transmission curve T(t) shows the cumulative amount of data sent out by the server for a media stream at a given time and it corresponds to the function TS(t). The reception curve R(t) shows the cumulative amount of data received in the client buffer at a given time. R is an approximation of RN(t) in the sense that it shows the effect of the network rate on the curve T (with possible delays or losses), but it does not show the real available network rate at any given time. The playout curve P(t) shows the cumulative amount of data the decoder has processed by a given time from the client buffer and ready to be played out. This curve is the counterpart of the sampling curve, and is actually a time-shifted version of it.

The distance between two curves at a given time shows the amount of data between two observation points in the system. For example the distance between the transmission and reception curves corresponds to the amount of data in the network buffer, and the distance between the reception and playout curves corresponds to the amount of data in the client buffer (see Figure 2 of Publications [P7]). The rate adaptation problem can then be reduced to the curve control problem. Curve control here means constraining by some limits the distance between two curves (e.g., by a maximum amount of data, or a maximum delay). This problem is equivalent to controlling two buffers: the network buffer and the client buffer (see Figure 1 of Publications [P7]).

In case there is bit stream switching or other rate adaptation action foreseen, the server signaled pre-decoder buffer parameters (see section 6.5.2) are to be interpreted as the limits to what the server will constrain its difference between the sampling curve and transmission curve during the session (S-T curve control). In addition, the server attempts to do T-R curve control in order to limit the packet delays (i.e., limit the jitter buffering at the client). The variable bit rate character of the network, and therefore the variable packet delays, create the

83

need for transmission curve adaptation. A server must also ensure that the same T-R difference does not produce packet losses because of network buffer explosion (i.e., because the two curves distance is over a certain constraint).

6.2. ADAPTATION MODELS

The models for adaptation can be divided to architecture-based models and time-based models. The first type is defined based on the entity that drives the media adaptation mechanism and the related signaling between server and client. The second type is based on the time instant when the media adaptation takes place. The discussion in this section applies to multimedia streaming applications, but it can easily be generalized also to multimedia telephony applications.

6.2.1. Architecture-based adaptation models

Depending on the entity that drives the decisions and actions, the schemes can be [P10]:

Server-driven: when the decisions and actions about when and how to operate the rate adaptation, as well as the client buffer are under control of the streaming server. In this case, the client task is that of periodically reporting some useful information to the server. An example of server-driven rate adaptation signaling mechanism is that for PSS and is described in [37]; its performance evaluation is available in Publications [P6, P7] .

Client-driven: when the decisions above are under the streaming client control. An example of client-driven rate adaptation is defined for HTTP progressive streaming and described in [37].

Network-driven: when the decision about when and how to adapt are assisted or driven by the network. An example of network-driven rate adaptation is available in Publication [P8].

Co-operative: when there is a clear responsibility split between the server and client decisions and actions [166]. An example of co-operative rate adaptation is available in [164, 165].

6.2.2. Time-based adaptation models

Since a streaming system is essentially a real-time system, the time instant when the rate adaptation is performed is critical for the whole application user experience. Therefore, a classification of rate adaptation schemes is possible also depending on when the rate adaptation operation is performed. For instance, there are reactive schemes, pro-active schemes and predictive schemes. Their description is available in Publication [P10].


84

6.2.3. Responsibility split in rate adaptation management

One important aspect that must be considered for streaming applications is the responsibility split between server and client in order to handle efficiently the rate adaptation (especially, but not only, in the case of a co-operative model). In practice, this means finding precise server-client roles separation in the management of the STRP curves and the related rate adaptation operations [166].

As a general principle, a given curve must be managed by only one and one entity (client or server), and whenever a need for multiple entities to manage a curve arises, the co-operation behavior must be clearly defined.

The sampling curve S(t) should be left completely under the server control for these reasons:

Only the server knows about the exact characteristics of each bit stream (e.g., switching positions, priority of frames, frame sizes, average and maximum instantaneous bit rates);

The server is able to look into the “future” of the bit stream (pro-active role);

When in a multi-rate stream environment, there could not be a stream rate that matches the network rate. So, the server might want to add some intelligence (e.g., thinning, switching up-and-down between stream rates) in order to fit the stream rate to the network rate. The sampling curve should not be controlled by the client, because the latter does not know how much of its buffer level increase/decrease is due to variation of the bit rate within the given stream, and does not know about the accumulation of difference between the stream average rate and the transmission rate. The sampling curve can be under the client control only if this is made aware of the information above.

The transmission curve T(t) should be left completely under the server control for these reasons:

In the general case, it is only the server that can measure the amount of data “on the way” using RTCP receiver reports;

There might be need to (re-)couple transmission and sampling curves, if the latter has limited flexibility (i.e., a limited range of bit rates).

The real-time constraints management functionality should be left under the server control. This should maintain some real-time constraints by adapting its sampling curve to the transmission curve:

The adaptation of S(t) to T(t) guarantees that, with adequate buffering, the client is able to play media with correct timing and without interruptions;

At every time instant t, the sampling curve S(t) should not deviate by a too large amount of bytes from the transmission curve T(t).

Without the server maintaining the real-time constraints, only a completely client-driven scheme could be used, where the client issues specific commands for controlling S(t) and T(t). This may result in a sub-optimal scheme, if the client does not have any information about the bit streams.

The reception curve R(t) is under the network control.

85

The playout curve P(t) should be left under the streaming client control. The buffer management functionality should be completely under the client control,

which is responsible to provide the necessary buffering to “follow” the server. The client should then handle

the jitter buffer for management of any transfer delay variation between the transmission curve and the reception curve, i.e., help the server in managing the |T(t) – R(t)| control.

any mismatch of the sampling and playout curves (e.g., mobile station clock drift, or playback slowdown due to operating systems problems or excessive load in the mobile station).

the pre-decoder buffering for |S(t) – T(t)| management (see section 6.5.2); The responsibilities of rate adaptation between the server and the client are, therefore,

clearly divided as follows. The server is responsible for:

Adaptation of the sampling rate to the transmission rate and keeping it within the rate adaptation operating range.

Adaptation of the transmission rate to the reception rate (i.e., congestion control). When trying to perform the adaptation, the server is limited by:

Modification of the sampling curve: depending on the rate adaptation capability of the server. For example, if the server implements stream switching and if the server is transmitting at its lowest (or highest) stream rate, it would not be able to further decrease (or increase) the sampling rate;

Modification of the transmission curve: the transmission curve is constrained by the reception curve, and thus the server may not be able to increase the transmission rate. It can increase it only if it were not using previously the total available network bandwidth. For example, a server may be using the TFRC mechanism [102] (or receiving explicit network bandwidth information via client signaling) to compute its allowable transmission rate and, as a consequence, not increase its rate above the rate TFRC (or the actual signaled bandwidth) tolerates.

The client is responsible for:

setting the parameters of the server rate adaptation operating range

compensating for the packet delay variation (i.e., network jitter). The key to maintaining uninterrupted playout is the efficient management of the client

buffer level. This can be accomplished by having at least implicit or estimated control over both the playout curve and the reception curve at the client. The streaming client by definition knows and controls the decoding/playout timeline. If the client is enabled to have control of the playout curve and its relation to the sampling curve, it will have control of its buffer level (client-driven approach).

The client should choose the rate adaptation parameters considering its absolute buffering limitations. In a server-driven approach, it should be up to the server to choose how to adapt


86

its encoding rate and/or transmission rate when responding to the client feedback. Either the transmission curve or the sampling curve or a combination of both can be adapted.

In a co-operative approach, the streaming client should be able to instruct the server to send the packets earlier or later than their sampling time. This scheme is in contrast

with a purely server-driven S(t) control approach, where the server estimates what should be the client buffer level and how to shape the sampling curve accordingly;

with a purely client-driven S(t) control approach, where the client dictates what should be the sampling rate at any given time instant, for example by sending bit stream switch commands.

6.3. BASIC END-TO-END SIGNALING SUPPORT

6.3.1. Application awareness of network QoS

It has been mentioned that finding the exact network rate is hard, especially if the network is best effort. However, there are methods in literature to estimate the available network rate. One way is by using network probing [93]. The situation is better for QoS-guaranteed networks, where the GBR and MBR values are those that the application is looking for. There may be a difference between the requested GBR/MBR by an application and the negotiated (i.e., granted by the network) GBR/MBR values, because the requested values might not be supported by the network (or because of other reasons). Therefore, it is very important for an application to be aware of the network granted QoS attributes, so that the right assumptions on the network rates can be made. In [77, 173] an application mechanism to signal the negotiated GBR, MBR and delay end-to-end in multimedia telephony applications is described. The PSS and MTSI standards [34, 37] include this type of signaling.

6.3.2. RTCP

A very important protocol that is used to convey feedback information between the parties of an on-going session is the Real-time Transport Control Protocol (RTCP), which is part of the RTP protocol specification [199]. RTCP allows monitoring the data delivery or, in other words, the QoS. The information that RTCP packets include for receivers (Receiver Reports (RR)) is the following: fraction of packets lost since the last report, cumulative number of packets lost, highest packet sequence number received (HSN) from a given source, interarrival jitter, last sender report timestamp, delay since the reception of the last sender report from a given source up to the sending time of the current report packet.

In addition to the above, senders can also report the following information in their Sender Reports (SR): NTP timestamp to indicate the wall clock time when the report packet was sent, RTP timestamp used for media synchronization, total number of transmitted packets up

87

to the time the report packet was sent, total number of bytes transmitted up to the time the report packet was sent. The latters is used to calculate the average payload data rate.

The RTCP reporting frequency in PSS and MTSI is quite flexible, and there are mechanisms to allow fast feedback [34, 37, 61, 168, 178], whenever this is required. For MTSI speech traffic, RTCP could be turned off, if necessary. Usage of packets with only the required RTCP fields is recommended. As alternative, it is possible to use shorter packets called non-compound or (stacked) semi-compound RTCP packets [34] in order to minimize the impact of the RTCP traffic on the RTP traffic (especially for speech-only sessions).

6.4. MEDIA ADAPTATION FOR MOBILE MULTIMEDIA TELEPHONY

In this section several solutions for mobile multimedia telephony media adaptation will be introduced using an architecture-based classification.

6.4.1. Sender-driven adaptation

TCP Friendly Rate Control (TFRC) [102] is an equation-based congestion control algorithm implemented at the sender side, and it fairly competes for bandwidth with other TCP flows. TFRC uses knowledge at the sender side to calculate the new transmission rate based on the average transmitted packet payload size, the RTT and the receiver’s loss rate. This congestion control mechanism is primarily designed for TCP applications that use fixed packet sizes and vary their sending rate in packets per second as response to congestion. One of the drawbacks of TFRC is that it responds slowly to changes in the available bandwidth [102], for example after handovers [145]. Another drawback is given by the fact that TFRC cannot distinguish between congestion losses and wireless losses [105], causing lower performance to real-time multimedia applications. Congestion and wireless losses differentiation schemes are described in [62].

The simplest adaptation strategy that an RTP-based endpoint can adopt is just by making use of the plain RTCP feedback [199]. The fraction and the cumulative number of lost packets allows a sender endpoint understanding that likely the encoding/transmission rate is too high, and should be lowered. If this is the case, the sender could change the encoding parameters and react accordingly. The knowledge of the network buffer size (i.e., the difference between the transmission curve and the reception curve) and the monitoring of the inter-arrival jitter are beneficial in order to have an idea of the data accumulation in the network, and predict situations of packet losses. For this purpose, it is possible to estimate the amount of bytes “on-the-way”. A sender endpoint, by making use of the HSN, the Sequence Number (SN) of the next RTP packet to be sent, RTT information and a table of the recently transmitted packets, is able to estimate the amount of bytes that are travelling towards the receiver, and therefore, calculate if the network buffer increases (i.e., there is a situation of congestion or a mobility event) or not.


88

Examples of sender-driven adaptation methods for speech traffic are given in [51, 67, 156]. In [51] the authors use RTCP: the sender terminal changes the encoding rate from 8 kbps up to 32 kbps in steps of 8 kbps, using variable-size RTP packets sent at a constant rate of one packet every 125 ms. In the work described in [67] RTCP is also used. The sender reacts only upon reception of three consecutive RTCP reports indicating a packet loss situation, that is, about after 15 s. In [156] the authors describe an algorithm that makes use of adaptive Reed-Solomon FEC to balance the effects of congestion and packet losses.

6.4.2. Receiver-driven adaptation

The Temporary Maximum Media stream Bit rate Request and Notification (TMMBR/TMMBN) mechanism [218] allows a receiver endpoint to request a sender to limit the maximum transmission bit rate for a stream to a given or lower value.

In Publication [P9], the TMMBR signalling method is extended with new functionalities: with TMMBR-A (Network-Assisted TMMBR), the network notifies the sender and receiver of the uplink and downlink bearer bit rates respectively. With the plain TMMBR, the sender becomes aware of the receiver downlink capacity, but this information arrives at the sender delayed by an order of a one-way delay from the receiver. However, the receiver downlink may not be the constraining link, but the sender uplink might be the bottleneck between sender and receiver. Therefore, the sender receives also information about its own uplink rate. This method could be classified among the network-assisted adaptation methods (see section 6.4.3), but since this is developed on the top of TMMBR which is a receiver-driven adaptation scheme, TMMBR-A has been included in this section.

In TMMBR-B (see Publication [P9]), the network notifies the receiver of the downlink rate. As before, the sender is notified about the current downlink capacity by the receiver; however the sender is not aware of its own uplink rate. Hence, the TMMBR messages from the receiver are considered as an upper bound for the current sender encoding/transmission rate, and the TMMBR message bit rate is never exceeded. This technique is similar to that standardized in MTSI [34] for video adaptation.

In the same Publication [P9], and for comparison, a third scheme is defined (TMMBR-U, Unassisted TMMBR), where the network does not assist the sender nor the receiver. The latter sends the new maximum rate request to the sender using TMMBR, based on the average inter-arrival time of RTP packets between two RTCP RRs.

Yet another investigation that we intended to assess in Publication [P9] was related to the question of whether the rate adaptation mechanism defined in PSS (i.e., the NADU APP packet) [37] was applicable also for multimedia telephony applications, and what is its performance. So, a fourth method was also part of the study and was named C-NADU (Conversational NADU).

To complement all the TMMBR-A, TMMBR-B, TMMBR-U and C-NADU methods, also the number of bytes discarded at the receiver, because of too early or late dropped packets, was signalled from the receiver to the sender [177]. Performance results shown in

89

Publication [P9] (Figures 3 and 5) show that the highest available bandwidth utilization (ABU) (i.e., the goodput) is achieved by TMMBR-A, that reached 70%. TFRC, which is used as comparison, reaches ABUs as low as 33%. The other results in Tables 1 and 2 of Publication [P9] show that TMMBR-A performs the best because of the network-assisted property of the method. The Delta Loss Rate (DLR), i.e., additional congestion losses caused by the rate adaptation mechanism (on the top of wireless losses), in this case is as low as 0%. Whenever no network assistance is available, then C-NADU is the second best method with a competitive average goodput. The NADU signalling method defined in PSS is therefore an efficient method also for MTSI applications. Further investigations on these rate adaptation techniques have been later carried out in [204].

Other methods for receiver-driven adaptation include for example that for speech in MTSI [34] which is implemented with extensions of RTCP packets that allow signaling requests for redundancy, change in packetization (i.e., the number of frames per RTP packet), or codec mode requests. In [107], a rate adaptation scheme that monitors the uplink and downlink channels has been described. In addition, the authors found that receiver feedback every 200 ms helps the sender to perform a quicker adaptation. However, the authors used non-compound RTCP packets, as opposed to our experiments where plain RTCP is used.

6.4.3. Network-driven adaptation

The idea behind network-driven adaptation is that the network actively participates to the adaptation process with key information to drive the adaptation. A method for dual rate speech adaptation is described in [49]. Here the authors reported that noticeable quality degradation occurred, compared to fixed rate quality, if a random switching rate exceeded 30% (i.e., without a well defined mode switching logic).

In Publication [P8] a method for AMR mode selection which is network-driven is presented. The main idea is that the AMR mode is selected by the network based on the congestion level and the radio network link quality. The congestion level monitor uses a sliding time window mechanism with variable weights according to the age of the window. The method takes also into account possible out-of-order packets and it uses AMR mode rate hysteresis in order to avoid too frequent mode switches. Simulation results in Table 3 of Publication [P8] show improvements in the network congestion level when using the method compared to the case when the method is not used. The recommended window size is 500 ms with moderate congestion (i.e., with a 0.8 kbps narrower network channel), and 250 ms with high congestion (i.e., with a 4.8 kbps narrower network channel). The maximum recovery time (i.e., the time it takes to gradually switch back to the highest AMR mode) is 1.75 seconds.

In other network-driven scenarios, if the network communicates the endpoints a situation of congestion, these can take an early action to reduce it, without waiting for situations that are associated to congestion (e.g., packet losses). When packets are travelling through a


90

network, if a router notices congestion, it could mark the IP header packets with the Explicit Congestion Notification (ECN) field [189]. This signaling has the advantage that it has no additional header overhead cost compared to other mechanisms for signaling congestion [220]. It has to be noted that the introduction of ECN requires changes in the network. The use of ECN in RTP-based application could include forwarding the ECN-type of information to the sender via RTCP [220]. ECN has been adopted in MTSI [34].

6.5. MEDIA ADAPTATION FOR MOBILE STREAMING

Some of the rate adaptation methods for multimedia telephony are also applicable to mobile streaming. Those methods will therefore not been repeated in this section. Instead, signaling and mechanisms optimized for mobile streaming will be analyzed in the following.

6.5.1. Server-driven adaptation

As mentioned in Publication [P7] a rate adaptive implementation of streaming server must adapt both its encoding and transmission rates in order to keep network and client buffers in an optimum state at each time instant (and therefore realize the curve control). Encoding and transmission rates do not necessarily have to match. Given a fixed encoding bit rate stream, a server may decide to transmit (parts of) that stream slower of faster, depending on the situations. For example, in periods of network outage (e.g., a handover) the available bandwidth can be close to zero, and the server may intelligently decide to slow down or stop the transmission, in order to avoid unnecessary packet losses (due do network buffer overflow) that would surely require packet retransmission in order to be repaired; conversely, the server may decide to stream media faster when more bandwidth is available, in order to avoid a client buffer underflow and quickly lead the buffer to a healthy level [P10].

The encoding bit rate can also be subject to rate-adaptive techniques. For example, the already compressed media stream rate could be further reduced when using bit rate thinning techniques (temporal scalability). For video codecs that use reference and non-reference pictures, the latter do not affect the decoding of other pictures. In case there is need of lowering the sender encoding rate, some non-reference pictures (e.g., the B frames) or the whole enhancement layer made of non-reference pictures can be dropped (i.e., not transmitted). According to [198], dropping a whole enhancement layer causes a bit rate reduction of 30-40%. However, it has to be taken into account that rate thinning causes also frame rate reduction.

A rate-adaptive technique based on priority scheduling is presented in [138]. Here, the most important frames (I frames) are sent in favor of other frames, so that a certain amount of I frames is always stored in the client buffer; in case of a prolonger handover, a slideshow is shown instead of a rebuffering event.

91

Multi-rate video encoding and bit-stream switching [37] are commonly used techniques for reducing the amount of media bits to send over the air. As described in Publication [P6], a streaming server can keep a set of media streams of the same content encoded at different bit rates and it may perform seamless switching between the different streams when in adverse radio conditions. The server needs to switch at positions that will avoid artifacts in the decoded stream (for example at I frames as done in Publication [P6], or at SP frames [139] or other types of switching frames [106, 210]). When in presence of live media, it is possible to change the encoding parameters and use a different (e.g., lower) encoding rate on-the-fly.

The 3GPP PSS and DLNA3 specifications [37, 92] introduce a feature (initially proposed in [95] and derived from Publications [P6, P7]) of signaling from the receiver to the server the playout buffer status information which, used together with the regular feedback offered by the RTCP protocol, allows the server to have the information it needs in order to optimally choose both transmission rate and media encoding rate. The server can thus attempt to maintain both the network buffer and the client buffer in an optimum state. There are two types of information signalled by the client (see Publications [P6, P7]): static buffer information and periodic buffer information. The static buffer information is signalled via RTSP at the beginning of the streaming session and it includes

the size of the client buffer (in bytes) allocated to a particular media, and

the target buffer level (in milliseconds) the client wishes the server to keep. The buffer size corresponds to the size of the reception, de-jittering (and de-interleaving,

if used) buffer, and it includes any pre-decoder buffer space used by the client for that media. The target buffer level is determined by the client that has the best knowledge about the mobile network characteristics. This parameter represents an adequate protection level against network level interruptions (e.g., handovers), inter-arrival packet jitter, and other factors that may not guarantee pause-less playback. In other words, the target buffer level is the maximum time margin that the server can utilize to perform its rate adaptation operations (e.g., change the transmission rate and/or change the content rate). If the server is able to keep the client buffer to the requested target level all the time, this will give nearly 100% guarantee that the playback session will be pause-less (Publication [P6]).

Since network conditions are dynamic, a different protection level may be required at different points of time. For example, a handover from a UTRAN to a GPRS network would suggest the client that future cell reselections within the GPRS network will produce longer breaks than UTRAN handovers. For this purpose, the client may decide to update its target buffer level during the session and signal a higher value. However, the buffer size cannot be modified during the lifetime of a streaming session.

3 DLNA includes consumer electronics, computer and mobile device manufacturer companies that defined a standard that allows multimedia applications (e.g., streaming) to interoperate over a variety of devices.


92

The dynamic buffer information is signalled periodically by the client via RTCP NADU APP packets. It includes

the Next Sequence Number (NSN), which is the RTP SN of the next Application Data Unit (ADU) to be decoded in the sequence of ADUs to be played out from the buffer. This information is called OBSN in Publication [P6], but it exploits the same semantics.

Next Unit Number (NUN), which is the unit number (within the RTP packet) of the next ADU to be decoded. This is useful for interleaved media packetization. For audio codecs, an ADU is defined as an audio frame. For H.264, an ADU is a NAL unit. For H.263 and MPEG4 Visual Simple Profile, each packet carries a single ADU; so the NUN field is set to zero in the latter cases.

Playout delay (in milliseconds), which is the difference between the scheduled playout time of the next ADU to be decoded (the NSN packet) and the time of sending of the NADU APP packet, as measured by the media playout clock.

Free Buffer Space (FBS), which is the amount of free buffer space (in 64 byte blocks) available at the PSS client at the moment of reporting.

The NSN and NUN fields allow the server estimating the client buffer level, whereas the additional playout delay and the FBS allow having a more precise estimation of how much playout time the receiver has (i.e., the client buffer underflow point). The size of the buffer and the estimated client buffer level allow avoiding the server to overflow and underflow the client buffer. In case a stream switch is required, the server has the necessary information to perform the switching seamlessly. The server keeps a table for each transmitted packet containing its SN, timestamp and size. This information can be deleted when the packet has been played out [43]. For a deeper analysis on how to calculate the adaptation information (packets playback times or buffer underflow points) for different scenarios (with or without interleaving and re-ordering), the reader may refer to [37].

In Publications [P6, P7] performance results of the mentioned signalling method are shown. The basic idea for client underflow prevention is simple. If the buffer level in time decreases, the server switches down to a lower content rate. Decreasing the content rate (but keeping the same transmission rate) allows the server to send packets earlier and increase the client buffer level faster (in the time dimension). The results are for streaming over EGPRS with two MCS-7 time slots (89.6 kbps) in the downlink direction. Handovers of different duration (up to 4 seconds) were also simulated. The network was of the best-effort type and the bearer was shared with two other users that were performing Web browsing. Three streams of the same content were encoded at different bit rates and used for stream switching. Figure 3 in Publication [P6] shows how the transmission rate is adapted to the reception rate, and also the stream switches during the session. Figures 4 and 5 in Publication [P6] show the client buffer level in the time and space dimensions. From the figures it can be clearly seen that, when using a target buffer level of 12 seconds, the streaming client never experiences a buffer underflow or overflow, even in presence of network rate variations or multiple handovers. Figure 5 of Publication [P7] shows the client

93

buffer level in the time dimension (in case no RTCP APP NADU signalling is used) and the fact that it suffers from buffer underflows because of the variable network bandwidth and the handovers. Subsequent research [104] confirmed the performance of the adaptive scheme for PSS.

6.5.2. Buffering aspects

According to Publication [P7], in order to efficiently realize the adaptation, the streaming server must have a clear picture of the streaming client buffer. The server, at any time during a session, must keep the client buffer full up to a certain security level, and ensure that the buffer does not overflow or underflow even in adverse radio conditions or whenever roaming or handovers occur.

Different receiver buffer models are possible. For instance, the buffer could be used for various purposes and it could be made of multiple physical parts. Some of the functions covered by the receiver buffer are: de-jittering, de-interleaving, pre-decoding, post-decoding. In some implementations, the de-jitter buffer could be implemented in software, whereas the pre-decoder and post-decoder buffers could be implemented in hardware. In this thesis, without loss of generality, a single logical multi-function buffer model will be considered, with integrated jitter, de-interleaving, pre-decoder and post-decoder functions (see Figure 16) which is applicable to the PSS and DLNA streaming standards that have purposely aligned buffer models. A buffer model with non-integrated functions is also described in the DLNA specifications [92].

With reference to Figure 16, some definitions of the different buffer parts and their functions are given in the following:

Figure 16. A mobile media streaming buffer model

Client buffer: the total buffer space used to store data received from the streaming server before decoding;

Client buffer (bytes) RTSP

Free buffer space (bytes) RTCP

Target buffer level (time) RTSP

Post-decoder buffer (time)

Pre-decoder buffer (bytes) SDP/RTSP

Decoding delay

Initial buffering (time)

FullEmpty

NSN

HSN

De-jitter and de-interleaving buffer


94

De-jitter and de-interleaving buffer: the client buffer space (excluding the pre-decoder buffer) that is used to store data for de-jittering and de-interleaving. This includes RTP header and payload, and is used to handle packet delay variation (i.e., the difference between the transmission curve and the reception curve) and de-interleaving;

Pre-decoder Buffer: the hypothetical pre-decoder buffer that is used to indicate how much buffer size is required for streaming a certain media. It stores raw compressed data before decoding;

Post-decoder Buffer: the buffer space used to store decompressed data before rendering; Receiver Buffer: the total buffer space including client buffer and post-decoder buffer. The pre-decoder buffer can be defined as the difference between the playout curve and

the reception curve. The hypothetical pre-decoder buffer refers to the pre-decoder buffer as defined in the hypothetical buffering model [37]. This model assumes a zero delay network, and therefore that the transmission curve is equal to the reception curve. It also assumes that the playout curve follows this model, which means that the sampling curve is assumed to be equal to the playout curve (apart from a shift equal to the initial pre-decoder buffering period [37, 43]). In this model, the hypothetical pre-decoder buffer can be traced at streaming time by the server as the difference between the sampling curve and the transmission curve, and this difference has to fit into the buffer defined with the pre-decoder buffering parameters (initial pre-decoder buffering time and pre-decoder buffer size). This requirement must hold regardless whether a predetermined transmission schedule or adapted transmission schedule is used, that is, the rate adaptation must be transparent to this requirement [43]. If there is no stream switching or other rate adaptation action, the hypothetical pre-decoder buffer parameters are linked to the media stream and its transmission schedule. These parameters can be calculated from the media stream (or used as constraints at encoding time). If stream switching or other rate adaptation actions are used, the server signaled pre-decoder buffer parameters should be ignored as the buffer model is as defined in section 6.5.1.

It has to be noted that the initial buffering depicted in Figure 16 is different from the initial pre-decoder buffering time. The former is the minimum amount of buffering before starting the user playback. In a mobile streaming implementation, this buffering time should be reasonably short, but enough to guarantee pause-less playback until the target buffer level is reached by the server.

6.5.3. Co-operative adaptation

In a co-operative scheme, both streaming server and client co-operate towards the same goal. In [95, 164, 165] two schemes are proposed. One mechanism is based on the concept of clock shift, and it allows the client to request the server to transmit packets faster or slower depending on the situations. By being able to shift the server transmission clock, the client is able to modify the real-timeness of the stream. In this co-operative model, the client requests only relative clock shifts and it is completely up to the server to adapt by changing the sampling curve or the transmission curve (or both). A forward shift [164] increases the client

95

buffer level (e.g., to tolerate future handovers). This allows the client to receive packets ahead of time and play them during an outage period. A backward shift allows decreasing the transmission rate and therefore the buffer level (e.g., when the receiver wants to request the retransmission of some lost packets and some room in the bandwidth budget is needed for that purpose).

A subsequent version of the above scheme was defined in [95, 165]. Here the shift concept is based on three variables: minimum shift, used to prevent client buffer underflow; target shift used to enable fast filling of the client buffer (e.g., for handovers or retransmissions); maximum number of send ahead bytes used to prevent client buffer overflow. This scheme does not require continuous client buffer status feedback, but only shift corrections based on events (e.g., whenever the network conditions change and require an update of the parameters). The method allows the server to send media within a given rate adaptation operating range, which is fully defined by the three above mentioned parameters. Simulation results over EGPRS are shown in [167]. This scheme defined the foundation of the current 3GPP PSS rate adaptation signaling [37] described in section 6.5.1.

6.5.4. Client-driven adaptation

With media adaptation mechanisms that are client-driven, the client decides when and how to adapt (e.g., when to switch to a stream with a different bit rate). The advantage of this approach is that the client is the entity which is closer to the mobile network (compared to the server) and, therefore, is generally more aware of events and information that relate to the network (e.g., handovers to other networks). The disadvantage of this approach is that the client is unaware of the media bit streams characteristics (e.g., the available stream bit rates, at what points is possible to switch, etc.) and, without this information, the rate adaptation functionality can only yield sub-optimal performance.

One of the first commercial client-driven rate adaptation schemes is the Adaptive Stream Management (ASM) of the SureStream technology by RealNetworks [68]. This mechanism was not initially engineered for mobile networks, despite it has been claimed to be quite efficient for the fixed Internet. A client-driven RTSP stream switching scheme is described in [108]. This solution makes use of new RTSP methods for defining streams that are part of a switch-set, determine what entity between client and server has the switch control, and send a command from client to server to switch stream. A similar method as TMMBR, but optimized for streaming, has been described in [82].

A drawback of RTP/UDP streaming is that it suffers from firewall and NAT traversal issues, compared to TCP-based progressive streaming. Recently, technology developments looked at progressive streaming over TCP to enable client-driven approaches by providing the streaming client information about the bit stream characteristics. Examples are the proprietary solutions developed by Microsoft, Apple and Adobe. Each of these technologies offers also media adaptation functionalities. Microsoft IIS Smooth Streaming [157] and the Apple adaptive bit rate streaming [179] use the HTTP protocol, whereas the Adobe Flash


96

Dynamic Streaming [45] uses a proprietary protocol called RTMP [44]. The Microsoft solution cuts media into short chunks and encodes them at different bit rates. The chunks are hosted by a HTTP server in MP4 files, and the client dynamically selects the chunks to be delivered performing stream switching based on the rate adaptation logic. The Apple solution is based on the concept of Playlist File, which is a file that the server creates and make accessible to the client. The file contains all the streams information, and allows the client to drive the chunk switching. The Adobe solution uses also multiple streams to allow switching.

Adaptive HTTP (progressive) streaming was also standardized by 3GPP as part of the PSS specifications [37]. One of the advantages offered by HTTP streaming is that the server does not require changes (or it requires minimal modifications), provided that the information about the encoded streams (Media Presentation Description (MPD)) is made available to the streaming client by some means. In other words, standard HTTP servers can deliver media to clients that are using adaptive HTTP streaming. The MPD includes information about how to access the content (e.g., random access points for seek operations) encoded with one or more bit rates, resolutions, languages, codec, etc. The MPD can also be generated on-the-fly for live content, and it may provide time-shifting access [37]. In [147] a client-driven algorithm for the PSS HTTP adaptive streaming is presented. The results shown are promising, but further research is needed for assessing its performance in highly lossy environments, in order to understand the effects of the TCP retransmission timer (the latest one proposed by IETF, based on recent Internet observations, is 1 second [181]) on actual mobile networks deployments and user experience.

6.5.5. Network-driven adaptation

Similarly to what described in section 6.4.3, also for mobile streaming network-driven adaptation is also an option. For example, the work in [65] describes a solution and a signaling scheme, which is compared to pure client-driven signaling. In this case, the network sends bandwidth information to the streaming server via a proxy, which translates the network signaling into RTSP signaling.

6.5.6. Geo-Predictive adaptation

As it is known, in a best effort network the client received throughput can be variable over time even in the same physical location. In addition, when the user moves to different locations, the probability of having a constant received throughput for all the duration of a streaming session decreases dramatically. Geo-Predictive streaming adaptation described in Publication [P10] is a novel rate adaptation technique that makes use of user context information, location and motion data, together with past network throughput information, to predict a future network state. For instance, the authors of [222] demonstrated that the bandwidth along a path is more predictable if the location information is taken into account,

97

and that there is no significant correlation between the bandwidths at different points in time within a given trip. Furthermore, the authors found that the bandwidth uncertainty may reduce considerably whenever observations from past trips are taken into account. They showed also the efficiency of geo-prediction on TFRC-adaptation for streaming [223]. In [201] an algorithm for context-aware rate adaptation in VANET environment is discussed.

This new way of looking at the problem radically changes the approach to rate adaptation and streaming delivery: the adaptation becomes predictive, as opposed to reactive or pro-active methods. In a reactive or pro-active system [P10], a tunnel encountered during the user’s route would likely cause session discontinuity, packet losses, and then the client-server reaction after the disruptive event. In a predictive system, it is possible to detect in advance that a tunnel is in proximity of the user’s route, and that will produce a very low (or zero) network bit rate. With this information available before the disruptive event occurrence, the client and server are enabled to take well in advance the best action in order to handle the event and guarantee pause-less playback. Figure 1 in Publication [P10] shows the basic architecture of a Geo-Predictive streaming system. Note that the geo-predictive server and streaming server may be co-located in the same physical server. However, for simplicity, here they are considered as separate logical and physical entities.

From the perspective of the 3GPP PSS standard, the server operations are minimally changed, and most of the new required signalling and features are concentrated in the streaming client and the Geo-Predictive server. See Publication [P10] for a more detailed list of their functions and responsibilities.

With reference to a 3GPP PSS system [37], three parameters play an important role in the client to server communication: free buffer space, the total buffer size and the target buffer level. For example, if a tunnel is in proximity, and there no network coverage is expected, the client may temporarily expand the size of its buffer by an amount which is enough to overcome the network outage, and prevent a buffer underflow and a disruption in the user experience. The new buffer size implies also a new size of free buffer space and a new desired target buffer level. This data is communicated to the server well in advance that will therefore try to keep the client buffer at a healthy state by pushing more data for reaching the new target buffer level as quickly as possible.

Publication [P10] shows experimental preliminary results of Geo-Predictive rate adaptation with multi-rate stream switching over a simulated LTE HSPA channel for H.264 video. Two video streams at different bit rates were used. Three cases were compared: NOR (no rate adaptation), RAT (3GPP PSS Rate Adaptation Transmission with stream switching) and GPT (Geo-Predictive rate adaptation Transmission). In particular, Figure 10 of Publication [P10] shows the client buffer fullness (in seconds) for GPT and how this is treated in proximity of a long outage. The rest of the results are summarized in Table 20. NOR is the basic comparison scenario, and does not provide a good user experience because of the rebufferings. In the worst case the streaming client suffers of a 20.2 seconds rebuffering, which is a media playback disruption occurred to fetch data after a network bandwidth outage. In addition, NOR suffers from packet losses that produce bad media


98

quality at the streaming client. RAT and GPT can both survive rebufferings and packet losses. However, GPT always offers the best media quality, compared to RAT, because a single stream is used in the transmission without the need of switching stream and lower the media quality. With GPT, the media quality remains constant and the highest possible, and no disruptions in the playback occur. Further studies on geo-predictive streaming have been later done in [64].

TABLE 20. PERFORMANCE RESULTS FOR NOR, RAT AND GPT

Route Number of rebufferings

Cumulative length of rebufferings (s)

Average media bit rate (kbps)

Packet Loss Rate (%)

NOR RAT GPT NOR RAT GPT NOR RAT GPT NOR RAT GPT

Route 1 1 0 0 14.0 0 0 136 128 136 2.3 0 0 Route 2 1 0 0 19.5 0 0 136 130 136 3.3 0 0 Route 3 1 0 0 20.2 0 0 136 130 136 3.5 0 0

6.5.7. Implications of packet retransmission on media adaptation

Through the integration of adaptive streaming and RTP retransmission, the server can make the optimum decision whether to retransmit a lost packet or not. In fact, since retransmission occupies a certain amount of extra bandwidth, the functionality of rate adaptation can give information on whether or not this extra bandwidth for retransmission is available. Furthermore, the rate adaptation module is potentially capable of “creating” the required bandwidth for retransmission, if this is not available (Publication [P6]). From these considerations, it is clear that the integrated use of rate adaptation and RTP retransmission mechanisms may enhance the quality of a streaming session.

99

Chapter 7

Mobile and Interactive Social Television

elevision viewing has been part of our life and a mean of socializing since many decades. It is not only a pure individual action, but it has had a social dimension since its first introduction on the market. People often prefer to watch TV programs,

such as a football game or a movie together with other people, in order to consume the content as a social experience. As stated in [195], the content very often represents a medium for social interaction between people, since it provides common interests; in such situations in which people choose to watch TV together, socializing around the content might be more important than the content itself. In the usual case, people wishing to share the same content must be located in a relatively restricted geographical area, in order to be able to meet at home or some other mutually convenient location. If these people, for some reason, are either in different geographical locations, or do not have time for meeting others in a common place, then shared watching experience is not feasible. All of these observations represent the main motivation behind the concept of Mobile and Interactive Social TV (MIST).

This chapter is about the MIST concept, and the next sections will cover the application paradigm, the description of two different architectures and first user experience experiments. The chapter will end with considerations on session mobility.

7.1. FUSING DIFFERENT APPLICATION PARADIGMS

The MIST concept [P11, 69, 155] is relatively new when compared to the concept of traditional TV operating in a static (non-mobile) context (e.g., in the home). The basic elements contributing to MIST are inherited from two existing paradigms: the one related to multimedia telephony (in this chapter referred to also as multimedia conferencing), and that

T

MOBILE AND INTERACTIVE SOCIAL TELEVISION

100

related to multimedia streaming or broadcasting of TV programs. MIST fuses elements of both paradigms into the same concept and application, where content consumption and user interaction functions are both present when used in a mobile context. Here the term user interaction includes not only speech and video, typical of multimedia telephony applications, but also text, emoticons, customized sounds, etc. According to this view, the multimedia streaming is associated to the content consumption part, whereas the multimedia telephony (extended with rich interaction mechanisms) is associated to the user interaction part. In other words, in this new paradigm each application is used for a specific purpose. The MIST concept overview is shown in Figure 1 of Publication [P11].

An early experiment for fusing DVB-H and SMS interaction is reported in [196]. In this system, the SMSes sent by the users to the service are embedded in the video stream, which is broadcasted along with the users’ messages.

In MIST, the technical challenges and user needs are different compared to the traditional TV system. In MIST there is a possibility to use mobile devices for watching TV or video content not only with people who may be far apart, but also with those who are on the move, as if they were in the same physical room. The new ingredient, if compared to typical recent mobile TV applications and systems, is that here rich interaction possibilities between participants watching a common content together facilitate a higher feeling of virtual presence. Synchronized playback of TV content among participants, such that all of them watch the same event with minimal time difference, is very important. Rich interaction coupled with synchronized content playback ensures that all participants are watching the same content and interact nearly at the same instant of time, and therefore have a common shared context of the viewing experience. This shared context is the key to creating a feeling of watching together.

7.2. INTERACTION MODALITIES

The most important features of mobile devices are location-independence (allowing people to use mobiles wherever they are) and time-independence (at any moment of the day). The MIST service should enable people to join a social TV session from any place and at any moment.

Interaction should ideally satisfy the requirement of creating a virtual presence of the participants, which is as close as possible to the real presence, without distracting the attention of the viewers from the shown content (at least no more than in the real world TV viewing experience). In a MIST system, the following five modalities of interaction could be mainly considered [69]: text chat, emoticons, graphic animations, audio-only conferencing and multimedia conferencing. The real-time property of the interaction is fundamental for creating a virtual presence of people. For low-bandwidth and low computational complexity solutions, text, chat, graphic animations and emoticons would be preferable. Under higher bandwidth networks and high-end mobile devices, audio and multimedia conferencing

101

would be the best possible options, since they provide a stronger feeling of social presence to participants.

The main goal of interaction in a social TV system is to allow participants to do what they would usually do when they are watching TV content together in the same real world TV room, such as commenting the events being shown on the screen, looking at the faces and gestures of the others [69]. A multi-party multimedia conference system coupled with synchronized content streaming could provide the virtual presence of the participants in a mobile TV room.

Using real-time audio and video for interaction in such a system represents a challenge because the required sum of bandwidth needed for video streaming and conferencing are high. Also occasional packet losses are quite common in such a communication environment as it was shown in the previous chapters of this thesis. Therefore, media quality, bandwidth and latency are the biggest issues to tackle when designing such a system. Also, computational complexity of video processing is higher than for other types of data, and this could be an important issue especially for mobile devices that have limited computational capabilities.

7.3. CONTENT AND INTERACTION MIXING ARCHITECTURES

The interaction system can be fused with the TV content streaming in different ways. In this thesis two approaches will be considered [P11, 69]: performing the mixing at the server side or at the client side. The two approaches are depicted in Figures 2-3 of Publication [P11].

7.3.1. Centralized mixing architecture

The real-time audio and video interaction feeds may be created by sending the locally captured interaction media from multiple mobile devices to a central node (the conference server) which combines them into a single video stream that is then sent back to each of the devices.

The MIST architecture with centralized media mixing has three main entities, as illustrated in Figure 2 of Publication [P11]: the Content Provider, the Interaction Server (or multimedia conference server), and the clients. The Content Provider sends the TV content (such as a movie) to the Interaction Server. Each mobile client captures the interaction media using the embedded camera and the microphone, encodes the video and audio streams and sends them to the Interaction Server. This server creates a new media stream by combining the interaction media received from every client wishing to participate to the same virtual shared space with the TV content stream received from the Content Provider. This new stream is sent to all the clients. Each mobile client receives the combined stream, decodes it and plays it out. The main advantage of this architecture is that each mobile client has to receive and decode only one video stream [69].


102

Community streaming with interactive visual overlays is analyzed in [211]. Here the authors use a single separable video stream where a region which contains game content is fixed, and another region which contains interaction video or avatars is modifiable by using partial transcoding in order to modify only the necessary parts of the bit stream.

7.3.2. Endpoint mixing architecture

The MIST system with endpoint mixing has a similar architecture as for the centralized mixing approach when it comes to handling the interaction among participating users. In fact, the main actors are still the content provider, the interaction server and the clients. The difference lies in the way these actors are interconnected. In the endpoint mixing architecture, as illustrated in Figure 3 of Publication [P11], the TV content provider is connected to the clients and not anymore to the interaction server. The content provider sends the media content to the clients. Each client sends the captured and encoded interaction media to the interaction server. This server combines the media received from each client wishing to participate to the same social TV session and creates a new interaction feed, the multi-party conference stream, which is sent to all the clients. Each client receives two multimedia streams and combines the received interaction feed and the movie before it plays them out [69]. This approach requires each client to receive, decode, combine and display multiple incoming media streams simultaneously, thus being a more challenging solution than the centralized mixing approach, because the mobile devices should be able to receive and decode three, four, or even five individual media streams. The media flow for the endpoint mixing architecture is depicted in Figure 17. The advantage of this architecture is that each individual client or group of clients can connect to the desired content provider and, therefore, have a wider choice of content.

Figure 17. Media data flow in the endpoint mixing architecture

Reception

Decoders

Mixer

Rendering

TV content Multimedia conference

Decoders

103

An analysis of the CPU load for the Nokia S60 clients in the endpoint mixing architecture is presented in [69]. During the analyzed time interval, the mobile device was performing the following operations simultaneously: capturing the live media (video and audio from the embedded front-camera and microphone), encoding the video into H.264 and the audio into AMR, sending the interaction media to the interaction server, receiving and decoding the conference media (H.264 video and AMR audio), decoding a local movie (H.263-encoded), mixing the decoded pictures of the two videos and rendering the resulting combined video.

It was observed that the CPU load caused by the MIST client application was taking most of the resource usage [69]. The media decoding and playback operations in the downlink direction had an impact of approximately 55% over the total CPU capabilities, whereas the interaction media flow in the uplink direction caused a load of approximately 25%. The H.263 local video file decoding did not appear in the CPU load because it was decoded by the hardware-accelerated H.263 decoder.

7.4. PROOF-OF-CONCEPT SYSTEM

In Publication [P11], a MIST proof-of-concept system is presented. The MIST system allows users to invite other mobile users and create a Virtual Shared Space (VSS) for a shared context. In this space, the users can talk-to and see each other while simultaneously watching TV/Video content on their mobiles. The virtual shared space is created by synchronizing the interactions and content playback between users. This system allows rich interaction modalities such as audio, video and text chat interaction for communication between the users. Figure 4 in Publication [P11] shows the steps involved in creating and interacting with a VSS.

The system allows users to select people of interest from their phonebook and invite them to join the Social TV watching session. Users joining late can browse for the sessions already in progress with the service and dial-in the one they desire. Users can interact during the selection of the TV or video content to be watched, and also when the content is being played back. On starting to watch the content, all the participants see exactly the same content which is synchronized between the participants. It is possible to speak to the other participants or gesture them by popping-in the participant video to the screen. The popping-in and removing the individual interaction video from the screen is user controlled.

For video, managing the small display size is the challenge. The user live video and content video display should be managed such that, on one hand, maximal space is devoted to content playback for ensuring that the video content playback experience is not unnecessarily compromised. This could be achieved by minimizing the loss of visible content due to the user video. On the other hand, it must be ensured that a meaningful and engaging video presence is achieved. To make optimal use of the small sized mobile display, the individual interaction video is linked to the voice activity of each participant such that the interaction videos of participants who are silent are kept small. The interaction video window grows in size only when a participant is speaking (see Figure 5 in Publication [P11],


104

where the user on the top left corner is speaking). This keeps unnecessary clutter away from the content being watched.

There is a shared remote control between the participants, which can be used to collectively choose the content from the channel list and also for playback control. Each user has equal control as other users. If Pause is pressed by one user, the content pauses for all the participants of the social watching session. Similarly if Play or Stop are pressed.

The proof of concept MIST system has been tested in both WLAN and 3.5G HSDPA environments (see Publication [P11]). Over WLAN, the response time of the user actions such as selecting the movie or switching between different views or playback control commands (Play, Pause, Stop) was within acceptable limits (around half second). Over 3.5G network, the MIST system proved to be stable when tested with mobile clients in several locations (one client in Finland and the other one in UK) with a response time of less than 1 second.

7.5. USER EXPERIENCE

The study in [194], evaluates various peer-to-peer interaction enablers in Mobile TV services. This study concluded that social features do enrich the viewing experience despite the inherent possibility of distractions. In our study, we have tried to explore the MIST concept further to catch additional user requirements and preferences for such a system. Publication [P11] and [155] present a preliminary consumer experience study which provides initial feedback and opinions regarding the MIST proof-of-concept system in qualitative nature.

Results have shown that some users found audio conferencing to be more socially engaging. Other users found keying in responses as text messages far more laborious. Some other discovered a preference for use of asymmetric interaction modalities, where they could talk to the other participants, but the participant responses were rendered as text (and vice versa). The video conferencing capability was considered to add a higher feeling of social presence compared to audio-only or text-only interaction. The study captured that although the users desire rich interaction capabilities, they do not want them to be enabled all the time, i.e., during the whole content duration. Table 1 in Publication [P11] shows the user preferences for the content type and interaction modalities that they prefer to use. Results matched also the conclusions found in [193], where it was observed that voice+video, voice-only interaction and text-only interaction provide a feeling of presence in a descending order. Overall, 100% of the study participants expressed their desire for availability of audio, video and text interaction modalities. The preference for audio over text and vice versa was 22% each, while the rest 55% had no clear preference. Users had also some concerns on the MIST system. The first and most common concern was related to the privacy. The users wanted to be able to control the participant media, i.e., their own interaction content (their personal audio, video and text) being shared with other participants. The reader interested in the sample questions used in the consumer experience study may refer to [155].

105

7.6. SESSION MOBILITY

It could be desirable to integrate a MIST system with home TV systems, such that users could have “traditional” social TV sessions at home, or on the move. This feature requires that the whole session is seamlessly transferred from a device to another with the smallest possible service discontinuity in order to guarantee the best user experience. In other words, there is need for session mobility [154]. The reason for this could be cheaper access cost (WLAN as opposed to cellular), better user experience (a large screen and big speakers as opposed to mobile device screens and speakers) or user mobility. Session mobility is a mechanism to enable service mobility, i.e., to continue the service uninterrupted from a different device.

Traditional physical mobility puts a single user device at the center of a network (made of cells, or different network types), and attempts to offer service continuity (for example through handovers to different cells or networks). In section 5.5.6, cell reselection management has been analyzed.

Differently from traditional physical mobility, service mobility puts the user media at the center of a system (made of multiple devices, running possibly over several types of networks), and attempts to offer service continuity by moving media seamlessly from one device to another. In the context of a multimedia environment, session mobility decouples media from the device, i.e., one is in presence of a movable media [154].

Complete session transfer would mean the originating device will transfer all the individual media to the target devices from the on-going multimedia session. Partial session transfer would mean transferring only a part of the current multimedia session from the originating device onto the target device (for example transfer the audio to bigger speakers and keep the video on the mobile in order to allow mobility within the home and continue enjoying the session).

Multimedia session transfer requires the capture of the session state and context information from the originating device, and transferring it to the target device. The rough steps for performing session mobility are: 1) Device and Service discovery to discover the target device and its capabilities; 2) Session state capture and representation of all the critical parameters of the session that need to be communicated to the target device. For example, in case of transferring a video streaming session, the parameters needed by the target device are the position of last frame rendered on the originating device, the video codec details, etc.; 3) Session state transfer where the session state is sent to the target device using specific protocols.


106

107

Chapter 8

Conclusions and Future Work

here are several challenges for the deployment of multimedia applications on mobile devices. In this thesis, some of these challenges were described, analyzed and some solutions have been proposed for mobile multimedia telephony and mobile

streaming applications. The mobile networks landscape, summarizing the main critical elements which are

relevant for the considered applications over circuit-switched and packet-switched networks was set in the beginning of the thesis. The mobile networks considered were based on GSM, GPRS and UMTS. The QoS capabilities of these networks were also described.

Next, a matching between mobile multimedia application requirements and network capabilities was realized. Even if mobile multimedia telephony and mobile streaming are both real-time multimedia applications, they do have different requirements and, therefore, the same considerations do not always apply. The QoS aspects linked to networks were related to bandwidth, error rates and delivery of erroneous packets, delivery order, delay and its jitter, handovers and packet segmentation. Each of these spawned different considerations depending on the application. The top down and bottom up analysis done allowed drawing a first set of suitable and unsuitable mobile networks channels for the multimedia applications in the focus of this thesis.

A closer look at mobile multimedia telephony was then taken according to two different types of architectures: circuit-switched and packet-switched. These are supported by different standards and protocols. In the first case, 3G-324M terminals were considered, whereas in the second case MTSI terminals were studied. When the traffic is packet-switched, the PDP contexts must be allocated taking into account several aspects that were analyzed in the thesis. Also, when developing mobile multimedia telephony applications special attention must be paid to benchmarking and QoS assessment via the adoption of specific QoS metrics. These have to cover the most significant key performance indicators to be measured. The recommended metrics belong to several classes: frame-based, PSNR-

T

CONCLUSIONS AND FUTURE WORK

108

based, delay-based, service flexibility-based and call control-based. Finally, methods and algorithms for QoS improvements were introduced. The algorithms relate to the areas of bit errors or packet loss handling, delay optimization, jitter buffer management, inter-media synchronization, packetization overhead, receiver feedback and session control signaling delay. Simulation results for selected methods were presented.

The next contribution of the thesis is in the area of mobile media streaming applications and its QoS aspects. Different types of such applications are possible, and these range from real-time streaming, to progressive streaming to downloading. Each of these use cases has different characteristics, peculiar performance and mobile terminal requirements. The thesis focus was on real-time streaming and progressive downloading, and in particular on the 3GPP PSS standard. Streaming traffic characteristics were discussed with considerations on content creation and rate control strategies. Specific QoS metrics can be applied also to mobile streaming. Some are common to mobile multimedia telephony, but others are more mobile streaming tailored. Finally, few mobile streaming QoS improvements methods have been introduced. These range from RTP retransmission to robust handover management (the latter technique is today part of the 3GPP PSS standard, and was contributed by the Author). Performance results of PSS over GPRS and UMTS bearers have also been presented together with recommended settings for this application.

One of the fundamental problems of bandwidth-hungry applications is how to deal with variable network bandwidth, whenever this cannot be guaranteed. The next thesis contribution was dedicated to mobile media adaptation for both multimedia telephony and streaming. Different approaches exist, and these have been classified in endpoint-driven, network-driven, server-driven, co-operative and geo-predictive adaptation methods. Simulation results have shown that the proposed network-driven method for speech adaptation and the proposed server-driven method for mobile streaming are effective methods for rate adaptation of real-time media. The server-driven signaling scheme for mobile streaming described in this thesis is today part of the 3GPP PSS and DLNA standards, thanks to the contribution of the Author. The latest research on geo-predictive rate adaptation and its results show the state of the art in context-based adaptive media for mobile streaming, and offer the best performance, compared to the other methods.

The last contribution of the thesis was dedicated to a new experimental application, Mobile Interactive Social TV, which fuses mobile TV and multimedia telephony paradigms. It allows introducing a social dimension to the TV watching. This application has several challenges: for example the definition of a common shared context among the users to be used as common interactivity platform. Implementations of MIST terminals present several challenges, especially from the processing power point of view, but also from the QoS side, because one should decide if assigning the same QoS level as a streaming application or the same as multimedia telephony applications or a totally new level of hybrid QoS. The first user experience results have shown that this application is very useful to provide a new level of social communication and entertainment.

109

8.1. FUTURE DEVELOPMENTS

Results on rate adaptation and MIST give hints on possible future research paths. Geo-predictive rate adaptation is a very promising topic. It is disruptive is some sense, because it changes the approach to the problem. In fact, a system should not anymore try to detect an event (for example a tunnel where little or no radio coverage is present) after its occurrence and minimize the reaction time, but rather predict the event well in advance and take all the measures to conceal or hide that event from the user experience. The system from reactive becomes then predictive, thanks to an increased context-awareness. Integrating location, maps and sensing technologies into streaming players allows more and more the prediction of events along the route of a mobile user. The research questions could be addressed to answer issues such as what is the minimal infrastructure and signaling needed to guarantee an optimal prediction scheme to be implemented with minimal changes to streaming servers and clients. HTTP streaming requires also further performance investigations in lossy environments. The MIST technology is largely unexplored and deserves special attention in the future, both from the social dimension and the technical side.

CONCLUSIONS AND FUTURE WORK

110

111

Bibliography

[1] 3GPP, “Liaison Statement on Meaning of the ‘Transfer Delay’ QoS Attribute for Packet-Switched Streaming Bearers”, TSG-SA WG4 meeting #25bis, Berlin, Germany, 24-28 Feb. 2003, Tdoc S4-030200.

[2] 3GPP, “Liaison Statement on Meaning of the ‘Transfer Delay’ QoS Attribute for Packet-Switched Streaming Bearers”, TSG-SA WG4 meeting #26, Paris, France, 5-9 May 2003, Tdoc S4-030361.

[3] 3GPP TSG-CN, “General Packet Radio Service (GPRS). Mobile Station – Serving GPRS Support Node (MS-SGSN). Logical Link Control (LLC) layer specification” (Release 1997), TS 04.64, v. 6.10.0 (2001-12).

[4] 3GPP TSG-CN, “High speed circuit switched data (HSCSD). Stage 2” (Release 9), TS 23.034, v. 9.0.0 (2009-12).

[5] 3GPP TSG-CN, “Mobile Radio Interface Layer 3 Specification. Core Network Protocols. Stage 3” (Release 6), TS 24.008, v. 6.20.0 (2010-03).

[6] 3GPP TSG-CN, “Radio Link Protocol (RLP) for circuit switched bearer and teleservices” (Release 9), TS 24.022, v. 9.0.0 (2009-12).

[7] 3GPP TSG-CN, “Signaling flows for the IP multimedia call control based on SIP and SDP. Stage 3” (Release 5), TS 24.228, v. 5.15.0 (2006-09).

[8] 3GPP TSG-CN, “Mobile station (MS) - Serving GPRS support node (SGSN). Subnetwork dependent convergence protocol (SNDCP)” (Release 4), TS 44.065, v. 4.3.0 (2004-09).

[9] 3GPP TSG-GERAN, “Digital Cellular Telecommunications System (Phase 2+). Radio Subsystem Synchronization” (Release 1999), TS 05.10, v. 8.12.0 (2003-08).

[10] 3GPP TSG-GERAN, “Base Station System (BSS) – Serving GPRS Support Node (SGSN). BSS GPRS Protocol (BSSGP)” (Release 1999), TS 08.18, v. 8.12.0 (2004-05).

112

[11] 3GPP TSG-GERAN, “General Packet Radio Service (GPRS). Overall description of the GPRS radio interface. Stage 2” (Release 4), TS 43.064, v. 4.5.0 (2004-04).

[12] 3GPP TSG-GERAN, “Packet-switched handover for GERAN A/Gb mode; Stage 2” (Release 6), TS 43.129, v. 6.13.0 (2009-11).

[13] 3GPP TSG-GERAN, “General packet radio service (GPRS). Mobile station (MS) – Base station system (BSS) interface. Radio link control/medium access control (RLC/MAC) protocol” (Release 4), TS 44.060, v. 4.23.0 (2005-11).

[14] 3GPP TSG-GERAN, “Seamless Support of Streaming Services in GERAN A/Gb mode” (Release 6), TR 44.933, v. 1.3.0 (2003-09).

[15] 3GPP TSG-RAN, “Spreading and Modulation (FDD)” (Release 5), TS 25.213, v. 5.6.0 (2005-06).

[16] 3GPP TSG-RAN, “Services provided by the physical layer” (Release 1999), TS 25.302, v. 3.16.0 (2003-09).

[17] 3GPP TSG-RAN, “High Speed Downlink Packet Access (HSDPA). Overall description. Stage 2” (Release 5), TS 25.308, v. 5.7.0 (2004-12).

[18] 3GPP TSG-RAN, “Medium Access Control (MAC) protocol specification” (Release 1999), TS 25.321, v. 3.17.0 (2004-06).

[19] 3GPP TSG-RAN, “Radio Link Control (RLC) protocol specification” (Release 1999), TS 25.322, v. 3.18.0 (2004-06).

[20] 3GPP TSG-RAN, “Packet Data Convergence Protocol (PDCP) Specification” (Release 4), TS 25.323, v. 4.6.0 (2002-09).

[21] 3GPP TSG/RAN, “Delay Budget within the Access Stratum” (Release 4), TR 25.853, v. 4.0.0 (2001-03).

[22] 3GPP, TSG-RAN, “High Speed Downlink Packet Access. Overall UTRAN Description” (Release 5), TR 25.855, v. 5.0.0 (2001-09).

[23] 3GPP TSG-RAN, “High Speed Downlink Packet Access. Physical Layer Aspects” (Release 5), TR 25.858, v. 5.0.0 (2002-03).

[24] 3GPP TSG-RAN, “Typical examples of Radio Access Bearer (RABs) and Radio Bearers (RBs) supported by Universal Terrestrial Radio Access (UTRA)” (Release 9), TR 25.993, v. 10.0.0 (2010-12).

[25] 3GPP TSG-SSA, “Vocabulary for 3GPP Specifications” (Release 10), TR 21.905, v. 10.2.0 (2010-03).

[26] 3GPP TSG-SSA, “High speed circuit switched data (HSCSD). Stage 1” (Release 9), TS 22.034, v. 9.0.0 (2009-12).

113

[27] 3GPP TSG-SSA, “Service aspects. Service and service capabilities” (Release 9), TS 22.105, v. 9.1.0 (2010-09).

[28] 3GPP TSG-SSA, “General Packet Radio Service (GPRS). Service description. Stage 2” (Release 5), TS 23.060, v. 5.13.0 (2006-12).

[29] 3GPP TSG-SSA, “Quality of Service (QoS) concept and architecture” (Release 9), TS 23.107, v. 9.1.0 (2010-06).

[30] 3GPP TSG-SSA, “IP Multimedia Subsystem (IMS). Stage 2” (Release 5), TS 23.228, v. 5.15.0 (2006-06).

[31] 3GPP TSG-SSA, “Mandatory speech CODEC speech processing functions; AMR speech CODEC; general description” (Release 9), TS 26.071, v. 9.0.0 (2009-12).

[32] 3GPP TSG-SSA, “Codec for Circuit Switched Multimedia Telephony Service. General Description” (Release 9), TS 26.110, v. 9.0.0 (2009-12).

[33] 3GPP TSG-SSA, “Codec for Circuit Switched Multimedia Telephony Service. Modifications to H.324” (Release 9), TS 26.111, v. 9.0.0 (2009-12).

[34] 3GPP TSG-SSA, “IP Multimedia Subsystem (IMS); Multimedia Telephony; Media handling and interaction” (Release 9), TS 26.114, v. 9.4.0 (2010-12).

[35] 3GPP TSG-SSA, “Speech codec speech processing functions; Adaptive Multi-Rate – Wideband (AMR-WB) speech codec; General description” (Release 9), TS 26.171, v. 9.0.0 (2009-12).

[36] 3GPP TSG-SSA, “Transparent end-to-end packet switched streaming service (PSS); General description” (Release 9), TS 26.233, v. 9.0.0 (2009-12).

[37] 3GPP TSG-SSA, “Transparent end-to-end Packet-switched Streaming Service (PSS). Protocols and codecs” (Release 9), TS 26.234, v. 9.5.0 (2010-12).

[38] 3GPP TSG-SSA, “Packet switched conversational multimedia applications; transport protocols” (Release 9), TS 26.236, v. 9.0.0 (2009-12).

[39] 3GPP TSG-SSA, “Audio codec processing functions; Extended Adaptive Multi-Rate – Wideband (AMR-WB+) codec; Transcoding functions” (Release 9), TS 26.290, v. 9.0.0 (2009-09).

[40] 3GPP TSG-SSA, “General audio codec audio processing functions; Enhanced aacPlus general audio codec; General description” (Release 9), TS 26.401, v. 9.0.0 (2009-12).

[41] 3GPP TSG-SSA, “Codec for Circuit Switched Multimedia Telephony Service. Terminal implementor’s guide” (Release 9), TR 26.911, v. 9.0.0 (2009-12).

[42] 3GPP TSG-SSA, “QoS for Speech and Multimedia Codec; Quantitative performance evaluation of H.324 Annex C over 3G” (Release 4), TR 26.912, v. 4.0.0 (2001-03).

114

[43] 3GPP TSG-SSA, “Transparent end-to-end Packet-switched Streaming Service (PSS). RTP usage model” (Release 9), TR 26.937, v. 9.0.0 (2009-12).

[44] Adobe, “Real Time Messaging Protocol Chunk Stream”, draft-rtmpcs-01.txt, Jun. 2009.

[45] Adobe, “Dynamic stream switching with Flash Media Server 3”, http://www.adobe.com/devnet/flashmediaserver/articles/dynamic_stream_switching.html (accessed on 29 May. 2011).

[46] S. Ahmad, R. Hamzaoui, M. Al-Akaidi, “Adaptive Unicast Video Streaming With Rateless Codes and Feedback”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 20, No. 2, Feb. 2010, pp. 275-285.

[47] ANSI T1.801.03-2003, “Digital Transport of One-Way Video Signals – Parameters for Objective Performance Assessment”, 2003.

[48] F. Andreasen, “Session Description Protocol (SDP) Capability Negotiation”, IETF Request for Comments 5939, Sep. 2010.

[49] S.A. Atungsiri, S. Tateesh, A. Kondoz, “Multirate Coding for Mobile Communications Link Adaptation”, IEE Proceedings Communications, Vol. 144, No. 3, Jun. 1997, pp. 211-216.

[50] M. Baldi, Y. Ofek, “End-to-End Delay Analysis of Videoconferencing over Packet-Switched Networks”, IEEE/ACM Transactions on Networking, Vol. 8, No. 4, Aug. 2000, pp. 479-492.

[51] A. Barberis, C. Casetti, J.C. De Martin, M. Meo, “A Simulation Study of Adaptive Voice Communications on IP Networks”, Computer Communications, Vol. 24, No. 9, 1 May 2001, pp. 757-767.

[52] A. Begen, D. Hsu, M. Lague, “Post-Repair Loss RLE Report Block Type for RTP Control Protocol (RTCP) Extended Reports (XRs)”, IETF Request for Comments 5725, Feb. 2010.

[53] P. Bellavista, A. Corradi, L. Foschini, “Proactive Management of Distributed Buffers for Streaming Continuity in Wired-Wireless Integrated Networks”, Proc. IEEE/IFIP Network Operations & Management Symposium (NOMS ’06), 3-7 Apr. 2006, Vancouver, Canada, pp.351-360.

[54] P. Bellavista, A. Corradi, L. Foschini, “IMS-Compliant Management of Vertical Handoffs for Mobile Multimedia Session Continuity”, IEEE Communications Magazine, Vol. 48, No. 4, Apr. 2010, pp. 114-121.

[55] G. Bjontegaard, “Calculation of Average PSNR Differences Between RD-curves”, ITU-T SG16, Q16, VCEG, 2-4 Apr. 2001, Austin, TX, U.S.A., Document VCEG-M33.

115

[56] J.-C. Bolot, S. Fosse-Parisis, D. Towsley, “Adaptive FEC-Based Error Control for Internet Telephony”, Proc. 18th Annual Joint Conference of the IEEE Computer and Communications Sociaties (INFOCOM ’99), 21-25 Mar. 1999, New York ,NY, U.S.A., Vol.3, pp. 1453-1460.

[57] C. Bormann (ed.), “Robust header compression (ROHC): framework and four profiles: RTP, UDP, ESP and uncompressed”, IETF Request for Comments 3095, Jul. 2001.

[58] G. Camarillo, A. Monrad, “Mapping of Media Streams to Resource Reservation Flows”, IETF Request for Comments 3524, Apr. 2003.

[59] G. Camarillo, H. Schulzrinne, “The Session Description Protocol (SDP) Grouping Framework”, IETF Request for Comments 5888, Jun. 2010.

[60] G. Carle, E.W. Biersack, “Survey of Error Recovery Techniques for IP-Based Audio-Visual Multicast Applications”, IEEE Network, Vol. 11, No. 6, Nov./Dec. 1997, pp. 24-36.

[61] S. Casner, “Session Description Protocol (SDP) Bandwidth Modifiers for RTP Control Protocol (RTCP) Bandwidth”, IETF Request for Comments 3556, Jul. 2003.

[62] S. Cen, P.C. Cosman, G.M. Voelker, “End-to-end Differentiation of Congestion and Wireless Losses”, IEEE/ACM Transactions on Networking, Vol. 11, No. 5, Oct. 2003, pp. 703-717.

[63] D. Chalmers, M. Sloman, “A Survey of Quality of Service in Mobile Computing Environments”, IEEE Communications Surveys & Tutorials, 2nd Quarter 1999.

[64] S. Chatterjee, Crowdsourcing Mobile Internet Access Characteristics to Improve Video Streaming, M.Sc. Thesis, Aalto University, Helsinki, Finland, Jan. 2011.

[65] S. Chemiakina, L. D’Antonio, F. Forti, R. Lalli, J. Petersson, A. Terzani, “QoS Enhacement for Adaptive Streaming Services Over WCDMA”, IEEE Journal on Selected Areas in Communications, Vol. 21, No. 10, Dec. 2003, pp. 1575-1584.

[66] P. Chimento, J. Ishac, “Defining Network Capacity”, IETF Request for Comments 5136, Feb. 2008.

[67] L. Christianson, K. Brown, “Rate Adaptation for Improved Audio Quality in Wireless Networks”, Proc. IEEE 6th International Workshop on Mobile Multimedia Communications (MoMuC ’99), 15-17 Nov. 1999, San Diego, CA, U.S.A., pp. 363-367.

[68] G.J. Conklin, G.S. Greenbaum, K.O. Lillevold, A.F. Lippman, Y.A. Reznik, Video Coding for Streaming Media Delivery on the Internet, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 3, Mar. 2001, pp. 269-281.

116

[69] F. Cricrí, S. Mate, I.D.D. Curcio, M. Gabbouj, “Mobile and Interactive Social Television – A Virtual TV Room”, Proc. 10th IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM ’09), 15-19 Jun. 2009, Kos, Greece.

[70] G. Cunningham, S. Murphy, L. Murphy, P. Perry, “Seamless handover of streamed video over UDP Between Wireless LANs”, Proc. 2nd IEEE Consumer Communications and Networking Conference (CCNC ’05), 3-6 January 2005, Dublin, Ireland, pp. 284-289.

[71] I.D.D. Curcio, “Practical Metrics for QoS Evaluation of Mobile Video”, Proc. IASTED Internet and Multimedia Systems & Applications Conference (IMSA 2000), 19-23 Nov. 2000, Las Vegas, NV, U.S.A., pp. 199-208.

[72] I.D.D. Curcio, “Mobile Video QoS Metrics”, International Journal of Computers and Applications (ACTA Press), Vol. 24, No. 2, 2002, pp. 41-51.

[73] I.D.D. Curcio, “Multimedia Streaming over Mobile Networks: European Perspective”,

Wireless Internet Handbook: Technologies, Standards and Applications, B. Furht, M. Ilyas (Eds.), CRC Press, 2003, pp. 77-104.

[74] I.D.D. Curcio, “Mobile Video Telephony”, Wireless Internet Handbook: Technologies, Standards and Applications, B. Furht, M. Ilyas (Eds.), CRC Press, 2003, pp. 469-495.

[75] I.D.D. Curcio, E. Aksu, R.-S. Wang, K. Miller, “Method in a Communication System, a Communication System and a Communication Device”, Patent, United States, US 7,701,915, 20 Apr. 2010 (also EP 1 721 427 and other states), priority date 27 Jun. 2003.

[76] I.D.D. Curcio, E. Aksu, Y.-K. Wang “Timing of Quality of Experience Metrics”, Patent, Australia, 2004317111, 8 Jan. 2009 (also KR 10-0808981 and other states), priority date 13 Feb. 2004.

[77] I.D.D. Curcio, M. Hannuksela, V. Varsa, “Method in a Communication System, a Communication System and a Communication Device”, Patent, South Korea, 10-0731963, 19 Jun. 2007 (also JP 4105695 and other states), priority date 25 Sep. 2002.

[78] I.D.D. Curcio, A. Hourunranta, “QoS of Mobile Videophones in HSCSD Networks”,

Proc. Of the 8th IEEE International Conference on Computer Communications and Networks (ICCCN ’99), 11-13 Oct. 1999, Boston, MA, U.S.A., pp. 447-451.

[79] I.D.D. Curcio, M. Lundan, “Study of Call Setup in SIP-based Videotelephony”, Proc. 5th World Multi-Conference on Systemics, Cybernetics and Informatics (SCI 2001), 22-25 Jul. 2001, Orlando, FL, U.S.A., Vol. IV, pp. 1-6.

117

[80] I.D.D. Curcio, M. Lundan, “Event-Driven RTCP Feedback for Mobile Multimedia Applications”, Proc. IEEE 3rd Finnish Wireless Communications Workshop (FWCW ’02), 29 May 2002, Helsinki, Finland.

[81] I.D.D. Curcio, M. Lundan, “On RTCP Feedback for Mobile Multimedia Applications”, Proc. IEEE International Conference on Networking (ICN ’02), 26-29 Aug. 2002, Atlanta, GA, U.S.A., pp. 637-648.

[82] I.D.D. Curcio, M. Lundan, “Bandwidth Adaptation”, Patent, Finland, 116498, 30 Nov. 2005 (also US 7,346,007, EP 1 552 655 and other states), priority date 23 Sep. 2002.

[83] I.D.D. Curcio, M. Lundan, “Enhancing Streaming Media Reception for a Mobile Device During Cell Reselection”, Patent, United States, US 7,733,830, 8 Jun. 2010 (also CN 1706146 and other states), priority date 14 Oct. 2002.

[84] I.D.D. Curcio, M. Lundan, E. Aksu, “Transmission of Embedded Information Relating to a Quality of Service”, Patent, EP 1 661 366, 17 Feb. 2010 (also IN 232262 and other states), priority date 2 Sep. 2003.

[85] N. Damera-Venkata, T.D. Kite, W.S. Geisler, B.L. Evans, A.C. Bovik, “Image Quality Assessment Based on a Degradation Model”, IEEE Transactions on Image Processing, Vol. 9, No. 4, Apr. 2000, 636-650.

[86] DARPA Internet Program, “Internet Protocol”, IETF Request for Comments 791, Sep. 1981.

[87] DARPA Internet Program, “Transmission Control Protocol”, IETF Request for Comments 793, Sep. 1981.

[88] S. Deering, R. Hinden, “Internet Protocol, Version 6 (IPv6)”, IETF Request for Comments 2460, Dec. 1998.

[89] M. Degermark, B. Nordgren, S. Pink, “IP header compression”, IETF Request For Comments 2507, Feb. 1999.

[90] J. Devadoss, V. Singh, J. Ott, C. Liu, Y.-K. Wang, I.D.D. Curcio, “Evaluation of Error Resilience Mechanisms for 3G Conversational Video”, Proc. IEEE 10th International Symposium on Multimedia (ISM ’08), 15-17 Dec. 2008, Berkeley, CA, U.S.A., pp. 378-383.

[91] E.M. Dickson, “Potential Impacts of the Video Telephone”, IEEE Transactions On Communications, Vol. 23, No. 10, Oct. 1975, pp. 1172-1176.

[92] DLNA, “DLNA Guidelines – Volume 1: Architectures and Protocols”, Aug. 2009.

[93] C. Dovrolis, P. Ramanathan, D. Moore, “Packet-Dispersion Techniques and a Capacity-Estimation Methodology”, IEEE/ACM Transactions on Networking, Vol. 12, No. 6, Dec. ’04, pp. 963-977.

118

[94] J. Dunlop, “Potential for compressed video transmission over the GSM HSCSD service”, Electronic Letters, Vol. 33, No. 2, 16 Jan. 1997, pp. 121-122.

[95] Ericsson, Nokia, “Signaling for rate adaptation in PSS”, 3GPP TSG-SA WG4 #27 meeting, Munich, Germany, 7-11 Jul. 2003, Tdoc S4-030501.

[96] ETSI, “Digital cellular telecommunication system (Phase 2+). General Packet Radio Service (GPRS). Service description. Stage 1” (Release 1997), TS 02.60, v. 6.3.1 (2000-11).

[97] ETSI, (TIPHON) “Release 3; Part 5: Quality of Service (QoS) Measurement Methodologies”, TS 101 329-5, v.1.1.2, (2002-01).

[98] ETSI, “Overview of 3GPP Release 5”, v. 0.1.1 (2010-02).

[99] ETSI, “Overview of 3GPP Release 6”, v. 0.1.1 (2010-02).

[100] T. Eyers, H. Schulzrinne, “Predicting Internet Telephony Call Setup Delay”, Proc. 1st IP Telephony Workshop (IPTEL 2000), 12-13 Apr. 2000, Berlin, Germany.

[101] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, “Hypertext Transfer Protocol – HTTP/1.1”, IETF Request for Comments 2616, Jun. 1999.

[102] S. Floyd, M. Handley, J. Padhye, J. Widmer, “TCP Friendly Rate Control (TFRC): Protocol specification”, IETF Request for Comments 5348, Sep. 2008.

[103] T. Friedman, R. Caceres, A. Clark (Eds.), “RTP Control Protocol Extended Reports (RTCP XR)”, IETF Request for Comments 3611, Nov. 2003.

[104] P. Fröjdh, U. Horn, M. Kampmann, A. Nohlgren, M. Westerlund, “Adaptive Streaming with the 3GPP Packet-Switched Streaming Service”, IEEE Network, Vol. 20, No. 2, Mar./Apr. 2006, pp. 34-40.

[105] Y. Fu, R. Hu, G. Tian, Z. Wang, “TCP-Friendly Rate Control for Streaming Service Over 3G Network”, Proc. International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM ’06), 22-24 Sep. 2006, Wuhan, China.

[106] N. Färber, B. Girod, “Robust H.263 Compatible Video Transmission for Mobile Access to Video Servers”, Proc. IEEE International Conference on Image Processing (ICIP ’97), 26-29 Oct. 1997, Santa Barbara, CA, U.S.A., Vol. 2, pp. 73-76.

[107] H. Garudadri, H. Chung, N. Srinivasamurthy, P. Sagetong, “Rate Adaptation for Video Telephony in 3G Networks”, IEEE 16th International Packet Video Workshop, Lausanne, Switzerland, 12-13 Nov. 2007, pp. 342-348.

[108] P. Gentric, “RTSP Stream Switching”, IETF Internet Draft, draft-gentric-mmusic-stream-switching-01.txt, Jan. 2004, Expired.

119

[109] T. Halonen, J. Romero, J. Melero (Eds.), GSM, GPRS and EDGE Performance. Evolution towards 3G/UMTS, Second Edition, John Wiley & Sons, 2003.

[110] M. Handley, V. Jacobson, C. Perkins, “SDP: Session Description Protocol”, IETF Request for Comments 4566, Jul. 2006.

[111] M.M. Hannuksela, Error-Resilient Communication Using the H.264/AVC Video Coding Standard, Ph.D. Thesis, Tampere University of Technology, Tampere, Finland, Publication 796, Mar. 2009.

[112] A. Hourunranta, I.D.D. Curcio, “Delay in Mobile Videophones”, Proc. IEEE 7th Mobile Multimedia Communications Workshop (MoMuC 2000), 23-26 Oct. 2000, Tokyo, Japan, pp. 1B-3-1/1B-3-7.

[113] C.-Y. Hsu, A. Ortega, M. Khansari, “Rate Control for Robust Video Transmission over Burst-Error Wireless Channels”, IEEE Journal on Selected Areas in Communications, Vol. 17, No. 5, May 1999, pp. 756-773.

[114] J. Hämäläinen, Design of GSM high speed data services, Ph.D. Thesis, Tampere University of Technology, Tampere, Finland, 1996.

[115] S. Iai, T. Kurita, N. Kitawaki, “Quality Requirements for Multimedia Communication Services and Terminals – Interaction of Speech and Video Delays”, Proc. IEEE Global Telecommunications Conference (GLOBECOM ’93), 29 Nov.-2 Dec. 1993, Houston, TX, U.S.A., Vol. 1, pp. 394-398.

[116] Y. Ishibashi, S. Tasaka, “A Comparative Survey of Synchronization Algorithms for Continuous Media in Network Environments”, Proc. IEEE 25th Annual Conference on Local Computer Networks (LCN ’00), 8-10 Nov. 2000, Tampa, FL, U.S.A., pp. 337-348.

[117] Y. Ishibashi, S. Tasaka, H. Ogawa, “A Comparison of Media Synchronization Quality among Reactive Control Schemes”, Proc. 20th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM ‘’01), 22-26 Apr. 2001, Anchorage, AK, U.S.A., Vol. 1, pp. 77-84.

[118] ISO/IEC, “Information Technology – Coding of audio-visual objects – Part 2: Visual”, 14496-2, 2004.

[119] ISO/IEC, “Information Technology – Coding of audio-visual objects – Part 3: Audio”, 14496-3, 2005.

[120] ITU-T, “Network grade of service parameters and target values for circuit-switched public land mobile services”, Recommendation E.771, Oct. 1996.

[121] ITU-T, “Multiplexing protocol for low bit rate multimedia communication”, Recommendation H.223, Jul. 2001.

120

[122] ITU-T, “Video coding for low bit rate communication, Recommendation H.263”, Jan. 2005.

[123] ITU-T, “Video Codec Test model near-term, Version 5 (TMN5)”, H.263 Ad Hoc Group, 1996.

[124] ITU-T, “Advanced video coding for generic audiovisual service”, Recommendation H.264, Mar. 2010.

[125] ITU-T, “Video back-channel messages for conveyance of status information and requests from a video receiver to a video sender”, Recommendation H.271, May 2006.

[126] ITU-T, “Terminal for Low Bit-Rate Multimedia Communication”, Recommendation H.324, Apr. 2009.

[127] ITU-T, “Principles of a reference impairment system for video”, Recommendation P.930, Aug. 1996.

[128] ITU-T, “Multimedia communications delay, synchronization and frame rate measurement”, Recommendation P.931, Dec. 1998.

[129] ITU-T, “Internet protocol data communication service – IP packet transfer and availability performance parameters”, Recommendation Y.1540, Nov. 2007.

[130] V. Jacobson, “Compressing TCP/IP headers for low-speed serial links”, IETF Request for Comments 1144, Feb. 1990.

[131] H. Jin, R. Hsu, J. Wang, “Performance Comparison of Header Compression Schemes for RTP/UDP/IP Packets”, Proc. IEEE Wireless Communications and Networking Conference (WCNC ’04), Atlanta, GA, U.S.A., 21-25 Mar. 2004, Vol. 3, pp. 1691-1696.

[132] H. Jin, AC Mahendran, “Using SigComp to Compress SIP/SDP Messages”, Proc. IEEE International Conference on Communications (ICC ’05), Seoul, Korea, 16-20 May 2005, Vol. 5, pp. 3107-3111.

[133] M. Johanson, “Adaptive Forward Error Correction for Real-Time Internet Video”, Proc. 13th International Packet Video Workshop (PV ’03), Nantes, France, Apr. 2003.

[134] I. Johansson, M. Westerlund, “Support for Reduced-Size Real-Time Transport Control Protocol (RTCP): Opportunities and Consequences”, IETF Request for Comments 5506, Apr. 2009.

[135] T.V. Johnson, A. Zhang, “Dynamic Playout Scheduling Algorithms for Continuous Multimedia Streams”, Multimedia Systems, Vol. 7, 1999, pp. 312-325.

[136] D. Jurca, J. Chakareski, J.-P. Wagner, P. Frossard, “Enabling Adaptive Video Streaming in P2P Systems”, IEEE Communications Magazine, Vol. 45, No. 6, Jun. 2007, pp. 108-114.

121

[137] H. Kaaranen, A. Ahtiainen, L. Laitinen, S. Naghian, V. Niemi (Eds.), UMTS Networks: Architecture, Mobility and Services, Second edition, John Wiley & Sons, 2005.

[138] M. Kampmann, N. Baldo, “Adaptive Wireless Video Streaming using Transmission Rate Control and Priority-Based Packet Scheduling”, 14th International Packet Video Workshop (PV ’04), 13-14 December 2004, Irvine, CA, U.S.A.

[139] M. Karczewicz, R. Kurceren, “The SP- and SI-Frames Design for H.264/AVC”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13, No. 7, Jul. 2003, pp. 637-644.

[140] J. Kurose, K.W. Ross, Computer Networking: A Top-Down Approach, 5th edition, Addison Wesley, 2010.

[141] T.V. Lakshman, A. Ortega, A.R. Reibman, “VBR Video: Tradeoffs and Potentials”, Proceedings of the IEEE, Vol. 86, No. 5, May 1998, pp. 952-973.

[142] L-A. Larzon, M. Degermark, S. Pink, L-E. Jonsson (Ed.), G. Fairhurst (Ed.), “The Lightweight User Data Protocol (UDP-Lite)”, IETF Request for Comments 3828, Jul. 2004.

[143] J. Lazzaro, “Framing Real-Time Transport Protocol (RTP) and RTP Control Protocol (RTCP) Packets over Connection-Oriented Transport”, IETF Request for Comments 4571, Jul. 2006.

[144] X. Li, A. Gani, R. Salleh, O. Zakaria, “The Future of Mobile Wireless Communication Networks”, Proc. IEEE International Conference on Communication Software and Networks (ICCSN ’09), 27-28 Feb. 2009, Macau, China, pp. 554-557.

[145] D. Li, K. Sleurs, E. Van Lil, A. Van de Capelle, “Improving TFRC Performance against Bandwidth Change during Handovers”, Proc. 4th International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM ‘08), 12-14 Oct. 2008, Dalian, China.

[146] G. Liang, B. Liang, “Effect of Delay and Buffering on Jitter-Free Streaming Over Random VBR Channels”, IEEE Transactions on Multimedia, Vol. 10, No. 6, Oct. 2008, pp. 1128-1141.

[147] C. Liu, I. Bouazizi, M. Gabbouj, “Rate Adaptation for Adaptive HTTP Streaming”, Proc. 2nd annual ACM conference on Multimedia Systems (MMSys ’11), Santa Clara, CA, U.S.A., 23-25 Feb. 2011.

[148] A. Lo, G. Heijenk, I. Niemegeers, “Evaluation of MPEG-4 Video Streaming over UMTS/WCDMA Dedicated Channels”, Proc. 1st International Conference on Wireless Internet (WICON ’05), Budapest, Hungary, 10-14 Jul. 2005, pp. 182-189.

122

[149] M. Lundan, “Streaming over EGPRS”, Proc. 9th IEEE Symposium on Computer and Communications (ISCC ’04), 28 Jun.-2 Jul. 2004, Alexandria, Egypt, pp. 969-974.

[150] M. Lundan, I.D.D. Curcio, “RTSP Signaling in 3GPP Streaming over GPRS”, Proc. Finnish Signal Processing Symposium (FINSIG ’03), Tampere, Finland, 19 May 2003, TICSP Series #20, pp. 149-153.

[151] M. Lundan, I.D.D. Curcio, “3GPP Streaming over GPRS Rel. ’97”, Proc. IEEE 12th International Conference on Computer Communications and Networks (ICCCN ’03), 20-22 Oct. 2003, Dallas, TX, U.S.A., pp. 101-106.

[152] D. Malas, A. Morton, “Basic Telephony SIP End-to-end Performance Metrics”, IETF Request for Comments 6076, Jan. 2011.

[153] Y. Mansour, B. Patt-Shamir, “Jitter Control in QoS Networks”, IEEE/ACM Transactions on Networking, Vol. 9, No. 4, Aug. 2001, pp. 492-502.

[154] S. Mate, U. Chandra, I.D.D. Curcio, “Movable-Multimedia: Session Mobility in Ubiquitous Computing Ecosystem”, Proc. ACM 5th International Conference on Mobile and Ubiquitous Multimedia (MUM 2006), 4-6 Dec. 2006, Stanford, CA, U.S.A.

[155] S. Mate, I.D.D. Curcio, “Consumer Experience Study of Mobile and Interactive Social Television”, Proc. 10th IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM ’09), 15-19 Jun. 2009, Kos, Greece.

[156] J. Matta, C. Pépin, K. Lashkari, R. Jain, “A Source and Channel Rate Adaptation Algorithm for AMR in VoIP Using the Emodel”, Proc. 13th ACM International

Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV ’03), 1-3 Jun. 2003, Monterey, CA, U.S.A., pp. 92-99.

[157] Microsoft, “IIS Smooth Streaming Technical Overview”, Mar. 2009.

[158] M. Mitzenmacher, “Digital Fountains: A Survey and Look Forward”, Proc. IEEE Information Theory Workshop (ITW ‘04), 24-29 October 2004, San Antonio, TX, U.S.A., pp. 271-276.

[159] M. Mouly, M.B. Pautet, The GSM System for Mobile Communications, Published by the authors, 1992.

[160] J. Nieweglowski, T. Leskinen, “Video in Mobile Networks”, Proc. European Conference On Multimedia Applications, Services and Techniques (ECMAST ’96), 28-30 May 1996, Louvain-la-Neuve, Belgium, pp.120-133.

[161] Nokia, “Video streaming traffic characteristics – Application modeling in the RTP usage model”, 3GPP TSG-SA WG4 PSM AHG #4 Meeting, Munich, Germany, 17-18 Jan. 2002, Tdoc S4-AHP092r.

123

[162] Nokia, “Maximum bit rate parameter in SDP for PSS and conversational multimedia applications”, 3GPP TSG-SA WG4 #25 meeting, San Francisco, CA, U.S.A., 20-24 Jan. 2003, Tdoc S4-030033.

[163] Nokia, “Robust handover management for PSS”, 3GPP TSG-SA WG4 #25 meeting, San Francisco, CA, U.S.A., 20-24 Jan. 2003, Tdoc S4-030040.

[164] Nokia, “New client to server signaling for co-operative rate adaptation”, 3GPP TSG-SA WG4 #25bis meeting, Berlin, Germany, 24-28 Feb. 2003, Tdoc S4-030126.

[165] Nokia, “New client to server signaling for co-operative rate adaptation”, 3GPP TSG-SA WG4 #26 meeting, Paris, France, 5-9 May 2003, Tdoc S4-030329.

[166] Nokia, “Some issues on rate adaptation”, 3GPP TSG-SA WG4 #26 meeting, Paris, France, 5-9 May 2003, Tdoc S4-030348.

[167] Nokia, “Simulation results for scheduling parameter signaling mode”, 3GPP TSG-SA WG4 #27 meeting, Munich, Germany, 7-11 Jul. 2003, Tdoc S4-030503.

[168] Nokia, “On RTCP reporting for bit-rate adaptation”, 3GPP TSG-SA WG4 #28 meeting, Erlangen, Germany, 1-5 Sep. 2003, Tdoc S4-030652.

[169] Nokia, “SDU Requirements for PtM MBMS Radio Bearers”, MBMS Joint Meeting, Baden, Austria, 13-14 Oct. 2003, Tdoc MBMS-030012.

[170] Nokia, “RTCP Optimization Usage for VoIMS”, 3GPP TSG-SA WG4 #29 meeting, Tampere, Finland, 24-28 Nov. 2003, Tdoc S4-030770.

[171] Nokia, “QoE metrics extension”, 3GPP TSG-SA WG4 #31 meeting, Montreal, Canada, 17-21 May 2004, Tdoc S4-040237.

[172] Nokia, “Signaling of Applications Desired Synchronization Jitter”, 3GPP TSG-SA WG4 #37 meeting, Bordeaux, France, 14-18 Nov. 2005, Tdoc S4-050753.

[173] Nokia, “End-to-end signaling of negotiated QoS parameters for IMS multimedia sessions”, 3GPP TSG-SA WG4 #37 meeting, Bordeaux, France, 14-18 Nov. 2005, Tdoc S4-050754.

[174] Nokia, “Quality of Experience (QoE) Reporting in MTSI”, 3GPP TSG-SA WG4 #50 meeting, Sophia Antipolis, France, 18-22 Aug. 2008, Tdoc S4-080473.

[175] Nokia, Ericsson, “Introduction to rate adaptation for Rel. 5-6 PSS”, 3GPP TSG-SA WG4 #25bis meeting, Berlin, Germany, 24-28 Feb. 2003, Tdoc S4-030171.

[176] J. Ott, C. Bormann, G. Sullivan, S. Wenger, R. Even (Ed.), “RTP Payload Format for ITU-T Rec. H.263 Video”, IETF Request for Comments 4629, Jan. 2007.

[177] J. Ott, I. Curcio, V. Singh, “Real-Time Transport Control Protocol Extension Report for Run Length Encoding of Discarded Packets”, IETF Internet Draft draft-ott-xrblock-rtcp-xt-discard-metrics-00.txt, 14 Mar. 2011, Work in Progress.

124

[178] J. Ott, S. Wenger, N. Sato, C. Burmeister, J. Rey, “Extended RTP Profile for Real-time Transport Control Protocol (RTCP)-based Feedback (RTP/AVPF)”, IETF Request for Comments 4585, Jul. 2006.

[179] R. Pantos (Ed.), W. May, “HTTP Live Streaming”, IETF Internet Draft draft-pantos-http-live-streaming-05.txt, 19 Nov. 2010, Work in progress.

[180] V. Paxson, “End-to-End Internet Packet Dynamics”, IEEE/ACM Transactions on Networking, Vol. 7, No. 3, Jun. 1999, pp. 277-292.

[181] V. Paxson, M. Allman, J. Chu, M. Sargent, “Computing TCP’s Retransmission Timer”, IETF Request for Comments XXXX (draft-paxson-tcpm-rfc2988bis-02.txt), 14 Mar. 2011.

[182] J. Peltotalo, Solutions for Large-Scale Content Delivery over the Internet Protocol, Ph.D. Thesis, Tampere University of Technology, Tampere, Finland, Publication 925, November 2010.

[183] F. Pereira, T. Alpert, “MPEG-4 Video Subjective Test Procedures and Results”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, No. 1, Feb. 1997, pp. 32-51.

[184] C. Perkins, RTP: Audio and Video for the Internet, Addison Wesley, 2003.

[185] C. Perkins, O. Hodson, “Options for Repair of Streaming Media”, IETF Request for Comments 2354, Jun. 1998.

[186] K. Piamrat, C. Viho, A. Ksentini, J.-M. Bonnin, “Quality of Experience Measurements for Video Streaming over Wireless Networks”, Proc. 6th IEEE Conference on Information Technology: New Generations (ITNG ’09), 27-29 Apr. 2009, Las Vegas, NV, U.S.A., pp.1184-1189.

[187] J. Postel, “User Datagram Protocol”, IETF Request for Comments 768, 28 Aug. 1980.

[188] R. Price, C. Bormann, J. Christoffersson, H. Hannu, Z. Liu, J. Rosenberg, “Signaling Compression (SigComp)”, IETF Request for Comments 3320, Jan. 2003.

[189] K. Ramakrishnan, S. Floyd, D. Black, “The Addition of Explicit Congestion Notification (ECN) to IP”, IETF Request for Comments 3168, Sep. 2001.

[190] J. Rey, D. Léon, A. Miyazaki, V. Varsa, R. Hakenberg, “RTP Retransmission Payload Format”, IETF Request for Comments 4588, Jul. 2006.

[191] H.I. Romnes, R.R. O’Connor, “General Mobile Telephone System”, Transactions of American Institute of Electrical Engineers, Vol. 66, No. 1, 1658-1666, 1947.

[192] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. Peterson, R. Sparks, M. Handley, E. Schooler, “SIP: Session Initiation Protocol”, IETF Request for Comments 3261, Jun. 2002.

125

[193] E.-L. Sallnas, The Effect of Modality on Social Presence, Presence and Performance in Collaborative Virtual Environments, Doctoral Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2004.

[194] R. Schatz, S. Egger, “Social Interaction Features for Mobile TV Services”, Proc. IEEE Broadband Multimedia Systems and Broadcast Symposium, 31 Mar.-2 Apr. 2008, Las Vegas, NV, U.S.A.

[195] R. Schatz, S. Wagner, S. Egger, N. Jordan, “Mobile TV becomes Social – Integrating Content with Communications”, Proc. IEEE 29th International Conference on Information Technology Interfaces (ITI ’07), 25-28 Jun. 2007, Cavtat / Dubrovnik, Croatia, pp. 263-270.

[196] R. Schatz, S. Wagner, N. Jordan, “Mobile Social TV: Extending DVB-H Services with P2P-Interaction”, IEEE 2nd International Conference on Digital Telecommunications (ICDT ’07), 1-5 Jul. 2007, San Jose, CA, U.S.A.

[197] T. Schierl, M. Kampmann, T. Wiegand, “H.264/AVC Interleaving for 3G Wireless Video Streaming”, Proc. IEEE International Conference on Multimedia & Expo (ICME ’05), 6-8 Jul. 2005, Amsterdam, The Netherlands.

[198] T. Schierl, T. Wiegand, “H.264/AVC Rate Adaptation for Internet Streaming”, Proc. Packet Video Workshop (PV ’04), 13-14 December 2004, Irvine, CA, U.S.A.

[199] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications”, IETF Request for Comments 3550, Jul. 2003.

[200] H. Schulzrinne, A. Rao, R. Lanphier, “Real Time Streaming Protocol (RTSP)”, IETF Request for Comments 2326, Apr. 1998.

[201] P. Shankar, T. Nadeem, J. Rosca, L. Iftode, “CARS: Context-Aware Rate Selection for Vehicular Networks”, Proc. IEEE International Conference on Network Protocols (ICNP ‘08), 19-22 Oct. 2008, Orlando, FL, U.S.A.

[202] Y. Shibata, N. Seta, S. Shimizu, “Media Synchronization Protocols for Packet Audio-Video System on Multimedia Information Networks”, Proc. 28th Annual Hawaii International Conference on System Sciences, 1995, pp. 594-601.

[203] A. Shokrollahi, “Raptor Codes”, IEEE Transactions on Information Theory, Vol. 52, No. 6, Jun. 2006, pp. 2551-2567.

[204] V.R. Singh, Rate-Control for Conversational H.264 Video Communication in Heterogeneous Networks, M.Sc. Thesis, Aalto University, Helsinki, Finland, May 2010.

[205] J. Sjoberg, M. Westerlund, A. Lakaniemi, Q. Xie, “RTP Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate

126

Wideband (AMR-WB) Audio Codecs”, IETF Request for Comments 4867, Apr. 2007.

[206] C.J. Sreenanm J-C. Chen, P. Agrawal, B. Narendran, “Delay Reduction Techniques for Playout Buffering”, IEEE Transactions on Multimedia, Vol. 2, No. 2, Jun. 2000, pp. 88-100.

[207] E. Steinbach, Y. Liang, M. Kalman, B. Girod, “Adaptive Media Playout”, Multimedia over IP and Wireless Networks, M. Van Der Schaar, P.A. Chou (Eds.), Academic Press, 2007, pp. 527-556.

[208] R. Steinmetz, “Human Perception of Jitter and Media Synchronization”, IEEE Journal of Selected Areas in Communications, Vol. 14, No. 1, Jan. 1996, pp. 61-72.

[209] R.R. Stokes, “Human Factors and Appearance Design Considerations of the Mod II PICTUREPHONE Station Set”, IEEE Transactions on Communication Technology, Vol. 17, No. 2, Apr. 1969, pp. 318-323.

[210] X. Sun, F. Wu, S. Li, W. Gao, Y.-Q. Zhang, “Seamless Switching of Scalable Video Bitstreams for Efficient Streaming”, IEEE Transactions on Multimedia, Vol. 6, No. 2, Apr. 2004, pp. 291-303.

[211] W.-T. Tan, G. Cheung, A. Ortega, B. Shen, “Community Streaming With Interactive Visual Overlays: System and Optimization”, IEEE Transactions on Multimedia, Vol. 11, No. 5, Aug. 2009.

[212] A.S. Tanenbaum, Computer Networks, Fourth Edition, Prentice-Hall, 2003.

[213] V. Varsa, M. Karczewicz, “Long Window Rate Control for Video Streaming”, Proc. 11th International Packet Video Workshop, 30 Apr.-1 May 2001, Kyungju, South Korea, pp. 154-159.

[214] Vidiator, Nokia, 3, Apple, Vodafone, “QoE Metrics”, 3GPP TSG-SA WG4 #30, Malaga, Spain, 23-27 Feb. 2004, Tdoc S4-040019.

[215] Y-C. Wang, C-H. Lin, M-F. Tsai, C-K. Shieh, “Cross-Layer Unequal Error Protection Mechanism with an Interleaved Strategy for Video Streaming Over Wireless Networks”, Proc. IEEE Wireless Communications and Networking Conference (WCNC ’10), Sydney, Australia, 18-21 Apr. 2010.

[216] Y.-K. Wang, I. Bouazizi, M.M. Hannuksela, I.D.D. Curcio, “Mobile Video Applications and Standards”, Proc. 1st ACM Mobile Video Workshop (MV ’07) (in conjunction with the ACM Multimedia Conference), 28 Sep. 2007, Augsburg, Germany.

[217] Y.-K. Wang, I.D.D. Curcio, E. Aksu, M. Hannuksela, “Refined Quality Feedback in Streaming Services”, Patent, United States, US 7,743,141, 22 Jun. 2010 (also AU 2005241687 and other states), priority date 7 May 2004.

127

[218] S. Wenger, U. Chandra, M. Westerlund, B. Burman, “Codec Control Messages in the RTP Audio-Visual Profile with Feedback (AVPF)”, IETF Request for Comments 5104, Feb. 2008.

[219] M. Westerlund, “A Transport Independent Bandwidth Modifier for the Session Description Protocol (SDP)”, IETF Request for Comments 3890, Sep. 2004.

[220] M. Westerlund, I. Johansson, C. Perkins, P. O’Hanlon, K. Karlberg, “Explicit Congestion Notification (ECN) for RTP over UDP”, IETF Internet Draft draft-ietf-avtcore-ecn-for-rtp-00.txt, 28 Jan. 2001, Work in progress.

[221] Y. Xu, Y. Chang, Z. Liu, “Calculation and Analysis of Compensation Buffer Size in Multimedia Systems”, IEEE Communications Letters, Vol. 5, No. 8, Aug. 2001, pp. 355-357.

[222] J. Yao, S.S. Kanhere, M. Hassan, “An Empirical Study of Bandwidth Predictability in Mobile Computing”, Proc. ACM International Workshop on Wireless Network Testbeds, Experimental Evaluation and Characterization (WiNTECH ‘08), 19 Sep. 2008, San Francisco, CA, U.S.A., pp. 11-18.

[223] J. Yao, S.S. Kanhere, M. Hassan, “Quality Improvement of Mobile Video using Geo-Intelligent Rate Adaptation”, Proc. IEEE Wireless Communications and Networking Conference (WCNC ’10), 18-21 Apr. 2010, Sydney, Australia.

[224] Y.F. You, F. Gong, D. Winkelstein, N. Hillery, “A Method of Delay and Jitter Measurement in an ATM Network”, Proc. IEEE International Conference on Communications, 1996, Vol. 1, pp. 328-332.

[225] Z. ZiXuan, B.S. Lee, C.P. Fu, J. Song, “Packet Triplet: A Novel Approach to Estimate Path Capacity”, IEEE Communications Letters, Vol. 9, No. 12, Dec. 2005, pp. 1076-1078.

128

129

Publications

[P1] Igor D.D. Curcio, Ville Lappalainen, Miraj-E-Mostafa, “QoS Evaluation of 3G-324M Mobile Videophones over WCDMA Networks”, Computer Networks, Elsevier, Vol. 37, No. 3-4, 5 Nov. 2001, pp. 425-445.

© 2001 Elsevier. Reprinted with permission.

��

!�� "�" ��#� �$%$ &�� '�� $ ��(�)��

� �� !�� ""� ��

*�#�� + ,��- �+++. ��#�� / 0��- �++1. �##�� / �� ++1

*�� )��2 �" *��

��

3�� !�� #��- ��! ��-�� #��

�#��!" �� - �� 4��- �� #� 5��6 �� 77 �� $

��! ��- #�#� ��#�� #��#��" &�� 4��- � �� #��

��#�� /� �� 8�� #�� #�� " �� #�� #� 9�:��-

�� -;��" *�� 4��-$ ��#�� !�� $ ��-$ ��

��" � �++1 )�� #��#� <"&" �� !�� "

#$%�� & ��- �� #�. �� -. �� . ��. �� #��

� ��

�� #��#��$ �� #��#�-$�� #��!�� !��-��!�! � � ��!�� !� �� #��#��" �� #��#�� #�� #��! �� -��$ ��-��" �� #��#�-�� !� 4��- �� #� ��= ��$ �#��! ��#�� #� �� #��

�!$ ��!$ �� #��#�" �� #��!�� !�!$ �� >�� $ �� :�� "

�� #�� -"3�� -�� ?�� !�� ! �� 5��#� �� 7�$ 3��$@�3$ �� @336 � �� 1AB+�$ � �� #��!�� !�� 5��#� �� $7��$ ��7�$ ��AC6 � �� 1AB+�" �!�� 8�� !�� #�� D#��#-$ �� #�� #�� ! #�� ?�� !�� -��" 0��$ �� ! � �8�� #�� #�� #� �� !�� #��#� �� #�� #�� " 3�� #��- �� "C

�� @�� E 5�++16 ��CF��C

��"��"#��>��#��>#��

�� *��" G1BH"% ��! ��" ��2 @�� 7��$

7"I" <�: /B 5&�� 16$ E�1 3��$ 0��" 3��"2 J CB�

E1B+� C E�. ��:2 J CB�E1B+��E//�"

'(�� & !��"#��#�K��"#�� 5 "�"�" ��#�6$

��"��K��"#�� 5&" '��6$ ��("��

��K��"#�� 5��(�)��6"

1 BA�1�B/>+1>L � �� ++1 )�� #��#� <"&" �� !�� "

7 2 � 1 B A � 1 � B / 5 + 1 6 + + � � � E

!�� " M�� 5�!� �� #�#��#�� 6$ �7*� 5!�� #�� #�6 �� )��) 5��#�� 6 �� :�� #�� $ �� #��! � �� /�$1E1"�$ �� B� ��$ ��#��-" 3�� "C!�� #��- �� D#$ ��! ��#�� 4��! �� "

� ��! �8�� - ��;�� !�� 5 �6 �� 8�� !�� 5�� 6$ �� 9�:��-$ �� #�� $ �� 8�� 4��- �� #� 5��6 #��" 0�� :��$ � �� #�� #�� ! � ?�� #�� #�!�� #��" � �� 3��#��#�� N�� 5 3N6$ �� #�� 3��+++ 5 �� 3��#��#�� -�� +++6" �3��+++ �� (�� 8�� !�! ��)��$ ��>,�� @�� #�� 4�� #�� #��#��" � �� -�� #�� ##��5��6 �#��!-$ ��#� � #��#��;�� -��#�� !�� #�� !�� #� �� $ #�� #��!�� "

*�#�� #�� #�� #��!- �� " �7)�� M"�/ �� :�� #� �� #�� !��" 0��$ �� 3�� 7�� 7��(�# 5 �776 �� ?��;�� ;�� 3N M" �� $#�� " 3�� #�� #�#��#�� - �� "

�� #��#�� -�� - ?�� #��" �� -�� AC �� " 3��$ � �#� �� #" � !�� #�� *��" GCAH" � *��" GCCH$ �� !�� #�� #� ��

�� #��#��" I�� #�� #� �� N�� 3��#��#�� -��5N�3�6 �� G1A$�B$/1H" 3��- #��#�� #� #�� - )3� 5)�� 3��#��#�� 6"�� ?�� :�� *��"GA$1�$��$�E$C/$/+H"

3�� *��" G1EH �� 4��- �� M��" � *��" G1 H$ �� 7)�� 5��?��6 �� #��! � �� ! �� 0 �� 51�B � A/ �:��6 � �+ �� +�M; ��7 #��" �7)�� #� �:�� 0 54�� #�� #��!��$ 1E/ � 1�� :��6 �� 1�+ �� 1�+�� *��" G�/H" *��" G1CH �#�� M"�/ �� ! � ��#� #��" �*��" GC H$ #��!�� "

� �� $ �� :�� #�� " � �� $ �� #�#��#�� #�� " �� #�� #�#��#�� -�� #�� - �� ?�� " �� # �� - �:� �� M"�/ �� G1�$1C$��H$�� #�� :�� #�� 4��- �� #�� "3�� #��#�� - #��: �� - �� #��#�� #��$ ��#� #��- � �� 9��#� �� " 0�� ! �� #� �� $ �� 4��#�� 8�� " ��-$ �8��#�� #�� 8�� -��;��" 0��$ �� # ��#�� 57�3@6 �� -;��" ��#�9�:��- � �� ! �� " 3��$ �� -;� �� #��

��/ )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

�� - �� -�� 5�"!"$ �� 6"

��#�� #�� #��" � ��#�� $ �� # ��! ��#�� #��" 3�� 77 ��;�� #�� #�� #�� " � ��#�� C$ �� #��" � ��#�� /$ �� #� �� ?��$ �� #�� E �� " 0��-$ �� #��#�� #�� B"

� ��

��#� �� 1AB+�$ 3N �� ;�� ! ��#�� 3��+++" � )��$ )3� �� ?�� #�� N�3� � 1AA/" �,��$ �* < 5��#�� *�� <��6 �� 3��+++ ��- #�� 1AA " � �� N�� $ 3 � 53��#��#�� - ��#��6 �� 3171 �� ;�� #�- � 1AAE" � ��O��$ )3* 5)��#�� 3��#��#��*��#� ��6 �� 33� 53��#��#�� 3�#��!- ��#��6 �� ;�� #�� 1AA/" � �� #�� - �� #��$ �� 8�� " M��$ )3� ��* < �� ;�� #��!-"

3�� (�#�� 3��+++ �� #� #�� ;�� #��!� �� - �� 1�� 5�

��#�� 6$ ��- B� �� 5� �� 6" '�� #��!� �� -�� .

� ��!�� $ �D#� �� #�� #�� !�� -�� !�� .

� ��#� �� #� 4��- � �� #�� #�� ?:�� .

� �!� ��#�� D#��#- #�� :��!�-��.

� �!� 9�:��- � ��#� �� #��"M!� � �� 4��

�� #��- ��#� 4��- ��

#��#��$ ��- �� ##�� !�>�� " ��#� ��- ��#�� #�� $ � �� ;� �� #�� " �� #�$ �� #�� #�� $ �� - �� ;�� D#�� " 3�� #��- �� :�!�� #�� 8�� 4��5"�"$ � ��$ � �� 5<)*6 �� -6 � �� #��#��# �� " 3��$ �� ! � #��" �� #�� #��#��>�� #�� - #��#�� #� �� #�$ �� #�� #�� !�� #�� #�� - G�1H"

�� C�M; �� " I�� #��! �� !�� F �� 1�� B� �� F ��#�� #��#-" )�� $ � �� #��$ #�� #��" '��!�� 1+$ 1C �� + �M; �� !�� 8�#��-"

�� #�� ;�� 2�� ! !�� ! ��%� ��!�� -� .� ��#� ��

�� #� ��$ ��37� �� !��#� �� 5� *6 #�� -� ��;� �� ##��! � �� D# �� $��#� � �� #�� #��" 3��4��- �� !�� #�� !�$ ��:��$ � *" 3�� - #��-��! � *$ �� <)*$ �� (��! �� * � ��#� �� <)*��! �� -�� #�" )��! �� * ��!� � �� :�;�! �� #��#-" ��$ � ��$ �� 37� �4��;�� F�� 8�# G�B$CCH"�� ! �� 2��! ��!(�� !�� 3��

�� #� �� #�� - ��-��! �� #��" 3�� #�� 5�� !�?#�� 6 ��

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 ��E

� �� #�� #��")�!�3�4+��$ ��5�� 3�� #��

��- �� !��#��#- ��#�#�� #�� #��$�� #��$ �#�� #��-�� 4��#��" ��4��#- �� 8��$ �� -��$ �� :�� "6�5�� 7��!$� 3�� - ��

�� " 7�� #�� #� �� " 3�� - 9�:�� #�� 8�� 4��- ��4�� -�#�� #��" �� #��!��$ �� #�� #�� (�� #��- �� #� � !�� - �� #��#��" 3�� -�#�� #�� #�� #��!�� #��!��" 3�� #�� #� �� #�� #�� -�� G11H" � ��$ �� #�� #�� 5*'� 7�N�6 �� ;�$ ��!�� " 3�� #� �� - #��!� ��- 1+�� 5�� !� �� 6$ �� *'�7�N� �� #� 1+ �� G1AH" 3�� ! � 1+ �� #�� 2 �� !�� $ ��!�� " � �� -$ �� #��" 0�� :��$ �� #� ��!�� #��$ #�� #�� ! � ��- �� $ ��;�! ��#� �� "8�� 9 !�� !��

#�� -� �� #��#�� #��$ ��!� � �� #�� #�� "0��$ �� #�� #� � �� ! �� #�� #��"�� #�� 5�� #� ��! �� !��!#��6 � ��$ �� #� �� -�� #��" 3�� 8�# ��#�� #�� #��!�" � � ��- �� -��$ �� #� ��#�� - #�� #�� !" 3�� # �#��4�� # ��!��! #�� - � �� ##�� -�� -��

�:#�� #�� " @�!�� #�� !��! �� #��" �� #��$ �� #�� #� �� !� �##��$ �� :��$ � �� 4��$ � �� #�� " )�� D#�� #�� !�� -��$ � �� #��!�� - �� ##��"��!�� #�� #�� - !��#��- ��!��! �� #��! �� #��!��"3�� -��# #��!� � �� -��! �� #�� 5��-6 �� #�� -� ��- ��! �� #��#��" � �� #�� $ ��!�� #�� #�� #�� !�� " 3�� #�� #�� 4�� - � ��#��" 3��$ �� #�� 4�� # �� - ��#��! �� $ ��! � �� !� � �� !�� #��!��"

�� #�� - �8�� -�� #��$ �� - #�� :��! ��#�� #��" ��!$ ��:$ ?�� #�� :�� -�#�� -��# �� #��" ��#�� !�� #��- �� #��" M��$ ��!�� #�� " 3�� #��#�� !�! ��!��5��(!$� ��#��$ �� #��#�! �� - �� 5� �!��(!$� ��#��"

� ��

� 0��- 1AAB$ �� - �� 1/ �� 3N�3 �� *�#�� M" ��G�C$C�H" ��#�� #�#��#�� #��#��$ ��#��;� &" � �� ! � � �� "/ �� 7�3@"

M" �� #�� #��- �� #�$ �� ! ��

��B )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

��-" M" �� - �� !�� #�� #��" � M" �� *�#��$ ��!�� -�� @ �� #��"

�� #�� ! �� 4�� #��" 3-�#��-$ �� - �� (�# � � �!� <)* 5�� 1+� �� #�� #��!6" 3��$ � ��4�� - ��#�� #� ��" � M" �� *�#��$ �� #�� -8�� #�� #?# �� ! M" �� #��" �� - �� #� � ��#� �� &" � ��"

� ��#�� 1AAA$ �77 ?��;�� ;�� M" �� !��" 3�� $ �� $ �� 3N�3 M" �� : � GCH" 3�� 8��#�� 3N�3 M" �� !��!�� *��" G/H"

� 0!" 1$ �� #�� #�� !��# �� "3�� 4�� #� ��#�� $ �� - �� #�$ M"�� : �� :�� <$ �� M"��C �� -�� #�� #��" � � �� -�� $ �� #�� ?!��

��- �:#�� #�� 4��#��$ ��#� �� " � �� #�� #� ��! ��#� � !�� !2� 3�� - :�� #��# � �� M"�/ G AH

#�� !��" I�� #��#�$��#� �� 7)�� &�� G +H �� M"�/1 G BH$��- �� M"��C ��!��"3�� #�� !�� #�� 5�� #��6 �� #�� :��- CF1C�� ! �� $ �#�� #��:- �� 4��#�"

� 3�� - 8+�� #��# � �� * 5�� 6 G�H #��#$ ��#� �� !��#� #��#� �� "EC$ C"1C$ C"A$/"E$ E"�$ E"AC$ 1+"� �� 1�"� ��" � �� #�� #� � ��- �� #�� - ��#� �� 8�� *�� -��#��- ��! �� #�� D# #��" 3�� * ��#�#��# �� #�� - #��# �� 77 ��" <�� 9��!�� 5�� 6 �� * �� ! ��;�� 77" � �� $ �� #��#��#� �� "E� "1 G 1H ��- �� M"��C ��!��"

� 3�� *�!� ��#�� #��#� �� !� ��$ ?�� :#��!�$ �� ##��$ �� #� #��$ ��#��$ �#"

0!" 1" <��#� ��!�� G/H"

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 ��A

� 3�� !�� #�� 5M"��C G EH6 �� !��! �� $ #��- �:#��!�$ ��!�� - ��#�� #�� !#�� #��"

� 3�� +�!�� #�� 5M"�� G �H6 ��:��$ ��$ �� #�� !�� $ �� :�� #�� " #�� -�� -�� 5�'6 ��#��:#��!�� !�� -�� 5"�"$ ��>�� #��#�6$ �� -��#�� : ��-�� 5�NP6 ��#� � �� ! �� #�� ' � �� -�#�� -��" 3�� ' �� #�� #��#��$��4��#� ��! �� #�� #� �� " 3��8�� '� �� #?�� M"�� *�#��$ ��#� ��!�� 8�� -�� " 3�� '1 �� -�� !�� - �� #�� $ ��#� � ��- ��- �� 4�� #��#��" M��$ �'1�� - �� #�� #��$ �� !�� -��5"�"$ � �� #�� #� �� '�7�>&"��6"�'� � �� - �� !�� $��#� � ��- ��$ �� - �� #�#�� ##�� - �� !�� #�" �'� ��#�� !�� -�� 5"�"$ �� #��#6 �� NP ��-�� ! �� B��*� 5#-#�# ��#- #��#�6 �� B�� 4��#� �� #� #�� # ��! �� " �' � ��!�� !�� " �� 1/�� *� � �� #�� !�� -�� 5"�"$ � �� #��6$ �� NP ��-��" �' ��#�� 4��#� ��! �- �� B� �� 1/�� #�� ?��" 3��NP ��-�� : �� !#�� #�� ! �' 5�"!"$ ��$��$ ��6 �� !�� -�#�� -�� "�� NP ��-�� #�� !M�'� 9�!�$ �� #�� B�� $ ��#�

#��$ ��! �� $ � �� *� �� #��" 3�� !� ��?�� #� �NP ��#�� #�� #�� + �� #�� 5��!��6 ��!#�� #��"3� �� #� ��

�� $ �� 8��:�� #��$ �8��! ��!��- �#��! �� #�� !��- �#��! �� #��:-"3�� 8�� 8�� :�� #�� #��2;�� <5� "2 3�� #�� #

��#��- �� ?�� *�#��M"�� " �� ! �� ";�� <5� �2 3�� #�� :

� �� *�#�� M"�� G H" 3�� M�'� 9�!�� : ��#�� NP��-�� M"�� #�� !�� 9�!$ ��M�'� ;�� 5� ��D�!6 � ��";�� <5� �2 3�� #�� :

< �� *�#�� M"�� G �H" � �� M"�� '�� 1$ � �� 5��-�� 6 �� #��! �� :��#�� " 3�� #�� #�� 5��! �:�� - #��6 �� #��!� ?��";�� <5� �2 3�� #��

��:�� *�#�� M"�� G C$ /H"3�� #�� M"�� '�� "0��$ �� #�� #�� #�� -��"

3�� #��#�� !�� - �� "3�� - �� 7�3@M" �� " � �-��# �� #��!� ��#�� #�� (�� #� ��#�� #�� - ��! � #��#��" 3�� #�� - � ��#��! �� #��"

M" �� : � ��#�� #��#�� !�� - ��-��5��*'6$ ��#� � �� !� #��#�� #��" 3�� !�� 4��#� ��##�� !� ��#��

� + )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

�!� �� - �� D#��$ �� #��#�� - �� *'"

�� #��4�� #� �� ?�� *�#�� M"�/ G AH$ ��#� 3N�3 �� 0��- 1AAB" 3�� #��# �� M"�/ �� $ �� M"�/ �" � ��$ �� #� �#��4�� ?�� #�� G�+F��H" 3�� #��4�� #��2� ��#� ��#�� 5��: O6"� ��#�� #� �#�� #�� 5��

��: N6.� �� !�� #��! �� 5��:

*6.� �� #� �� 5��: &6.� �� #��! 5��: 6 �� @3*� ��

�#��4��.� �� #��

�� 5��: �6"):�� ! ��

�� *��" G� H" � ��$ ��!�� #�� #��#�� #�� M"�/ � ��$ �� - �� #�� "

� ��

� �� #�� #�� 77 ��;�� !��! �� #�� #�� $�� @ �� @�� #��"��-$ �� !�� #�� #�� #�� " �� *��"G�H$ �� - �!�� - 5�36 �� #��!#�#� ��#�� " � ��#$ �� #��#�� #� �� - ��?�� AA5*AA6 �� #?#�� G FBH"3�� *AA �� #�� #�� #?#�� #�� !��! �� !�3 #�� !�� $ ��@ �� 7�3@ #�� -�� !��! �� !�#��" � �� !��! �#��$ �� !�� #� ��#�� #�� #��>��#��$ �� ! �#��$ �� 8��

�8�� #��>��#��$ ��#�� - ��" 0��#� � ��#� � �� - �� - �3 #�� - �� - �� *AA ��#?#�� #�� !�� #��#��-" ��- �� - �� #��- ��! ��#�� 77 3�#��#�� #?#�� 53��6 #�� 5�@6" 3�� #�� :�� #� #��$ ��!>�!!��! ��#� �� #��$ ��>�� #� ��!��>��!��$ �� !��$ ��! �� M" �+>M" � G� $��H$ ��- ��#��$ �� !#�� #�� &"1�+ GC�H" 3��?�� *�� 5*�6 ��#?#�� G FBH$�!� ��#� �� 3 #��#�� *��" G�H ��#�� ;�� " *�� *�� +++ 5*++6 ��-" *��"GCE$CBH ��#�� #�#� ��#�� 3 � �� 4�� #��#��$ �� ! ��#��-"

/�� !+�

�� - �� #�� #�� $ �� -$ �� #��- �!�� 3 #�� #�� 77 �� -#�� " �� #�#��#�� 3 #�� !�� #� �� #� #��#��#�$ ��;�! �-�#�� #� 5<� +6 G1H" 3��#�� 3 ��#�� - �� !�� $ �� :�� !57�3@6 ��" 7�3@ ��#�� #��:- �� 3 #�� #��! #�� #�� 8�� #��! �� !�� #��#��" 3��7�3@��! #�� 4�� ! ��#�� 5 �06 !�� #��#�� #�� #�� !��!2 �� 8��#� #�� !�� #��" M��$ �3 #�� 7�3@ #�#�� #�� $�� #�� *��" GCEH"

��#� ��4�� 3 #�� !�� #��#�� !��

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 � 1

QQ��#�� !�� == 5N� 6 �� QQ��#�� !�� == 5*� 6 � �� #��- 5 3�6" 3�� 3�?�� #�� #��- �� 5<� )6 G H$ ��#� � �� #�� #�� !��" 0��$ �� #�� #?# #�� ! �� 3 #�� 0�� $ �� #�� " 3�� #� � ��-�� ##�� #�� #��" M"�� M"��C �� 4�� #��!�� !�� 4��- ��#�� #��" 3��$ �� QQM�� R M��C==#�� #?�� :��! �� 5I*�6 ?�� <� )" �3 #�� 9- ��#�� $ �� :��$ �� 8��!�� "

/�� (!�(�� 3�� #� �!��! �� 3 #��

��! � �� ?�� *��" G H" � ��@ �� $ ��#� � �� #�� #��$ ��@ �� 5 �N76��!�� - �� #��- �� 5��#�� #��! ��6" �N7� �� ?�� #�� 8�� #�� *��"G�EFC+H"

0!" � �� #�� #�� 3 #�� $ �� #�� ;�� 21" 3�� !��! �� 5��6 ��

�� #�� - ��! � �)3N7 ��!� ��QQN� ==>QQ*� == �� 3�$ QQM�� R M��C== ��I*� �� 4�� #�� <� )"

�" 3�� N7 �� !� 5 ��6 � �� !��!�� ! �� 57'�@> ��@6 ��"

" �� #��! �N7 ��$ �� !�� #� �� ?�� <� ) �� )3N7��!� � �� ! ��"

�" 3�� ! �� ?�� #�� -�� I*� ?��$ �� ! �� #�� $ � �� $ �� - ��!��#� �� '' �I@0 *�)� ��!� � ��#�� #��"

C" 3�� ! �� ')*3� @� �� I@@)�3 ��!�� #�� #�� #�� ! ��! �� ! �� #��#��-"

/" 3�� #�� /� �� C/ �� !�#�� #�� !�� #�� QQN� == ��QQ*� == #��#��$ ��#��-"

0!" �" �� #�� 3 #��"

� � )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

3�� -�� !��! �� - � �� N7��!�� F �� #�� !� 5��6$ #��#��! ��!� 5�7�6 �� ! ��!�5�@�6 � �� - ��''7*I�))� @�$ �')*3 @� �� I@@)�3��!�� !��! �� !��! ��"

0��#� � ��#� 5 "1 �M; ��6 �� #��#�� QQN� >*� == #��#�� ?#�� :��!��#?#�� 7'�@�� @" @� �� :��! ��@ #�#�� #�$ �� !� �� #��! ��#� ��4�� "0�� $ �� #�� #�� #� �� QQN� >*� == �� *AA��#?#��"

3�� #�� #�� !�� " M��$ � � �� #�� #� �� #��$ ��#�� #��#��!$�#�� #��#��$ ��!$ #��!�!$ �#"$ ��! �� #� �� #��:"

/�� 5��5��9 )6*�3�� : � �� M" �� ?�� :��

��$ ��#� �� @#�#�� !�! �� C/ �� 1A�+ ��" 3�� :�� M" ��> " *��" G�H ��?�� !�� #��! ��@ �� C/ �� /� ��"M��$ ��@ �� @ 5@� ��@6" 3�� @ ��F�� !��! ��?�� *��" GC1H$ �� #� �� #� �!��! G H � ��- ��" 3�� 3 #�� ! �� @ ��"

3�� #�� #�� 3#�� ! �� @ �� #�� #�� 3 #��" 3�� #��! �� <� ) � �� :��#�- �� #� �� @ ��F�� #�" 3�� #�� #?# #�� 3 � �� -�� 1 ��#�� 5N '176 ?�� @ ��F�� #� <� )" ��$�� ?�� -�� #��-

5''�6 ) �� #�� #?##��" 3��$ �� ! �� #�� #��! #�� I*� � �� <� ) �� #� �� ! �� @" �� #�� 3 #�� ! ��@ �� ?!�� 9�� 0!" �"

3�� #�� #�� @ ��!�� 3 #�� @ ��!�� #��$�:#�� #��!� �� !�� "

3�� : � �� M" �� &"1�+ � �� M" ��> �� #��-" 3��$ �� &"1�+ � ��> �0 �� #�� #��- �� M" ��> ��" M��$ �� !�� *� �� 77 3�� @"

� �� !��

�� ! ��-;�� #��#�� #�� #�� $ �� #�� #� �� #�� " 3�� 0!" "

3�� ?!�� #�� ! � ��" 0�� #- �� !�� - ��- �� #�#��#�� #��#��"

3�� ;�� 3�� 1" I�� M"�/ � #��# �� #� � �� M"�/ �� #��#5"�" � #��# �� - �� #��! �� #� �#��4��6" 3�� !�� #��# ��#�� SN& �� + ��"&�� 0 �;� 51E/ � 1�� :��6" � �� $ �� #�� 4��#��2 ��-�$ ��#� � � -�#�� 4��#�$ ��

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 �

��$ ��#� � � ��>�!�� 4��#�" 3�� 4��#�� #�� 4��#�� !�� M"�/ ��" �� #��#�� !��4��#�� !�� 4��#�$ ��#� ��#��- ��" 3�� #�� 4��#�� #�� "

� �� #� �� #��#$ ��#� �8�# �� 4��- �� " 0�� :��$ �� #�� #�� #�� (�� !�� !�� 4��- � �� :�� $ �� #��" 0�� -$ �� #��#��! ��!�� #�� *��" G�CH �� " 3�� !�� #��! �� M"�/ �#��#2

� ��#�� #�� 5��: 06.� ��#�� #��! �� 5��: 6.� ��#��! ?��! �� 5��: ,6.� ��?�� 4��;�� 5��: 36"

� �� ! �� $ ��#�� #� �� :��! M"�� '�� " �� $ � ��#�� #?�� $ �� <)* � �� - �(�#�! �� :�� ##��! � �� #?�� " �� #?# �� ##�� !��! �� 5�� 3�� 16" � �� $ �� :�� #�� #� �� #�� #��"3�� #�� SN& �� "

3�� #�� #�� /�� C+ ��>�" 3��#��#�� -�� -;��2 �� ?�� 7�3@> ��@ �� 5�� (!�(��$ �� 3'6$ �� #�� 5�� (!�(�� 3�6" �� $ ��#� �� #�� $ �� 8�#�� #�� #�� !��!" 0�� 3'#��#�� 7�3@ �� @��! ��"

�� #�� #�� #�� #��" 3�� #�� 3' #�� #�- �� 3N �� " 1 �� 3' �� 3�#��" � �� ?�� #�� - �� $ �� #�� #�� 2 �� $ �� " 0�� $ � �� 3� #�� (�#�� #� 5��! �� 8��$ �� 6$��#� �� #� �� " 3��

1 ��2>>��"�#��"#��>��#��>'<��>��T��"

0!" " ��!�� "

� � )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

�� #�� " ��!� �� - �� -� �� #��$ �� !�� !�� "

�� 1+ �� #� �� ?��$ ��#� �� ! �� 8�� ?��"

" #��$ �% ��!��

�8�� #� �� ! �� #� �� " 3�� -�� (�#�� # �� 4��- �� 7�@* G1+$1/$�/H$ ��#� � �� #�� (�#�� 4��- ��G�+H$ �� - ��! �� #��" �� ! �� #��! �� 7�@* �� !�� #�� 2

7�@* � 1+ ��!1+ 1�C�� CC� ��

��1

��1

��

� ��

��

��1

��

��1

��

��

��

��

��1

��

��1

��

� ��

��1�

�� $ ��$ �� $ ��$ �� #�� #� �� #��#� #�� !�� #�� #��-.� �� #�� #�� 5SN&�2�2+ �� 6" � �� 0 ��$ �� 1E/ �� 1��" 3��7�@* � �:�� < �� #��#�� #��#�� 4��#�" 3�� !� 7�@* �� !��

3�� 1

��

��#� 7��#�� * ��#� �� !� � ��

�"A �� 5��#� �� 6

&�� #��# M"�/ � �� + ��

0�� ;� �� 0 51E/� 1�� :��6

I�!�� 4��#�� -� 5 ++ ��$ 1+ �6

�� 5 B� ��$ 1�"E �6

��#�� 4��#�� -� 5A++ ��$ + �6

�� 511�/ ��$ B"1 �6

�� #�� /� ��

�� C+ ��>�

<)* � �� 3' ��>�2 B)� +C �� )� +�

� �� 3' C+ ��>�2 C)� +C �� )� +�

� �� 3� ��>�2 ��B)� +C �� )� +�

� �� 3� C+ ��>�2 ��C)� +C �� )� +�

/� �� 3' ��>�2 E)� +C �� )� +�

/� �� 3' C+ ��>�2 /)� +C �� )� +�

/� �� 3� ��>�2 ��E)� +C �� )� +�

/� �� 3� C+ ��>�2 ��/)� +C �� )� +�

0��4��#- 1A�+ �M;

�� "+A/ �#��

3�� #�� N��

��! �� + ��

��! 1� �� 3�� #��$ � ��

�� #� �� 1B+ �

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 � C

�� - ��!�! �� 7�@* �� #��#�� 4��#�" � ��$ ��#� 1+ �� #� ��#��$ �� 7�@* � #�� !�� 1+ �� 5�� : ��#��6 ��! �� 5��:�� <6"

3�� 4��- � �� #�� #�� E" � #�� 7�@* 5�:�� <6 �- ��

�7�@* � 7�@*��#�� 7�@*��

�� 7�@*�� 7�@* �� 4��#�$ �� 7�@*��#�� 7�@* �� #�� 4��#�"

�� #�� -�� #��$ ��#� #�� #�� " ��$ �� ! �� 5�� $ �##�� ! ��" �� ?�� - ��

��

�� #�� #�� $ �� :�� - �� :��" ��!�-$ �� #�� ?�� #��! �� $ �� ?�� :��" 3�� -�� :�� #��"

�� #�� ##�$ �� #��!�� 5"�"$ �� 6 �� :�! �� 5"�"$� � ��6 #�� #� �� 5"�"$ �� 6"�� ! �##�� $ �� #�� :�� " 3� �� :�#$ ��! �##�� #�� #�� 5�� 6 � �� :�� ?��! �� #��! �� "

�� $ �:�� #�� 5��6$ �� 4��- "�"$ �� "

3�� D# �� #�� #�� #��!� �� #�� #�� #�� " ��$ �� D#�� !�� "

�8�� #�� #�� #�� !�� $ �� #�� $ �� #�� #�� :�� " ��#� 9�:��- � �� " �� ?��

�� #��

�� #�� #�� #�$ �� :�� 7�@* 5��7�@*6$ ��- 5��-6 �� 5��6"

3�� #�� #�� $�� :�� " 3�� !�� - �� M"�� :�� #�� $ �*�$ �� 4��#� �� #�� ?��" ��-$ �� #�� #�� M"��C #�� #�� $��#� �� #�� :��" �� #� �� $ �� :�� ?�� " � �� #��$ ��! �4�� -� ��2

< �� < �� < ��&��

�I��NP�

0��-$ ��(�#�� 4��- � �� ! �� ! ��#2 1 � ��$ � � ��#��-$ � ��$ � � !�� C � �:#��" �� : ��#��$ �� ! �� #� �� "

& ��

):�� 7�@* �� $ �7�@*$ ��$ ��!� �� -$ �� !�" �� - �� $ � !�� - �� 5��#��- � �� M"�/ � #��#6$ ��#� �� 8�# �� 4��-" *�� #�� !"

7�@* �� -� �� 4��#�� ;�� 3�� FE �� 0!�" �FE" 3�� !� �� 7�@* �� #��#�� -�� 5�3' �� 3�6$ ��

� / )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

5 � �� /� ��6$ �� 5 �� C+ ��>�6 �� <)*� �� #� #��#�� -��" 7�@*�� "

� !��$ ��#� � �� 3� #�� !�� 8�#�� - �� #�$ �� !��- �� 3' #�� "

��! �� :�� #� �� #� �� 4��#�$ ��#� � �� D#�� #��#�� #�� #�� " 3�� # �� #��#�� #��4�� #��#�� #��

3��

�� 7�@* 5� �<6 �� #� 9�:��- 5��-� �3' #��6

0��

��>� C+ ��>�

B)� +C )� +� C)� +C �)� +�

3� � �� >� B) � +C F +"+/ �+"+� +"+E

)� +� �+"+/ F �+"11 +"+1

C+ ��>� C)� +C +"+� +"11 F +"11

�)� +� �+"+E �+"+1 �+"11 F

/� �� >� E) � +C 1"E 1"EA 1"/B 1"B+

�)� +� 1"/C 1"E1 1"/1 1"E�

C+ ��>� /)� +C 1"E� 1"B+ 1"/A 1"B1

�)� +� 1"E� 1"EB 1"/E 1"EA

3��

��!� 7�@* �� -� 5� �<6

� �� !" �"�" /� �� !" �"�"

)�� C"C+ +"+C )�� E"�/ +"+E

�3' ��>� B) � +C C"�� +"+A �3' ��>� E) � +C E"1E +"+A

)� +� C" B +"11 �)� +� E"+A +"+E

�3' C+ ��>� C) � +C C"�A +"+/ �3' C+ ��>� /) � +C E"1B +"+E

�)� +� C" E +"11 �)� +� E"1/ +"+A

�3� ��>� ��B)� +C C" E +"11 �3� ��>� ��E)� +C E"1A +"+�

�� )� +� C"1� +" 1 ��)� +� E"+ +"1E

�3� C+ ��>� ��C)� +C C" � +"�B �3� C+ ��>� ��/)� +C E"1E +"+C

��)� +� C"1A +"�� )� +� /"A� +"1A

3��

��!� 7�@* �� 5� �<6

� �� !" �"�" /� �� !" �"�"

)�� +"C+ +"+ )�� "1� +"+�

�3' ��>� B) � +C +" 1 +"� �3' ��>� E) � +C �"+/ +"+E

)� +� �A"A� +"�B �)� +� 1"A +"1

�3' C+ ��>� C) � +C +" A +"1 �3' C+ ��>� /) � +C �"+ +"1+

�)� +� +"+A +"1+ �)� +� 1"BA +"+B

�3� ��>� ��B)� +C +"�+ +"�C �3� ��>� ��E)� +C �"+1 +"+E

�� )� +� �A"A+ +" ��)� +� 1"/C +"1A

�3� C+ ��>� ��C)� +C +"�� +"�B �3� C+ ��>� ��/)� +C 1"BA +"1�

��)� +� �A"CA +" � ��)� +� 1"/� +"1A

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 � E

#�� " 3�� $ �� 8�� #� ��" 0�� $ �� 7�@* �� !�� -�"

3�� 3' #�� :�� 7�@* �� +"1 �< �� -� ��+"/ �< �� $ �� 3� #�� :�� +"� �< �� -� �� +"A �< ��

3�� C

�� 7�@* 5� �<6 �� #� 9�:��- 5��-� �3� #��6

0��

��>� C+ ��>�

��B)� +C �� )� +� ��C)� +C ��)� +�

3� � �� >� ��B)� +C F +"� +"+C +"1B

�� )� +� �+"� F �+"1A �+"+C

C+ ��>� ��C)� +C �+"+C +"1A F +"1�

��)� +� �+"1B +"+C �+"1� F

/� �� >� ��E)� +C 1"B� �"+C 1"B/ �"++

��)� +� 1"// 1"BA 1"E1 1"B�

C+ ��>� ��/)� +C 1"B+ �"+ 1"B� 1"AB

��)� +� 1"C/ 1"B+ 1"/1 1"EC

3�� E

�� 7�@* 5� �<6 �� #� 9�:��- 5�� 3� #��6

0��

��>� C+ ��>�

��B)� +C �� )� +� ��C)� +C ��)� +�

3� � �� >� ��B)� +C F +" + �+"+� +"/1

�� )� +� �+" + F �+" � +" 1

C+ ��>� ��C)� +C +"+� +" � F +"/�

��)� +� �+"/1 �+" 1 �+"/� F

/� �� >� ��E)� +C 1"B1 �"11 1"EA �"�

��)� +� 1"�C 1"EC 1"� �"+E

C+ ��>� ��/)� +C 1"/A 1"AA 1"// �" +

��)� +� 1"�� 1"E� 1"�1 �"+C

3�� /

�� 7�@* 5� �<6 �� #� 9�:��- 5�� 3' #��6

0��

��>� C+ ��>�

B)� +C )� +� C)� +C �)� +�

3� � �� >� B) � +C F +" / �+"+A +"��

)� +� �+" / F �+"�C �+"1�

C+ ��>� C)� +C +"+A +"�C F +" 1

�)� +� �+"�� +"1� �+" 1 F

/� �� >� E) � +C 1"EC �"11 1"// 1"AE

�)� +� 1"/� 1"AA 1"C� 1"B�

C+ ��>� /)� +C 1"E� �"+B 1"/ 1"A�

�)� +� 1"CB 1"AC 1"C+ 1"B+

� B )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

��" �� 7�@* �� -� �� 4�� +" �<"

3�� 0!�" / �� E #�� #��#�� /� �� 3'�� 3� #��" 3�� 3' #�� :�� 7�@* �� +"��< �� -� �� " � �� 3�#��$ �� :�� 7�@* � +" �< �� -�� +"C �< �� " �� 7�@* �� +"� �<"

*�� #�� #� 9�:��- �� 3�� FE" 3�� #��!�� #�� -��;��2 � #��!� � ��#�� #�� 5"�"$ <)*6$ � #��!� �� #��!� �� #��#��" *�� 7�@* ��" � �� #�� !�� 4��-$ �� !�� " 0�� #-$ �� #��!�� 5�8�� <)* �� 6 �� /� �� 5�#��! �8�� <)* �� 6"*�� #��!�� /� � � �� -#�� - ��! �� !� �� "

3�� 3' #��#�� -� ��4��#�$ #��!�! ��#�� #�� !�� 7�@* � � ��!� ��G�+"1$ +"1H �<" � �� #�� /� ��$�� !�� 4��- � � �� !� �� G1"/$ 1"BH �<" � �� 3� #�� 5�� 3�� C6$ � #��!� �#�� #�� !� �� G�+"�$ +"�H �<$ �� #�� 4��- ��1"/ � �"+ �<"

3�� / �� E �� " � �� 3' #�� $ �� 7�@*

� �� +"C �� +"C �<$ �� !�� 4��- �� 1"C � �"1�<" � �� 3� #��$ �� #��! ��!�� G�+"/$ +"/H �< �� $ �� G1"�$ �"�H �< �� !�� /� ��"

0�� 3�� B" 3�� M"�/ � ��#�� 5��#��- �� #��6 � ��

0!" �" ��!� �7�@* �� 3' #��#�� "

0!" /" ��!� �7�@* �� 3' #��#�� /� ��"

0!" C" ��!� �7�@* �� 3� #��#�� "

0!" E" ��!� �7�@* �� 3� #��#�� /� ��"

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 � A

�� #��" 3�� 3'�� 3� #�� 8�� <)*� �� - ��" 3��$ ��!� �� " ��#� �� !�� -�$ �� #��- ��" 3�� #�� $ � #��!� � �� #�� #�� 5�� #��6 �� 1"EF1"B ��" � ��!�� ##��$ �� #�� #�� "

��- �� 3�� A" 3�� - �� #�� ?�� *��" G�AH" �� #�� $ �� -� � �� 3' ��3� #�� 8�� <)*� �� " 3��$ ��- �� !�� " � � ��$ �� !��:��#�� - � �1� �� -�" �� >�!�� 4��#�$ ��!� ��- �� - �� + �� !�� -�" � /� ��$ �� - �� 4��#�� #�� - � �� $ ��! �� - � �� " M��$ �� !�-�� "

3�� - �� ! �� #��! �� :�! ��-�$ �� #�� #��" 0�� /� �� #��$ �� #�� #�� !�� :��- �� $ ��

��" 3�� :�� - �� - �� $ ��$ � 4��#��" 3�� #��! ��- �� $ ��#� �� " ��#�� #�� #�� #��! ��-" M��$ ��#�� - �� #��8�# �� -$ ��#� �� :�!��- #�� /+U �� -" 0��-$�� #�� #��! ��- � �8�� - �� #�� :�� #�� :�!�� 5"�"$ �� -� �� 6 �� -�"

<�� 3�� A$ � #�� #��!�� /� ��$ �� - � ��C+ �� -� �� E+ �� "

� 0!�" B �� A$ �� - �� /� �� #��#��"�� :�� -� �� -� �� !�� !" 3��$ ��#��- � /� ��$ �� - � �� !��- #�� 4��#�" � �� ?!��$��-� �� ?�� !��#��"3�� C+ �� -� 5 � �� /��6$ �� #��- �B1 �� 1� �� "

� 0!�" 1+ �� 11$ �� !� �� !�5� U6 �� $ ��$ �� :�� 8�� " 3�� #�� "A �� 51CU �� BU ��#��- �� 6" 3�� C�"� ��5ECU �� B�U �� /� �� #��#��6"3�� #�� - �� :�� #��- �� "1 �� "A �� 51+U �� BU �� 6"

� �� (�#�� $ �� 4�� #��!�� 4��- �� #�� 4��#��" 3�� #� �� ?�� #�� /" 3�� #��#�� (�#�� " 0��:��$ � �� #�� /� ��$ �� 4��- �� -�$ �� !�� <)* #��$ �� #�� "15��6 � �"1 5!��6" 0�� $ �� !��#�� 4��- � �� 2 �� 4��-

3�� A

��!� ��- �� 8�� 3' ��

�3� #�� 5� ��6

� �� /� ��

��-� �1� 1//

�� C 1/�

3�� B

��!� �� 8�� 3' ��

�3� #�� 5� ��6

� �� /� ��

��-� E"B 11"A

�� /"+ 1+"�

��+ )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

�#�� 1"B 5�� #��-6 � �"�5�� #��- �� 6" M��$ �� - ��$ �� 4��- �� /� �� #� �� -� � � ��"3�� 8�#� �� -

�:��#�� ! �� -�" 3� �� (�#�� 4��-$�� !�� :��#�� #��4��#�� 0!�" 1�F1A" 3�� 7�@*

0!" A" �� - �� /� ��"

0!" B" �� - �� "

0!" 11" ��!� �� !� �� /� ��"

0!" 1+" ��!� �� !� �� "

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 ��1

0!" 1�" ��-� /� �� 3' <)* � �)� +�"

0!" 1C" ��-� /� �� 3� <)* � ��)� +�"

0!" 1/" �� 3' <)* � )� +�"

0!" 1E" �� 3� <)* � �� )� +�"

0!" 1B" �� /� �� 3' <)* � �)� +�"

0!" 1A" �� /� �� 3� <)* � ��)� +�"

0!" 1�" ��-� � �� 3' <)* � )� +�"

0!" 1 " ��-� � �� 3� <)* � �� )� +�"

�� )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

�� !�� #�� !� 7�@* �� #��! �� #�� " �� #�� !�� <)* #�� ! �� -;�� >�" 0�� <)* #�� 4��- �� "

' ��

I��$ �� ! � � �� /� �� ! 4�� " �� -;�� #�#��#�� 3' �� 3� #��#�� 8�� 8�� #�� #�� 5"�"$ <)*�6" � �� #�� #��!� � ��#�� #�� $ �� 7�@* � ��-� � � ��!� ��G�+"/$ +"/ �<H �� #��#�� :�� !� �� ) � +�" �� #�� #�� /� ��$ ��4��- !�� 7�@* � ��-� � �� !�G1"�$ �"�H �< �� #�� #��! 1+F1� ��" 0�� #��$ �� -��#�� C+FE+ �� 1/C ��" ��(�#�� 5��! �� 0 �#�� 6 � �� #��- �� $ �� #��#- 5/� ��6 ��!� #�� 5�� !��6")��-$ �!�� #� �� 1�B �� B� ��#�� - � �� !�� #�� ;��5�"!"$ � 06 �� "

�� (��

3�� :�� !�� -�� 4��- �� "

)�%��

G1H �77 3��$ ��#� <�� #�� 5<�6 ��

�- � 7��# '�� @�� 57'�@6$ �77 3�

��"++� �" "/"+ 5�� 1AAA6$ �++1F+ "

G�H �77 3��@$ ��#� ��#�� 3��-$

�77 3* � "AE� �" "+"+ 5�� 1AAA6$ �+++F+ "

G H �77 3��@$ �� *�� #� '�-�� #?�

#��. �� @�� 7��#�� F ��!� $ �77 3� ��"++B

�" "E"+ 5�� 1AAA6$ �++1F+ "

G�H �77 3��$ ��- ��#� ��# ��#�

7��#��! 0��#��" ��* ��#� ��#. ��

��#��$ �77 3� �/"+E1 �" "+"1 5�� 1AAA6$

1AAAF+B"

GCH �77 3��$ ��# �� #� ��#��

3��- ��#�. �� #��$ �77 3� �/"11+

�" "+"1 5�� 1AAA6$ 1AAAF+B"

G/H �77 3��$ ��# �� #� ��#��

3��- ��#�$ ��?#�� M" ��$ �77 3�

�/"111 � "�"+ 5�� 1AAA6$ �+++F1�"

GEH �77 3��@$ �� 3�� 0��#�

�� 53�06 �� 5��6$ �77 3� �E"++1

�" "B"+ 5�� 1AAA6$ �+++F+ "

GBH �77 3��@$ �� *�4�� !

<�� 7��# '�� @�� 57'�@6 ��

�� !�� #�� !�� @�� 5 ��@6 �� 7��#

��#�� 3�� @�� 57�3@6$ �77 3� �A"++E �

"B"+ 5�� 1AAA6$ �++1F+ "

GAH 0" ��#�$ �" ��$ M" ��$ ��

�� :�!�� #��#�� -��$ )))

��#�� !�;�� / 5A6 51AAB6 C/F/A"

G1+H �@� 31"B+1"+ �1AA/$ �!�� 3�� I��-

&�� !�� F 7�� I�(�#�� 7��#�

��$ 1AA/"

G11H �" <��$ )" <��$ � �� #� ��

�� !�� -��$ �� @��

#�� 51AAE6 1AF�A"

G1�H '" <��$ 0" <��#��$ &" 7��$ �" *��$ N�3�

�:�� -�� - F ?��

�� #�� !�� -��$ 7��#��!�

�� /� ��

��#�� 5��=AA6$ �� !�$ ��$ N��$ 1CF

1E @��$ 1AAA$ ��" �CF �A"

G1 H �" <��!��$ �" *��$ ," ��$ *" 3��$ ��

�7)�� 3�:�� 7 #��$ ��

�� #� �� #��#$ ��#�$ �� !�� 7��

#��! 5 ��7=AA6$ 7��:$ �V$ N��$ 1CF1A ��-$ 1AAA"

G1�H 7" ��$ 3" O��$ '" M��;�$ I��!�� 4��#-�

�� : �� M"�/ ��#��

�� !��- ��4��#-��#�� $ )))

3��#�� #�� -�� &�� 3�#��

�!- A 5C6 51AAA6 E+1FE1�"

G1CH �" ��$ �" ��$ 0" O��$ *�� M"�/ ��

#��#�� #��$ 7��#��!� ��

�� #� �� !� 7��#��! 5 � 7=AA6$

O��$ ,��$ ��F�B I#��$ 1AAA"

G1/H "�"�" ��#�$ 7��##�� #� ��

�� $ ��3)� �� -��

�� #�� #� 5 �� +++6$ '�� &�!��$

@&$ N��$ 1AF� @��$ �+++$ ��" 1AAF�+B"

G1EH "�"�" ��#�$ �" M��$ ��

�� M�� $ ))) B� ��

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 ��

��#� �� #�� @��

5 ��@=AA6$ <��$ ��$ N��$ 11F1 I#��$ 1AAA$

��" ��EF�C1"

G1BH "�"�" ��#�$ &" '��$ �� 9�:��- ��

�� $ 7��#��!� ��

C� ,�� #� �� #��#�� 5,� � �+++6$

��" 1$ ��# �-$ @,$ N��$ �E 0��-F ��#�$

�+++$ ��" / F /A"

G1AH )" ��$ 7" <��!$ 0" I��(��$ �" 7��$ �"

*��$ �� F �� #� ��

�� #��#��$ ))) 3��#�� &��

�#�� 3�#��!- �E 5�6 51AAB6 11+CF111B"

G�+H @" ��&��$ 3"�" O�$ �"�" ��$ <"'" )��$

�"�" <��$ ��!� 4��- �� !��

�� $ ))) 3��#�� !� 7��#��! A

5�6 5�+++6 / /F/C+"

G�1H )3� $ N�3� � "�+$ N�� 3��#��#��

�� -�� 5N�3�6$ )�� 7��

�� N�3�"

G��H �" )��!$ ," 0��(�$ �" ��;��$ 7��#� ��

�� $ 7��#��

�!� �� E� ))) &��#�� 3�#��!- ��#�

5&3�=AE6$ 7��:$ �V$ N��$ ��- 1AAE$ ��" 1++AF1+1 "

G� H @" 0��$ <" ��$ ," &��$ ):�� 3N�3

*�#�� M" ��

��$ ))) ��#�� !�;�� / 5/6 51AAB6 1�+F

1�B"

G��H @" 0��$ )" ��#�$ <" ��$ *�� M"�/ #��

�� #��$ 7��#��!�

�� 7#�� ! �-�� 57��=A/6$

��$ ��$ ��#� 1AA/$ ��" CECFCEB"

G�CH �" ��$ �" O��$ 0" O��$ �� D#�� #��

��#�� #�� !��

�� #��!$ ))) 3��#��

��!� 7��#��! B 51�6 51AAA6 1B1/F1B� "

G�/H �" ��!��$ *" )!��$ O" ��$ �" '��$ <" <��#�$

*�� #�� 7)�� $

7��#��!� �� E� ��

�� #� 5��=AA6$ I��$ 0'$ N��$

+ I#��FC @��$ 1AAA$ ��" 11 F1�+"

G�EH O" M��$ M" M��$ 3" N��$ �" I!��$ �" M!��$ �"

�$ �� :�� -�� $ 7��#��!�

�� /� ��

��#�� 5��=AA6$ �� !�$ ��$ N��$

1CF1E @��$ 1AAA$ ��" C+F C�"

G�BH M" M��$ �" 3�� 5)��"6$ �� N�3�2 *��

�##�� 3�� #��$

��-$ @�� S��$ �+++"

G�AH �" M��$ "�"�" ��#�$ ��- � ��

��$ 7��#��!� �� E� ��

�� #�� 5�� +++6$

��$ 3��-�$ ,��$ � F�/ I#��$ �+++$ ��" 1<� >1F

1<� �E"

G +H �� I> )� 1��A��$ ��

3�#��!- F ��# ��! �� &�� I�(�# F

7�� 2 &��$ 1AAA"

G 1H 3N�3$ *�#�� "E� "1$ �� *�� #�

�� #�� 3��! �

C" �� /" ��>�$ ��#� 1AA/"

G �H 3N�3$ *�#�� M"�� $ ��:�! 7��#��

�� '�� < *�� #��$ ��#�

1AA/"

G H 3N�3$ *�#�� M"�� : �$ ��:�!

7��#�� '�� < *��

#�� '�� )��7�� $ 0��-

1AAB"

G �H 3N�3$ *�#�� M"�� : <$ ��:�!

7��#�� '�� < *��

#�� )��7�� $ 0��-

1AAB"

G CH 3N�3$ *�#�� M"�� : �$ ��:�!

7��#�� '�� < *�� #��

�� M!��- )��7�� $ 0��- 1AAB"

G /H 3N�3$ *�#�� M"�� : �$ I��

��:�! 7��#�� '�� < ��

��#�� M!��- )��7�� $ ��-

1AAA"

G EH 3N�3$ *�#�� M"��C$ �� 7��#��

�� #��$ 0��- �+++"

G BH 3N�3$ *�#�� M"�/1$ &�� # ��

�� #�� /� ��>�$ ��#� 1AA "

G AH 3N�3$ *�#�� M"�/ $ &�� ! �� '��

< *�� #��$ 0��- 1AAB"

G�+H 3N�3$ *�#�� M"�/ ��: N$ )��#��

*��#� 7#�� #��$ @�� +++"

G�1H 3N�3$ *�#�� M"�/ ��: &$ �� 7��

�� #� 5�7�6$ @�� +++"

G��H 3N�3$ *�#�� M"�/ ��: �$ ��

�� )��#�� $ @��

�+++"

G� H 3N�3$ *�#�� M" �+$ @�� @

&�� 3�� -�� 3�� )4��$ ��-

1AAA"

G��H 3N�3$ *�#�� M" � $ 7�#��

�� #�� -��$ @�� +++"

G�CH 3N�3$ *�#�� M" ��$ 3�� '�� <�

*�� #��$ 0��- 1AAB"

G�/H 3N�3$ *�#�� 7"A 1$ �� #��

�� -$ �-�#��;�� 0�� *��

��$ ��#�� 1AAB"

G�EH 3N�3$ *�#�� "E/1$ �!��! �-�� @�" E F

��@ N�� 7�� 0��#�� #��$ ��#��

1AAA"

G�BH 3N�3$ *�#�� "E/�$ �!��! �-�� @�" E F

��@ N�� 7�� 0��#�� !��

�!��$ �� 1AAE"

�� )�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0

G�AH 3N�3$ *�#�� "E/ $ �!��! �-�� @�" E F

��@ N�� 7�� 0�� $ �� 1AAE"

GC+H 3N�3$ *�#�� "E/�$ �!��! �-�� @�" E F

��@ N�� 7�� !��! 7��#��$ �� 1AAE"

GC1H 3N�3$ *�#�� "A 1$ ��@ N��@��

��#� '�-�� #?#�� <��# �� $

��- 1AAB"

GC�H 3N�3$ *�#�� &"1�+$ 7��#�� )��

�! ��#�� #��

�� 3�� N��! �!�� /�

�� C/ ��>�$ 0��- 1AAB"

GC H ," O��$ ," M��$ ��!��

�� !� �� $ 7��#��!� �� C�

��

#�� 5��=AB6$ 1�F1� I#�� 1AAB$ <��$ ��-"

GC�H �" '��!�$ 3�� M" �� #��#��

��$ ))) ��#�� !�;�� 51�6 51AA/6

�/FC1"

GCCH 3" I(��$ *" 7�� 5)��"6$ ��

3�� #��$ ��#� M��$

'��$ 1AAB"

GC/H O" 7��$ M" M��$ M" O��$ " @��$ 3"

��$ � ��#� ��-�� 3��

�� #� �� N�3� �!� � ��

��#��$ 7��#��!� �� B� ))) �� -��

�� 7��$ �� *��

��#�� 57 �*�=AE6$ ��" 1$ M��$ 0��$

��" ��F�/"

GCEH �"�)��$ ��#��#��

��- F ��!$ 7��#��!� �� 3)� ��

�� #� �� -��

�� #�� 5 �� +++6$ 1AF� @�� +++$

'�� &�!��$ @&$ N��"

GCBH �"�)��$ ��#��#��

��- F �� 4�� #�� #��$ 7��#��

�!� �� -��

7�� #�� +++ 5�7��=++6$

<��!��$ 3��$ 1�F1C @��$ �+++"

GCAH *"'" 7#��;$ �"'" �#��!$ '"<" ��$ 3��- ��

�� #�� #��#�� F � ��$ )))

3��#�� #�� 5��- 1AB�6 BCCFBB�"

G/+H �" ��$ �" ��;��$ M" I��$ � �� -��

�� #��!-$ �B� ))) &��#�

�� 3�#��!- ��#� 5&3�=AB6$ ��" �$ 1BF�1 ��-$

1AAB$ ��" AB FABE"

G/1H �" 3��$ M" M��$ 7" ��;-��$ )3� ��

N�3�$ 7��#��!� �� C� ))) �� -��

�� #�� 3�#��4�� #��$

��" �$ �F� ��$ 1AAB$ ��" /1/F/�+"

(�� $ �� 5 ��-6� 1A/B$ �� A -�� #�� #� �� !��$ ��(�# ��!�� 3�#��!- ��#�� 1AB/ � 1AAE" M� ��#�� '�� !�� #��#�� N��- �� 5 ��-6 �1AAE" � 1AAB �� (�� @�� *��#� ��$ 3�� 50��6�� #��!�� " ��#�1AAA �� @�� 7��

�� #�� - ��#� 5��6" ��" ��#� � �� #� 1AA+ �� ))) �� #� 1AA1" M� �� !��!$ ��$�� 7" M� � #��- � 7�"�" #�� !�� 7��#��! '��- �� 3�� N��- ��3�#��!-" M� �� #�� #� �� !� ��$ �� !��! �� $ �� #�� #�� ''� 7 �� "

*�� +�,,�� '��$ 0��$ �� 1$1AE�" M� ��#�� "�#" 5QQ��#��==6 � �� #��#� ��3�� N��- �� 3�#��!-$0��$ � 1AAE" � 1AA/$ �� (��@�� *��#� �� 3��$�� #� ��!�� 7��#�� !��" M� �#��- ?��;�! �� #�� "M� �� #�� 1+ �#��?# (�� #��#� ��#��" M� #�� #� �� #��!$

��- � �� !��! ��#��#�� D#�� #��! ��!��"

��-�.��%� �� !�� @�� #� �� !��! �� 1AAB"M� � �#�� ! � �� -�� !�! �-�� 77 �� 7 0��" ��( �� !�� <�#�� #��#� �� )�!��! �� <��!�� N��- �� )�!��! �� 3�#��!-�� 3�#��!- �1AAC �� 1AAE ��#��-" M� � ��! �� !�� #��!�� 3��N��- �� 3�#��!-"

)�*�*� +�� ! �� , ��+!� �!%�� -�""�. /�01//0 ��C

[P2] Igor D.D. Curcio, Miikka Lundan, “SIP Call Setup Delay in 3G Networks”, Proc. 7th IEEE Symposium on Computers and Communications (ISCC '02), 1-4 Jul. 2002, Taormina-Giardini Naxos (Italy), pp. 835-840.

© 2002 IEEE. Reprinted with permission.

SIP Call Setup Delay in 3G Networks

Igor D.D. Curcio and Miikka LundanNokia Corporation

P.O. Box 8833721 Tampere, FINLAND

E-mail: {igor.curcio, miikka.lundan}@nokia.com

Abstract

In this paper, call setup time in SIP-based video-telephony is analyzed. We used a 3G network emulator tomeasure post-dialing delay, answer-signal delay andcall-release delay. The results are compared to local, na-tional, international and overseas Intranet LAN calls.Furthermore, we have also studied the effect SIP callsover lossy channels with restricted bandwidth, typical ofmobile network signaling bearers.

1. Introduction

SIP (Session Initiation Protocol) is an IETF applica-tion layer control protocol for creating, modifying andterminating sessions with one or more participants [9].

In the panorama of protocols for IP telephony, SIP isnot the only possible choice. For instance, ITU-T H.323[12] is an alternative candidate. However, SIP has re-cently gained the increasing interest of universities, stan-dardization organizations and companies. One of themain drivers for this fact has been the decision of 3GPP(Third Generation Partnership Program) that, in the year2000, has selected SIP as call control protocol for 3G IP-based mobile networks (3GPP Release 5) [1], [2], [3].

The SIP protocol enables a wide set of applicationsranging from Multimedia over IP to Instant Messaging,Presence, and Rich Calls. In this paper we focus on Mul-timedia over IP applications, considering one of the mostchallenging ones: SIP-based videotelephony. This studyis also applicable to other use cases where the call setupprocedure follows the same steps, and therefore it is notintended to be restricted to the sole case of SIP video-telephony.

Call setup delay is part of the more general QoSframework. Media related delay issues are deeply ana-lyzed in [10]. [7] Presents results of H.323 and SIP calls

based on Internet traces, considering also proxy, redirectservers and UDP error bursts. In the work in [4] we haveshown results of SIP call setup time in IntranetLAN/WLAN environment for local, international andoverseas calls, with and without proxy, using IPv4/IPv6,and under packet loss conditions.

In this paper, we study the SIP call setup time in 3Gnetworks and we compare it with call setup time overLAN networks. We are interested in the different delaycomponents involved during the lifetime of a SIP call,namely post-dialing delay, answer-signal delay and call-release delay. In addition, we measure SIP call setuptimes in narrow lossy channels, typical of mobile net-works, where the bandwidth allocated for the signalingbearer is limited.

The paper is structured as follows: section 2 describesthe basics of call signaling with the SIP protocol. In sec-tion 3 the main network configurations used in our testsare discussed. Section 4 contains the metrics used for callsetup measurements. Section 5 describes the main find-ings of our research showing the results. Finally in sec-tion 6 we draw some conclusions.

2. SIP call signaling

In this section we will describe the basic signalingflow diagrams for calls between two SIP User Agents(UA). A mapping between signaling and call setupphases is also described. Finally, we will mention some-thing about the reliability of messages using UDP.

2.1. Signaling flows and call setup phases

The basic SIP signaling flows for a call between twoUAs is depicted in Figure 1.

Proceedings of the Seventh International Symposium on Computers and Communications (ISCC’02) 1530-1346/02 $17.00 © 2002 IEEE

Caller Callee

INVITE

100/TRYING

200/OK

ACK

Media transfer

BYE

200/OK

T1

T2

PostDialingDelay

CallReleaseDelay

180/RINGING

T3

AnswerSignalDelay

Figure 1 - Call signaling legs for a SIP call setup andrelease

A SIP call setup is essentially a 3-way handshakebetween caller and callee. For instance, the main legs areINVITE (to initiate a call), 200/OK (to communicate adefinitive successful response) and ACK (to acknowledgethe response). However, implementations can make useof provisional responses, such as 100/TRYING and180/RINGING when it is expected that a final responsewill take more than 200ms. 100/TRYING indicates thatthe request has been received by the next-hop server andthat some unspecified action is being taken on behalf ofthis call (for example, a database query). 180/RINGINGindicates that the callee is trying to alert the user. Thisscenario is shown in Figure 1.

After the call has been established, the actual mediatransfer (speech and video) can take place. The release ofthe call is made by means of the BYE method, and thesuccessful call release can be communicated through a200/OK.

Call setup and release times are defined in the ITU-TRecommendation E.721 [11] which is targeted to ISDNnetworks. However, the main definitions can also be usedfor SIP-based calls. For instance, the phases of measure-ment of a SIP call can be divided into three parts:

Post-Dialing Delay (PDD). It is also called post-selection delay or dial-to-ring delay. This is the timeelapsed between when the caller clicks the button of histerminal to call the callee, and the time the caller hears

his terminal ringing. In our case the PDD corresponds tothe time T1 (see Figure 1).

Answer-Signal Delay (ASD). This is the time elapsedbetween when the callee picks-up the phone, and the timethe caller receives indication of this. In our case the ASDcorresponds to the time T2 (see Figure 1). It has to be un-derlined that the caller receives notification that the calleehas picked up the phone when the first receives the200/OK. However, the call-signaling handshake is com-pleted when the callee receives the ACK from the caller.This is the reason why we have considered the ASD inthis way.

Call-Release Delay (CRD). This is the time elapsedbetween when the releasing party (the caller in our exam-ple in Figure 1) hangs-up the phone, and the time thesame party can initiate/receive a new call. In out tests theCRD corresponds to the time T3.

2.2. Reliability of signaling using UDP

The SIP protocol has been defined to work with bothTCP or UDP transport protocols. Our main focus in onthe latter protocol. UDP is well known as unreliabletransport protocol. In practice, this means that in case atransport packet is lost, the network does not provide toretransmit it. While TCP implements a retransmissionprocedure giving a higher level of reliability to the SIPsignaling, in the case of UDP it is responsibility of theSIP protocol to handle retransmissions of lost packets.

The retransmission requests of SIP methods and re-sponses follow well-specified procedures [9] and theseare reported in the following. These algorithms are offundamental importance in understanding the call setuptimes in different scenarios, especially in highly con-gested, lossy or narrow channels.

INVITE method. A SIP UA should retransmit anINVITE request with an interval that starts at T1 seconds,and doubles after each packet transmission. T1 is an es-timate of the Round-Trip Time (RTT). The client stopsretransmissions if it receives a provisional (1xx) or de-finitive (2xx) response, or once it has sent a total of 7 re-quest packets. A UA client may send a BYE or CANCELrequest after the 7th retransmission (i.e., after 64*T1 sec-onds). In our implementation the value of T1 is set to be0.5 seconds.

BYE method. In this case, a SIP client should retransmitrequests with an exponential bakeoff for congestion con-trol reasons. For example, if the first packet sent is lost,the second packet is sent T1 seconds later, and eventuallythe next one after 2*T1 seconds (4*T1 seconds and soon), until the interval reaches a value T2. Subsequent re-


transmissions are spaced by T2 seconds. T2 representsthe amount of time a BYE server transaction will take torespond to a request, if it does not respond immediately.If the client receives a provisional response, it continuesto retransmit the request, but with an interval of T2 sec-onds (this is done to ensure reliable delivery of the finalresponse). Retransmissions cease when the client has senta total of 11 packets (i.e., after T1*64 seconds), or it hasreceived a definitive response. Responses to BYE are notacknowledged via ACK. In our implementation the val-ues of T1 and T2 are set to be respectively 0.5 and 4 sec-onds.

ACK method. ACK is not retransmitted, but in case ofloss the UA server retransmits the 200/OK.

Informational (provisional) responses (1xx). UA serv-ers don't transmit informational responses reliably. Forinstance, our implementation does not retransmit infor-mational responses (100/TRYING, 180/RINGING).However, the UA server, which transmits a provisionalresponse, will retransmit it upon reception of a duplicaterequest.

Successful responses (2xx). A UA server does not re-transmit responses to BYE. In all the other cases a UAserver, which transmits a final response, should retrans-mit it with the same spacing as the ACK. Response re-transmissions cease when an ACK request is received orthe response has been transmitted 11 times (i.e., after64*T1 seconds). The value of a final response is notchanged by the arrival of a BYE or CANCEL request.

3. Network configurations tested

In our experiments we tested our UA, a simulator ver-sion of a SIP-based videophone, under different networkconfigurations. This section summarizes the test cases:

• 3G calls: local 3G calls using a 3GPP Rel. '99 emu-lator;

• Low bit-rate (lossy) channel calls: direct calls witha narrow channel and packet losses using the NIST-Net simulator [13];

• Intranet calls: local calls between two UAs in Tam-pere (Finland). National calls between an UA inTampere and another in Oulu (Finland), where thedistance was approximately 500 km). Internationalcalls between an UA in Tampere and another in Bu-dapest (Hungary), where the distance was approxi-mately 1500 km. Overseas calls between an UA inTampere and another one in Dallas (U.S.A.), wherethe distance was over 10000 km.

The UAs were running on PC under Windows oper-ating system. The NISTNet simulator was running underLinux OS. The 3GPP Rel. '99 emulator was made of aninterconnected set of Linux PCs via an ATM card at 155Mbps, and it had implemented the full 3GPP protocolstacks. In the configuration used (see Figure 2) there wereMobile Station (MS), Radio Access Network (RAN) and3G core network made of Serving GPRS Support Node(SGSN), Home Location Register (HLR) and GatewayGPRS Support Node (GGSN). Only one SIP terminalwas configured as an MS, while the other was connectedvia Intranet LAN. The PDP (Packet Data Protocol) con-text allocated for the SIP calls was operating at 384 kbpsfor uplink and downlink using RLC (Radio Link Control)unacknowledged mode. We should point out that the 3Gnetwork tested was a 3GPP Rel. 99 network, not a Rel. 5network. In the latter, in fact, the whole IP MultimediaSubsystem (IMS) is governed by the SIP call controlprotocol where, for a call establishment, a greater numberof messages are exchanged between the different networkentities end-to-end [1][2][3]. This clarification helps tounderstand the nature of our experiments, that were tar-geted to assess the performance of SIP call setup delayover a 3GPP Rel. '99 network, where the signaling is car-ried over the user-plane of the 3GPP protocol stack.

Since the application is handling speech and video,we used real codec parameters. These parameters are car-ried in the SIP messages via the Session DescriptionProtocol (SDP) [8]. SIP uses the offer/answer model forSDP negotiation [14]. We have implemented it in suchway that the first codec offer is placed in the INVITEmethod, while the codec answer is placed in the 200/OKresponse. Depending on the number of codecs used dur-ing the negotiation, the actual packet size and negotiationtime can vary. As a result the call setup time can changeas well. For this purpose we decided to explicitly list thespeech and video codec used to allow the SDP informa-

Figure 2 - 3G network configuration

S I P T erm ina l

M obileS tation

R A N3G co re netw ork(S G S N , H L R , G G S N )

S I P T erm ina l

In tra ne t


tion to be really carried in the packets transmitted and,therefore, to simulate cases that are as close as possible tothe reality. The speech codecs used were G.711 µ-Law,G.711 A-Law and AMR. The video codecs used wereH.263+ and MPEG-4 Visual. The packet sizes for the dif-ferent SIP methods and responses are listed in the tablebelow:

Table 1 - Packet sizes for different SIP methods andresponses

SIP method/response Size includingUDP/IP header

(bytes)INVITE (with SDP - 5 codecs) 605100/TRYING 233180/RINGING 234200/OK (with SDP - 2 codecs) 493ACK 268BYE 268200/OK 226

In our tests we used neither SIP message compressionnor UDP/IP header compression. These two compressionmethods help in reducing call setup times, especially inlow bit-rate channels.

4. Call setup metrics

In this paper we are interested in computing the post-dialing delay, answer-signal delay and call-release delayof SIP calls. ETSI TIPHON [6] defines call establishmentmeasurements methods for post-dialing delay and an-swer-signal delay. Figure 3 shows the call setup sequenceused by general telephony services. However, with IP-based terminals, it is likely that there is no hook-off (stepA) and dial tone (step B) sequence. In our case, the act ofa user pressing the "call" button is regarded as step C (lastdigit dialed).

A B C D1 D2 E F

off-ho

ok

dial to

ne

last d

igit di

alled

far-en

d ring

s

near-

end r

ings

call a

nswere

d

conn

ection

time

Figure 3 – Call setup sequence

From the users perspective the most significant se-quence is D2-C. This is exactly the same as the time T1we have defined in section 2. Some systems are imple-mented in a way that they present a ring-back to the callerbefore the connection has been established to give an im-pression that the PDD is low. This is unacceptable in

TIPHON systems [6], and it justifies our choice not tocompute the PDD as time difference between the firstINVITE and the 100/TRYING provisional response. TheASD, on the other hand, is computed as F-E. In the nextsection we will show the experimental results of our tests.

5. Simulation results

For each test case we computed the average and 95th

percentile of post-dialing delay, answer-signal delay andcall-release delay. In our computations the PDD includedalso translation between domain names and IP addressesvia DNS. We performed 20 calls per test case.

5.1. 3G SIP calls Vs. Intranet calls

In a first set of tests, we have measured SIP signalingdelays in 3G-network environment and compared themwith SIP signaling delays in Intranet LAN.

Results for PDD are summarized in Figure 4, andshow that PDD for local SIP calls is 24 ms. For nationalcalls the PDD is 38 ms, while for international calls it is153 ms; for overseas calls the PDD is 240 ms. PDD for3G calls was 62 ms, a value which is rather close to PDDvalue for national calls. 95-percentiles of PDD follow thesame trend, as shown in figure (in particular, the value for3G calls is 71 ms).

Figure 5 shows the simulation results for ASD. Theyfollow the same trend as the PDD but, as expected, areslightly lower because the ASD is compound of thetransmission or reception of only two SIP messages, dif-ferently from the PDD that is made up of three messages.ASD for local calls is 23 ms, while for national calls is 31ms. ASD for international and overseas calls is respec-tively 147 ms and 237 ms. ASD for 3G calls is 45 ms,rather close to the ASD value for national calls. 95-percentiles of ASD follow the same trend as in the PDDcase (in particular, the value for 3G calls is 60 ms).

Figure 4 - Post-dialing delay for SIP calls

0

50

100

150

200

250

300

LAN Tre-Tre LAN Tre-Oulu 3G LAN Tre-Buda LAN Tre-Dallas

Milli

seco

nds

Average

95 percentile


CRD results are included in Figure 6. For local callsthe CRD is 11 ms, about 50% of the corresponding PDD.For the other cases, the CRD results are similar to thoseof ASD. For instance, the CRD for national calls is 30ms, while for international calls it is 138 ms. CRD foroverseas calls is 230 ms. CRD for 3G calls is 50 ms, alsoin this case close to the value for national calls. 95-percentiles of CRD are similar to the previous cases (inparticular, the value for 3G calls is 70 ms).

5.2. Low bit-rate channel calls

The idea of this set of tests was to assess the call setuptime in case of limited bandwidth, because in realistic 3Gnetwork configurations, the bearer allocated for signalingcould be a few kbps. We used the NISTNet simulator tolimit the available bandwidth, and we simulated also 2%packet loss rate. We tested our SIP UA over channelbandwidths of 2, 5, 9.2, 16, 32, 64, 128 and 256 kbps.The call success rate was always 100%, in the sense thatno call establishment and release failed due to the nar-

rower channels we used. We measured PDD and ASD.We couldn't measure CRD, because the media packetswere queued up in the simulator (not discarded), and theyblocked the channel for a long time.

Results for PDD for different bandwidths are shownin Figure 7. PDD for 2 kbps bandwidth was just belowone second. For 5 kbps bandwidth the PDD was around420 ms. Values of PDD decrease for increasing band-width, becoming lower than 100 ms for bandwidth of atleast 64 kbps. At the maximum bandwidth of 256 kbps,the PDD is around 50 ms. 95-percentile of PDD followedclosely the same trend.

Figure 8 shows the ASD for different channels. Thisvalue is constantly equal to 45 ms for channels of at least5 kbps, but it increases for very narrow channels (2 kbps)to 166 ms). 95-percentiles of ASD for channels of at least5 kbps follow the same trend as the ASD; for 2 kbpschannel the 95-percentile ASD was 559 ms.

0

50

100

150

200

250

300


Milli

seco

nds

Average

95 percentile

Figure 5 - Answer-signal delay for SIP calls

0

50

100

150

200

250

300


Milli

seco

nds

Average

95 percentile

Figure 6 - Call-release delay for SIP calls

Figure 7 - Post-dialing delay for SIP calls

(low bit rate bandwidth)

Figure 8 - Answer-signal delay for SIP calls

(low bit rate bandwidth)

0

100

200

300

400

500

600

2 kbps 5 kbps 9.2 kbps 16 kbps 32 kbps 64 kbps 128 kbps 256 kbps

Mill

isec

onds

Average

95-Percentile

0

250

500

750

1000

2 kbps 5 kbps 9.2 kbps 16 kbps 32 kbps 64 kbps 128 kbps 256 kbps

Mill

isec

onds

Average

95-Percentile


6. Conclusions

We have presented results for SIP call setup over 3Gnetworks, and compared them with Intranet LAN results.Our simulations show that SIP 3G call setup over 3GPPRel. '99 emulator is rather close to the call setup time incase of national Intranet calls. In our simulations, for allthe cases the post-dialing delay, answer-signal delay andcall-release delay values for 3G calls are below 100 ms.Simulation of SIP calls over low bit-rate channels showthat the signaling delay is below 1 second, in channels asnarrow as 2 kbps. Globally, these SIP signaling valuesare well in line with the Grade of Service (GOS) boundsproposed by the ETSI TIPHON QoS classes [5].

In general, the most challenging part of the end-to-endnetwork is the RAN part, that is subject to air interfacelosses and narrow channels. These factors have a greatimpact on the overall SIP call setup time. In fact, the firstfactor produces more retransmissions of the SIP mes-sages, whenever they are transmitted over unreliabletransport protocols, such as UDP. Narrow channels dedi-cated for SIP signaling must deal, sometimes, with SIPmessages of big size. This yields large call setup times.However, it is expected that UDP/IP header compressionand SIP message compression algorithms would greatlyreduce the SIP call setup delay over 3GPP networks.

7. Acknowledgements

The authors would like to thanks all their colleaguesin Nokia that have contributed to this work. In particularJiangtao Li, Kalevi Orre and Raquel Rodriguez.

8. References

[1] 3GPP TSG SSA, IP Multimedia Subsystem (IMS) – Stage2 (Release 5), TS 23.228 v. 5.4.1, 2002-04.

[2] 3GPP TSG CN, Signaling Flows for the IP MultimediaCall Control Based on SIP and SDP - Stage 3 (Release 5),TS 24.228 v. 2.0.2, 2002-03.

[3] 3GPP TSG CN, IP Multimedia Call Control ProtocolBased on SIP and SDP - Stage 3 (Release 5), TS 24.229 v.2.0.1, 2002-03.

[4] I. Curcio and M. Lundan, Study of Call Setup in SIP-basedVideotelephony, 5th World Multi-Conference on Systemics,Cybernetics and Informatics (SCI 2001), 22-25 July 2001,Orlando (FL, U.S.A.), Vol. IV, pp. 1-6.

[5] ETSI Telecommunications and Internet Protocol Harmoni-zation Over Networks (TIPHON), End to End Quality ofService in TIPHON Systems; Part 2: Definition of Qualityof Service (QoS) Classes, TS 101 329-2 v.1.1.1, July 2000.

[6] ETSI Telecommunications and Internet Protocol Harmoni-zation Over Networks (TIPHON), Part 5: Quality of Serv-ice (QoS) Measurement Methodologies, Draft TS 101 329-5 v.0.2.6, July 2000.

[7] T. Eyers and H. Schulzrinne, Predicting Internet Teleph-ony Call Setup Delay, First IP Telephony Workshop(IPTEL 2000), 12-13 April 2000, Berlin, Germany.

[8] M. Handley and V. Jacobson, SDP: Session DescriptionProtocol, IETF RFC 2327, April 1998.

[9] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J.Peterson, R. Sparks, H. Handley and E. Schooler, SIP: Ses-sion Initiation Protocol, IETF RFC 3261, March 2002.

[10] A. Hourunranta and I.D.D. Curcio, Delay in Mobile Video-phones, Seventh International Workshop on Mobile Mul-timedia Communications (MoMuC 2000), 23-26 October2000, Tokyo, Japan, pp. 1B-3-1/1B-3-7.

[11] ITU-T, Recommendation E.721, Network Grade of ServiceParameters and Target Values for Circuit-Switched Serv-ices in the Evolving ISDN, May 1999.

[12] ITU-T, Recommendation H.323, Packet-based MultimediaCommunications Systems, September 1999.

[13] NIST, NISTNet, http://www.antd.nist.gov/nistnet/.

[14] J. Rosenberg and H. Schulzrinne, An Offer/Answer Modelwith SDP, IETF RFC 3264, March 2002.


[P3] Igor D.D. Curcio, Miikka Lundan, “Human Perception of Lip Synchronization in Mobile Environment”, Proc. 8th IEEE Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM ’07), 18-21 Jun. 2007, Helsinki, Finland.


Human Perception of Lip Synchronization in Mobile Environment

Igor D.D. Curcio Nokia Research Center

Tampere, Finland E-mail: [email protected]

Miikka Lundan Nokia Corporation Tampere, Finland

E-mail: [email protected]

Abstract

In literature, many studies about human perception

of lip synchronization refer to experiences based on TV sets. In these cases, researchers give some hints on how easily humans percept lip synchronization problems. The in-sync region is typically known to be in the range –80ms to +80ms. Within this range most of the test candidates does not detect any lip synchronization skew. The out-of-synch region typically falls outside the range between –160ms and +160ms, where nearly everybody detects the lip synchronization error. Between those boundaries there is a transient area, where the detection of any lip synchronization problem depends on speaker size (head, shoulder or body view). These earlier assumptions based on TV environment do not hold for mobile environment, where the viewing characteristics and of the device itself are different. Therefore, further research is needed. The purpose of this paper is to find new results for human perception of lip synchronization in mobile environment. Results will show that new lip synchronization thresholds apply when using mobile phones and mobile media concepts.

1. Introduction

In multimedia systems the simultaneous presence of several media brings the challenge of how to relate different media streams. While current real-time transport protocols [1] enable implementers to create a timing relation between different media (e.g., audio and video) by the means of timestamps, the human perception of media synchronization (i.e., lip synchronization) depends on several factors.

In an ideal perfectly synchronous system, audio and video streams are transmitted and received synchronously, and the playback device (TV or mobile device) shows audio and video in a synchronous

manner. Human perception of synchronized audio-video streams has the greatest fidelity as in nature.

In current communication systems, the transmission and reception of multiple streams may be suffering from variable delays for several reasons (e.g., network jitter, transmitter or receiver processing delays). This makes the general case of media synchronization harder, because in the general case, audio and video streams do not arrive at the receiver in a synchronized manner. When two (or more) media transmitted in a synchronized manner arrive at the receiver side with a temporal distance larger than the original, we use to say that the two media experience synchronization skew.

Depending on the size of the synchronization skew, the two media can still be perceived as synchronized by humans. For TV environment [2], experience shows that the typical acceptable synchronization skew is in the range [-80, 80]ms. This range is called in-synch region. The synchronization skew starts to be annoying to humans when it falls outside of the range [-160, 160]ms. Here people percept a lip synchronization problem between the different media. The region outside the mentioned range is also called out-of-synch region. In the ranges [-160, -80]ms and [80, 160]ms there is a transient region where the detection of lip synchronization problems depends of the speaker size (head, shoulder and body view). In the TV environment, the numbers -160ms, -80ms, 80ms, 160ms are the lip synchronization thresholds.

The knowledge of the right lip synchronization thresholds allows implementers to develop optimized applications for devices used by humans. A typical application buffers for a certain amount of time some media fragments that arrive too early, and it discards media fragments that arrive too late for playback in a real-time synchronized way. In order to exploit this application behavior, it’s crucial the knowledge of the lip synchronization thresholds in order to dimension application buffers [3][4] and to determine the

1-4244-0992-6/07/$25.00 © 2007 IEEE

boundary of the in-synch region for humans during media playback.

Other techniques for ensuring lip synchronization include gradually accelerating or retarding the media streams [5], synchronizing only once every several video frames or only during speech talk spurts [6].

The purpose of this paper is to find the lip synchronization thresholds in mobile environment. These will be shown to be different than those in typical TV systems.

The paper is organized as follows: section 2 introduced the mobile media concept. Section 3 describes the testing conditions; section 4 includes the test results, and section 5 concludes the paper.

2. The Mobile Media Concept

Lip synchronization depends on several factors. In

this paper we will focus on human perception of lip synchronization in mobile environment. Therefore, it is essential to understand the meaning of Mobile Media as opposed to traditional TV media concept.

There are three major differences between television and mobile environment: the size of the image, the distance of the user from the screen and the video frame rate.

The physical size of TV screens is at least ten times larger than that of mobile devices screens. Two common mobile image sizes are QCIF (Quarter Common Interchange Format) and SQCIF (Sub-QCIF). Mobile devices are progressively evolving and embedding larger video screens. However, we used these two typical sizes in our experience paper. For QCIF the image size is 176*144 pixels, whereas for SQCIF the image size is 126*96 pixels.

The TV viewing distance is generally several meters. Typical viewing distances of a mobile device are less than one meter. The viewing distance in our tests was not strictly fixed, but the test subjects could decide their own distance. Normally it was ranging between 0.3m and 0.8m.

TV frame rates are typically 25-30 frames per second (fps). In mobile environment video frame rates are generally up to 15 fps. In our experiments, we used 5, 10 and 15 fps. 3. Testing methodology and conditions

Video or audio frames arriving too early or too late are causing lip synchronization problems. Since the speed of light is higher than the speed of sound, humans tolerate more easily if video comes before audio [2]. In fact, there are places where this natural

out-of-sync effect can happen. For example, in the back seats of large theaters, where the facial movement of an actor can be seen before the voice is heard.

In this paper, we assess both the case of video received before audio, and the case of audio received before video, to see the differences.

We selected the same out-of-sync step as in [2], which was 40ms (Figure 1). This means that in each step, the synchronization was 40ms worse than the previous case. The labels b1 and b2 denote a media arriving before the ideal time, which is 0ms. The labels a1 and a2 denote a media arriving after the ideal time.

Figure 1. Out-of-sync steps

The test sequence was a common news clip in QCIF

and SQCIF formats with a typical head and shoulder motion. The length of the test sequence was 37 seconds. Each test case used a different out-of synch step, and this step was permanently maintained during the whole duration of the test sequence.

Each test case was performed with 20 people (all together there were 240 people). The test subjects where randomly chosen between non-video expert people. In order to avoid the subjects to be concentrated on the mouth of the news speaker, we did not reveal what kind of error we were looking for in the tests. We gave a little instruction about the nature of the error, so that the subjects did not concentrate only on looking at the video or listening at the audio.

There are typically two points where a test subject reacts upon an out-of-sync test sequence: the detection point and the annoyance point. The detection point is the first point where the out-of-synch is detected by a subject. The annoyance point is beyond the detection point and is the point where the out-of synch is not only detected, but it is also not anymore tolerable, becoming annoying. For example, a subject may report that the out-of-sync detection point at -160ms when video is received before audio, but it may also report that this becomes annoying when the synchronization skew is stretched up to -200ms. From this perspective, the detection point is a tighter and more conservative value than the annoyance point.

0 ms 40 ms 80 ms -40 ms

-80 ms

Case a1

Case a2

Case b1

Case b2

We were researching for the subjects to detect the detection points (thresholds for the out-of-sync region), not the annoyance points (tolerance to out-of-sync). The goal was to determine the in-sync region according to the detection points. It is clear that the in-sync region according to the annoyance points is larger than the in-sync region according to the detection points because, as said above, the detection points are tighter than the annoyance points. Also, the in-sync region according to the detection points would be all contained within the in-sync region according to the annoyance points.

The main reason why we chose to study the detection points of out-of-sync, instead of the annoyance points, was that we wanted to find tighter thresholds that mobile video players should target to in real implementations. Also, the annoyance points may fluctuate over time, depending on how much out-of-sync content people see, but the detection points are much more stable values.

We did not ask the subjects to view the video clip at a certain distance, but each person could choose his/her own distance. The normal viewing distance was between 0.3m and 0.8m, which is the normal mobile phone holding distance.

When people detected the lip synchronization error we asked which media comes first. In this way, we ensured that people did not guess the answer, and that the reason of the error was because of lip synchronization, not because of other reasons (e.g., low frame rates or frame skipping).

We used a commercial mobile phone in our tests. The test sequence was built using the audio stream as anchor time reference, and a time shifted video stream relative to the audio stream. 4. Test results

Due to the different nature of video-early and

video-late cases (relative to audio), we chose two different out-of-sync thresholds. Since the video-early case is a natural phenomenon, we chose a tighter threshold to measure when out-of-sync occurs. The threshold was set to the value where 80% of the people have detected the out-of-sync. Differently, in the video-late case the threshold was set to the value where 50% of the people have detected the out-of sync. We lowered the video-late percentage of subjects to just half of the subjects, since we expected that it is so easy to predict that the result may lead to a too tight threshold. More motivations for these choices are also given later in the paper.

In the tests of section 4.1 of this paper, the video frames are displayed earlier than their ideal synchronized display time (i.e., before audio playback). In Figure 2, the time A is the maximum time limit when the video frame can be shown in advance. If the frame arrives before that time, it should be stored in a buffer until the time A is reached. Any video frame displayed before time A falls in the out-of-synch region and a human will perceive an error in the lip synchronization. In the tests of section 4.2 of this paper, the video frames are displayed later than their ideal synchronized display time (i.e., after audio). The time B in Figure 2 is the maximum limit when the frame is not too late. Frames arriving later than the time B fall in the out-of-synch region and should be discarded (i.e., not played back) or played back with a lip synchronization error.

If a video frame arrives between times A and B for display, then it is rendered immediately (if it arrives at time 0, this means perfectly in-sync with the audio stream).

The thresholds A and B in [2] were determined as equally distant from the ideal display time. Their values were respectively –80ms and 80ms. However, in general, we will see in later sections that the thresholds for detection of the out-of-sync regions are not –X and

A (video early) B (video late)

Buffered Discarded

Frame

t (ms)

Ideal display time

Out-of-sync region Out-of-sync region In-sync region

0 ms

Figure 2. Two out-of-sync cases

+X for some X value, but –A and +B, where A and B are different values.

We used two different methods to evaluate the out-of-sync threshold: Cumulative Distribution Function (CDF) and Negative CDF. The CDF describes the point where the majority of people detect the out-of-sync, and this is the minimum point where mobile video players should target to. The negative CDF describes the point where the majority of people don’t detect the out-of-sync. If players can guarantee this operational point, then the quality of the lip-sync is high.

4.1. Video early

In this section we report experiments where the video is displayed before audio. Table 1 show the distribution of percentages of subjects that have detected problems in lip synchronization at certain time thresholds. A preliminary results analysis shows from Table 1 that the highest percentage of subjects detects lip synchronization problems "later" for frame rates that are increasing. In other words, this means that higher frame rates allow a larger synchronization skew than lower frame rates. Intuitively, this is confirmed by the fact that low frame rates have an embedded lip synchronization problem when associated with good quality audio, as video is perceived with some missing parts. This is evident from the grey-marked cells in Table 1, which shows that for the SQCIF format at 5-15 fps, the majority of subjects detected problems in lip synchronization when the synchronization skew was in the range 320-400ms. For the QCIF format, the range was 240-280ms.

Table 1. Percentage of distribution for detection of out-of-synch for video early

SQCIF QCIF5 f/s 10 f/s 15 f/s 5 f/s 10 f/s 15 f/s

<=160 ms 0 0 0 10 10 0200 ms 0 5 5 20 5 10240 ms 15 10 5 25 10 5280 ms 15 10 10 15 35 35320 ms 30 15 20 5 20 25360 ms 20 30 25 10 20 15400 ms 20 10 30 5 0 0440 ms 0 10 5 0 0 0480 ms 0 10 0 0 0 0

Table 2 show the tabular cumulative distribution

function (CDF) of out-of-synch detection thresholds for different formats and frame rates when video is displayed before the corresponding audio. A generic point (X,Y) in the CDF in this case means that "Y % of the test subjects has detected a lip synchronization problem at X ms or earlier". For example, for SQCIF

video at 5 fps, 80% of the test subjects have detected a lip synchronization problem when the synchronization skew was 360ms or less.

Results analysis of Table 2 show that if the video frame rate increases, also the out-of-synch detection threshold increases. This is in line with the previous results and conclusions related to Table 1. It is also possible to see that the biggest step in threshold increase happens for frame rates between 5 and 10 fps for both SQCIF and QCIF formats (respectively 360-to-400ms for SQCIF and respectively 280-to-320ms for QCIF). The threshold increase when shifting from a frame rate of 10 fps to a frame rate of 15 fps is noticeable, but less relevant.

Table 2. Numerical CDF results for out-of-synch detection in video early case



Table 3 shows the negative CDF results. In order to

determine the new detection thresholds for out-of-synch in mobile environment, we need such data. A generic point (X,Y) in the negative CDF in this case means that "Y% of the test subjects has not detected a lip-synchronization problem at X ms or earlier". In other words, this helps us in stating that when two media are out-of-synch in the range [0, X]ms, no lip synchronization problem is detected by a viewer of a mobile device when video is played back earlier than audio.

Table 3. Numerical negative CDF results for out-of-synch detection in video early case



The final goal is that of searching for thresholds

that make media out-of synch unnoticeable for at least 80% of the test subjects. We chose 80% instead of a

higher percentage (i.e., 90% or 95%) for the following reasons: 1) The out-of-sync period in our experiments lasted

very long (37 seconds). So, it was easier to detect. In real life situations it is expected that the out-of-synch periods would last only few seconds, making it more tolerable and even not noticeable to the user.

2) The test subjects where looking for errors (i.e., watching very critically). In real life, people are not looking for errors.

3) The out-of-synch tolerance thresholds are supposedly a little higher than the detection of problems in lip synchronization. In fact, normally a user may detect a problem in lip synchronization, but tolerate it for some time. This allows the relaxation of the detection thresholds.

The thresholds for out-of-synch detection in mobile environment are marked in grey in Table 3. A summary will be presented in the conclusion section of this paper.

4.2. Video Late

Table 4 shows the histogram and distribution of percentages of subjects that have detected problems in lip-synchronization at certain time thresholds. Here the video was shown after audio.

Preliminary results analysis shows that the detection of the lip synchronization problem happens much earlier than in the video early case. This phenomenon was expected, since physics (audio and light speed) makes this case unnatural, and therefore the human brain doesn't tolerate it so easily. Also for low frame rates the impression that video is late is natural, since in the audio a "word" starts on time, but the corresponding lip motion doesn't, because some essential video frames might be skipped. Earlier detection time and smaller variance between detection makes the difference between frame rates smaller (grey marked cells in Table 4). The majority of subjects detected problems for the SQCIF format when the synchronization skew was 80-120ms. For QCIF the synchronization skew was 80ms.

Table 4: Percentage of distribution for detection of out-of-sync for video late

SQCIF QCIF

5 f/s 10 f/s 15 f/s 5 f/s 10 f/s 15 f/s40 ms 15 5 5 25 20 1580 ms 30 40 45 30 25 70120 ms 35 45 20 20 15 5160 ms 5 0 25 20 30 10200 ms 15 10 5 5 10 0

Table 5 show the tabular CDF of out-of-synch

detection thresholds for different formats and frame rates when video is displayed after the corresponding audio. A generic point (X,Y) in the CDF in this case means, as in the video early case, that "Y % of the test subjects has detected a lip-synchronization problem at X ms or earlier".

The first three (SQCIF) cases show that at least 80% of the subjects detected the lip synchronization error in the range 120–160ms. For the QCIF format, the result is not so uniform. For frame rates of 5 and 10 fps the 160ms is still a good limit. For 15 fps over 80% of the people detected the lip synchronization error at 80ms or earlier.

Table 5. Numerical CDF results of out-of-sync detection in video late case

SQCIF QCIF

5 f/s 10 f/s 15 f/s 5 f/s 10 f/s 15 f/s40 ms 15 5 5 25 20 1580 ms 45 45 50 55 45 85120 ms 80 90 70 75 60 90160 ms 85 90 95 95 90 100200 ms 100 100 100 100 100 100

Table 6 shows the negative CDF results. In order to determine the new detection thresholds for the out-of-synch region in mobile environment, we also need such data. A generic point (X,Y) in the negative CDF in this case means, similarly to the video early case, that "Y % of the test subjects has not detected a lip-synchronization problem at X ms or earlier". In other words, this helps us in stating that when two media are out-of-synch in the range [0, X]ms, no lip synchronization problem is detected by a viewer of a mobile device when video is late.

Table 6. Numerical negative CDF results for out-of-synch detection in video early case


40 ms 85 95 95 75 80 8580 ms 55 55 50 45 55 15120 ms 20 10 30 25 40 10160 ms 15 10 5 5 10 0200 ms 0 0 0 0 0 0

We are searching for thresholds that make media out-of-synch unnoticeable for at least 50% of the test subjects. We selected 50% instead of a higher percentage (i.e., 80% or 90%) for the following reasons:

1) The same three reasons of the video early case as described in section 4.1.

2) Low frame rates and frame skipping give the impression that audio comes naturally before video (which is not the case). Lip motion doesn't start at the same time as audio, since some essential frames are skipped. When the test subjects where asked about which media comes first, they always answered audio. In the video early case the tests could be continued since the answer was wrong, but in the video late case it was not possible to make the difference between true lip synchronization error detection and a problem caused by low frame rates.

3) Video late is an unnatural process in our brain.

The out-of-synch detection limit is 80ms in all cases (although 15 fps QCIF case shows 40ms (or 60ms as roughly interpolated value for 50% test subjects), this is a too strict limit in practice). The tests show that in the video late case the small screen size doesn't give any benefits in order to relax the detection thresholds. This is probably due to the fact that we normally look at the screen much closer than the television screen. 4.3. Low Frame Rate Cases

In all low frame rate cases (5 fps), some people claimed that an in-sync case was out-of-sync. This is probably due to the fact that 5 fps doesn't give a sense of real motion in a video sequence. These people were excluded from the test results, since they where complaining about the frame rate rather than reporting a lip synchronization problem.

5. Conclusions

In Table 7, are summarized the recommended lip synchronization thresholds for audio/video synchronization in mobile devices. It is clear that it is easier to tolerate video frames arriving earlier than audio than video frames arriving too late. Gender or age made no difference in perception of lip synchronization errors.

Table 7. Lip synchronization thresholds for mobile environment

Case Video early Video late

SQCIF 5 fps - 240 ms + 80 ms



QCIF 5 fps - 160 ms + 80 ms

QCIF 10 fps - 200 ms + 80 ms

QCIF 15 fps - 200 ms + 80 ms

In the following, a simple decision algorithm for a mobile device media player is given.

Example: case QCIF 10 fps

If video arrives before audio and synch skew is > 200 ms then buffer the frame else if video is skewed to (anchor) audio of [-200, 80]ms then display the video frame else

do not display the video frame. Even though video late case showed no “gains”

compared to the results in [2], we have to point out that we used a much stricter evaluation method than what is done in [2]. Using same method as in [2] would have given at least 40ms larger values in most of the cases.

Further developments of this research include testing using different types of video material (e.g., with medium/high motion) in order to confirm or complement the new lip synchronization thresholds for mobile environment.

6. References [1] H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications”, IETF, RFC 3550, July 2003. [2] R. Steinmetz, “Human Perception of Jitter and Media Synchronization”, IEEE Journal of Selected Areas in Communications, Vol. 14, No. 1, January 1996, pp. 61-72.

[3] Y Xu, Y. Chang and Z. Liu, “Calculation and Analysis of Compensation Buffer Size in Multimedia systems”, IEEE Communications Letters, Vol. 5, No. 8, August 2001, pp. 355-357. [4] C.J. Sreenanm J-C. Chen, P. Agrawal and B. Narendran, “Delay Reduction Techniques for Playout Buffering”, IEEE Transactions on Multimedia, Vol. 2, No. 2, June 2000, pp. 88-100.

[5] T.V. Johnson and A. Zhang, “Dynamic Playout Scheduling Algorithms for Continuous Multimedia Streams”, Multimedia Systems, Vol. 7, 1999, pp. 312-325. [6] Y. Shibata, N. Seta and S. Shimizu, “Media Synchronization Protocols for Packet Áudio-Video System on Multimedia Information Networks”, Proceedings of the 28th Annual Hawaii International Conference on System Sciences, 1995, pp. 594-601.

[P4] Miikka Lundan, Igor D.D. Curcio, “Mobile Streaming Services in WCDMA Networks”, Proc. IEEE International Symposium on Computers and Communications (ISCC ’05), 27-30 Jun. 2005, Cartagena, Murcia, Spain, pp. 231-236.


Mobile Streaming Services in WCDMA networks

Miikka Lundan and Igor D.D. Curcio Nokia Corporation

P.O. Box 88 33721, Tampere

Finland E-mail: {miikka.lundan, igor.curcio}@nokia.com

Abstract

Third generation (3G) mobile phones are leading the telecommunications world into a new era. 3G networks offer increased speed for mobile users and enable new services. Mobile streaming applications benefit from these new capabilities, although streaming doesn’t strictly requires a 3G network to work. The purpose of this paper is to show how 3G can improve the performance of streaming applications. In particular, we will reference the 3GPP Packet-switched Streaming Service (PSS).

1. Introduction

3G networks are called Universal Mobile Telecommunication System (UMTS) and the radio technology is Wideband Code Division Multiple Access (WCDMA) in Europe. In USA, the system is called CDMA2000. In Europe the common UMTS standard was developed under the Third Generation Partnership Project (3GPP). In USA the counterpart 3GPP2 body was founded. The first 3GPP standard was the Release 99. The specification work has continued with Release 4, which introduced new multimedia services such as the Packet-switched Streaming Service (PSS). In 3GPP Release 5, IP was used as the only protocol in the core network, making mobile networks all-IP a reality. 3GPP Release 6 has been finalized in the second half of 2004, and it includes improvements in the networks and services (for example, an advanced PSS service and the possibility of multicast/broadcast data are part of Release 6 specifications).

PSS has reached the third level of specification with Release 6 and it is in a very mature state. Release 4 introduced a very simple service with basic protocols and codecs. Release 5 extended the codecs and

introduced the capability exchange, which enabled streaming services for different kind of devices. The latest Release 6 brings a great amount of improvements to the service. Rate adaptation and RTP retransmission can handle the inherent problems caused by the wireless interface and mobility issues. The content itself and the transmission can be protected and secured, and also feedback mechanisms are included to describe the quality of user experience.

This paper is organized as follows. In section 2 we describe the 3G-network architecture used in our testing activity. Section 3 introduces the QoS concept in 3G networks. Section 4 describes the 3GPP PSS standard. Section 5 contains simulation results in terms of signaling delays, media bit rates, packet loss rates, handover duration and buffering. In section 6 we compare the results obtained with those obtained with GPRS and EGPRS network configurations. We conclude and summarize our findings in section 7.

2. 3G network architecture

Figure 1 shows the end-to-end UMTS network architecture [2]. The architecture can be divided in two elements: UTRAN (UMTS Terrestrial Radio Access Network) and Core Network (CN). UTRAN is divided into Radio Network Subsystems (RNS), which consists of several Base Stations (BS) and the controlling element Radio Network Controller (RNC). The Base Station (also known as Node B) implements the WCDMA radio access physical channels and transfers information from transport channel to the physical channels based on the arrangements determined by the RNC. The Core Network (CN) consists of the circuits switched part and the packet switched part. For simplicity Figure 1 shows only the packet switched domain and its two elements: the Serving GPRS

Proceedings of the 10th IEEE Symposium on Computers and Communications (ISCC 2005) 1530-1346/05 $20.00 © 2005 IEEE

Support Node (SGSN) and the Gateway GPRS Support Node (GGSN). The SGSN supports packet communication towards the access network. It is mainly responsible for mobility management. The GGSN maintains the connection towards other packet switched networks such as the Internet.

Figure 1 shows also one special feature of UMTS networks. A mobile phone can be attached to the network through several BSs, which may be controlled by different RNCs. This feature enables seamless handovers between cells.

The WCDMA technology is used between BS and mobile phone. GSM networks use Frequency Division Multiple Access (FDMA), where each mobile phone uses different frequencies when connecting to the BS.

This can be even further enhanced with the use of Time Division Multiple Access (TDMA), where one frequency is divided into different time slots. Each user has one or more time slots of one frequency. GSM is using eight different time slots. The Code Division Multiple Access (CDMA) uses a different approach. Each user is using a different code when connecting to a BS. This way mobile phones can use the same frequency band without time slots. If the bit rate need is low, the power required for transmission is small. If the bit rate need is high, the power requirement is high.

WCDMA uses Spread Spectrum techniques. These produce a spectrum for the transmitted signal that is much wider than the bandwidth of the actual information. It allows more users to share same frequency and it also produces mechanism like fast power control, diversity and soft handovers.

3. UMTS Quality of service (QoS) Profile

Release 99 QoS profile [3] is common for UMTS and EGPRS. It provides more QoS parameters than the QoS profile used for GPRS in Release 97. Release 99 QoS profile is used at least up to 3GPP Release 6. In the UMTS QoS profile the data traffic is divided into four classes: Conversational Class: Minimum fixed delay, no buffering, symmetric traffic, guaranteed bit rate. Streaming Class: Minimum variable delay, buffering allowed, asymmetric traffic, guaranteed bit rate. Interactive Class: Moderate variable delay, buffering allowed, asymmetric traffic, no guaranteed bit rate. Background Class: Big variable delay, buffering allowed, asymmetric traffic, no guaranteed bit rate. The Conversational Class is the most demanding in terms of QoS requirements, and is meant for real time call type of services. The Streaming Class is slightly less demanding and it is for near-real-time uni-directional services. Interactive and Background Classes are meant for non-real-time services.

Some of the most important QoS parameters for streaming are:

Maximum bit rate (kbps) defines the highest possible bit rate that the bearer may use when delivering data. Guaranteed bit rate (kbps) defines bit rate that a bearer must provide for data transmission. Transfer delay (ms) defines the maximum delay that the bearer should provide. SDU error ratio indicates the fraction of packets lost or detected as erroneous within a bearer connection.

IR

3GPP PSS streaming

server

3GPP PSS streaming

client (PC based)

SGSN GGSN

RNC

BS

Internet

Core network

RNC

BS

BS

BS

RNS

RNS

UTRAN

Figure 1 – End-to-end UMTS system architecture for multimedia streaming


Maximum and Guaranteed bit rates can be 2048 kbps according to the specifications (and even up to 16,000 kbps for Release 5 networks and later), but the first 3G mobile phones may provide only lower bit rates (e.g., up to 384 kbps connections), and operators may even limit the data services to be lower (e.g., up to 128 kbps).

4. Packet-switched Streaming Service (PSS)

The Packet-switched Streaming Service (PSS) was introduced as a part of 3GPP Release 4 standard [4][5]. Specification [4] defines the concept of simple and extended PSS client, although only the use case of simple client is described and the description of extended client is left to post-Release 4 specifications.

Figure 2 shows a typical PSS session. While browsing in the Internet or via the WAP (Wireless Application Protocol), the mobile user (UE in the figure) finds a URI (Universal Resource Identifier) to the specific content.

UE WAP/Web

server

Get Web/WAP Page with URI

UTRAN/GERAN & CN

WAP/Web/ Presentation/ RTSP server

Mediaserver

Secondary PDP context activation request (QoS = Streaming): Note

IP/UDP/RTP content

RTSP: SETUP

RTSP: PLAY

RTSP: TEARDOWN

RTSP:DESCRIBE (or other optional way to get content description file)

Secondary PDP context deactivation request: Note

SGSN UE WAP/Web server

Get Web/WAP Page with URI

UTRAN/GERAN & CN

WAP/Web/ Presentation/ RTSP server

Mediaserver

Secondary PDP context activation request (QoS = Streaming): Note

IP/UDP/RTP content

RTSP: SETUP

RTSP: PLAY

RTSP: TEARDOWN

RTSP:DESCRIBE (or other optional way to get content description file)

Secondary PDP context deactivation request: Note

SGSN

NOTE: These messages are examples of how to establish and terminated the bearer. Other alternatives exists

Figure 2 - A typical PSS session [4]

This URI specifies a streaming server and the address of the content on that server. With this URI the user is able to establish a multimedia session and request for media description by using the Real Time Streaming Protocol (RTSP) [6] DESCRIBE message. The description of the media is sent to the user by a media server via SDP (Session Description Protocol) [7] inside an RTSP response. If the description of the media content is suitable for the user device, it can start the media set up by sending an RTSP SETUP message for each media chosen by the client. After a successful

media set up, the client requests the playback of the media flow by sending an RTSP PLAY message to the server, which starts to send one or more streams over the IP network.

A simple PSS client includes the basic set of session establishment, set-up, control and transport protocols and media codecs [5]. Figure 3 shows an overview of the protocol stack for a more complete Release 6 PSS client [22]. The session set up is done by means of the RTSP protocol carried over TCP (UDP transport is optional) including an SDP part, which describes the media codecs and their parameters. The same protocols are also used for session control during a streaming session. The streaming of continuous media is based on the RTP (Real-time Transport Protocol) for media transmission and RTCP (Real-time Transport Control Protocol) for feedback about the transmission quality [8].

The mandatory speech and video codecs for Release 4 PSS are AMR (Adaptive Multi-Rate) narrow band [9] and H.263 [10] baseline (Profile 0, Level 10). In our testing activity we used these two codecs. For audio, the codec is MPEG-4 AAC Low Complexity (AAC-LC) [11]. Some optional codecs are also defined: AMR Wideband [12], AAC Long Term Prediction (AAC-LTP) [11], H.263 Profile 3 Level 10 [13] and MPEG-4 Visual Simple Profile Level 0 [14]. PSS supports also discrete media formats for still images, bitmap graphics, vector graphics, and text.

Figure 3 - Release 6 PSS protocol stack

PSS standard is further developed in Releases 5 and 6. Release 5 specifications [15,16,17] define new media formats for discrete media and capability exchange mechanism, which enable content for a wider set of mobile devices. Release 6 PSS specifications [22,23,24] include: progressive downloading, Digital Rights Management (DRM) and Secure RTP (SRTP) that enable confidentiality and integrity protections, application level client-server bit rate adaptation that can handle bit rate variations caused by non-guaranteed

IP

UDP TCP

RTP RTSP

Payload formats

Video Audio

Speech

Capability exchange Scene description

Presentation description Still images

Bitmap graphics Vector graphics

Text Timed text

Synt hetic audio

HTTP

Capability exchange Presentation description

UDP

Timed Text


channel bit rate and inter-system handovers, RTP retransmission handles the problems related with the lossy nature of the air interface, and Quality of Experience metrics give more precise information to the service providers about the streaming session quality.

5. Experimental results

In our testing activity, the WCDMA phone was working as a modem between the PSS client laptop and the network where the server was connected (see also Figure 1). We are deliberately concentrating on application layer protocol results and left out lower layer issues (such as logical and transport channels) as well as and radio bearer issues. We used the same one-minute high motion movie trailer test sequence as in our previous GPRS and EGPRS tests described in [18, 19, 20]. Even though we were testing an early implementation of WCDMA technology, this was already very different compared to (E)GPRS. The network offered much higher bit rates, which enable better video quality and make the session set up faster. The most common handovers (or cell reselections, as they are called for (E)GPRS) are seamless, which is a big improvement for streaming. In the first commercial 3G networks only Background and Interactive traffic classes were available. Both of these two classes can’t guarantee bit rates or delay bounds. The testing environment was moderately loaded (~33% load). Most of the time, our phone was the only phone in the cells. The cells could handle up to three 384 kbps data connections. If multiple phones are connected simultaneously to the same cell, they need to share the resources; therefore, the user bit rates are lower, but still much higher than in (E)GPRS. The results in the subsequent sections are calculated as an average of running ten test cases.

5.1 RTSP signaling delays

The session set up starts with a TCP handshake, where three TCP messages are transmitted between client and server. In our tests, the actual user Set up delay signaling included four message pairs (DESCRIBE, video SETUP, audio SETUP and PLAY). From the user's perspective, the total connection time is the sum of RTSP signaling and initial buffering, which is shown in section 5.5. The Pause and Play delays are examples of RTSP signaling occurred during a PSS session. If the user gives the Pause command during a session, the playback is immediately paused, but it takes some time before the whole streaming system is ready for next command.

Table 1 – Set up delays for 384 kbps bearer

Set up delay

Pause delay

Play delay

Average (sec.) 1.4 0.4 0.2 Minimum (sec.) 1.2 0.3 0.2 Maximum (sec.) 1.9 0.5 0.3

STD deviation (sec.) 0.2 0.07 0.02

Results in Table 1 show that the set up delay is on average 1.4 seconds, and the delay is always below 2 seconds. The Pause and Play delays during a session are in the range of 0.2 – 0.5 seconds. 5.2 Media bit rates

The total bearer bit rate cannot be used for media, since IP and lower layer packetization require bits for headers and also the error management system (e.g., layer 2 retransmissions) may use part of the gross bearer bit rate. Table 2 shows the media bit rates used in our tests. 384 kbps is the maximum bit rate supported by the tested 3G phones, but an operator may limit the service to lower bit rates, like for example 128 kbps. The difference between media bit rate and network bit rate is left for lower layer(s) headers and retransmissions.

Table 2 - Maximum media bit rates (kbps)

Network 128 384 Media 114 342

5.3 Packet losses

The lossy nature of the air interface causes most of the packet losses in the RTP streams. Table 3 shows the observed average packet loss rate without handover conditions, so the table has to be regarded as normal WCDMA performance with no mobility involved.

Table 3 – Packet loss rate

Packet loss rate

Average (%) 0.1 Standard 0.2

Results show that WCDMA is able to yield near 0% packet loss rate to the application layer.


5.4 Handover events

WCDMA has four different types of handovers (HO) [1]: 1. Soft 2. Softer 3. Hard or Inter-frequency 4. Inter-system.

The two simplest handovers (Soft and Softer) are seamless and therefore they don’t cause any packet losses. Soft handover occurs between two base stations, which are using same frequency. Softer handover happens within one base station between two sector-cells. Hard handover is required if the frequency is different and that is the reason why it is also called Inter-Frequency handover (IFHO). The (E)GPRS cell reselection is an inter-frequency cell reselection. In Inter-System handovers (ISHO) the mobile phone is switching between WCDMA and (E)GPRS networks. The difference between handover and cell reselection is in the initiator of the procedure. For handovers the initiator of the cell change is the network, while for cell reselections, the mobile phone initiates the cell change. In our testing environment, we were able to test only the first two types of handovers and our test results showed that both Soft and Softer handovers are seamless, so there is no packet loss occurring during a streaming session. Also the packet arrival delay was the same with or without handovers, so the handover is not causing any extra delay to the packet transmission. Table 4 shows the handover results.

Table 4 – Handover and packet arrival delay

Handover time

Packet arrival delay

Packet arrival delay

(no handover)

Average (ms.) 0 56.5 56.5 95-percent. (ms.) 0 143.3 144.0

The handover time is the data flow interruption perceived at the application layer. As it can be seen, this is zero. Also the packet arrival delay is the same during or outside of a handover period.

The hard handover is similar to (E)GPRS cell reselection and it is not seamless. Although we were not able to test this handover type, we expect the length to be bounded by the cell reselection for (E)GPRS networks [19,20], so in the range of 2.9-3.4 seconds.

The Inter-system handover is the most difficult handover for an application, and the packet flow stop time is expected to be higher than for the IFHO.

5.5 Buffering delays

The initial streaming client buffering is occurring right after the connection set up. This is done to guarantee pause less playback, and it is useful against network jitter, handovers and variable media bit rate. Since WCDMA often offers seamless handovers, the buffer doesn’t need to be able to handle packet flow stops. Buffering is still required to fight against network delay and jitter caused by the network (especially when the Streaming traffic class is not available). 3 seconds is the smallest possible buffering delay required to guarantee pause less playback and avoid rebufferings in the middle of a streaming session.

We have also measured the memory consumption in a streaming client during the execution of our tests. The buffer usage depends on the media bit rate. With the highest media bit rate used (342 kbps), the maximum buffer usage for video was 142.3 KB and for audio was 4.7 KB. The right size of a media (pre-decoder) static buffer would be then around 142.3+4.7=147.1 KB of memory (this is able to accommodate speech and video content for up to 3 seconds). If the service is limited to 128 kbps, the buffer sizes requirements are 43.9 KB for video and 4.7 KB for audio (for a fixed AMR mode), which makes a total of 48.6 KB.

5.6 Network loaded environment

If the cell capacity has been reached and more voice calls or higher priority data connections are entering into the cell, the lower priority data connections start to lose bandwidth. For example streaming sessions that start at 384 kbps may, at some point of the session lifetime, have an available bit rate of only 64 kbps. There are buffers on the network that can handle short periods of overload, but if the overload period is long and buffers become full, data packets start to be dropped. Without PSS bit rate adaptation, the streaming quality is degraded dramatically.

If the streaming server is capable to adapt to the available bandwidth [25], then only the RTSP signaling is affected by the network load. Table 5 shows the session setup signaling with a 64 kbps connection.


Table 5 – Set up delays for 64 kbps bearer

Set up delay

Average (sec.) 1.9 Minimum (sec.) 1.7 Maximum (sec.) 2.1

STD deviation (sec.) 0.2

Compared to 384 kbps case (Table 1) the delay is on average 0.5 second higher.

6. Comparison of (E)GPRS and WCDMA

From the subjective media quality perspective, GPRS provides an entry-level quality streaming service. EGPRS improves the service into adequate level, but WCDMA improves the service quality to the enjoyable grade. Values in Table 6 show the remarkable differences between WCDMA and (E)GPRS networks. Connection set up in WCDMA is 3.7 times faster compared to EGPRS and 5.2 times faster compared to GPRS. Single PLAY message go through a WCDMA network 5.5 times faster than through (E)GPRS. Media bit rates can be 3.6 times higher than in EGPRS, and the difference between WCDMA and GPRS media bit rates is enormous (9.7 times). Despite the huge difference in media bit rates, the buffer usage is moderately low. WCDMA requires only 1.5 times larger buffer than EGPRS and 3.1 times larger when compared to the GPRS at the maximum media bit rate. All these improvements are offered without an increase in packet loss rates, which are close to zero for all the networks. One of the most significant improvements for real-time media is the seamless cell change. Typical WCDMA handovers (Soft and Softer) are loss less, while EGPRS and GPRS cell reselections are producing losses of around 3 seconds of media content. Table 6 - WCDMA, EGPRS and GPRS results

Network WCDMA EGPRS 2+1 GPRS 3+1Connection set up 1.4 s 5.2 s 7.3 s Pause 0.4 s 1.1 s 1.6 s Play (after Pause) 0.2 s 1.1 s 1.1 s Max media bit rate 342 kbps 96 kbps 35 kbps Buffer usage 147 KB 99 KB 48 KB Avg. packet loss rate 0.1 % 0.0 % 0.0% Handover time 0 s 2.9 s 3.4 s

7. Conclusions

Our tests suggest that streaming with the WCDMA background traffic class works well and the quality is good. Session set up is fast enough to create a feeling to be using a wired connection. High bit rates and therefore improved image quality and motion make streaming look like a small television service. Seamless handovers enable truly mobile usage and consumption of multimedia content over 3G networks.

References

[1] Tero Ojanperä and Ramjee Prasad, WCDMA: Towards IP Mobility and Mobile Internet, Artech House, 2001.

[2] Harri Holma and Antti Toskala, WCDMA for UMTS, Radio Access for Third Generation Mobile Communications, Third Edition, Wiley, 2004.

[3] 3GPP, TS 23.107, V.3.9.0, (Release 1999), 2002-09. [4] 3GPP, TS 26.233, V. 4.2.0, (Release 4),2002-03. [5] 3GPP, TS 26.234, V.4.5.0, (Release 4), 2003-01. [6] IETF, RTSP, RFC 2326, April 1998. [7] IETF, SDP, RFC 2327, April 1998. [8] IETF, RTP, RFC 3550, July 2003. [9] 3GPP, TS 26.071, V. 4.0.0, (Release 4), 2001-03. [10] ITU-T, Recommendation H.263, February 1998. [11] ISO/IEC 14496-3, 2001. [12] ITU-T, Recommendation G.722.2 (2002). [13] ITU-T, Recommendation H.263 annex X, 04-2001. [14] ISO/IEC 14496-2, 2001. [15] 3GPP, TS 26.233, V. 5.0.0, (Release 5), 2002-03. [16] 3GPP, TS 22.233, V. 5.0.0, (Release 5), 2002-03. [17] 3GPP, TS 26.234, V. 5.7.0, (Release 5) 2005-03. [18] Miikka Lundan and Igor D.D. Curcio, RTSP Signaling

in 3GPP Streaming over GPRS, Finnish Signal Processing Symposium (FINSIG ‘03), Tampere, Finland, 19 May 2003, TICSP Series #20, pp. 149-153.

[19] Miikka Lundan and Igor D.D. Curcio, 3GPP streaming over GPRS Rel. ’97, 12th IEEE International Conference on Computer, Communications and Networks (ICCCN ‘03), Dallas, TX, U.S.A., 20-22 October 2003, pp. 101-106.

[20] Miikka Lundan, Streaming over EGPRS, 9th IEEE Symposium on Computer and Communications (ISCC’04), 28 June-2 July 2004, Alexandria, Egypt, pp. 969-974.

[21] Igor D.D. Curcio, Multimedia Streaming over Mobile Networks: European Perspective, in B. Furht and M. Ilyas (Eds.), Wireless Internet Handbook. Technologies, Standards and Applications, CRC Press, 2003, pp. 77-104.

[22] 3GPP, TS 26.234, V.6.3.0, (Release 6), 2005-03. [23] 3GPP, TS 26.233, V.6.0.0, (Release 6), 2004-09. [24] 3GPP, TS 22.233, V.6.3.0, (Release 6), 2003-09. [25] Igor D.D. Curcio, David Leon, Application

RateAdaptation for Mobile Streaming, IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM ’05), Taormina/Giardini Naxos, Italy, 13-16 June 2005.


[P5] Miikka Lundan, Igor D.D. Curcio, “Optimal 3GPP Packed-switched Streaming Service (PSS) over GPRS”, Multimedia Tools and Applications Journal, Vol. 35, No. 3, Dec. 2007, pp. 285-310.

© 2007 Springer. With kind permission from Springer Science+Business Media.

Optimal 3GPP packet-switched streaming service(PSS) over GPRS networks

Miikka Lundan & Igor D. D. Curcio

# Springer Science + Business Media, LLC 2007

Abstract 3GPP packet-switched streaming service (PSS) is a standardized packet-basedmobile streaming service, which is based on IETF RTSP/SDP standards. PSS can beimplemented over GPRS networks; however these cannot usually guarantee any data ratesor delay bounds, but allow sufficient bandwidth for mobile streaming. GPRS cellreselections pose additional challenges for streaming, since several seconds of datatransmission breaks may occur, and data may even be lost. The level of error protection ofGPRS is good enough for mobile streaming, if the correct quality of service (QoS) profile isconfigured. In this paper, we study the effect of different QoS parameters configurations tofind optimal values for PSS over GPRS. The paper shows also a method for optimizing cellreselection management at the application layer, in order to provide seamless mobility.Results show that despite all the limitations of a GPRS environment, PSS is feasible with adecent quality of service.

Keywords GPRS . Streaming . 3GPP PSS . RTSP. RTP

1 Introduction

Multimedia services enable today more visual mobile communication at user’s disposal.Internet protocol (IP) based streaming is one of the new services that mobile terminalsoffer, where the capability of receiving video traffic together with color screen featuresallows to create a completely new application category for mobile phones. Thirdgeneration mobile terminals are conceived to be the first devices providing these newaudio-visual services. However, pre-3G multimedia services are already possible today.

Multimed Tools ApplDOI 10.1007/s11042-007-0130-y

M. Lundan : I. D. D. Curcio (*)Nokia Corporation, Tampere, Finlande-mail: [email protected]

M. Lundane-mail: [email protected]

The aim of this study is to show that at least one of those services, multimedia streaming,is feasible already over earlier mobile phones and networks.

Streaming enables users to watch video clips without the need of fully storing themin the memory of the mobile terminal before playback. When the user wants to watch amedia stream, he/she contacts (through a streaming client player residing on the mobileterminal) a streaming server. This is done by means of the RTSP (real-time streamingprotocol) and SDP (session description protocol), which are specified by the InternetEngineering Task Force (IETF) [21, 23]. These protocols allow the user to get all theinformation about the media streams (e.g. bit rates, codecs and length) that he/she wantsto play. RTSP is normally carried over reliable TCP/IP connections. After the initialRTSP negotiation, the media flow can start. This is usually transmitted via the RTP (real-time transport protocol) [22] over unreliable UDP/IP encapsulation. Media data is oftenbuffered in the receiving mobile terminal for several seconds. The property of non-conversational real-time application allows a streaming application more relaxed delayrequirements than in video conferencing applications. After media data is shown to theuser, data can be discarded. This greatly reduces the memory requirements, and sincedata is not stored in the mobile terminal, there is no practical limitation on the temporallength of a media stream to be played back. In addition, the user has the possibility topause, rewind and fast-forward the media stream as if the stream were stored in themobile terminal.

Streaming over mobile networks is specified by 3GPP (Third GenerationPartnership Project) in [3]. The specification describes the components of the packet-switched streaming service (PSS), including session establishment/set-up and controlprotocols, data transport protocols, as well as codecs for different media types. The firstnetwork technology that best enables the use of the protocols required for PSS is GPRS(general packet radio service), which supports IP based traffic. Although GPRS has notbeen designed optimally to support low delay real-time traffic, GPRS has the capabilityto support 3GPP PSS applications. In our research, the main aim is to address questionsabout feasibility, optimal network configuration, user experienced connection set-updelay, and the effects of GPRS cell reselections on media fruition. For a long time GPRSwill be the backbone of mobile data transmission networks. Although enhanced GPRS(EGPRS) and 3G networks can provide faster and more reliable connections, GPRSservices will be the most common services and people will continue to receive them.Therefore, it is commercially important to offer as many services as possible throughGPRS networks. Also, some challenges common to the different radio networktechnologies exist. For example, lossy cell reselections can occur under GPRS, EGPRSand 3G networks. If such challenges can be addressed already over GPRS with asolution, this can be inherited to more advanced radio technologies too.

This paper is structured as follows. Section 2 introduces some related work of this topic.Section 3 contains general information about GPRS networks and its QoS parameters.Sections 4 and 5 are about the packet-switched streaming service standardized in 3GPP andits related protocols. In Section 6 we introduce some concepts about streaming trafficcharacteristics. Section 7 contains the testing methodology applied throughout the paper. InSection 8 we describe the experimental results: RTSP signaling delays, packet loss rates, theeffects of cell reselections, buffering and frame rates. In Section 9 we discuss about optimalQoS parameters for GPRS streaming and suggest improvements to the service. Finally inSection 10 we draw some conclusions.

Multimed Tools Appl

2 Related work

Although streaming is quite a hot topic in research articles of the last decade, mobilestreaming and especially GPRS related mobile streaming is not so widely discussed. Thissection surveys some publications related to this topic. Some of the earlier works focusedon studies of IP voice over GPRS [31, 35]. Delay for voice over GPRS is analyzed in [35].This study showed that, with a proper configuration, it is possible to provide IP voiceservices over GPRS with acceptable delays. The study also pointed some modifications toGPRS that would improve the quality. Speech quality for GPRS environment is studied in[31]. This paper shows the benefits of the AMR (adaptive multi-rate) speech codeccompared to earlier codecs. The biggest obstacles against a perceived good quality are thelossy nature of GPRS networks and the packetization overhead.

Video traffic over GPRS is analyzed in [18, 19]. The usage of the MPEG-4 video codecfor multimedia services over GPRS is studied in [18]. This paper shows what GPRS codingschemes should be used under certain network error rates. The results shown are alignedwith those described in [19] (mentioned also in the Section 3). More details about this topiccan be found from [36]. Video streaming on embedded devices with GPRS networks is thetopic of the research in [32]. This study makes use of personal digital assistant (PDA) basedsystems to test video streaming over GPRS. The results achieved are in line with those inthis paper, but bit rates and frame rates results are more conservative in [32]. The capacityof GPRS networks is studied in [20]. According to this study, four active streaming usersrequired eight fixed packet data channels, whereas when using on-demand packet datachannels, the quality of streaming is not acceptable.

It is interesting a comparison of GPRS to other wireless networks (e.g., wireless localarea networks (WLANs)). The relationship of packet sizes on packet loss rates and delays isstudied in [30]. The authors conclude that even though the difference in performancebetween the usage of small and large packet size is observed, this is not significant; andstreaming over WLAN does not require packet size optimizations. The authors alsoconclude that there is enough capacity in a WLAN network to handle lower layerretransmissions without impact on the application layer. Despite the fact that end-to-enddelays of GPRS network are higher, usually buffering at the streaming client is able to hideany delay variation and lower layer retransmissions are also possible in a GPRS streamingservice. Without an efficient buffering, packet optimization is needed. A seamless handovermechanism for WLAN streaming is introduced in [13]. The basic idea is very similar to thatof third generation mobile networks: one WLAN mobile node can be attached to twodifferent WLAN access points simultaneously during a soft handover. The authors showthat packet delays can be used as a signal to upcoming packet losses, and therefore a mobilenode can start looking for a new access point before packet losses occur. However, thisapproach requires that the streaming server sends two simultaneous streams through the twoaccess points for a period of time, which is disadvantageous and complex to implement. Asimilar WLAN handover approach is analyzed in [12], where multi-path streaming is used.These types of seamless handovers are not possible with GPRS networks, but handoverscan be hidden from the end user as will be shown in this paper.

In our previous publications we have researched RTSP signaling delays over GPRSnetworks [33]. These results show that although RTSP delays are slightly larger comparedto those typical of circuit-switched networks, they are within acceptable limits for newmobile services, such as streaming. Low bit rate streaming over GPRS is analyzed in [34].

Multimed Tools Appl

This study showed that streaming is feasible even in cases the GPRS network is notproviding full capacity. In addition, [14] provides a good tutorial introduction to streamingover mobile networks.

3 General packet radio service (GPRS)

3.1 Architecture

General packet radio service (GPRS) networks [1, 16] provide an end-to-end mobile packetradio communication system that allows packet mode data transmission and reception.GPRS uses the same radio architecture as GSM (global system for mobile communications)[17]. Although GPRS was designed mainly to carry non-real-time traffic, it is suitable alsofor multimedia applications with no real-time conversational requirements, since it supportsIP and provides enough downlink bandwidth capacity for applications such as streaming.

Two reasons makes the implementation of a streaming service challenging over GPRSnetworks: cell reselection (CR) delay and non-guaranteed bit rates. A GPRS cell reselectiontakes much longer time than a typical GSM handover; on average a GPRS cell reselectiontakes 2–5 s, while a GSM handover takes only 120–220 ms [2]. The mobile terminal makesthe CR autonomously. The parameters used by the mobile terminal for CR are sent from thenetwork, and they can be different in each cell. There are several types of CRs:

1. Inter Cell CR. The mobile terminal changes from a cell to another cell within the sameBSC (base station controller).

2. Inter BSC CR. The mobile terminal changes to a cell that is served by another BSC.3. Inter SGSN CR. The mobile terminal changes to a cell that is served by another SGSN

(serving GPRS support node).

Above that it is still possible to change between GGSNs (gateway GPRS support node)and between networks (roaming). In this paper, whenever cell reselections are mentioned,we always mean the inter cell reselection case.

Non-guaranteed bit rates are typical in GPRS connections; mainly these are due to tworeasons. Firstly, whenever a connection needs more error protection (because of bad radiolink conditions) the network switches to a more robust coding scheme (CS), where morebits are allocated to provide a better air interface error resilience [15]. In GPRS there arefour coding scheme, namely CS-1 through CS-4 [15, 17]. According to the results shown in[19], CS-1 provides enough protection to allow operations in noisy conditions (Eb/No 7–11 dB); CS-2 should be used in better environments (Eb/No 14–18 dB) and CS-3 is forconditions above that. Secondly, if more users enter to the same cell, the network loadincreases and there may be need to share one or more of the time slots (TS) allocated to auser. A GPRS connection shares (or loses) then one or more time slots when the loadincreases. Table 1 contains GPRS bit rates. The maximum achievable bit rate can be as highas 171.2 kbps.

Figure 1 shows the basic architecture of a GPRS network. There are two new elementscompared to a circuit switched GSM core network: GGSN, which is connected to an IPbackbone (e.g., the Internet), SGSN, which is connected to the GPRS core network.

The GGSN performs the routing of mobile addressed packets coming from an externalnetwork to the relevant SGSN, the routing of packets from the mobile terminal to thecorrect external network, the collection of charging and traffic data and the allocation of

Multimed Tools Appl

dynamic IP addresses to the mobile terminals. The SGSN performs the protocol conversionfrom the IP backbone to those used in the base station subsystem (BSS) and the mobileterminal, authentication and mobility management, data routing to external networks,collection of charging and traffic data, ciphering and data compression.

Figure 1 describes the system architecture and configuration we have used to implement3GPP PSS over GPRS. The 3GPP PSS client application is in PC environment and a GPRSphone works as an infrared modem. The 3GPP PSS server is connected to the IP backboneat the far end of the architecture shown. In our experiments, we used a corporate shared testnetwork, which is a standard (although smaller compared to a commercial one) GPRSnetwork containing all the elements shown in Fig. 1. The test network is not an indoorlaboratory network, but a full coverage network with real base stations and other networkelements. The coverage of the network is limited to a certain city district. Since the testnetwork is available for all company employees, the traffic characteristics are very diverse(e.g., voice and data calls, phones leaving from and arriving to the network). The testnetwork was subject to a certain load all the time, but differed from the typical load patternof an operator network. The air interface quality was most of the time good, butoccasionally some coding schemes and time slots configuration changes occurred. We usedtwo cells in our tests, and created the cell reselections by adding noise to the cell to whichthe phone was currently attached. When the noise level reached a certain threshold, thephone reselected the other cell.

3.2 Quality of service (QoS) parameters

There are four QoS parameters in GPRS Rel. ’97 [1, 16]: precedence class, reliability class,delay class and throughput class (divided in two values). The precedence class indicates therelative priority of maintaining the service. For example, under network congestion, packetsthat may be discarded can be identified. The following precedence levels are defined:

Fig. 1 GSM/GPRS network

Table 1 GPRS bit rates for different time slot and coding scheme configurations (kbps)

1 TS 2 TS 3 TS 4 TS 5 TS 6 TS 7 TS 8 TS

CS-1 9.05 18.1 27.15 36.2 45.25 54.3 63.35 72.4CS-2 13.4 26.8 40.2 53.6 67.0 80.4 93.8 107.2CS-3 15.6 31.2 46.8 62.4 78.0 93.6 109.2 124.8CS-4 21.4 42.8 64.2 85.6 107.0 128.4 149.8 171.2

Multimed Tools Appl

High precedence (Service commitments will be maintained ahead of all otherprecedence levels).

Normal precedence (Service commitments will be maintained ahead of low priorityusers).

Low precedence (Service commitments will be maintained after the high and normalpriority commitments have been fulfilled).

The reliability class indicates the transmission characteristics that are required by anapplication. The reliability class defines the probability of loss, duplication, mis-sequencingor corruption of service data units (SDU). There are five reliability classes, from 1 to 5,where 1 is highly non-transparent (i.e., air interface errors are not acceptable in thisconfiguration and every layer works in acknowledged mode (ACK), which retransmits theerroneous or missing data) and 5 is fully transparent (i.e., air interface errors are passedthrough the upper layers without any attempt to recover them). Table 2 shows the reliabilityclass protection levels.

For example, for Reliability Class 3 the GPRS tunneling protocol (GTP) works inunacknowledged (UNACK) mode, the logical link control (LLC) layer works in UNACKmode with data protection, the radio link control (RLC) works in acknowledged (ACK)mode, and the typical traffic type is non-real-time (Non-RT).

The delay class parameter defines the end-to-end transfer delay incurred in thetransmission of SDUs through the GPRS network. Table 3 shows the delay bounds ofthe different delay classes.

The throughput class parameter indicates the data throughput requested by the user. Thethroughput is defined by two negotiable parameters: maximum bit rate (described bythe peak throughput class) and mean bit rate (described by the mean throughput class). Thepeak throughput class is a value between 1 and 9, and defines bit rates between 8 and2,048 kbps (as power of 2). Table 4 shows the peak throughput classes.

Table 2 Reliability class protection levels

Class GTP mode LLC frame mode LLC data protection RLC block mode Traffic type

1 ACK ACK Protect. ACK Non-RT2 UNACK ACK Protect. ACK Non-RT3 UNACK UNACK Protect. ACK Non-RT4 UNACK UNACK Protect. UNACK RT5 UNACK UNACK Unprotect. UNACK RT

Table 3 Delay bounds for GPRS networks

Delay class Packet size

128 bytes 1,024 bytes

Mean delay (s) 95% Percentile (s) Mean delay (s) 95% Percentile (s)

1 (Predictive) 0.5 1.5 2 72 (Predictive) 5 25 15 753 (Predictive) 50 250 75 3754 (Best effort) Unspecified

Multimed Tools Appl

In practice, the largest useful peak throughput class for typical GPRS connections withthree TS in downlink and one TS in uplink (also referred to as 3+1 TS) using CS-2 is 4 (i.e.,64 kbps). The mean throughput class is a value between 1 and 31 that defines bit ratesbetween ∼0.22 bps and ∼111 kbps. The value 31 is used for best effort bit rates. This valueincludes, for bursty transmissions, periods in which no data is transmitted. For typical GPRSTS 3+1 TS CS-2 streaming connections, the useful mean throughput classes are 15–17.Table 5 shows the mean throughput classes.

4 Packet-switched streaming service (PSS)

The first standardized packet-switched streaming service (PSS) is in 3GPP Release 4specifications [3, 5]. Release 5 specifications [6–8] define additional capabilities to thestreaming service. The major new features are an enhanced capability exchange mechanismand new media formats. The former enables providing content for a wide set of mobile

Class Bit rate (kbps)

1 82 163 324 645 1286 2567 5128 1,0249 2,048

Table 4 Peak throughput classes

Class Bit rate

1 ∼0.22 bps2 ∼0.44 bps3 ∼1.11 bps4 ∼2.2 bps5 ∼4.4 bps6 ∼11.1 bps7 ∼22 bps8 ∼44 bps9 ∼111 bps10 ∼0.22 kbps11 ∼0.44 kbps12 ∼1.11 kbps13 ∼2.2 kbps14 ∼4.4 kbps15 ∼11.1 kbps16 ∼22 kbps17 ∼44 kbps18 ∼111 kbps31 Best effort

Table 5 Mean throughputclasses

Multimed Tools Appl

devices. The latter include synthetic audio, bitmap graphics, scalable vector graphics andtimed text. Release 6 PSS specifications [9] have been finalized by 3GPP at the end of 2004.Rel. 7 PSS (to be finalized in 2007) is expected to be the most advanced streaming servicedefined for mobile environment [37]. Other than being backward compatible with the earlyPSS releases, Rel. 6 PSS adds new advanced media codecs (such as H.264 [29], AMR-WB+[10] Enhanced AAC+ [11], and also a new set of features to make this service more appealingfor deployment and robust. Among the new features are bit-rate adaptation to provideapplication adaptation to different network environments, RTP retransmission to increase thelevel of error-robustness, quality of experience (QoE) protocol and metrics to enable operatorsto monitor the user QoE.

In our experiments, we were interested in understanding the performance of a basic PSSservice. Therefore, we concentrated essentially on Release 4 PSS functionalities, despitehaving a Rel. 6 compliant system. Streaming services compliant to the latest releases areexpected to perform better, and therefore PSS Rel. 4 performance results can be regarded aslower bound for a streaming service expectation. In [5], the simple and extended PSS clientsare defined, while [9] describes the protocols and codecs used. The main features andarchitecture of a mobile streaming service are described in this section. Further details for Rel.4 and 5 PSS can be found in [14]. The end-to-end architecture for PSS service includes acontent server, a streaming client and a network between these two elements. The contentserver may reside within the operator’s network or in the public Internet. The streaming clienthas to be designed for mobile use. Figure 2 shows the functional components of a PSS client.Most of the functional components were already in Rel. 4, but Rel. 5 added vector graphicsdecoder and capability exchange functionalities, while Rel. 6 included timed text decoder andsynthetic audio decoder functionalities. Rel. 5 and 6 has made major enhancements tofunctionalities that originated from Rel.4.

Figure 3 describes the Release 6 PSS protocol stack. Data transport can be either overRTP/UDP for continuous media (video, audio, speech and timed text) or HTTP/TCP fordiscrete media (scene descriptions, still images, graphics, text, timed text and syntheticaudio). Session control and set-up can occur using two protocols depending on the mediatypes. RTSP is used for RTP traffic, while HTTP is used for discrete media traffic.

5 Protocols for PSS

The session control protocol for 3GPP PSS is RTSP. The message flows between server andclient for 3GPP PSS are shown in Fig. 4. After TCP synchronization [24] (which is a three-way handshake), the PSS client asks the PSS server to describe the media content using theRTSP DESCRIBE message. The server then sends a response message, which includes theSDP part. SDP contains information about audio and video codecs, media bit rates, streamlocation and length. Using the SDP information allows the PSS client asking the server toset-up the media by means of the SETUP message, which specifies the transportmechanism to be used for the streamed media. In this phase, the RTP and RTCP portsare defined for audio and video. The first SETUP message response includes a sessionnumber, which remains unchanged during the whole connection. The session number isused as an identifier of future commands, so that the server knows the owner of everycommand. Each media requires its own SETUP message; therefore, there can be one ormore SETUP messages during connection set-up. As a last operation, the PSS client sendsthe PLAY message to the server, which starts the media flow. If the user wants to stop themedia flow or after the media stream is over, the client sends a TEARDOWN message to

Multimed Tools Appl

the server, which is acknowledged by a final response of the server. In addition to thesemessages, the user can pause the media flow by using the PAUSE message, and theneventually restart it by using a successive PLAY message.

After RTSP streaming session set-up, several IP data flows start the communicationbetween streaming server and streaming client:

1. One or more uni-directional RTP media flows, related to speech, audio or video (indownlink). An audio session is typically a single media flow, while a multimediasession requires at least two media flows.

2. One or more RTCP flows from server to client (in downlink). Each RTCP flow isrelated to an RTP flow. If there are two RTP flows, there are also generally two RTCPflows. The purpose of the RTCP sender reports (SR) [22] is to summarize the datatransmission (information about timestamps and sent data) and give information to thereceiver on how feedback messages are received; RTCP SRs allow the receiver to makealso media synchronization.

3. One or more RTCP flows from client to server (in uplink). Also here the number ofRTCP flows depends on the number of RTP flows. The purpose of RTCP receiver

Graphics Display

Sound Output

Terminal Capabilities

User Interface

3GP

P L

2 Scope of PSS

Image Decoder

Pa

cke

t ba

sed

ne

two

rk in

terf

ace

Vector Graphics Decoder

Text

Audio Decoder

Speech Decoder

Scene Description

Session Control

Session Establishment

Capability Exchange

Sp

atia

l la

you

t

Video Decoder

Synthetic audioDecoder

Timed text Decoder

Syn

chro

nisa

tion

Fig. 2 Functional components ofa 3GPP Release 6 PSS client

Multimed Tools Appl

IP

UDP TCP

RTPRTSP

Payload formats

VideoAudio

Speech

Capability exchangeScene description

Presentation descriptionStill images

Bitmap graphicsVector graphics

TextTimed text

Synthetic audio

HTTP

Capability exchangePresentation description

UDP

Timed Text

Fig. 3 Release 6 PSS protocolstack

PSS Client

PSS Server

TCP SYN

TCP SYN, ACK

TCP ACK

RTSP DESCRIBE

RTSP 200 OK + SDP

RTSP SETUP (media1)

RTSP 200 OK

RTSP SETUP (media2)

RTSP 200 OK

RTSP PLAY

RTSP 200 OK

Media flow (RTP/RTCP)

RTSP TEARDOWN

RTSP 200 OK

RTSP 200 OK

RTSP PLAY

RTSP 200 OK

RTSP PAUSE

RTCP

TCP synchronization

Media streaming

Fig. 4 RTSP message flows in3GPP PSS

Multimed Tools Appl

reports (RR) [22] is to give information to the server about how a media stream isreceived. With RR, the receiver is able to tell the server the amount of packet lossesand delay, and the server can act accordingly to the reported information.

6 Streaming traffic characteristics

Streaming data flows are mainly made of three types of traffic: RTSP signaling, RTP mediaand RTCP reports. Different traffic may be carried over different PDP (packet dataprotocol) contexts. For example RTSP traffic could be carried over a primary PDP context,and RTP+RTCP traffic could be carried over a secondary PDP context. GPRS Rel. ’97allows using only one primary PDP context with a single IP address, which carries all threetypes of traffic. Successive releases of GPRS allow for multiple PDP contexts.

The typical RTSP message size, without any lower layer headers, is 40–200 bytes (whenincluding more advanced header fields, the size may increase significantly). As said above,one of the RTSP responses includes an SDP message part, which also increases the size ofthe message approximately by some 500–1,000 bytes, depending on the sessioninformation. The size of ordinary RTCP RR packets is 44–140 bytes (without lower layerheaders), depending on the usage of optional fields. The size of RTP packets depends onmedia type, packetization strategy and media bit rate. An AMR (adaptive multi-rate) [4]speech frame contains 20 ms of audio data. A typical packetization strategy for streaming isto encapsulate ten AMR frames into one RTP packet. The RTP payload format for AMR[25] defines the packetization principles. The following Table 6 shows typical RTP packetsizes and bit rates with lower layer headers using ten AMR frames per packet. The packetsize is in octet-aligned mode with CRC and interleaving.

Over a GPRS channel of 40.2 kbps (3+1 TS CS-2) with audio only streaming, all AMRmodes can be used; however, if video is included too, then in practice the application maybe forced to avoid using the highest AMR bit rates, in order to leave enough bandwidth forvideo. Table 7 shows example bit rates (including lower layer headers) for 500 bytes videopayload packet size using different video bit rates.

Thirty-six kilobits per second is the largest possible bit rate for video only streamingover GPRS channel of 40.2 kbps. If also audio is included, the largest possible video bitrate is around 28 kbps. Five hundred-byte packets were the best trade-off between errorresilience and bandwidth efficiency. Smaller packet sizes would help to increase errorresilience, but if 100 bytes packet size were used, the maximum video bit rate would havebeen 20 kbps in order to compensate the packets header overhead. Using larger packets is

AMR mode (kbps) Packet size (bytes) IP level bit rate (kbps)

4.75 182 7.35.15 192 7.75.9 212 8.56.7 232 9.37.4 252 10.17.95 262 10.510.2 322 12.912.2 372 14.9

Table 6 Packet sizes and bitrates for AMR encapsulation

Multimed Tools Appl

more bandwidth efficient, but occasional packet losses due to air interface errors may causesevere errors and degradation in the user experience. The packetization for H.263 isdescribed in the H.263+ RTP payload specification [26].

7 Performance evaluation parameters and metrics

In our testing activity, the PSS server and client were in separate laptops. The GPRS mobileterminal worked as a modem between PSS client laptop and mobile network, and we usedinfrared connection between the mobile terminal and the client laptop. The test sequencewas a 1-min movie trailer, with very high motion. The media bit rates (including both audioand video) were ranging from 12 to 32 kbps. This section includes also more informationabout the encoding parameters and the quality metrics used for analyzing the results. Weused the following network condition: 3+1 TS and CS-2 (40.2 kbps). We estimate that thenetwork load is between normal and low. Results of streaming over GPRS with lower bitrates can be found in [32].

7.1 Media encoding

We used the AMR speech codec for audio encoding and H.263 (Profile 0, Level 10) [27, 28]to encode the video material. The packetization for these media was the same as in the IETFspecifications [25] and [26] respectively for audio and video. Packet sizes are in generalconfigurable and dynamically adjustable at run-time. However we used a predefinedsetting. The speech packets were of constant size, depending on the speech encoding bitrate (AMR mode). Their sizes were 131, 161 and 181 bytes (these values include alsopayload header information) respectively for 4.75, 5.90 and 6.70 kbps speech. The videopackets were of variable size. The video packetization strategy was that of encapsulatingone video frame into one RTP packet, with the limitation of 1,430 bytes per packet(including payload header information). If a video frame was larger than 1,430 bytes (e.g.,an intra-coded frame, also referred in the following to as INTRA), this was fragmented intomultiple RTP packets. The overall average packet size was 515 bytes with a standarddeviation of 373 bytes. Video bit rates were ranging from 7.25 to 25.3 kbps. The content weused in these tests was created using the following parameters described in Table 8:

The intra-coded frame refresh rate of 5 s was chosen because it was found to be a goodcompromise between frame rate (i.e., video smoothness) and error propagation. If the intra-coded frame refresh rate is too frequent, the frame rate has to be lowered in order tomaintain the average encoding bit rate. If the intra-coded frame refresh rate is too low, theerror propagation of packet losses is visible to the user for too long time. Figure 5 shows the

Video bit rate (kbps) IP level bit rate (kbps)

12 13.016 17.320 21.724 26.028 30.432 34.736 39.0

Table 7 Video bit rates for 500bytes payload packets

Multimed Tools Appl

quality difference (expressed in terms of peak-signal-to-noise ratio (PSNR)) betweendifferent intra-coded frame rates.

7.2 Evaluation metrics

We used the following evaluation metrics for GPRS streaming. The main emphasis is onthe end-user side that receives media streams using a client streaming player (Fig. 4):

Connection setup delay (seconds): This is the time elapsed between when the user asks thestreaming client to start the streaming session and thetime when the session between client and server is set-up and ready to initiate the media transfer. In RTSP itmeans the time difference between when the clientsends to the server the first and receives from the serverthe last of the following messages: DESCRIBE-200/OK-SETUP-200/OK-PLAY-200/OK (the sequence

3% loss rate quality

19.50

20.00

20.50

21.00

21.50

22.00

22.50

23.00

23.50

24.00

3 sec 5 sec 7 sec 10 sec

Intra-frame rate

PS

NR

Fig 5 Quality of frame rate

Table 8 Media encoding parameters

Parameter Value

Amr AudioSilence compression ONNumber of speech frames per RTP packet 10Alignment ByteCRC OFFBit sorting OFFInterleaving OFF

H.263 VideoFormat QCIF (176×144 pixels) or SQCIF (128×96 pixels)Profile 0, Level 10Rate control TMN 5Annexes No (H.263 baseline)Packetization 1 Video frame/RTP packet (max packet size=1430 bytes)Intra-coded frame refresh rate 0.2 fps (=1 intra-coded frame every 5 s)

Multimed Tools Appl

SETUP-200/OK of messages is repeated n times for ndifferent medium). In addition to RTSP, connection set-up delay includes TCP synchronization.

Pause–play delay (seconds): The pause delay is the time between pausing theplayback and when the system is ready for the nextcommand. In RTSP, it means the time differencebetween the PAUSE and the 200/OK messages. Apause command can be either user driven or PSSclient driven. The play delay is the time betweenwhen the PSS client requesting to play and the actualplayback starting. In RTSP, it means the timedifference between the PLAYand 200/OK messages.The play command can be either user driven or PSSclient driven.

Round trip time (seconds): It is the time between sending an RTSP request andreceiving an RTSP response.

Packet loss rate (percentage): Lost media packets during streaming (with andwithout network cell reselection).

Packet flow stop time (seconds): Cell reselection length in time. This is the delaybetween two arriving packets before and after thecell reselection event.

Initial buffering delay (seconds): Time for initial buffering to enable streaming withoutrebuffering.

Total user delay (seconds): It is the connection setup delay + initial bufferingdelay.

Buffering requirement (kilobytes): This is the memory requirement for dynamic pre-decoder buffer to store the amount of data equal tothe initial buffering delay.

Video frame rate (frames per second): It is the average video frame rate.

8 Test results

This section contains all the results from our tests. First we will show RTSP signaling results,followed by packet loss rates and cell reselection results. After that we show buffering results. Wehave performed ten runs to achieve these results and averaged the results over the number of runs.

8.1 TCP/RTSP connection set-up delays

The connection set-up delay includes RTSP signaling and TCP synchronization delays, butit does not include the first buffering operation. From the user’s perspective, the initiallatency time is the sum of connection set-up and buffering. In our tests, we have found thatin all the cases the connection set-up signaling could survive a cell reselection event, whichmeans that even though cell reselection added extra delays, in the connection set-up therewere no failed connections. We used Reliability Class 3 in all signaling tests. Table 9 showsthe results of connection set-up delay.

The connection setup delay is the time between when the user presses the start buttonuntil the time the first media packets arrive to the phone. More details about RTSP signalingcan be found in [33].

Multimed Tools Appl

8.2 RTSP pause–play delay

When the PSS client wants to pause the media flow, it sends a PAUSE message to server. Assaid above, a pause command can be either user driven or PSS client driven (e.g., in the caseof rebuffering). Although the playback can be paused immediately to the end user’s screen, ittakes some time before the whole system is ready for next command (e.g., a play command).Table 10 shows the signaling delay of a PAUSE message and the subsequent PLAY.

The PAUSE signaling delay should be interpreted in a way that after 1.6 s, the system isready for the next command. The reason why the PAUSE delay is larger than the PLAYdelay is that PLAY is sent with no traffic on the network channel, while PAUSE is sentwhen the channel is partially filled with RTP packets, which increases slightly the delay.

8.3 RTSP round trip times

The round trip time (RTT) describes how fast the messages travel through the system end-to-end, and therefore gives an estimate of how fast the system can change its state. Table 11shows the RTT of RTSP PLAY messages during connection set-up. The packets used forthese tests are 90 bytes large for the uplink and 200 bytes large for the downlink direction.

RTTs depend on the network bit rate and the load; therefore 3+1 CS2 is obviously the fastest(0.8 s) for a given network load. The table shows that it takes 0.8–1.2 s to change the state.

8.4 RTP packet losses

Packet losses in RTP streaming are due to three different reasons: (1) lossy nature of the GPRSchannel; (2) the traffic bit rate can be greater than the available network bit rate; (3) a cellreselection event. An accurate analysis of the loss behavior should take into account these threefactors when researching the causes for degraded media quality. We will try therefore to addressseparately the three previous issues, with the objective of quantifying the weight of each of thethree factors (cell reselections will be analyzed in the next sub-section).

Connection setup 3+1 TS CS-2

Average (s) 7.3Minimum (s) 5.3Maximum (s) 8.895-Percentile (s) 8.8

Table 9 Connection setup delays

Signaling delays 3+1 TS CS-2

Pause Play

Average (s) 1.6 1.0Minimum (s) 1.2 0.7Maximum (s) 2.0 1.695-Percentile (s) 1.9 1.4

Table 10 Pause–play delays

Multimed Tools Appl

Lossy nature of the GPRS channel The following Table 12 shows the observed averagepacket loss rates (PLR) for both speech and video for Reliability Classes (RC) 2, 3, 4 and 5for a video streamed at 16 kbps over CS-1. The values do not include packet losses undercell reselection conditions, so the table has to be regarded as normal GPRS performancewith no mobility involved. The results are of a typical GPRS network with good coverage.

The results show that RC 2 and RC 3 are able to yield 0% PLR to the application layer,because of the RLC ACK mode that provides to retransmit erroneous/lost blocks.Reliability Class 2 provides totally a packet loss free environment, since the LLC ACKmode provides even better error protection. The performance of RC 4 and RC 5 are worse,because both use RLC UNACK mode and there is no retransmission at the data link layer.The performance of RC 5 seems to be better compared to RC 4 for speech, but theexpectation was actually that RC 5 would perform worse than RC 4, since RC 5 uses nodata protection in the LLC layer, while RC 4 uses data protection. The deviation is due toone isolated test case where the PLR was higher; this can be seen also from the higherstandard deviation value. However, we believe that statistically RC 5 performance wouldconverge to values not better than RC 4. Summarizing, RC 3 is the best reliability classusing LLC unacknowledged mode for wireless streaming. With RC 3, only 10% of the testcases experienced packet losses, while RC 4 and RC 5 (100% of both groups of test caseswere with packet losses) performed worse and are not recommended for PSS.

Traffic bit rate >available network bit rate Packet losses occur also when the media bit rateis higher than the bit rate offered by the network. In this case, the most common result is thatthe media quality is fair for a few seconds, the time needed for the network buffers to befilled, and then it drops because the network buffers overflow, discarding packets. This canhappen if there are changes in the coding scheme, because of higher error protection need, orif more users enter the cell and there is need to share time slots. Table 13 shows maximumapplication bit rates mapped to GPRS channel bit rates. We considered only RC 3, since itbehaves as an error free channel and it is well suited for our investigations in this case.

From our experiments, the maximum allowed bit rates for 3+1 TS connections are 25and 35 kbps depending on coding scheme used.

Table 12 PLRs for different reliability classes

Packet loss rate Speech Video

RC 2 RC 3 RC 4 RC 5 RC 2 RC 3 RC 4 RC 5

Average (%) 0.0 0.0 0.6 0.4 0.0 0.0 0.9 1.0Minimum (%) 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.4Standard dev. (%) 0.0 0.1 1.6 0.3 0.0 0.1 0.4 0.3

Round trip time 3+1 TS and CS-127.2 kbps

3+1 TS and CS-240.8 kbps


Table 11 RTT

Multimed Tools Appl

8.5 GPRS cell reselections

Cell reselections can occur during one of the following phases: RTSP signaling, initialbuffering or media flow. A cell reselection during RTSP signaling caused longer signalingdelays, but no connection rejections occurred. Since RTSP signaling is transmitted over areliable TCP connection, the retransmission mechanism recovered the lost packets.

A cell reselection during the first buffering caused reduced video quality, since the RTPpackets were lost during the cell reselection. Rebuffering doesn’t occur because the buffer isnot empty; it is just waiting for more data to start playback. There are two possible effects:

1. The first INTRA frame is corrupted and the errors are propagated until the next INTRAframe. If the first INTRA frame is totally or partially lost, the correct decoding of thefollowing predictive frames is difficult, since the data that prediction should compare iscorrupted.

2. The first INTRA frame is displayed, but the video is frozen until the next INTRA framewhen it really starts the motion. If predictive frames are lost, there is no motion until newdata arrives. Some prediction errors might occur before the next INTRA frame arrives.

A cell reselection during media flow caused either:

1. Lower video quality. This was the most frequent case, where the user sees a gap in themedia flow (no speech and simply a still image are played during the gap). If the firstframe after cell reselection is not an INTRA frame, then there are severe predictionerrors until the next INTRA frame arrives.

2. Rebuffering. In this case the data flow stopped before the actual cell reselection wasover, the display was paused, and buffers were refilled.

It has to be pointed out that the above degradation effects are not the same in allimplementations, but they depend on the error resilience techniques implemented in a PSSclient. When looking from the application layer, a cell reselection can be split into threephases: (a) pre cell reselection period; (b) cell reselection break, and (c) post cell reselectionperiod. In periods a and c, a cell reselection causes a certain number of packets to be lost insome cases and for a certain reliability class (see Fig. 6). The packet loss rate is higherwhen getting closer to period b. In period b, all the packets are lost. The length of period bis variable (see Table 14). Reliability Class 3 survived periods a and c with 0% packet lossrate, but for RC 4 and RC 5 there were packets lost. Figure 6 shows that in the last secondsbefore the cell reselection break, on average two packets are lost.

We have computed the packet loss rate during the real cell reselection for the same testcases as before (16 kbps sequence over CS-1). To have an estimate of the data lost duringcell reselection in period b, where all packets are lost, Table 14 shows the length of period bin seconds. The average flow stop time is from 2.2 to 3.7 s showing that RC 2 is the onethat produces the longest gaps in transmission flow. This is due to the fact that since RC 2 ismore complex (e.g., it uses LLC layer retransmissions) than the other reliability classes, thecell reselection takes more time. The benefit of RC 2 is that it provides loss-less cellreselections. Although no data is received during the cell reselection, the data is buffered in

Network bit rate (kbps) Media bit rate (kbps)

3+1 CS1 (27.15) 253+1 CS2 (40.2) 35

Table 13 Traffic bit rate thresh-olds for RC 3

Multimed Tools Appl

the network, and after the cell reselection is over, the data transmission continues from thefirst buffered packet.

8.6 Media buffering

Buffering can be either initial buffering or rebuffering. Initial buffering happens at thebeginning of the session and it is required in order to enable smooth media playback inpresence of network bandwidth variations, delay jitter and to allow seamless mobility. Witha proper size of initial buffering, it is possible to avoid rebufferings. Rebuffering occurs inthe middle of a media flow, if the buffer level is too low. Cell reselection, increased delayjitter, and variable channel rate are some of the most common reasons for rebufferings.

8.6.1 Initial buffering

The initial buffering is occurring right after the connection set-up. Table 15 shows thedelays for initial buffering. The first column shows results only for RC 3 and 8 s buffer size,while the second column shows results for RC 3 and 12 s buffer size. In our tests, 8 s bufferwas the minimum buffer size, which enabled smooth data flow without rebuffering. In orderto handle one cell reselection, we had to use 12 s buffer. In theory, it should take 8 s to fill8 s of buffer, but the network delays or variations in the available network bandwidth cancause extra buffering delays, as it can be seen from Table 15, where the maximum initialbuffering delay is 13.1 s for 8 s buffer.

0.0

1.0

2.0

3.0

-5 -4 -3 -2 -1 0 1 2 3 4 5

Seconds (before and after B)

Ave

rag

e n

um

ber

of

lost

pac

kets

Video

Audio

Total

A CB

Fig 6 Packet losses for cell reselection using RC 4

Table 14 Packet flow stop during cell reselection break (period b)

Packet flow stop RC 2 RC 3 RC 4 RC 5

Average (s) 3.7 3.4 2.3 2.2Minimum (s) 2.8 1.6 2.1 1.5Maximum (s) 5.1 8.2 2.6 2.6Standard dev. (s) 1.0 2.75 0.2 0.5

Multimed Tools Appl

After having computed the connection set-up delay and the initial buffering delay, it iseasy to compute the total user delay that the user experiences from the start of theconnection until the media starts to play just by adding the values in Table 9 and those inTable 15. The following Table 16 shows an example of user experienced connection delay.We suppose no cell reselection during the connection set-up (it is assumed reasonably to bethe most frequent case) and a client buffer size of 12 s.

8.6.2 Rebufferings

There is a tradeoff between rebuffering length and number of rebufferings occurred. In ourclient it was not possible to separate initial buffering time and rebuffering time. If a smallinitial buffering was used, then also the rebuffering time was short; however, rebufferingsmay occur more frequently. If we set a large initial buffering time, then there were fewerrebufferings, but the total user experienced connection delay was also larger. In ourexperiments, we have tried to make the number of rebufferings equal to zero.

In general, rebuffering delays can be caused by different events. The most frequent are:

1. Rebuffering without cell reselection event. This rebuffering is caused by the bandwidthvariability, the lossy nature and delay jitter of the incoming packets traveling over theGPRS channel. Lost packets and packets arriving too late are causing reduction in thebuffer level, which ultimately requires rebuffering to refill the buffer. The formula tocalculate the length of this delay is:

Rebuffering delay ¼ Pause Play signaling timeþ buffering delay:

2. Rebuffering caused by a cell reselection event. During the cell reselection no data istransmitted to the client. Even though data may not be lost, it causes reduction in thebuffer level. If the cell reselection takes more time than what is left in the buffer,immediate rebuffering occurs. If the cell reselection takes less time than what is left inthe buffer, the buffer level is lower and eventually the Case 1 above may occur. Theformula to calculate the length of this delay is:

Rebuffering delay ¼ Pause Play signaling timeþ buffering delay þ CRduration

� buffer level before CR:

Initial buffering delay RC 3 (8 s buffer) RC 3 (12 s buffer)


Table 15 Initial buffering delaysfor 3GPP PSS

Total user delay (3+1 TS)

Average (s) 20.1Minimum (s) 17.2Maximum (s) 23.695-percentile (s) 23.2

Table 16 User experienced con-nection delay

Multimed Tools Appl

3. Rebuffering with cell reselection in the middle. This is a limit case, where duringthe Case 1 above a cell reselection event happens, enlarging the user waiting timefor the continuation of the media streams. The formula to calculate the length ofthis delay is:

Rebuffering delay ¼ Pause Play signaling timeþ buffering delayþ CRduration:

Usually, the rebuffering time is larger than the initial buffering time because it includespause and play signaling time (2 s per message). Case 1 above means that if during theplayback a rebuffering occurs, on average the user should wait 16 s before seeing the videorestart playing, if we consider 12 s buffer and 4 s for signaling. Case 2 above means that ifduring the playback a cell reselection event occurs, which is long enough to empty the buffer,the effect on the media stream is a rebuffering of maximum 16+x s, where x is the time thebuffer is empty before the cell reselection is over. The maximum value of x is the cellreselection length, and the minimum value is the cell reselection length–buffer level. If thebuffer level is higher than the cell reselection length, there is no need to rebuffer the data.Case 3 above means that if during the playback a rebuffering occurs, and within therebuffering a cell reselection event occurs too, on average the user should wait a period of 16+y s where y is the length of the cell reselection.

8.6.3 Buffer capacity

We have measured the memory consumption of 12 s buffering during the execution of ourtests using the maximum media bitrates described in Table 13. The results are summarizedin the following Table 17:

The first column of results in the above table indicates the maximum memoryconsumption, respectively for speech and video, for 25 kbps media bit rate. Similar resultsare shown for 35 kbps media bit rate in the second column. The right size of a media (pre-decoder) dynamic buffer can be calculated by adding the speech and video values (e.g.,10.6+44.0=54.6 KB for 35 kbps media data).

8.7 Video quality

Table 18 shows results for typical video frame rates for different video bit rates. Since framerate is an encoding parameter, these values are indicative and they can change depending onthe content and the encoding procedure; therefore they must not be regarded as themaximum achievable frame rates. The values were logged from the video decoder. TheQCIF (quarter common intermediate format) size is 176×144 pixels and the SQCIF (sub-QCIF) size is 128×96 pixels.

Buffer usage 3+1 TS CS-1 (25 kbps) 3+1 TS CS-2 (35 kbps)

Media Speech Video Speech Video

Maximum (KB) 8.3 32.8 10.6 44.0

Table 17 Buffer usage for media

Multimed Tools Appl

The table shows that average frames rates are increasing for increasing bit rates, asexpected. For the 25 kbps media case, (3+1 TS CS-1) the average frame rate is 7.5 fps, andfor the 3+1 TS CS-2 case (35 kbps media) the average frame rate is 9.8 fps (slightly higherframe rates can be achieved by stealing bandwidth from the AMR speech stream byencoding it at 4.75 kbps instead of 5.9–6.7 kbps as we did). By comparing the QCIF resultswith those for SQCIF format, it is easy to see that video frame rates can be raised whenusing the SQCIF format, as the image size is smaller.

Table 19 shows the PSNR values for the QCIF image size under Reliability Class 3 QoSenvironment. The difference between the lowest bit rate and the highest bit rate is 2.6 dB.

9 Optimal network settings and improvements to the service

In our tested network, the CSs, the number of available TSs, and the cells changed manytimes over time without any mobile terminal control. This was already an expectedbehavior, but we realized that even on a lightly loaded network, these phenomena happen.If the CSs and the number of TSs vary all the time, it means that the available bit rate isvarying all the time as well, even slightly.

Considering also the CS-1/CS-2 possible variation over time, the gross bit rates achievedin downlink were in the range (9.05, 40.20) kbps, while for uplink the bit rates were in therange (9.05, 13.40) kbps. The asymmetry between downlink and uplink bit rates is notnegligible, because it affects heavily the set-up time of the streaming session. Oscillationsin the CSs and the number of TSs (if persistent for more than 4–5 s) were immediatelynoticeable at the application layer. For example, a reduction in the available bit rate causeda rebuffering.

From our tests, the best reliability classes for streaming were RC 2-3. Table 20 shows theoptimal QoS parameters for GPRS Rel. ’97 streaming. Since PSS is essentially real-timetraffic, the precedence class should be 1 (high). Also the shortest delays are required,therefore the delay class should also be 1.

Video bitrate (kbps)

Media bitrate (kbps)

Average framerate QCIF (fps)

Average framerate SQCIF (fps)

7.25 12 4.3 N/A11.25 16 5.0 6.515.25 20 6.0 7.719.10 25 7.5 12.128.30 35 9.8 N/A

Table 18 Frames rates for dif-ferent bit rates

Video bit rate (kbps) Media bit rate (kbps) PSNR (dB)

7.25 12 32.311.25 16 33.815.25 20 34.419.10 25 34.528.30 35 34.9

Table 19 PSNR of the streamedvideo

Multimed Tools Appl

Cell reselections cause data losses that might last several seconds under ReliabilityClasses 3 to 5. In addition to the total data losses, the media quality right after a cellreselection is also affected due to the predictive nature of the media coding. Most of thevideo frames contain only the information about the difference compared to the previousvideo frame. If a video frame is lost, the difference values are then wrongly compared to thepreviously received video frame and the user can see severe errors. Therefore, a method toavoid these data losses during cell reselections is needed. We suggest a simpleretransmission-based improvement to handle this problem.

The streaming client can detect a cell reselection, either by monitoring the cell identifiersor by monitoring the amount of received data. If the client doesn’t receive any data for acertain period of time, and then it starts to receive data again after some time, this meansthat the client has gone through a cell reselection period. From the RTP packets receivedbefore the cell reselection, the client can find the information about the time when the cellreselection started. The PSS client can request the server to resend the fragment of mediathat was lost.

After the cell reselection period is over, the client sends a message to the server with thetime of the last received frame. The information can be delivered using RTSP PAUSE-PLAY messages. Although a PAUSE message is sent, it is not needed to really pause themedia display, unless the client buffer becomes empty. PAUSE is needed to the server toknow that the PLAY message is for an old stream and not for a new stream. The PLAYmessage includes the time when the client wants to restart the playback. Examples of suchPAUSE and PLAY message pairs are:

PAUSE rtsp://example.com/foo RTSP/1.0CSeq: 6Session: 354832PLAY rtsp://example.com/foo RTSP/1.0CSeq: 7Session: 354832Range: npt=28.00-

With a GPRS connection, the initial buffering requirement was 9 s to ensure pause-lessplayback. These 9 s can be divided into:

1. Three seconds to cover delays and jitters caused by the GPRS network;2. Cell reselections were on average 3 s;3. Cell reselection management signaling in GPRS takes 3 s.

We tested the effect of this cell reselection management technique with a 32 kbps mediasequence. Three seconds CR causes a reduction in video PSNR, compared to the case where the

Table 20 Optimal GPRS Rel. 97 QoS parameters

Parameter Value

Service precedence 1Delay class 1Mean throughput class 16 (∼22 kbps) or 17 (∼44 kbps)Peak throughput class 3 (32 kbps) or 4 (64 kbps)Reliability class 2–3

PSNR (dB)

No CR management 27.1CR management 29.3

Table 21 Video quality withand without cell reselectionmanagement

Multimed Tools Appl

missing part is retransmitted. Table 21 shows the quality reduction over the whole videosequence. Results show that the CR management improved the video quality by over 2 dB.

The following figure shows the PSNR of each frame. During the cell reselection periodthe PSNR of the video sequence without cell reselection management is on average 17.8 dBlower and the maximum difference between the two curves is 33.9 dB.

The Fig. 7 clearly shows what happens during a CR. If there is no CR management, astill image is shown on the screen. The PSNR values (dark line) start to become lower,since the still image is getting further and further from the real frame content over time.After the CR, there is a small period of prediction error, where the PSNR is between 15 and12 dB (the lowest point), before the PSNR values return to be normal. With the use of theCR management described, the PSNR values are not impacted by any degradation for allthe duration of the video sequence.

10 Conclusions

This paper showed that 3GPP packet-switched streaming service (PSS) is feasible overGPRS networks, if the right QoS parameters are chosen. The PSS service is expected to bebetter in a 3G environment, since the bit rates are higher and can be guaranteed; however,GPRS can provide adequate streaming services. In our performance tests, we were able toreach 35 kbps media bit rates, which was enough for enjoyable mobile streaming by using3+1 time slots configurations and coding scheme 2. The connection set-up took on average7.3 s without initial buffering. Intra-session commands (pause or play) take on average 1.0–1.6 s, before the whole system enters in a new state, although the user can see an immediateaction on the terminal screen. The round trip time of one RTSP message was on average0.8–1.2 s. The average length of a cell reselection was 2.2–3.7 s, depending on thereliability class. Reliability Class 2 caused the longest cell reselections, but they werelossless, although a transmission break occurred. In general cell reselections were not toolong, and the application managed them fairly well. Reliability Classes 2 and 3 provided0% packet loss rates at the application layer. The buffering requirement for 35 kbps mediawas 55 KB. Video frame rates for the same media bit rate were in the order of 10 fps, with avideo bit rate of 28.3 kbps. The results are based on PSS Release 4 implementation. A newalgorithm for cell reselection management has been introduced. Simulation results showthat GPRS cell reselections can be made invisible to the application, and PSNR results haveshown the advantages of such technique. New PSS features like bit-rate rate adaptation,stream switching and enhanced client feedback will make the system performing evenbetter. These are specified as advanced features in Release 6 3GPP PSS.

0

10

20

30

40

50

1

187

373

559

745

931

1117

1303

1489

1675

FramesP

SN

R (d

B)

No CRmanagement

CR management

Fig 7 PSNR values with CRmanagement

Multimed Tools Appl

References

1. 3GPP, General packet radio service (GPRS), Service description. Stage 2 (Release 1997), TS 03.60(Release 1997), V. 6.11.0, 2002–09

2. 3GPP, Technical Specification Group GERAN; Radio subsystem synchronization (Release 1999), TS05.10 (Release 1999), V. 8.12.0, 2003–09

3. 3GPP, Technical Specification Group Services and System Aspects, transparent end-to-end packet-switched streaming service (PSS); protocols and codecs (Release 4), TS 26.234 (Release 4), V. 4.5.0,2002–12

4. 3GPP, Technical Specification Group Services and System Aspects, Amr speech codec, generaldescription, (Release 4), TS 26.071 (Release 4), V. 4.0.0, 2001–03

5. 3GPP, Technical Specification Group Services and System Aspects, transparent end-to-end packet-switched streaming service (PSS); general description (Release 4), TS 26.233, V. 4.2.0, 2002–03

6. 3GPP, Technical Specification Group Services and System Aspects, transparent end-to-end packet-switched streaming service (PSS); general description (Release 5), TS 26.233, V. 5.0.0, 2002–03

7. 3GPP, Technical Specification Group Services and System Aspects, transparent end-to-end packetswitched streaming service; Stage 1 (Release 5), TS 22.233, V. 5.0.0, 2002–03

8. 3GPP, Technical Specification Group Services and System Aspects, transparent end-to-end packetswitched streaming service (PSS); protocols and codecs (Release 5), TS 26.234, V. 5.7.0, 2004–05

9. 3GPP, Technical Specification Group Services and System Aspects, transparent end-to-end packetswitched streaming service (PSS); protocols and Codecs (Release 6), TS 26.234, V. 6.10.0, 2006–12

10. 3GPP, Technical Specification Group Service and System Aspects, Audio codec processing functions;Extended Adaptive Multi-Rate-Wideband (AMR-WB+) codec; Transcoding functions (Release 6), TS26.290, V. 6.3.0, 2005–06

11. 3GPP, Technical Specification Group Service and System Aspects, General audio codec audio processingfunctions; Enhanced aacPlus general audio codec; General description (Release 6), TS 26.401, V. 6.2.0,2004–05

12. Chen C-M, Chen Y-C, Lin C-W (2005) Seamless roaming in wireless networks for video streaming. In:IEEE International Symposium on Circuits and Systems, ISCAS ’05, Taiwan, 23–26 May 2005, pp3255–3258

13. Cunningham G, Murphy S, Murphy L, Perry P (2005) Seamless handover of streamed video over UDPbetween wireless LANs. In: IEEE Second Consumer Communications and Networking Conference,CCNC ’05, Dublin, Ireland, 3–6 January 2005, pp 284–289

14. Curcio IDD (2003) Multimedia streaming over mobile networks: European perspective. In: Furht B,Ilyas M (eds) Wireless internet handbook. Technologies, standards, and applications, CRC, pp 77–104

15. ETSI, Digital cellular telecommunications system (Phase 2+), channel coding (Release 1997), TS 05.03(Release 1997), V. 6.2.1, 1999–08

16. ETSI, General packet radio service (GPRS), service description (Release 1997), TS 02.60 (Release1997), V. 6.3.1, 2000–11

17. ETSI, Overall description of the GPRS radio interface. Stage 2 (Release 1997), TS 03.64 (Release 1997),V. 6.4.0, 1999–11

18. Fabri S, Cellatoglu A, Kondoz A (1999) Transmission of multimedia services over GPRS using MPEG-4coded video. In: IEEE 50th vehicular technology conference (VTC 1999), vol 1, Amsterdam, Holland,19–22 September 1999, pp 401–405

19. Fabri SN, Worrall S, Sadka A, Kondoz A (2000) Real-time video communications over GPRS, 3Gmobile communication technologies. In: IEE conf. publ. no. 471, pp 426–430

20. Hoymann C, Stuckmann P (2002) On the feasibility of video streaming applications over GPRS/EGPRS.In: IEEE global telecommunications conference, GLOBECOM’02, vol 3, Taipei, Taiwan, 17–21November 2002, pp 2478–2482

21. IETF (1998) Real-time streaming protocol (RTSP), RFC 2326, April22. IETF (2003) RTP: a transport protocol for real-time applications, RFC 3550, July23. IETF (1998) SDP: session description protocol, RFC 2327, April24. IETF (1981) Transmission control protocol, RFC 793, September25. IETF (2002) RTP payload format and file storage format for the adaptive multi-rate (AMR) and adaptive

multi-rate wideband (AMR-WB) audio codecs, RFC 3267, March26. IETF (1998) RTP payload format for the 1998 version of ITU-T recommendation H.263 (H.263+), RFC

2429, October27. ITU-T (1998) Video coding for low bit rate communication, recommendation H.263, February28. ITU-T (2001) Video coding for low bit rate communication, profiles and levels definition,

recommendation H.263 Annex X, April

Multimed Tools Appl

29. ITU-T Recommendation H.264 and ISO/IEC 14496-10 (2003) Advanced video coding for genericaudiovisual services. Information technology—coding of audio-visual objects—part 10: advanced videocoding

30. Korhonen J, Wang Y (2005) Effect of packet size on loss rate and delay in wireless links. In: IEEEwireless communications and networking conference, WCNC ’05, vol 3, Singapore, 13–17 March 2005,pp 1608–1613

31. Lakaniemi A, Parantainen J (2000) On voice quality of IP voice over GPRS. In: IEEE internationalconference on multimedia and expo (ICME 2000), vol 2, New York City, NY, USA, 30 July 2 August2000, pp 752–754

32. Lim KP et al (2003) Video streaming on embedded devices through GPRS network. In: IEEEinternational conference on multimedia and expo (ICME 2003), Baltimore, MD, USA, 6–9 July 2003, pp169–172

33. Lundan M, Curcio IDD (2003) RTSP Signaling in 3GPP streaming over GPRS. In: Finnish signalprocessing symposium (FINSIG ‘03), Tampere, Finland, 19 May 2003, TICSP Series #20, pp 149–153

34. Lundan M, Curcio IDD (2003) 3GPP streaming over GPRS Rel. ’97. In: 12th IEEE internationalconference on computer, communications and networks (ICCCN ‘03), Dallas, TX, USA, 20–22 October2003, pp 101–106

35. Parantainen J, Hamiti S (1999) Delay analysis for IP speech over GPRS. In: IEEE 50th vehiculartechnology conference (VTC 1999), vol 2, Amsterdam, Holland, 19–22 September 1999, pp 829–833

36. Sadka AH (2002) Compressed video communications. Wiley37. 3GPP, Technical Specification Group Services and System Aspects, transparent end-to-end packet

switched streaming service (PSS); protocols and codecs (Release 7), TS 26.234, V. 7.2.0, 2003–07

Miikka Lundan received his M. Sc. degree in Computer Science from Tampere University of Technology(TUT) in 2001. He is currently doing his Ph.D. degree about mobile streaming at TUT. He has authoredseveral international publications and two patents. He joined Nokia Corporation in 1999. Between 1999 and2001, he worked in SIP based video telephony project. In 2001, he joined to the group that was standardizing3GPP PSS (Packet-switched Streaming Service) in Nokia. Year 2004, he joined to Nokia S60 and started towork with User experience field in multimedia area. From 2005 onwards, he has been a product manager ofS60 Multimedia in camera and content management area.

Igor D.D. Curcio born in Milan (Italy) in 1968, from 1986 to 1997 has worked 9 years for severalcompanies as a freelance Software Engineer, Project Manager and Information Technology Educator. Hereceived the Laurea degree in Computer Science from University of Catania (Italy) in 1997. In 1998, he

Multimed Tools Appl

joined Nokia Corporation, where he has covered several research and management positions in the areas ofreal-time mobile multimedia. He is now Senior Research Program Manager at Nokia Research Center. Hehas been active in several standardization organizations (such as 3GPP, IETF, DLNA, DVB, ARIB) where hehas covered sub-working group or task forces Chair positions. Mr. Curcio holds 3 international patents andseveral pending patent applications. He is an ACM member since 1990 and IEEE member since 1991. Hehas also published about 30 research papers in the areas of software engineering, Video-on-Demand, QoS ofmobile video and streaming. Mr. Curcio is currently a Ph.D. candidate at the Signal Processing Laboratory ofTampere University of Technology. His current interest areas include mobile video applications, such asstreaming, conferencing, Mobile TV, P2P real-time media, and home/automotive multimedia.

Multimed Tools Appl

[P6] Igor D.D. Curcio, David Léon, “Application Rate Adaptation for Mobile Streaming”, Proc. IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM ’05), 13-16 Jun. 2005, Taormina/Giardini Naxos, Italy, pp. 66-71.


Application Rate Adaptation for Mobile Streaming

Igor D.D. Curcio

Nokia Corporation

P.O. Box 88, 33721, Tampere (Finland)


David Leon

Nokia Research Center

6000 Connection Drive, Irving, TX (U.S.A.)


Abstract

As 2.5G and 3G systems adoption grows,

multimedia streaming is one of the services that

operators will increasingly seek to provide. The

multimedia streaming solutions that have successfully

been deployed commercially on the Internet in the past

are all based on proprietary technologies. However,

existing streaming applications need to be

considerably rethought to work well over wireless. The

3GPP standardization body has standardized

streaming services for Release 4, 5 and 6

specifications. Release 6 advanced streaming service

overcomes the technical challenges posed by wireless

and best effort networks. A standard solution for

advanced multimedia streaming will ultimately drive

operators and users adoption. This paper describes the

solution standardized in 3GPP(2) for adaptive

streaming.

1. Introduction

One of the most appealing mobile multimedia

services is certainly streaming. 3GPP (Third

Generation Partnership Project) has standardized

streaming services since Release 4 specifications.

These have defined a basic streaming service. Release

5 3GPP streaming has enriched the basic service with

new tools and media types, in order to make the

streaming service more competitive and attractive.

Mobile phone manufacturers have already launched

mobile phones with streaming media players. However,

the streaming technology, both in the standards and in

commercial products, currently lacks of a fundamental

feature: adaptivity.

The need for the functionality of adaptive streaming

is driven on the one hand by the fact that today most of

the currently deployed mobile networks (e.g., GPRS,

EGPRS, UMTS) offer only best-effort QoS (Quality of

Service). This implies that resources, such as bit rates,

are not guaranteed, and continuous media streaming,

which requires rather constant bit rate channels,

becomes more challenging.

On the other hand, adaptivity is increasingly

required as the deployment of mobile networks that

offer QoS (e.g., EGPRS, UMTS networks with

Streaming Traffic Class) has been delayed over the past

years. Given that streaming services are already fully

deployed in many countries, and networks with QoS

have not been available by the time of these

deployments, 3GPP and 3GPP2 have worked out a

solution to provide adaptive streaming.

The term adaptive here means that a streaming

service is given the feature that enables to adapt to

varying network conditions. Examples of such

variations include variations of throughput, delay, and

intra/inter-operator roaming to networks with or

without QoS support.

The causes for throughput and delay variation in the

current best-effort mobile networks are mainly

dependent on load and radio network conditions. In

fact, it is known that if the number of users in a cell

grows, the capacity per user decreases. Similarly,

whenever radio conditions are not good, the user

throughput normally decreases.

These situations are not very critical for non-real

time traffic. However, they become critical for

continuous media transmission, such as streaming,

where the user experience drops if the service loses its

inherent property of continuity. For instance,

continuous (or pause-less) playback is the number one

requirement for a successful streaming service. When

the network throughput is varying all the time during a

session, the effect on the end user’s client is that of

picture freezes, pauses in the audio/video playback,

continuous rebufferings (i.e., re-loading from the

streaming server a sufficient amount of media data to

be streamed with no interruptions) and bad media

Proceedings of the Sixth IEEE International Symposium on a World of Wireless Mobile and Multimedia Networks (WoWMoM’05)

0-7695-2342-0/05 $20.00 © 2005 IEEE

quality (caused by packet losses derived by network

buffers overflow).

Adaptive streaming avoids the above phenomena

and ensures pause-less playback to the end user,

yielding a superior user experience compared to

conventional streaming (e.g., 3GPP streaming in

Releases 4 and 5). To do so, a streaming server keeps a

set of media streams of the same content encoded at

different bit rates and performs seamless switching

between the different media streams when in adverse

radio conditions. For example, if a streaming session

starts at 64 kbps with good network throughput, and

subsequently the network throughput halves, the

streaming server can switch to a lower bit rate stream

(e.g., 32 kbps) to guarantee pause-less playback and

avoid network buffers overflow that would cause

packet losses and bad media quality to the user. The

server can switch up to the 64 kbps media stream when

the network conditions are good again.

In order to realize the adaptation in an efficient way,

the streaming server must have a clear picture of the

streaming client buffer. The streaming server, at any

time during a session, must keep the client buffer full

up to a certain security level, and ensure that the buffer

does not overflow or underflow even in adverse radio

conditions or whenever roaming or handovers occur.

The remainder of this paper is organized as follows.

Section 2 describes the basis of 3GPP PSS, including

the reference specifications and protocols. Section 3

focuses on 3GPP adaptive streaming, whereas section 4

describes the signaling required by an application,

which supports adaptivity. Section 5 shows some

performance results. Section 6 is about adaptive

streaming and its integration with RTP retransmission.

Section 7 concludes the paper.

2. 3GPP PSS Specifications and Protocols

The 3GPP Packet-switched Streaming Service

(PSS) specifications [1][5] are the framework in which

a set of multimedia transport protocols and algorithms

is defined that are proven to enhance mobile

multimedia streaming user experience.

As key building blocks, 3GPP PSS mandates

streaming protocols that have been defined in the IETF

(Internet Engineering Task Force). These are the Real-

Time Streaming Protocol (RTSP) [2] and the Session

Description Protocol (SDP) [3] for session set-

up/control, and Real-time Transport Protocol (RTP) [4]

for the actual transport of the media streams. Along

with its associated payload formats, RTP aims to

provide services useful for the transport of real-time

audio and video over IP networks. These services

include timing recovery, loss detection, payload and

source identification, media synchronization (e.g., lip-

sync), and reception quality feedback through its

associated Real-Time Transmission Control Protocol

(RTCP).

RTP is typically run on top of the User Datagram

Protocol (UDP) [6], which makes it an unreliable

transport protocol, or in other words, unlike the

Transport Control Protocol (TCP), packet delivery is

not guaranteed because lost packets are not

automatically retransmitted. Using RTP, applications

also control packet transmission scheduling themselves,

and thus are in control of the throughput and packet

delivery delay. Such an integrated control is essential

for an application with real-time constraints such as

streaming.

3GPP builds upon the flexibility of the RTP

protocol by standardizing features that are needed for

interoperability and efficiency. One of the most

important new features introduced in Release 6 is that

of application Bit Rate Adaptation also referred to as

Adaptive Streaming.

3. 3GPP Adaptive Streaming

Figure 1 illustrates the functional entities that play a

role in rate adaptation. The content, i.e. the audio and

video data, is usually pre-encoded off-line and kept as

data files at the application (streaming) server. In the

case of a live event, the multimedia content is produced

and encoded in real time.

Figure 1. Functional entities in adaptive streaming

When a streaming client requests a given

presentation (through the RTSP protocol), the server

starts streaming the pre-encoded or real-time encoded

media data encapsulated in RTP packets. The need for

rate adaptation arises from the fact that the throughput

delivered by the network is variable.

Though the variation in throughput may come from

the Internet path to the server, when streaming to a

mobile client, the bottleneck in terms of bandwidth is

Mobile network Client

ApplicationApplication Server

Playout

Client (Playout) buffer

Reception Real-t ime transmission

Encoder

Wireless link

Sampling & Encoding (off-line)

Network buffer(e.g.at SGSN or RNC)

File


0-7695-2342-0/05 $20.00 © 2005 IEEE

generally the radio network. Variations in bit rate may

be caused by variable radio throughput due to radio

conditions, network load because of other users in the

same cell, and mobility. In particular, handovers (e.g.,

GPRS cell reselections) cause a period of no

throughput at all as the radio link is torn down and re-

established. This problem is more acute in GERAN

(GSM/EDGE Radio Access Network) type of

networks.

The radio network can be modeled as a bottleneck

link with variable bit rate. When the link rate is lower

than the incoming rate sent by the server, data

accumulates in the network buffer. When the link rate

is higher than the server transmission rate, the network

buffer empties.

The streaming client is able to withstand some

variations in the received throughput as it uses a so-

called play-out buffer. The play-out buffer is built up

from a short initial buffering at the beginning of the

session when the client receives the media data, but

delays playing it out for a certain period of time.

Therefore, during periods when the received

throughput drops, the client is able to play data

accumulated in its play-out buffer. However, since the

set-up time of the session has to be minimized, the

play-out buffer typically holds only a few seconds of

data. The receiver will thus run out of data and the

play-out will be interrupted, if the rate cannot be

precisely controlled and/or if consecutive handovers

occur because of user mobility.

The problem of rate adaptation can thus be

approached as the control of two buffers (the network

buffer and the client buffer). The server must adapt

both its transmission rate and the encoding rate of the

content in order to keep both buffers in an optimum

state at each time instant. If the encoding is done in real

time (live content), the sender can modify the media

encoding rate by changing the rate control parameters

of the encoder. On the other hand, if the server

provides streaming content that has been encoded off-

line, the server needs some mechanism to modify the

media rate of pre-encoded content which is usually

done by switching between different versions of the

same file that have been encoded at different bit-rates.

In this case, the server further needs to switch at a

position that will avoid artifacts in the decoded stream.

The 3GPP PSS specifications introduce the new

technical feature of signaling from the receiver to the

server, the play-out buffer status information, which

employed together with the regular feedback given by

the RTCP protocol, allows the server to have the

information it needs in order to optimally choose both

transmission rate and media encoding rate. The server

can thus attempt to maintain both the network buffer

and the client buffer in an optimum state.

4. Application signaling

This section describes the signaling details [1] of the

functionality that provides adaptive streaming from the

PSS server and client perspective. Figure 2 depicts the

signaling of a typical streaming session.

The PSS client starts the session sending an RTSP

DESCRIBE message to the PSS server. This answers

with a 200/OK response, which embeds an SDP

description. Among the other things, if the server

supports adaptivity, it will include the 3GPP-

Adaptation-Support SDP attribute to tell the client that

adaptivity is supported. This attribute is present at

media level only, and it enables the server to perform

bit rate adaptation for each media separately. This

increases flexibility compared to a rate adaptation

system, which is controlled purely at session level (e.g.,

making bit rate adaptation when each media is mapped

to one PDP context is easy). The 3GPP-Adaptation-

Support SDP attribute carries the reporting-frequency

parameter, which tells the client how frequently to

report buffer status to the server via RTCP.

A PSS client that supports adaptivity, will signal in

a subsequent RTSP method, such as, SETUP, PLAY,

OPTIONS or SET_PARAMETER the 3GPP-

Adaptation header with the following information:

• URL of the media on which the client wants the

server to perform bit rate adaptation;

• Size of the buffer (in bytes) allocated to that

particular media;

• Target buffer level (in milliseconds) the client

wishes the server to keep.

The buffer size corresponds to the size of the de-

jittering buffer and it includes any pre-decoder buffer

space used by the client for that media. The target

buffer level is determined by the client, that has the

best knowledge about the mobile network

characteristics. This parameter represents an adequate

protection level against network level interruptions

(e.g., handovers), inter-arrival packet jitter, and other

factors that may not yield pause-less playback. In other

words, the target buffer level is the maximum time

margin that the server can utilize to perform its rate

adaptation operations (e.g., change the transmission

rate and/or change the content rate). If the server is able

to keep the client buffer to the client requested target

level all the time, this will give 100% guarantee that the

playback session will be pause-less.


0-7695-2342-0/05 $20.00 © 2005 IEEE

Figure 2. Signaling in a typical streaming session

The presence of the 3GPP-Adaptation header (with

the same values initially signaled by the client) in the

RTSP response signifies that the server acknowledges

the client request.

After PSS client and server have negotiated the

parameters to be used for rate adaptation (there is a set

of such parameters for each media), the client starts

sending feedback information via RTCP [4], in order to

enable the server to perform bit rate adaptation in a

timely manner and with a minimal media quality

disruption to the end user.

The client uses RTCP APP packets to convey the

necessary buffer status information to the server. The

necessary information consists of

• Oldest Buffered Sequence Number (OBSN), which

is the RTP sequence number of the first packet in

the sequence of packets to be played out from the

buffer.

• Playout delay (in milliseconds), which is the

difference between the scheduled play-out time of

the oldest packet and the time of sending of the

OBSN APP packet.

The OBSN field allows the server estimating the

client buffer level, whereas the play-out delay allows

having a more precise estimation of the client buffer

underflow point. The size of the buffer and the

estimated client buffer level allow avoiding the server

to overflow and underflow the client buffer. In case a

stream switch is required, the server has all the

necessary information to perform the switching

seamlessly.

Since network conditions are dynamic, a different

protection level may be required at a different point of

time. For example, a handover from a WCDMA to a

GPRS network would suggest to the client that future

cell reselections within the GPRS network will produce

longer breaks than WCDMA cell reselections. For this

purpose, the client may decide to update its target

buffer level by sending a new 3GPP-Adaptation header

in an RTSP PLAY, OPTIONS or SET_PARAMETER

method embedding a higher target buffer level value.

However, the buffer size cannot be modified during the

lifetime of a streaming session.

The 3GPP adaptive streaming solution is compliant

with the solution developed in 3GPP2 [8].

5. Performance results

This section includes some simulation results of

adaptive streaming. For further information, please

refer to [5]. The basic idea for client underflow

prevention is simple. If the buffer level in time

decreases, the server switches down to a lower content

rate. Decreasing the content rate (but keeping the same

transmission rate) allows the server to send packets

earlier and increase the client buffer level faster.

In the simulation we used the following parameters:

• Client initial buffering period: 8 seconds;

• Video content of 3 minutes duration encoded at

20, 35 and 50 kbps. The video packet size was 300

bytes (excluding packetization overhead);

• Server switching only at I frames;

• Client buffer size 115000 bytes;

• Target buffer level: 12 seconds;

• RTCP transmission interval: 1 second.

The network was modeled using an EGPRS network

emulator where two MCS-7 time slots were allocated to

the streaming user. In addition two other users

generated traffic on the same shared channel by

performing Web browsing. The air interface bit rate

was 89.6 kbps, but because of protocol layer overhead,

varying radio conditions and network load, the real

throughput perceived by the streaming client was less.

There were three handovers (HO) during the

simulations:

• At time 19.8 s that lasted for 2.2s;

• At time 101.6s that lasted for 4.0 s;

• At time 116.8 that lasted for 1.8s.

PSS

Client

PSS

Server

RTSP DESCRIBE

RTSP 200 OK + SDP

RTSP SETUP

RTSP PLAY

RTSP 200 OK

Media flow (RTP/RTCP)

RTCP

Media

streaming

RTSP 200 OK


0-7695-2342-0/05 $20.00 © 2005 IEEE

Bandwidth (5s average)

0

10

20

30

40

50

60

70

0 20 40 60 80 100 120 140 160 180time (s)

rate

(k

bp

s)

RX bw

TX bw

Bitstream

HO HO HO

Figure 3. Transmitted/received bit rate and bit stream

Buffer Duration

0

2

4

6

8

10

12

14

16

18

20

0 20 40 60 80 100 120 140 160 180time (s)

bu

ffer

du

rati

on

(s

)

buffer level

HO HO HO

Figure 4. Client buffer level (in time)

Results are shown in Figures 3-5. The start of the

handover periods is marked with vertical lines in the

figures. As a result of these handovers, the average

network bit rate was very low at these times.

The plot in Figure 3 shows the bit rate received by

the streaming client over time (averaged over 5-second

intervals) and the adapted transmission bit rate. It can

be seen that the transmission rate (TX bw) is adapted to

the reception rate (RX bw) through estimation of the

Buffer level (bytes)

0

20000

40000

60000

80000

100000

120000

0 20 40 60 80 100 120 140 160 180time (s)

Nu

mb

er

of

byte

s

buffer level

HOHO HO

Figure 5. Client buffer level (in bytes)

network throughput. The plot also shows the bit stream

(20kbps, 35kbps or 50kbps) selected by the server at a

given time instant. The average content rate during the

session was 40 kbps.

The buffer level in seconds is shown in Figure 4.

The target buffer level is 12s and is the minimum

protection against throughput variations that the server

aims at providing. When the network conditions are

good and the server maximises the throughput available

TX bw

RX bw


0-7695-2342-0/05 $20.00 © 2005 IEEE

from the network, the buffer duration will be higher

than the target level.

The buffer level in bytes is shown in Figure 5.

Despite high bandwidth variations, the sender is

capable, through the signaling for rate adaptation, to

control the receiver buffer level and thus provide a

better end-user experience avoiding buffer underflow

and overflow.

6. Adaptive Streaming and RTP

Retransmission

In a streaming session, there may be different

reasons why packet losses happen: they may occur

before the radio access network (Core Network, or

even in an IP backbone outside the 3GPP domain) or in

the radio access network. To correct such packet losses

at the application layer it has been introduced an RTP

retransmission feature into 3GPP PSS Release 6 [7].

RTP retransmission is suitable for multimedia

streaming because unlike TCP, it is not fully persistent.

After being informed by the receiver through RTCP

NACK (Negative ACKnowledgement) packets, the

server has the freedom to decide whether to retransmit

an RTP packet or not. In order to make the

retransmission decision, the server estimates whether a

retransmitted packet could still arrive at the client

before it would be scheduled for play-out, assuming

continuous real-time playback. In addition, not all

packet losses have the same effect on the decoded

media quality, as some packet losses can be more

easily concealed than others. The server is also able to

consider such aspects in the retransmission decision,

and can selectively choose whether to retransmit a

packet depending on how difficult loss concealment

would be.

Through the integration of adaptive streaming and

RTP retransmission, the server can make the optimum

decision whether to retransmit a lost RTP packet or

not. In fact, since retransmission occupies a certain

amount of extra bandwidth, the functionality of rate

adaptation can give information on whether or not this

extra bandwidth for retransmission is available.

Furthermore, the rate adaptation module is potentially

capable of “creating” the required bandwidth for

retransmission, if this is not available. Therefore, it is

clear that the integrated use of rate adaptation and RTP

retransmission mechanisms enhance the quality of a

streaming session to the end user.

7. Conclusions

Though multimedia streaming over the Internet has

so far been led by proprietary systems, the main drivers

for service adoption for mobile multimedia streaming

will be interoperability and performance. By defining a

standard solution to multimedia transport, mobile

streaming will become increasingly popular.

The solution for adaptive streaming developed in

3GPP and today part of Release 6 specifications

includes a specific application-level signaling that

enables the streaming server to tightly track the

behavior of the streaming client, and allows fast

adaptive reactions to guarantee the best user

experience. This solution is aligned with the adaptive

streaming solution developed by 3GPP2.

One of the great advantages of 3GPP adaptive

streaming developed in Release 6 is that it does not

require any network or terminal QoS support (i.e.

Streaming Traffic Class), being adaptive streaming a

mechanism built end-to-end at the application layer and

completely independent of the lower layers. This makes

it useful today in best-effort mobile networks (GPRS

and EGPRS/WCDMA with no QoS support), and in

the future in mixed network environments where QoS

and non-QoS mobile networks will co-exist.

10. References

[1] 3GPP, TSGS-SA, Transparent end-to-end Packet

Switched Streaming Service (PSS). Protocols and

codecs (Release 6), TS 26.234, v. 6.0.0, (06-2004).

[2] IETF, Real Time Streaming Protocol (RTSP), RFC

2326, April 1998.

[3] IETF, SDP: Session Description Protocol, RFC 2327,

April 1998.

[4] IETF, RTP: A Transport Protocol for Real-Time

Applications, RFC 3550, July 2003.

[5] 3GPP, TSG-SA, Transparent end-to-end Packet

Switched Streaming Service (PSS). RTP usage model

(Release 6), TR 26.937, v. 6.0.0, (03-2004).

[6] IETF, User Datagram Protocol, RFC 768, 28 August

1980.

[7] J. Rey, D. Leon, A. Miyazaki, V. Varsa and R.

Hakenberg, RTP Retransmission Payload Format, IETF

Internet Draft draft-ietf-avt-rtp-retransmission-11.txt,

March 2005, Work in progress.

[8] 3GPP2, Multimedia Streaming Services for cdma2000

Spread Spectrum Systems, TS C.P0046-0, v.0.1.8,

December 2004.


0-7695-2342-0/05 $20.00 © 2005 IEEE

[P7] Igor D.D. Curcio, David Léon, “Evolution of 3GPP Streaming for Improving QoS over Mobile Networks”, Proc. IEEE International Conference on Image Processing (ICIP ’05), Genova, Italy, 11-14 Sep. 2005, Vol. III, pp. 692-695.


EVOLUTION OF 3GPP STREAMING FOR IMPROVING QOS OVER MOBILE NETWORKS

Igor D.D. Curcio1, David Leon2

1Nokia Corporation, P.O. Box 88, 33721 Tampere (Finland), Email: [email protected]

2Nokia Research Center, 6000 Connection Drive, Irving, TX (U.S.A.), Email: [email protected]

ABSTRACT Streaming is one of the key 3G mobile multimedia services provided by network operators. Traditional streaming applications need to be redesigned in order to work well over wireless. 3GPP has standardized streaming services for Release 4, 5 and 6 specifications. The last release overcomes the technical challenges posed by wireless and best effort networks. This paper describes the solution standardized in Release 6 for adaptive streaming.

1. INTRODUCTION 3GPP (Third Generation Partnership Project) streaming is certainly one of the most interesting mobile multimedia services standardized during the last years. The latest 3GPP releases 4 and 5 have enriched the basic streaming service with new tools and media types, in order to make streaming more competitive and attractive.

Adaptive streaming is a requirement given that today most of the currently deployed mobile networks (e.g., GPRS, EGPRS, UMTS) offer only best-effort QoS (Quality of Service). Adaptive here means that a streaming application is capable of adapting to varying network conditions, such as variations of throughput, delay, and intra/inter-operator roaming to networks with or without QoS support.

Continuous (or pause-less) playback is probably the most important requirement for streaming. When the network throughput varies during a session, picture freezes, pauses in the media playback, continuous rebufferings (i.e., re-loading from the server a sufficient amount of data to be streamed with no interruptions) and bad media quality (caused by packet losses derived by network buffers overflow) can happen.

Adaptive streaming aims at avoiding the above phenomena and ensures pause-less playback to the end user, yielding a superior user experience compared to conventional streaming. Section 2 of this paper focuses on the bit rate adaptation problem. Section 3 describes the

signaling required by an application, which supports adaptivity. Section 4 shows some performance results. Section 5 concludes the paper.

2. BIT RATE ADAPTATION PROBLEM

The need for rate adaptation in the 3GPP Packet-switched Streaming Service (PSS) [1][2] arises mainly from the fact that the throughput delivered by the network may be variable. When streaming to a mobile client, the bottleneck in terms of bandwidth is generally the radio network because of variable radio throughput due to radio conditions, network load because of other users in the same cell, and mobility. In particular, handovers (e.g., GPRS cell reselections) cause a period of no throughput at all as the radio link is torn down and re-established.

The radio network can be modeled as a bottleneck link with variable bit rate (see Figure 1). When the link rate is lower than the server transmission rate, data accumulates in the network buffer. When the link rate is higher than the server transmission rate, the network buffer empties.

The streaming client is able to withstand some variations in the received throughput as it uses a so-called playout buffer. The playout buffer is built up from a short initial buffering at the beginning of the session when the client receives the media data, but delays playing it out for a certain period of time. Therefore, during periods when the received throughput drops, the client is able to play data accumulated in its playout buffer. However, since the set-up time of the session has to be minimized, the playout buffer typically holds only a few seconds of data. The client will thus run out of data and the playout will be interrupted, if the rate cannot be precisely controlled and/or if consecutive handovers occur because of user mobility.

The rate adaptation problem can be described with reference to the bit rate evolution plots (i.e., sampling (or encoding) curve, transmission curve, reception curve, playout curve), and the term curve control can be used in place of rate control [2]. Figure 1 indicates the points where the different curves can be observed in a simplified streaming model.

0-7803-9134-9/05/$20.00 ©2005 IEEE III-692

ApplicationApplication

Server

Mobile network Client

Playout curve

Client buffer

Receivercurve

Buffer in the network (e.g. at SGSN or RNC)

Transmitter curve

Encoder

Samplingcurve

File

Wireless

Note: The sampling curve is decided by the application server at streaming-time and not completely pre-determined by the encoder. Example is bitstream switching or any other content dropping from the bitstream before transmission (i.e. thinning)

Figure 1 – Curves in a simplified streaming model

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

0 5 10 15 20 25 30 35 40 Time (sec)

Cum

ulat

ive

data

(bits

)

Sampling curve Transmitter curve Receiver curve Playout curve

Amount of data in client buffer

Amount of data in network buffer

Figure 2 – Example bit rate evolution plot

Figure 2 shows an example bit rate evolution plot. The horizontal axis in the graph denotes time in seconds; the vertical axis denotes cumulative amount of data in bits. The playout curve shows the cumulative amount of data the decoder has processed by a given time from the client buffer. The sampling curve indicates the progress of data generation if the media encoder was run real-time (it is the counterpart of the playout curve, and is actually a time shifted version of it). The transmission curve shows the cumulative amount of data sent out by the server at a given time. The reception curve shows the cumulative amount of data received in the client buffer at a given time.

The distance between two curves at a given time shows the amount of data between two observation points in the system. For example the distance between the transmission and reception curves corresponds to the amount of data in the network buffer and the distance between the reception and playout curves corresponds to the amount of data in the client buffer (see Figure 2). Curve control here means constraining by some limits the distance between two curves (e.g., by a maximum amount of data, or maximum delay). This problem is equivalent to controlling two buffers: the network and the client buffer (see Figure 1).

In case there is bit stream switching or other rate adaptation action foreseen, the server signaled pre-decoder

buffer parameters are to be interpreted as the limits to what the server will constrain its difference between the sampling curve and transmission curve during the session (sampling curve-transmission curve control).

In addition to sampling curve-transmission curve control, the server attempts transmission curve-reception curve control in order to limit the packet transfer delays (i.e. limit the jitter buffering required at the client). The variable bit rate over time on the transmission path, and thus variable packet transfer delays, creates the need for transmission curve adaptation. Unknown future packet transfer delays make it hard for the server to control the transmission curve-reception curve difference.

By placing only sampling curve-transmission curve control requirements on the server, any parameter that is not controllable directly by the server is excluded. In this case, there are no uncertain parameters used in this curve control. A sampling curve-transmission curve control algorithm can work independently of the transmission curve-reception curve control algorithms, such as “standard” congestion control algorithms (e.g., TCP Friendly Rate Control). Thus, it can be implemented on top of them.

A rate adaptive implementation of streaming server must adapt both its transmission and encoding rates of the content in order to keep network and client buffers in an optimum state at each time instant (and therefore realize the curve control). If the encoding is done in real time (live content), the server can modify the media encoding rate by changing the encoder parameters. Instead, if off-line encoded content is provided, the server needs some mechanism to modify the media rate of the content, which is usually done by switching between different versions of the same file that have been encoded at different bit rates.

The 3GPP PSS specifications introduce the new technical feature of signaling from the client to the server, which allows the server to have the information it needs in order to optimally choose both transmission rate and media encoding rate at any time. 3. SIGNALING TO SUPPORT RATE ADAPTATION This section describes the signaling details [1] of the adaptive streaming functionality from the PSS client perspective. The client sends two types of signaling: static and dynamic signaling.

Static signaling is sent at the beginning of the session using the RTSP protocol [4]. The values signaled do not often change (but may change occasionally). The relevant parameters are: media URL on which the client wants the server to perform rate adaptation; buffer size (in bytes) allocated to that particular media; target buffer level (in milliseconds) the client wishes the server to keep. The buffer size corresponds to the size of the reception, de-

III-693

jittering and, if used, any de-interleaving buffer space. This is the memory space for storing complete Application Data Units (ADUs) including RTP timestamp, and sequence number (SN) (or decoding order number). The target buffer level is determined by the client, that has the best knowledge about the mobile network characteristics. This parameter represents an adequate protection level against network interruptions (e.g., handovers), inter-arrival jitter, and other factors that may not yield pause-less playback.

Dynamic signaling is sent during the session. The client uses RTCP [3] NADU (Next ADU) APP packets [1] to convey the necessary buffer status information to the server. The necessary information consists of • Next Sequence Number (NSN), that is the RTP SN of

the next ADU to be decoded in the sequence of ADUs to be played out from the buffer.

• Next Unit Number (NUN), which is the unit number (within the RTP packet) of the next ADU to be decoded. This is useful for interleaved media packetization. For audio codecs, an ADU is defined as an audio frame. For H.264, an ADU is a NAL unit. For H.263 and MPEG4 Visual Simple Profile, each packet carries a single ADU; so the NUN field is set to zero in the latter cases.

• Playout delay (in milliseconds), which is the difference between the scheduled playout time of the next ADU to be decoded and the time of sending of the NADU APP packet, as measured by the media playout clock.

• Free Buffer Space (FBS), which is the amount of free buffer space (in 64 byte blocks) available at the PSS client at the moment of reporting.

The NSN and NUN fields allow the server estimating the client buffer level, whereas the playout delay allows having a more precise estimation of the client buffer underflow point. The size of the buffer and the estimated client buffer level allow avoiding the server to overflow and underflow the client buffer.

Under dynamic network conditions, the client may decide to update its target buffer level by sending a message to the server. For example, a handover from a WCDMA to a GPRS network may mean to the client that future cell reselections within the GPRS network would produce longer breaks than under WCDMA.

4. SIMULATION RESULTS This section includes simulation results of adaptive streaming. More information is available in [2][5]. For describing the implementation, we make the distinction between transmission and content rates. At any point of time, transmission rate is “how much” is sent on the

network, whereas content rate is “what” is sent on the network. Transmission rate control and content rate control are what regulate the server behavior. When RTCP reports are received, the server may adapt its current transmission and content rates based on the feedback.

The server keeps track of the client buffer status through the NSN reports. This information is used to avoid underflow and overflow of the client buffer and it allows the server to know how much playout time the client currently has (underflow condition) and how many bytes are in the buffer (overflow condition).

The server keeps in memory the following information about each packet it sends: sequence number, timestamp and size. The server can delete this information after the packet has been played out by the client (i.e., when it is not in the client buffer anymore).

In order to avoid buffer overflow, the server can estimate through the NSN how many bytes are currently waiting for playout at the client. By comparing this value to the total buffer size signaled in the RTSP at the start of the session, it can derive if the client is close to overflow and should thus decrease its transmission rate. As explained above, the server chooses its transmission rate in order to maximize the network throughput and to guarantee a pause-less playback to the PSS client. However, if the server gets closer to the client overflow point, it will send at a lower rate than the optimum rate supported by the network in order to avoid the overflow.

In order to avoid underflow, the server monitors the current client buffer delay. This can be estimated through the NADU APP packet since this contains information about the next packet to be played out (NSN) and the delay until this will be played out (playout delay). If the buffer level in time decreases, the server switches to a lower content rate. Decreasing the content rate (but keeping the same transmission rate) allows the server to send packets earlier and increase the buffer level faster.

In the simulations we used the following parameters: • Client initial buffering period: 8 seconds; • H.263 video content of 3 minutes duration encoded at

20, 35 and 50 kbps. The video packet size was 300 bytes (excluding packetization overhead);

• Server switching only at I frames; • Client buffer size 115000 bytes; • Target buffer level: 12 seconds; • RTCP transmission interval: 1 second.

The network was modeled using an EGPRS network emulator where two MCS-7 time slots were allocated to the streaming user. In addition, two other users generated Web browsing traffic on the same channel. The air interface bit rate was 89.6 kbps, but because of protocol layer overhead, varying radio conditions and network load, the real throughput perceived by the streaming client

III-694

was less. There were three handovers (HO) during the simulations: at time 19.8 s that lasted for 2.2s; at time 101.6s that lasted for 4.0 s; at time 116.8 that lasted for 1.8s.

Results are shown in Figures 3-5. The start of the handover periods is marked with vertical lines in the figures. As a result of these handovers, the average network bit rate was very low at these times.

The plot in Figure 3 shows the bit rate received by the streaming client over time and the adapted transmission bit rate. It can be seen that the transmission rate (TX bw) is adapted to the reception rate (RX bw) through estimation of the network throughput. The plot also shows the bit stream (20kbps, 35kbps or 50kbps) selected by the server at a given time instant. The average content rate during the session was 40 kbps.

The buffer level in seconds is shown in Figure 4. The target buffer level is 12s and is the minimum protection against throughput variations that the server aims at providing. When the network conditions are good and the server maximizes the throughput available from the network, the buffer duration will be higher than the target level.

For comparison, Figure 5 shows the buffer level for a server that does not implement adaptive streaming and it sends content at constant bit rate (50 kbps). As the simulations are dynamic, the throughput and time location of the handovers are different. However, the mobile network behavior has similar characteristics as in the previous case. The figure shows that, because of network load and handovers, the initial client buffer level decreases during the connection. After the first and the third handovers the client underflows and it needs to rebuffer which leads to interruption of continuous playback.

5. CONCLUSIONS

3GPP Release 6 specifications include a powerful mechanism for bit rate adaptation. This can be used without the need of network Streaming traffic class or any other mechanism for guaranteed QoS in order to guarantee end-to-end performance for PSS applications. This paper has presented the technical details of the solution and verified its performance by means of simulations results.

6. REFERENCES [1] 3GPP, TSGS-SA, Transparent end-to-end Packet Switched

Streaming Service (PSS). Protocols and codecs (Release 6), TS 26.234, v. 6.3.0, (03-2005).

[2] 3GPP, TSG-SA, Transparent end-to-end Packet Switched Streaming Service (PSS). RTP usage model (Release 6), TR 26.937, v. 6.0.0, (03-2004).

[3] IETF, RTP: A Transport Protocol for Real-Time Applications, RFC 3550, July 2003.

[4] IETF, Real Time Streaming Protocol (RTSP), RFC 2326, April 1998.

[5] Igor D.D. Curcio and David Leon, Application Rate Adaptation for Mobile Streaming, IEEE Int. Symp. on a World of Wireless, Mobile and Multimedia Networks (WoWMoM ’05). Taormina/Giardini-Naxos (Italy), 13-16 June 2005.

Bandwidth (5s average)

0

10

20

30

40

50

60

70

0 20 40 60 80 100 120 140 160 180time (s)

rate

(kbp

s)

RX bwTX bw

Bitstream

HO HO HO

Buffer Duration

0

2

4

6

8

10

12

14

16

18

20

0 20 40 60 80 100 120 140 160 180time (s)

buf

fer

dura

tion

(s)

buffer level

HO HO HO

buffer duration

0

1

2

3

4

5

6

7

8

9

10

0 20 40 60 80 100 120 140 160 180time (s)

buff

er d

urat

ion

(s)

buffer level

HO HOHO

Figure 3 – Transmitted/received bit rate and bit stream

RX bw

Figure 4 – Client buffer level (in time)

TX bw

Figure 5 – Client buffer level (in time) (no rate adaptation)

III-695

[P8] Igor D.D. Curcio, Juha Kalliokulju, Miikka Lundan, “AMR Mode Selection Enhancement in 3G Networks”, Multimedia Tools and Applications Journal, Vol. 28, No. 3, Mar. 2006, pp. 259-281.

© 2006 Springer. With kind permission from Springer Science+Business Media.

AMR mode selection enhancement in 3G networks

Igor D.D. Curcio & Juha Kalliokulju & Miikka Lundan

# Springer Science + Business Media, LLC 2006

Abstract This paper describes methods for mode selection in multirate speechcodecs, such as the AMR (Adaptive Multi-Rate), that is the mandatory speech codecselected in 3GPP (3rd Generation Partnership Project) mobile networks. Originally,the multirate functionality has been developed for coping with changing radioconditions. The algorithms described in this paper find applicability in IP-basedmobile networks, where speech encoded data is encapsulated using the RTP (RealTime Protocol). The main advantages offered by these techniques are improvedspeech quality and congestion control along the network path between two mobileterminals.

Keywords 3GPP . AMR . Mode selection . Voice over IP . VoIP . RTP

1. Introduction

The evolution of mobile networks through the first, second and third generation isallowing the deployment of richer services based on multimedia and real-timecapabilities of terminals and networks. Some of these services can be implementedover Circuit-Switched (CS) mobile networks, such as GSM (Global System forMobile Communications). However, some other services can be implemented withhigher efficiency and flexibility over Packet Switched (PS) mobile networks, such asGPRS (General Packet Radio Service), 3GPP GERAN (GSM/EDGE (EnhancedData rates for GSM Evolution) Radio Access Network) and UMTS (UniversalMobile Telecommunication System). Packet switched mobile networks make use ofthe Internet Protocol (IP) for delivering data to the users of the services.

Voice over IP (VoIP) is one of the most challenging applications to deploy overIP-based mobile networks due to several reasons. The real-time and low-delay

Multimed Tools Appl (2006) 28: 259–281DOI 10.1007/s11042-006-7714-9

I.D.D. Curcio (*) & J. Kalliokulju & M. LundanNokia Corporation, P.O. Box 88, 33721 Tampere, Finlande-mail: [email protected]

J. Kalliokuljue-mail: [email protected]

M. Lundane-mail: [email protected]

Springer

characteristics of the service pose strict requirements to the overall networkarchitecture and terminal implementation. Secondly, the lossy characteristic of thetransmission channel imposes the use of special error resilience techniques in orderto provide always the best possible speech quality. Any error resilience/protectiontechnique increases the end-to-end delay budget that is critical for conversationalVoIP services. Moreover, in PS networks, differently from CS networks, congestionin network routers plays an important role when considering end-to-end Quality ofService (QoS). As the VoIP is carried on top of UDP (User Datagram Protocol)encapsulated into RTP [18] packets, the transport protocol does not provide anymeans for congestion control. For this reason, ad hoc mechanisms to prevent andcontrol congestion must be implemented in the network and/or in the terminals. Themechanisms can be enhanced with the feature of being media-aware, i.e., con-sidering the characteristics of the media flow.

3GPP has selected the Adaptive Multi-Rate (AMR) codec as mandatory speechcodec in 3G mobile networks. AMR is a codec working at eight different bit ratesranging from 4.75 to 12.2 kbps. Each encoding rate corresponds to a mode of thecodec. In CS networks, lower modes (bit rates) offer the possibility for higher errorprotection over the radio interface, in order to provide the best possible perceivedvoice quality, relative to the codec bit rate. Such modes are suitable for higher errorrates in the radio link. A Wideband (WB) version of AMR has also been standardizedby 3GPP. The AMR-WB is working with nine different modes, with rates rangingfrom 6.6 up to 23.85 kbps. If the AMR codecs are used in IP-based mobile networks,speech data is encapsulated into RTP packets for transmission over a mobile network.

In the next sections, we will propose new mode selection algorithms for AMR andAMR-WB speech codecs that can be employed in IP-based 3G mobile networks. Thepaper is organized as follows: Section 2 describes the basics of AMR and AMR-WB.The methods for selecting the AMR modes based on radio link quality and othertechniques are described in Section 3. The proposed new algorithms for AMR modeselection over IP-based mobile networks are included in Section 4, while Section 5includes simulation results. Section 6 is about related work, and Section 7 concludesthe paper.

2. AMR and modes

This section discusses the AMR speech codec and its advanced version AMR-WB.The main motivation for the introduction of multirate codecs is to overcome thedegraded speech quality due to errors caused by the radio interface. In CS networks,the radio frames are of fixed size and incorporate the encoded speech bits and theerror protection bits. With the introduction of multirate codecs, the part of errorprotection can be increased at the cost of reducing the speech coding bit rate. Thisencoding technique improves the overall speech quality experienced by the end users.

2.1. AMR speech codec

The AMR speech codec consists of four different functions: source rate controlledmulti-rate speech codec, voice activity detection (VAD), comfort noise generationand error concealment [29]. The speech encoder has eight different encoding ratesin addition to low background noise encoding rate. The coding rates vary from 4.75

260 Multimed Tools Appl (2006) 28: 259–281

Springer

kbps to 12.20 kbps as shown in Table 1 (1.8 kbps for background noise coding) [29,33]. The codec can change the encoding mode every 20 ms (i.e., every frame).

The AMR codec is able to operate in source controlled rate mode, where thebackground noise is coded with a lower bit rate between two talk spurts. The end ofa talk spurt is detected by the voice activity detector. During the silence periods thecomfort noise generator functions produce parameters that describe the character-istics of the real background noise. At the receiving side, error concealmentfunctions are used to hide the effect of missing speech frames.

In case of packetized speech over IP-based mobile networks, the AMR RTPpayload format described in [20] is used, and its size is shown in the last column ofTable 1.1

2.2. AMR wideband speech codec

The AMR Wideband speech codec [30] has the same functions as the AMR codec.The main differences between these two codecs are the following:

& The AMR codec operates on narrow audio bandwidth, limited below 3,400 Hz,whereas AMR-WB brings significant quality improvements by extending theaudio bandwidth up to 7,000 Hz.

& AMR uses a sampling rate of 8 KHz, using 160 speech samples per frame.AMR-WB, on the other hand, employs a sampling rate of 16 KHz, using 320speech samples per frame.

& The range of encoding bit rates for AMR is between 4.75 and 12.2 kbps,whereas for AMR-WB the bit rates vary from 6.60 up to 23.85 kbps, in additionto low background noise coding (see Table 2) [30, 31].

AMR-WB greatly enhances the quality of speech signals. The 12.65 kbps mode isthe lowest mode that can offer high quality wideband speech (the two lowest modesare intended to be used only under bad radio link conditions or high loss rates).Furthermore, the highest mode of the AMR-WB codec provides speech qualityequal to the ITU-T G.722 wideband codec at 64 kbps.

In case of packetized speech, the AMR-WB RTP payload format described in[20] is used, and its size is shown in the last column of Table 2.1

3. Selection of an AMR mode

The problem of selecting an appropriate AMR mode is strictly related to the metricsused for measuring link quality. Algorithms for AMR mode selection may considerselecting the mode based on the Bit Error Rate (BER) or C/I estimates, as it is donein circuit switched networks [7]. These two estimates typically characterize thequality of radio links. If the mobile network is IP-based, a metric that must also betaken into consideration is the Packet Loss Rate (PLR) occurred in the networkrouters along the path between the terminals. The reason for this relies in the factthat a packet loss event due to congestion is not visible at the radio link level i.e., inthe BER and C/I estimates. Conversely, a bad radio link quality that results in one

1 Assuming one speech frame per RTP packet, bit alignment, no CRC, no bit sorting and nointerleaving.

Multimed Tools Appl (2006) 28: 259–281 261

Springer

or several erroneous bits in a speech packet, is visible at the IP level if the packet isregarded as lost due to bit errors. Since the voice quality is affected by PLR in anIP-based mobile network and the losses of packets induced by the radio link quality,it results critical to take both sources of packet losses into account, for a moreaccurate AMR mode selection.

The algorithms for AMR mode selection can be implemented in the network(centralized approach) or in the mobile terminals (distributed approach). In theformer case, the network is responsible of controlling the delivery of the bestpossible speech quality to all the mobile terminals at any instant. In the latter case,the terminal has complete responsibility on the link quality measurements and thecontrol of the speech coding mode. The drawback of the distributed approach is thatthe full benefit can only be obtained if all the terminals implement the advancedAMR mode selection algorithms.

Over IP-based mobile networks, some methods for quality measurement havebeen developed for Voice over IP. For example, by using the RTCP receiver reports,packet loss rate measurements can be implemented in the terminal following adistributed approach. In this case, the mobile terminal can monitor the quality of theconnection measuring the packet loss rate and request, in case of low speech quality, a

Table 2 Source coding rates ofAMR-WB speech codec

aAssuming silence indicator(SID) frames are continuouslytransmitted.

Mode (kbps)

Size of a speech

frame (bits)

RTP payload

size (bytes)

23.85 477 62

23.05 461 60

19.85 397 52

18.25 365 48

15.85 317 42

14.25 285 38

12.65 253 34

8.85 177 24

6.60 132 19

1.75a 40 7

Table 1 Source coding rates ofAMR speech codec

aAssuming silence indicator(SID) frames are continuouslytransmitted.

Mode

(kbps)

Size of a speech

frame (bits)

Corresponding

fixed rate codec

RTP payload

size (bytes)

12.20 244 GSM EFR, TIA

TDMA-US1

32

10.20 204 27

7.95 159 22

7.40 148 IS-641, TIA/EIA

IS-641

20

6.70 134 PDC-EFR 19

5.90 118 17

5.15 103 15

4.95 95 14

1.80a 39 7


Springer

different AMR mode. This method is certainly efficient; however, it has somedrawbacks:

& The RTCP protocol may not be available in certain low-end mobile terminals,which would be penalized in this sense for not being able of using this advancedfunctionality for QoS management.

& If the RTCP report rate is not adequate, RTCP would convey to the terminalPLR information of low utility for a fast mode adaptation.

& Since the use of RTCP may require a dedicated PDP (Packet Data Protocol)context, the network might not allow to allocate PDP contexts for feedbackdata, to save capacity and reduce the overload on the system. It is also foreseenthat the first mobile terminals supporting VoIP may not be able to allocate PDPcontexts to feedback and control data, but use them as efficiently as possible foruser data transmission.

After having considered these issues about the use of RTCP in QoS control forspeech, it results clear that a centralized solution which makes no use of RTCP isindeed required to take full advantage of the AMR speech codec properties. In thenext section we will describe algorithms that use only the information contained inthe RTP packets to provide better speech quality by adapting the AMR mode inpresence of network congestion or radio link quality fluctuations.

4. AMR mode selection enhancement

The main objective of the existing [12] AMR mode selection is to provide thebest possible perceived speech quality for different radio link quality conditions.The current algorithms are designed for circuit switched networks, and therefore donot apply directly to IP-based mobile networks. In packet switched networks theprinciples of the transcoder-free operation [34] adaptation algorithms can be appliedwith the exception that TRAUs (transcoder and rate adapter units) do not existand base stations are not aware of the content of the packets. Therefore, mo-bile terminals need to be able to adjust the AMR operation based on the linkquality indications coming from the network. Since IP-based mobile networks adda new variable, the loss rate due to congestion in the routers, mode selection algo-rithms need to be enhanced to work well also in PS mobile networks. An enhancedAMR mode selection should be able to cope with the following combination ofcases:

1. Good radio link quality, low packet loss rate in fixed IP-network.2. Bad radio link quality, low packet loss rate in fixed IP-network.3. Bad radio link quality, high packet loss rate in fixed IP-network.4. Good radio link quality, high packet loss rate in fixed IP-network.

Here, by fixed IP-network, we mean the core network part of the mobilenetwork, or any IP backbone network between the two mobile terminals. The firstcase results in good speech quality even without a control algorithm. In the secondcase, the radio link is the dominant source of packet losses compared to a low PLR(by Flow_ here we refer to a loss rate that does not heavily degrade the perceivedspeech quality), and therefore modification of traditional algorithms perform best in


Springer

that case. Case three is difficult to handle, since the dominant factor for low speechquality could depend either on the radio link quality or the PLR. The fourth case isthe hardest to handle by traditional selection algorithms. The link adaptationalgorithm would not notice deteriorated speech quality due to high PLR caused bynetwork congestion, and would not change the AMR mode since the radio linkquality is good.

The intent of the paper is to find methods to handle cases 2, 3 and 4. Beforepresenting the new algorithms, a few considerations on transmission of speech andcongestion over IP are introduced.

4.1. AMR streams over IP mobile networks

Transmission of a speech stream of data can be made at fixed or variable packet rate.In the former case, the RTP packets containing encoded speech at a specific bit rateare all of the same size (except for silence packets that are of smaller size). In thelatter case, the RTP packets have different size. For example, encapsulating avariable number of 20 ms speech frames into one RTP packet would yield a variablepacket rate transmission. It is known that encapsulation schemes that packetize alow number of speech frames per RTP packet, and use fixed packet rates offer thebest performance in terms of end-to-end delay and perceived speech quality (forexample, [22] shows that encapsulating three AMR frames per RTP packetproduces a reduction of Mean Opinion Score (MOS) of over one point, with aFrame Error Ratio (FER) of 3%, compared to the case where one AMR frame isencapsulated into one RTP packet). In the rest of the paper, we assume that AMRspeech data is transmitted at a fixed packet rate.

This assumption is also valid in case of multi-rate codecs like AMR and AMR-WB. Whenever a codec mode is changed, the packet rate does not change. Only thebit rate and the corresponding packet size for the new codec mode vary. Forexample, if the AMR-WB mode is changed from 12.65 to 8.85 kbps, the packet ratewould not change (50 packets per second), but the packet size will change from 34 to24 bytes (only payload), if a packet contains one speech frame.

The fixed packet rate and variable packet size is, therefore, the best assumptionfor AMR speech traffic that travels over IP-based mobile networks. Under theseconditions, an AMR flow encoded at 12.2 kbps will obviously require morebandwidth than another AMR flow encoded at 10.2 kbps, if the packets aretransmitted at the same rate.

4.2. Congestion control for speech traffic

The network routers along the path between two mobile terminals can beconfigured to measure the length of queues in packet mode or byte mode [19].When configured to operate in packet mode, the queue length is measured innumber of packets. In case of congestion, the router decides to drop packets fromthe queue with the largest number of packets, ignoring the packet sizes. This willcause the queues with large packets to get a larger share of bandwidth than thosewith small packets [21]. Differently, when configured in byte mode, the queue lengthis measured in number of bytes. In case of congestion, the router decides to droppackets from the queue that occupies the largest buffer space in terms of bytes, i.e.,smaller packets would be less likely to be dropped than larger ones [13]. In this


Springer

mode, packet sizes are taken into account when measuring queue length, and thenumber of packets in queue is not strictly relevant.

The byte mode is implemented in the Random Early Detection (RED) algorithm[15] and its variants [1, 2, 11, 16]; it is also the recommended default queuemanagement mechanism in IP routers [19]. RED in byte mode is designed so that aflow’s fraction of the aggregate random packet drops roughly equals its fraction ofthe aggregate arrival rate in bytes per second [14, 23]. In addition, with this option,the average queue size accurately reflects the average delay at the router [15]. Thebyte mode is also available in the Weighted Fair Queuing (WFQ) algorithm forscheduling and buffer management [5, 27], to implement Integrated and Differen-tiated Services routers [10].

In the following, we assume that the byte mode is implemented in the networkrouters, as it allows a better congestion control of AMR flows with constant packetrates and variable packet sizes.

4.3. Enhanced AMR mode selection algorithms

The methods described in the following enhance the ordinary link adaptationalgorithms, in the sense that they are complementary. In fact, the techniquespresented consider not only radio link quality, but also congestion in backbone IP-networks along the path between the two VoIP terminals. These mode adaptationalgorithms have the following characteristics:

& They are network initiated: the algorithms are implemented in the network, notin the terminal. This allow a network operator to have a better control on thespeech quality delivered to all the users, not only to the ones having the newestterminals.

& They do not use RTCP: some networks may not allow to allocate Packet DataProtocol (PDP) contexts for feedback control and even certain categories oflow-end mobile terminals might not implement the RTCP functionality in theRTP protocol, or RTCP may not be available end-to-end (for example in callsto PSTN terminals; see figure 3).

& They allow fast mode adaptation: a reaction time in the order of several seconds(compared to a faster reaction time in the order of hundred of milliseconds)may not provide sufficient control capability and, in the worst case, it can evendeteriorate the voice quality.

& They complement link adaptation algorithms: we consider mainly the problem ofreduced speech quality due to packet losses for congestion in network routers.However, the proposed algorithms take into account the uplink radio linkquality as well as the bad quality resulted by packet lossed. The algorithm doesnot make difference between packets lost in the radio interface vs. packet lost inthe backbone IP network. For the downlink radio quality, the traditional codecmode request from the receiving terminal is used. They help to reducecongestion: users prefer a lower rate nearly error-free speech quality (wherelittle or no packet losses due to congestion occur) over higher rate erroneousspeech quality (where sensitive congestion level and packet losses occur). Forexample, an AMR stream at 7.4 kbps with 1% PLR offers a better perceivedspeech quality than a stream at 10.2 kbps with 3% PLR (see figure 1 [32]). Ifcongestion can be reduced, the network can eventually sustain more traffic.


Springer

The AMR mode selection enhancements are designed to be used with 3G All-IPnetworks where the voice is carried over IP as shown in figures 2 and 3. In figure. 2,a call between two All-IP mobile terminals has been described. Mobile station 1connects to the BTS 1 and further to RNC1 (Radio Network Controller). RNC1 isconnected to the packet core network, which in turn is connected to the IP network.The setting is similar on the MS2 side. The enhanced algorithm for AMR control isrunning in a CL (Congestion Level) monitor that is connected (or embedded) toRNC1 and RNC2 in the case of WCDMA (Wideband Code Division Multiple

Fig. 2 Two All-IP terminals in a call via IP network (with WCDMA radio access network)

Fig. 1 MOS for different AMR bit rates for varying Packet Loss Rates (=FER) [32]


Springer

Access) radio interface. The difference when compared to GSM (where CL isconnected to BTS), is due to the different nature of the radio interface. InWCDMA, the transmission to/from one terminal can happen via several BTSs andthe RNC is the first element where the signals are combined (splitted).

When the call is made between an All-IP terminal and a fixed PSTN terminal, agateway is required to perform the speech transcoding. In such case, the enhancedalgorithm is running in a gateway as shown in figure 3.

The enhancement of the AMR codec mode control operation is described in figure4 for the case where AMR speech packets are sent from MS2 to MS1 as shown with adotted line. The CL monitor takes into account the performance characteristics ofthe IP network and the radio link quality between MS2-RNC2/BTS2 (the impact ofthe radio link quality between MS2-RNC2/BTS2 to the mode control is inherentlyshown in the figure, as a bad radio link quality results in packet losses, whereas agood radio link quality does not result in packet losses, and thus CL detects only thelosses in the IP network). The radio link quality between RNC1/BTS1-MS1 is mea-sured by the terminal MS1. The link quality control is based on the codec moderequest coming from the receiving mobile terminal (MS1) as in the traditional circuitswitched operation. CL monitor needs to take into account both radio link quality(between RNC1/BTS1-MS1) and PLR information, and make the selections based

Fig. 3 Call from All-IP terminal to fixed PSTN terminal via IP network (with WCDMA radio accessnetwork)

CL monitor(RNC/BTS)detects bad network quality

CMR: lower codec modeCMC: lower codec mode

CMI: lower codec modeCMI: lower codec mode

CMI: lower codec mode

MS1 MS2RNC2/BTS2RNC1/BTS1

Direction of AMR packetsDirection of AMR packets

Direction of AMR packets

Fig. 4 Downgrading AMR codec mode


Springer

on them. If either high PLR or bad radio link quality is detected, the RNC1/BTS1sends a CMR to RNC2/BTS2 to command MS2 to lower the sending AMR bit rate.This scenario is the general case where the packets from MS1 to MS2 are routed alongdifferent paths compared to packets from MS2 to MS1. However, if the operatordecides to route the packets for both directions via the same route, the performance ofthe mode control can be further improved (as described in figure 5) as congestionaffects similarly both directions. In figure 5, AMR speech packets are sent from MS2to MS1. If the CL monitor detects bad network and/or radio link quality betweenMS2-RNC2/BTS2 (the impact of the radio link quality between MS2-RNC2/BTS2to the mode control is handled as in the previous case related to figure 4) it sends adirect command to MS1 to lower the encoding rate to improve the perceived speechquality in MS2 in congested network conditions. In addition to commanding MS1 tolower the encoding rate, RNC1/BTS1 takes into account both radio link quality(between RNC1/BTS1-MS1) and PLR information and sends a CMR to RNC2/BTS2.The CMR is required to adjust (lower) the encoding rate of MS2. If the RNC2/BTS2obeys the CMR, the network entity converts the request to the CMC, which is sent toMS2. The figures describe only the case of bit rate downgrading. Other cases arecovered more in detail at the end of this section, where detailed argorithms and thedecision criterias for the AMR bit rate control are described.

The CL unit monitors the amount of packet losses; if there is large amount ofpacket losses it can be assumed that the IP network is congested or there is a baduplink radio channel and the AMR codec has to be switched to a lower mode. In thecase of no losses, the network is not congested and the codec mode can be shifted to ahigher bit rate mode. In an IP-based mobile network, the presence of the RTCPprotocol is not needed, but a simpler algorithm built on the top of UDP/IP, which isable to inspect the RTP packets would suffice. This would enable the media-awareness characteristics into the mobile network. In case the algorithm isimplemented over a network that employs header compression, the RTP/UDP/IPheaders are compressed by the mobile station before transmission of uplink trafficover the air interface; subsequently the headers are decompressed in the RNC (by thePDCP layer [35]) for uplink traffic. Therefore, the header compression operation istransparent to the AMR enhancement algorithm described, and does not impact

Fig. 5 Downgrading AMR codec mode when network has been configured to carry uplink anddownlink traffic via the same route


Springer

its functioning that makes use of the information in the RTP header of speechpackets.

Packet loss ratio (from now on referred interchangeably to as Congestion Level–CL) must be measured on a short-time T [25], where this value is a configurationparameter in the mobile network. Short measurement interval is needed because ofthe bursty nature of errors in 3G networks and the nature of speech signals. In fact,for example, it is not important whether the speech quality was bad one minuteearlier, if the current quality is good. In other words, the history of CL measure-ments must be limited to a short period of time. CL can be computed as CL = 100 *PL/EP (which expresses a percentage value), where PL is the number of PacketsLost during time T, and it is computed using the sequence number field in the RTPheader of the speech packets. If there are gaps in the sequence numbers of thereceived packets, a packet loss has occurred, and PL is incremented according to theamount of packets lost. EP is the Expected number of Packets in transit over theRNC. EP is easily computed from the sequence number field, taking into accountthat the sequence number of the first RTP speech packet can start randomly. If allpackets during time T are lost, the RTP sequence numbers cannot be used fordetermining the number of lost packets. In such case the number of lost packets is Tdivided by 0.02 sec (the whole window period divided by a speech frame duration,i.e., 20 ms), and the congestion level is 100%. This works correctly in the case thewindow contains active speech packets or silence packets, or a combination of both.

The packet loss computation function (CL) must be built in a way that it keepsmemory of the packet loss information collected during a time T also after thisperiod has elapsed. One idea could be to compute CL over n disjoint time windowsof duration T seconds each. However, every time window would be independent ofthe previous one, leading to unstable and inaccurate packet loss measurements.Since the flow of speech packets is continuous, also the CL computation should havea continuous character, in the sense that CL should have a reasonable amount ofmemory of the past CL measurements. As stated above, all the history of the CLinformation since the beginning of a speech call is not relevant in our framework;just the most recent history will suffice. To enable the system to keep memory of thepast CL measurements, we choose of measuring CL always over a pair of adjacenttime windows of total length 2*T seconds (each window is T seconds long). An (n)thBold’’ window of duration T seconds is deleted after a (n + 2)th Bnew’’ window ofduration T seconds is made available for CL estimation (see figure 6). In this way,the CL measurements in the (n + 2)th time window contain the most recent in-formation on packet losses; in addition, the algorithm keeps always memory of the(n + 1)th window, and the information related to the n(th) window is deleted at time(n + 3)T (or, in other words, only the last two time windows are considered forcongestion level measurements).

Fig. 6 Congestion level mea-surement windows


Springer

A weighting function can be used to assign different importance to the mostrecent time window, compared to the past time window of CL measurements. Ingeneral, the CL computed at time (n + 3)T, is defined as

CL nþ3ð ÞT ¼ a*CLnþ2 þ 1� að Þ*CLnþ1;

where a is the weight of the most recent CL measurement, and CLn is thecongestion level of the (n)th time window. The purpose of assigning differentweights is that of allowing a faster reaction time at the beginning of a congestionperiod. For example, in the case where the new window shows congestion and theold windows contain no congestion, it is possible that mode change doesn’t occurwith the expected speed (if both weights were assigned to be equal). The weights arealso useful in the opposite case, that is, the beginning of an error-free period (i.e.,the case where the new window shows no congestion, but the old window containscongestion). In this latter case, it is possible that equal weights would yieldunnecessary AMR mode lowering.

At this point we can sketch the algorithm for CL computation that makes use of apair of windows. In this algorithm, SN(Pi) is a function that returns the sequencenumber of the RTP packet Pi. LSN contains the Lowest packet Sequence Numberin a time interval.

CLPast = 0;For each time window n > = 0 {

If (n = 0)LSN = SN_Pij1 = SN(first packet arrived in the time interval [nT, (n+ 1)T])j1;

elseLSN = SN_Pij1;

// LSN contains the last SN received in the previous windowPLn = 0;For each packet Pi arrived in the time interval [nT, (n +1)T] {

LSN = Min(LSN, SN(Pi)); // Compute the LSN in the timeinterval

If (SN(Pi) > SN_Pij1 + 1) { // There is at least a packet lossPLn = PLn + SN(Pi )jSN_Pij1j1; // PL is the number of packet

losses in the time intervalSN_Pij1 = SN(Pi);

} else if (SN(Pi) < SN_Pij1 + 1) { // An out-of-order packet hasarrived

If (SN(Pi) > LSN) // The out-of-order packet belongsto the current measurementwindow

PLn = PLnj 1; // Decrease by 1 the number ofpackets lost

} else // SN(Pi) = SN_Pij1 + 1. A normalspeech packet has arrived (nolosses)

SN_Pij1 = SN(Pi);}// SN_Pij1 contains the highest SN received in the time interval


Springer

If (no packets arrived in the time interval [nT, (n +1)T])PLn = EPn = Time interval [nT, (n+1)T] / 0.02; // Assumed that all the

expected active speechpackets are lost

elseEPn = SN_Pij 1 j LSN; // EP is the expected

number of packets inthe time interval

if (PLn < 0)PLn = 0; // Misordering of

packets results in falselosses

CLn = 100*PLn/EPn; // Compute congestionlevel (percentage) intime window n

If (n = 0)CL = CLn; // There is no past

information, and thecurrent window has100% weight

elsexCL = !*PLn + (1 j !)*CLPast; // Compute current

congestion level inRNC

CLPast = CLn; // Save the nthcongestion level

}

One issue that has been taken into account in the above algorithm is that fact thatout-of-order packets may transit through the RNC. If packets are not maintainingthe correct order of increasing sequence number, generation of false statistics canoccur. To deal with this situation, the algorithm is capable of handling out-of-orderpackets received within the computation window. A straightforward way to do thisis to subtract one from PL whenever the sequence number of the received packet inthe RNC is smaller than the highest sequence number received until that momentwithin the window in question). This way PL approximates closely the actualnumber of packet losses. In case of out-of-order packets, the expected number ofpackets EP is computed with an off-by-one packet precision.

Another issue to be considered in the enhanced algorithm is the Initial ActivationSpeed S (in seconds), which defines the initial amount of time after which thealgorithm can react. For example if S = 0.5 s, the algorithm will start to be active notearlier than half second from the beginning of the call. This value is needed to allowthe collected statistics to reach a reasonable steady state. In order to react over a fairnumber of measurements, we assume that S is at least as large as the size of a pair ofwindows of size T or, in other words, S Q 2T The additional logic related to theInitial Activation Speed can be easily embedded to the above algorithm.

In order to avoid a high switching frequency between consecutive higher/lowermodes when the CL is close to the threshold value, thresholds with hysteresis can be


Springer

used (see [34] or [8] for example) by the decision algorithm that operates on the CLvalues. Hysteresis determines the sensitivity of the mode-switching algorithmbetween two adjacent AMR modes. This mechanism defines a range within whichno AMR mode switching occurs, and outside which an AMR mode switchingoccurs. If the threshold is K and the hysteresis is H, the no switching range isfor CL Z [KjH, K + H], whereas the up mode switching range is for CL Z [0,KjH), and the down mode switching range is for CL Z (K+ H, MAX_CL], whereMAX_CL is the maximum congestion level allowed in the network. Threshold andhysteresis can be made dependent on factors such as the round-trip time. Then, thedecision algorithm for the AMR mode selection (CL monitor located as in figure 4)could obey to the following function:

AMR bitrate threshold; hysteresis;CLð Þ

¼if CL < threshold� hysteresisð Þ then switch to higher AMR bitrateelse if CL > thresholdþ hysteresisð Þ then switch to lower AMR bitrateelse keep the current AMR bitrate

8<

:

The above algorithm ignores the downlink radio link quality between RNC1/BTS1-MS1 (see figure 4). The complete solution for the decision algorithm couldobey the following function, which results in taking into accout all links between theterminals MS1 and MS2. The quality of the IP network between RNC1/BTS1-RNC2/BTS2 as well as the radio link quality between RNC2/BTS2-MS2 iscomputed by the CL. CMR is based on the radio link quality between RNC1/BTS1-MS1. This function allows handling all four combinations of cases described atthe beginning of Section 4.

AMR bitrateðthreshold;

hysteresis; CL; CMRÞ¼

if ððCL < threshold� hysteresisÞ AND

ðCMR does not requests lowering AMR modeÞÞ then

switch to higher AMR bitrate

else if ððCL > thresholdþ hysteresisÞ OR

ðCMR requests lowering AMR modeÞÞ then

switch to lower AMR bitrateelse if ððthreshold� hysteresis < CL < thresholdþ hysteresisÞ AND

ðCMR requests increasing AMR bitrateÞÞ then

switch to higher AMR bitrate

else keep the current AMR bitrate

8>>>>>>>>>>>>>>>>>><

>>>>>>>>>>>>>>>>>>:

The goal is to have smooth transitions of upgrade and downgrade of AMR modebit rates, so that the speech quality to the end user is gracefully degraded or improved,without abrupt changes, but in such way that the quality is always maximized.

The algorithm for the CL monitor in the RNC results therefore beingcharacterized by only two basic parameters (S and T), and by a variable numberof thresholds and hysteresis, depending on the actual number of AMR modes in use.

5. Simulation results

In this section, by means of simulations, we will show the benefits of AMR modeselection enhancement and the benefits of using different T values (window sizes).


Springer

The impact of the downlink radio channel quality has been omitted, as it operatesaccording to the existing GMS/UMTS circuit switched principles. The speech streamwas encoded using the AMR codec with silence compression. Each speech payloadpacket has length as indicated in Table 1 plus 40 bytes for RTP/UDP/IP headers.For 12.2 kbps, the required channel bit rate was 28.8 kbps. We simulated networkcongestion by using a narrower channel bit rate than the speech stream bit rate.First, we used a 0.8 kbps narrower network channel, which caused only moderatecongestion. After that, we used 4.8 kbps narrower channel, which caused heavycongestion. For congestion level (CL) measurements, we used a pair of windows of2*T seconds. We assigned different weights to the windows, with the convention togive a smaller weight to older windows. In fact, we assigned the weight ! = 0.3 to the(n)th window, and the weight ! = 0.7 to the (n + 1)th window. A single congestionlevel (CL) threshold was used, and it was equal to 1% with a hysteresis of 10% of it.This yielded a congestion level threshold range of 1 T 0.1%. In other words, if CL <0.9%, the enhancement algorithm switched to the next higher AMR bit rate (ifavailable); if CL > 1.1%, the algorithm switched to the next lower AMR bit rate (ifavailable); and if 0.9% e CL e 1.1%, the algorithm kept the current AMR bit rate. Ithas to be pointed out that short time windows produce CL estimations that are verysensitive to short time packet loss events, whereas long time windows produce CLestimations that are less sensitive to short time packet loss events. The sizes of thetime windows (T) in our performance experiments were equal to 250 ms, 500 ms,1,000 and 2,000 ms. We used a 60-second test sequence, which contained 68% ofspeech activity and 32% of silence. The longest speech burst was 11 s, whereas thelongest silence burst was 6 s. Figure 7 shows the speech sequence characteristics(Black bars show the speech parts, and the white areas the silence parts).

Fig. 7 Speech sequence characteristics

Table 3 Measured congestionlevel for enhanced AMRmode selectionalgorithm

AMR mode

selection

CL for 0.8 kbps

narrower channel (%)

CL for 4.8 kbps


No enhanced 1.40 4.40

Enhanced 0.27 1.40


Springer

Table 3 shows the results with and without enhanced AMR mode selection. Thecase without enhancement indicates that there is no mode selection available, andthe channel bit rate is always smaller than the speech stream sending bit rate.

The simulations results clearly show that the enhanced AMR mode selectionreduces the congestion level. With moderate congestion, CL was reduced by 1.13%,and with heavy congestion was reduced by 3%.

In a second set of tests, we searched for the optimal T value (window size). Againwe used one congestion level threshold to assess the difference. Figure 8 shows thepercentage of time spent in each of the AMR modes over the whole 60 s testsequence with moderate congestion level. The figure shows that over 80% of thetime was spent encoding the AMR speech sequence at 12.2 kbps (which is themaximum available bit rate), if T was between 250 and 1,000 ms.

The figure shows that a T value of 250 ms implies a too fast reaction of theenhancement algorithm, and the AMR modes tend to fall to the lowest bit rates. Weremind the reader, that our objective is that of maximizing speech quality, by usingthe highest AMR modes for the longest possible period and, at the same time,minimizing congestion, by switching temporarily to lower AMR modes. If T = 250 ms,the lowest AMR mode used is at 5.9 kbps. On the other hand, if T = 2,000 ms, thealgorithm is reacting too slowly, and the effect is that of spending just about 25% ofthe time in the highest of the AMR modes (12.2 kbps), and the rest of the timebetween 6.7 and 10.2 kbps, with some moments of lower quality at 5.9 kbps. This isnot maximizing the speech quality. If T = 500 ms or T = 1,000 ms, the lowest AMRmode used is at 7.95 kbps, meaning that only the three highest AMR modes are

Fig. 9 Dynamic of the AMR mode changes (moderate congestion)

Fig. 8 AMR mode percentagetime spent with moderatecongestion level


Springer

used (and the lower AMR modes are never used), maximizing speech quality. Thedifference between T = 500 ms and T = 1,000 ms case is very small. However, adeeper analysis suggests that the T = 500 ms case is performing better, because ityields a smaller congestion level (see Table 7). Figure 9 shows the dynamic evolutionof the AMR mode switching during the 60 s speech sequence for moderatecongestion. Here mode 1 corresponds to the 4.75 kbps AMR mode. The T = 250 mscase (black line) shows that the reaction is too fast, and the AMR mode gets toolow. The recovery time (time it takes to switch back to the highest mode) is also fast,but the global quality is not as good as T = 500 ms and T = 1,000 ms case. Thehighest recovery time for T = 250 ms lasts 1.75 s. The T = 2,000 ms case (light grayline) shows a too slow reaction time (the highest recovery time is around 10 s) and inone case the recovery stops because of new congestion problems (between seconds35 and 40). The T = 500 ms (thin line) and T = 1,000 ms behave in similar wayscompared to each other, and offer the best results.

Table 4 shows the average bitrate for each T value, which describes the averagespeech quality. The results are in line with the previous statements, where we foundthat T = 500 ms yielded the best results. The average bit rate is 0.1 kbps better thanthe T = 1,000 ms case, and 0.2 kbps better than the T = 250 ms case. The differencefrom the T = 2,000 ms case is 2.7 kbps (this suggestes that the latter threshold is farfrom being optimal). The average bit rate is obtained by calculating the amount oftime each mode is used over the 60 s sequence, and multiplying this value with thecorresponding AMR mode bit rate. The 95% confidence interval shows also thatwith the T = 500 ms case, the variation between the different AMR modes is thesmallest among the T values used. This can also be seen from figure 9, where the T =500 ms case is the most stable case.

Figure 10 shows similar results as figure 8, but this time the congestion is heavy(4.8 kbps). The T = 250 ms yields the best results in the sense that the time spent atthe highest of the AMR modes (12.2 kps) was over 55%. On the other hand, the

Fig. 10 AMR mode percent-age time spent with heavycongestion level

Table 4 Average bit rate forspeech (moderate congestion) Window size

(T value), (ms)

Average

bitrate (kbps)

95% confidence

interval (kbps)

250 11.6 T0.19

500 11.8 T0.14

1,000 11.7 T0.15

2,000 9.1 T0.27


Springer

usage of 4.75 kbps AMR mode is also the highest if T = 250 ms; this equalizes thedifferences a little. To confirm that T = 250 ms is the best choice with highcongestion levels, Table 7 shows that the congestion level is the lowest.

Figure 11 shows the dynamic evolution of the AMR mode switching during the60 s speech sequence for heavy congestion (mode 1 again corresponds to the 4.75kbps AMR mode).

Heavy congestion causes naturally more AMR mode changes than lowcongestion. Since all T values cause lower AMR modes (AMR mode 4.75 kbps—number 1 in figure 11—is the lowest mode), the recovery time becomes moreimportant. The recovery time is the time it takes to switch back to the highest mode.The recovery time is also an indication of how fast the enhanced algorithm reacts tocongestion. Table 5 shows the down peak length in time and the lowest AMR modereached with different T values. The results show that although T = 250 ms cause adrop to AMR at 4.75 kbps, the recovery time is much shorter than with any other Tvalues. This is the reason why T = 250 ms yields the best result. Since the reaction tocongestion is fast, also the recovery is fast and the highest AMR mode is used for alonger time than with other T values.

Table 6 shows the average bit rate of each T value, which describes the averagespeech quality. The results resemble our previous results. For heavy congestion wefound that T = 250 ms performed the best. The average bit rate is 0.7 kbps betterthan for T = 500 ms and T = 1,000 ms cases. The difference compared to the T =2,000 ms case is 1.1 kbps. The average is computed in the same way as in Table 4.The 95% confidence interval shows that, in order to maintain the best averagebitrate, the T = 250 ms case requires fast AMR mode changes, which cause the

Fig. 11 Dynamic of the AMR mode changes (heavy congestion)

Table 5 Total reaction and re-covery time and lowest AMRmode Window size

(T value), (ms)

Maximum recovery

time (s)

Lowest AMR

mode in kbps,

(number in figure 11)

250 1.75 4.75 (1)

500 3.5 4.75 (1)

1,000 5 5.9 (3)

2,000 10 5.9 (3)


Springer

highest confidence interval value. Also figure 11 shows a similar fact that, in that theAMR modes are very rapidly changing, which enables fast recovery times.

Table 7 shows the average congestion levels for different window sizes calculatedover the whole 60 s period without the time window weighting (i.e., using the true valuesreceived from the network). With moderate congestion (0.8 kbps), the congestion levelsfor windows between 250 and 1,000 ms are almost the same. Five-hundred millisecondwindow is the best if we combine AMR modes percentage time (figure 8) andcongestion level results. With heavy congestion (4.8 kbps), the 250 ms window yields thebest congestion level results. Combining AMR modes percentage time (figure 10) andcongestion level results, this yield the 250 ms window to be the best.

Summarizing, the performance results show that the optimum T value is around250–500 ms. If heavy congestion is expected, then the T value should be chosen tobe around 250 ms, but if only moderate congestion is expected, the T value could benear 500 ms. In an unpredictable environment, dynamic T values could also beselected, based on the instantaneous congestion levels.

6. Related work

Rate and link adaptation issues have been analyzed in some existing publications. In[7] the authors describe mode adaptation for the GSM AMR speech codec, basedon radio link quality measures (Carrier to Interference (C/I) ratio). Christiansonand Brown [9] elaborate on a source terminal-initiated rate adaptation scheme forthe Pulse Code Modulation (PCM) codec based on RTP Control Protocol (RTCP)receiver reports. In [9], the rate adaptation algorithm reacts only upon reception ofthree consecutive RTCP reports indicating a packet loss situation, that is, afterabout 15 s. The algorithm described in [4] employs RTCP reports as feedbackmechanism used by the source terminal, and it changes the encoding rate from 8 upto 32 kbps in steps of 8 kbps, using variable-size RTP packets sent at a constant rateof one packet every 125 ms. The paper [8] highlights the utility of hysteresisthresholds for a multi-rate Differential PCM (DPCM) speech codec operating at 40,

Table 6 Average bitrate for speech(high congestion) Window size

(T value), (ms)

Average

bitrate (kbps)

95% confidence

interval (kbps)

250 10.1 T0.33

500 9.4 T0.31

1,000 9.4 T0.29

2,000 9.0 T0.27

Table 7 Measured congestionlevels for different windowsizes

Window size

(T value), (ms)

CL for 0.8 kbps


CL for 4.8 kbps


250 0.27 1.40

500 0.27 1.50

1,000 0.30 1.80

2,000 1.40 1.80


Springer

48 and 64 kbps. Thresholds are used to reduce the mode switching frequency, whichmay affect the speech quality perceived by the user, and the signaling requiredperforming this operation. The authors of [3] discuss quality of dual rate speech (4.8and 6.4 kbps) and report that noticeable quality degradation occur, compared to thefixed rate quality at 6.4 kbps, if a random switching rate exceeds 30%. This is notdirectly comparable to our AMR speech tests, which use eight modes, and a welldefined mode switching logic. The paper [26] shows that the quality of the MPEG-4CELP codec is not affected by the response time of the mode switch, for networkRound Trip Times of 0.2, 0.5 and 1.0 s and codec bit rates of 11.5, 15.9 and 19.5kbps. The work presented in [17] presents a distributed RTCP-based decisionalgorithm for network load estimation. However, the paper does not showsimulation results. The paper [24] describes a speech rate adaptation algorithm thatmakes use of adaptive Forward Error Correction (FEC) (in particular Reed–Solomon codes) to balance the effects of congestion and packet losses. Speechquality is measured using the Emodel defined by ITU-T. However, the authors donot compare their scheme with other rate adaptation schemes that use no FEC. Asimilar approach is also described in [6]. The use of FEC does not allow a directcomparison with our paper. In the work in [28], the authors monitor the arrivalpacket jitter at the receiving end, determining one of the eight possible networkstates, to which each AMR mode is mapped.

7. Conclusions

The development of 3rd generation mobile communication networks is goingtowards All-IP networks to better support data traffic. During the development inthe longer term, voice will turn to be Fjust one component of data_ and there will notexist separate networks for voice only support. The performance and design ofpacket-based networks differ from the networks designed for voice only. Thechange of network reflects also to the speech codec selection and operation.

The AMR and AMR-WB codecs are seen as feasible codecs also for IP basedpacket switched networks. However, the existing mode control algorithms do nottake into account the characteristics of the Fnew_ network i.e., possible packet lossesin the backbone IP network. This paper described enhanced algorithms for AMRmode selection to be implemented in 3G networks. The mechanisms are networkbased, which enable all the users benefit from improved voice quality. Themechanisms can be used together with link adaptation algorithms, to monitor thetransmission quality of VoIP connections subject to packet losses along the networkpath between two (mobile) terminals. A clear advantage of those algorithms rely onthe fact that they attempt to deliver at any instant the best possible speech quality tothe end users of the service. The operator of the network benefits from the newalgorithms in the form of lower congestion levels in the network, which is alsoproven with simulations.

The proposed enhancements to the AMR mode selection algorithms ensure thatthe performance of speech service does not suffer from the introduction of newtypes of networks. The service can be further advanced, as the speech will be onecomponent of a richer multimedia connection. Moreover, the proposed algorithmprovides a simple and efficient tool for operators to collect performance data fromthe network, which can be used for operation and maintenance purposes.


Springer

Acknowledgments The authors would like to thank the reviewers for their precious comments thathelped improving the quality of the paper.

References

1. Anjum FM, Tassiulas L (1999, March 21–25) Fair bandwidth sharing among adaptive and non-adaptive flows in the internet. Proceedings of the 18th Annual Joint Conference of the IEEEComputer and Communications Societies (INFOCOM ’99), Vol. 3, pp 1412–1420

2. Athuraliya S, Low SH, Li VH, Yin Q (2001, May–June) REM: active queue management. IEEENetw 15(3):48–53

3. Atungsiri SA, Tateesh S, Kondoz A (1997, June) Multirate coding for mobile communicationslink adaptation. IEE Proc Commun 133(3):211–216

4. Barberis A, Casetti C, De Martin JC, Meo M (2001, May 1) A simulation study of adaptive voicecommunications on IP networks. Comput Commun 24(9):757–767

5. Benmohamed LM, Dravida S, Harshavardhana P, Cheong Lau W, Mittal AK (1998, October–December) Designing IP networks with performance guarantees. Bell Labs Tech J 3(4):273–295

6. Bolot JC, Fosse-Parisis S, Towsley D (1999, March 21–25) Adaptive FEC-based error control forinternet telephony. Proceedings of 18th Annual Joint Conference of the IEEE Computer andCommunications Societies (INFOCOM’99), Vol. 3, New York, USA, pp 1453–1460

7. Bruhn S, Blocher P, Hellwig K, Sjoberg J (1999, May 16–20) Concepts and solutions for linkadaptation and inband signaling for the GSM AMR speech coding standard. Proceedings of the49th IEEE vehicular technology conference (VTC ’99), Vol. 3, pp 2451–2455

8. Casares Giner V (2001, May–June) Variable bit rate voice using hysteresis thresholds. TelecommSyst 17(1–2): 31– 62

9. Christianson L, Brown K (1999, November 15–17) Rate adaptation for improved audio qualityin wireless networks. Proceedings of the 6th International Workshop on Mobile MultimediaCommunications (MoMuC ’99), San Diego, California, USA, pp 363–367

10. Cisco IOS quality of service solutions configuration guide, Release 12.2, (2001, August). http://www.cisco.com/univercd/cc/td/doc/product/software/ios122/122cgcr/fqos_c/qcfintro.htm

11. De Cnodder S, Elloumi O, Pauwel K (2000, July 3–6) RED behaviour with different packet sizes.Proceedings of the 5th IEEE Symposium on Computers and Communications (ISCC 2000), pp 793–799

12. ETSI TR 101 505, Adaptive Multi-Rate (AMR) speech codec, Study Phase Report, GSM 06.76v. 7.0.2 2002–01

13. Floyd S. RED: discussion of byte and packet modes, March 1997 with comments from January1998 and October 2000, http://www.aciri.org/floyd/REDaveraging.txt

14. Floyd S, Fall K (1997, February 15) Router mechanisms to support end-to-end congestioncontrol. Unpublished manuscript, http://www.icir.org/floyd/papers/collapse.ps.

15. Floyd S, Jacobson V (1993, August) Random early detection gateways for congestion avoidance.IEEE/ACM Trans Netw 1(4):397– 413

16. Floyd S, Gummadi R, Shenker S (2001, August 1) Adaptive RED: an algorithm for increasingthe robustness of RED’s active queue management. Paper under submission, http://www.icir.org/floyd/papers/adaptiveRed.pdf

17. Galiotos P, Dagiuklas T, Arkadianos D (2002, July 3–5) QoS management for an enhancedVoIP platform using R-factor and network load estimation functionality. 5th IEEE InternationalConference on High Speed Networks and Multimedia Communications. pp 305–314

18. IETF RFC 3550, (2003, July) RTP: a transport protocol for real-time applications19. IETF RFC 2309 (1998, April) Recommendations on queue management and congestion

avoidance in the internet20. IETF RFC 3267 (2002, June) RTP payload format and file storage format for the AMR and

AMR-WB Audio Codecs21. Jain R (1990, May) Congestion control in computer networks: issues and trends. IEEE Netw

4(3):24–3022. Lakaniemi A, Ojala P, Toukomaa H (2002, October 6–9) AMR and AMR-WB RTP payload

usage in packet switched conversational multimedia services. IEEE Workshop Proceedings onSpeech Coding, pp 147–149

23. Mahajan R, Floyd S (2001, April) Controlling high bandwidth flows at the congested router.ICSI technical report, TR-01-001, http://www.aciri.org/floyd/papers/red-pd.TR.pdf


Springer

24. Matta J, Pepin C, Lashkari K, Jain R (2003, June 1–3) A source and channel rate adaptationalgorithm for AMR in VoIP using the Emodel. Proceedings of the Network and OperatingSystems Support for Digital Audio and Video Conference, Monterey, California, USA, pp 92–99

25. Nananukul S, Koodli R, Dixit S (2000, June 26–29) Controlling short-term packet loss ratiosusing an adaptive pushout scheme. IEEE Conference on High Performance Switching andRouting, Heidelberg (Germany), pp 49–54

26. Nomura T, Iwadare M (1999, June 20–23) Voice over IP systems with speech bitrate adaptation basedon MPEG-4 wideband CELP. IEEE Workshop on Speech Coding, Porvoo, Finland, pp 132–134

27. Parekh AK, Gallager RG (1993, June) A generalized processor sharing approach to flow controlin integrated services networks: the single-node case. IEEE/ACM Trans Netw 1(3):344–357

28. Seo JW, Woo SJ, Bae KS (2001, May 7–11) Study on the application of an AMR speech codec toVoIP. Proceedings of IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’01), Salt Lake City, Utah, USA, Vol. 3, pp 1373–1376

29. 3GPP TSG-SSA, AMR speech codec. General description (Release 4), TS 26.071 v. 4.0.0, 2001–0330. 3GPP TSG-SSA, AMR wideband speech codec. General description (Release 5), TS 26.171 v.

5.0.0, 2001–0331. 3GPP TSG-SSA, AMR wideband speech codec. Frame structure (Release 5), TS 26.201 v. 5.0.0,

2001–0332. 3GPP TSG-SSA, Performance characterization of the AMR speech codec (Release 4), TS

26.975 v. 4.1.0, 2001–0933. 3GPP TSG-SSA, AMR speech codec frame structure (Release 4), TS 26.101 v. 4.2.0, 2002–0334. 3GPP TSG-GERAN, Link Adaptation (Release 5), TS 45.009 v. 5.5.0, 2002–0635. 3GPP TSG-RAN, Packet Data Convergence Protocol (PDCP) Specification (Release 4), TS

25.323, v. 4.6.0, 2002–09


Igor D.D. Curcio, born in Milan (Italy) in 1968, from 1986 to 1997 has worked 9 years for several

companies as a freelance software engineer, project manager and Information Technology educator.

He received the Laurea degree in Computer Science from University of Catania (Italy) in 1997. In

1998 he joined Nokia Corporation, where he has covered several research and management positions

in the areas of real-time mobile multimedia. He is now Technology Manager in the Home Domain

area. He has been active in several standardization organizations (such as 3GPP, IETF, DLNA, DVB,

ARIB) where he has covered sub-working group or task forces Chair positions. Mr. Curcio holds 3

international patents and several pending patent applications. He is an ACM member since 1990 and

senior IEEE member since 2004. He has also published several papers is the area of software

engineering, Video-on-Demand, QoS of mobile video and streaming. Mr. Curcio is currently a Ph.D.

candidate at the Signal Processing Laboratory of Tampere University of Technology. His current

interest areas include mobile broadcast home domain applications.

Springer


Juha Kalliokulju graduated from Tampere University of Technology in 1993. Between 1993–1996

he has been involved in the development of WCDMA physical layer algorithms at Nokia Mobile

Phones. 1996–1999 he has been working as an Engineering Manager for the 3rd generation user

data protocols and QoS development. 1999–2004 he has continued the 3G work as a Principal

Scientist at Nokia Mobile Phones. During the period he has actively participated the QoS and radio

access network standardization in 3GPP and IETF. He has also been actively involved in new

service development being responsible of technical realization of enriching the voice call. Currently,

he is working as a Product Marketing Manager at Nokia being responsible of European operator

customers.

Miikka Lundan received his M.Sc. degree in Computer Science from Tampere University of

Technology (TUT) in 2001. He is currently doing his Ph.D. degree about mobile streaming at TUT.

He has authored several international publications and two patents. He joined Nokia Corporation in

1999. Between 1999 and 2001 he worked in SIP based video telephony project. In 2001 he joined to the

group that was standardizing 3GPP PSS (Packet-switched Streaming Service) in Nokia. Year 2004 he

joined to Nokia S60 and started to work with User experience field in multimedia area. From 2005

onwards he has been a product manager of S60 Multimedia in camera and content management area.

Springer

[P9] Varun Singh, Jörg Ott, Igor D.D. Curcio, “Rate Adaptation for Conversational 3G Video”, Proc. 2nd International Workshop on Mobile Video Delivery (MoViD), (in

conjunction with the 28th IEEE Conference on Computer Communications (INFOCOM ‘09)), 24 Apr. 2009, Rio de Janeiro, Brazil.


Rate Adaptation for Conversational 3G VideoVarun Singh, Jorg Ott

Helsinki University of Technology (TKK)Espoo, Finland

Email: {varun,jo}@netlab.tkk.fi

Igor D.D. CurcioNokia Research Center

Tampere, FinlandEmail: [email protected]

Abstract—Wireless cellular environments, such as UMTS, areoften affected by congestion and errors, which are inherentto wireless transmission channels due to fading, interference,resource scarcity, mobility, etc. For a conversational video ap-plication to be successful i.e., to provide good viewing qualityto the receiver at all times, the sender must be able to quicklyadapt its sending/encoding rate (and other related parameters) tothat offered by the link. Moreover, for a rate adaptation schemeto be successful, the receiver must provide timely feedback inorder to mitigate further losses due to congestion. In this paper,we investigate different rate adaptation mechanisms and redefinethem for 3GPP networks, reusing existing RTCP extensionsstandardized in the IETF and in 3GPP where possible.

I. INTRODUCTION

The third generation mobile system provides conversationalvideo communication in the Media Telephony Service forIMS (MTSI) [1]. This 3GPP standard supports the use ofH.264/AVC [2] encapsulated in RTP for carrying video traffic.A typical conversational mobile multimedia system, such asMTSI, require that end-to-end delays do not exceed valuesin the order of 400ms [3] for providing acceptable mediaquality for playback and a good user experience. Fading,interference, mobility, handovers, cell loading and other factorsoften cause the available bandwidth for each user to fluctuate,which causes congestion in the network. Moreover, packetlosses may occur due to radio effects causing bit errors,congestion-induced drops from router queues, and packetsdiscarded due to late arrival at the receiver. Since packet lossesare detrimental to video quality perception and expensive torepair, they need to be avoided as much as possible.

Mobile multimedia applications thus need to adapt to thebandwidth constraints by adjusting their encoding and/ortransmission rate. However, congestion control in wireless 3Gnetworks for conversational video applications is challengingbecause the application-defined maximum delay (400ms) andthe minimal network-incurred latency leave only very littleroom for a congestion control algorithm to operate. Traditionalcongestion indicators such as packet losses are not applicablebecause 1) air interface losses and congestion losses maybe hard to differentiate and, more importantly, 2) increasedqueuing delays in the network may cause the receiver todiscard packets even before congestion losses occur. There-fore, a sender has to anticipate upcoming congestion fromvarious cues—including but not limited to the per-packet delayused in many delay-based congestion control algorithms—toprevent network queues from building up in the first place.

This requires extreme sensitivity to the reported transmissioncharacteristics.

In this paper, we choose a suitable operating environment asdefined by the 3GPP in [3] to help evaluate the performanceof our new algorithm and our enhancements of existing rateadaptation signaling schemes against those already defined likeTCP Friendly Rate Control (TFRC) [4] [5], [6], TemporaryMaximum Media Stream Bit rate Request (TMMBR) [1], [7].We introduce these and discuss them and the related work insection II. In section III we introduce our new rate adaptationalgorithm and explain the features and configuration of thesimulation environment in section IV. Section V presents theresults, compares them against each other and we draw con-clusions and discuss directions for future work in section VI.

II. OVERVIEW OF RATE ADAPTATION TECHNIQUES

The decision making process of rate adaptation can bemade at the sender, the receiver, or at some intermediate node(edge or core) in the network. Sender-driven rate adaptationrequires that the receiver be aware of the current networksituation i.e., latency experienced by a packet, current jitterbuffer state at the receiver, current decoding rate, packetslost, etc., and signal this information to the sender whichdecides to adapt the rate based on the received parameters. Ina receiver-driven rate adaptation scheme, the receiver gaugesthe current situation based on the parameters available toit, and signals the new required bandwidth to the senderthat, on receiving the new rate, adapts to it. In a network-driven rate adaptation, an element in the mobile network willsignal to the sender/receiver that the rate is going to dropor increase due to better or worse network conditions arisingfrom handovers, cell-loading, etc. In these cases the networkis aware of the conditions beforehand and can therefore signalto the appropriate node the new data rate.

TCP Friendly Rate Control (TFRC) is an equation basedcongestion control algorithm implemented at the sender [4]and is a profile in the Datagram Congestion Control Protocol(DCCP) to fairly compete for bandwidth with other flows.TFRC uses knowledge at the sender to calculate the newbandwidth based on average packet size, RTT, loss-rate [8]. [5]extends [6] for multimedia applications by using RTP/RTCPfeedback loop to control the algorithm and redefines the timingrules in [9] for very short RTTs (< 20ms).

Temporary Maximum Media Bit-rate Request (TMMBR) /Temporary Maximum Media Bit-rate Notification (TMMBN):

978-1-4244-3968-3/09/$25.00 ©2009

Fig. 1. Receiver-side Queuing model

In addition to the Feedback Control Information defined inRFC4585 [9], RFC5104 [7] (codec control messages forAVPF) defines several more codec-related feedback messagessuch as the TMMBR and TMMBN. TMMBR is generated bythe receiver in a point-to-point (PtP) scenario and is sent by thereceiver to request the sender to limit its maximum bandwidthto that value i.e., the sender may choose the value received inTMMBR or a lower value. TMMBN is a notification sent byany entity (sender, receiver, network) to the other to notify thebounding rate it is using.

Next Application Data Unit (NADU) for Streaming video:The maximum delay budget (400ms) and the the minimalnetwork latency provide a small opportunity for the receiverto queue packets for a very short period of time. For example,in the 3G Simulation framework [3] the system allows amaximum 400ms delay budget (from video capture to display)and 240ms static one way delay. If there is no conges-tion due to queuing at the intermediate nodes in the corenetwork then the maximum time the receiver can cache ismax cache time = 400 − 240 = 160ms. This potentiallymeans that the receiver can queue upto 2-3 frames of a 15fps video stream). NADU is a signaling mechanism whichintimates the sender of Playout delay of the first packet inthe RTP queue and its sequence number [10], [11]. NADU isalready defined in 3GPP [12] for the video streaming scenariowhere it provides the sender with playback buffer information.Figure 1 shows the receiver side queuing model and someof the terms associated with the signaling of NADU, likeHighest Sequence Number (HSN), Next Sequence Number(NSN), buffer fill-level, number of Packets in Transit (PiT),number of Packets in the Buffer (PiB), Last RTP packet sentfrom the sender (LPS) just before receiving the RR, PlayoutDelay, etc.

III. NEW RATE ADAPTATION SCHEMES

In this section, we introduce two rate adaptation algorithmsfor conversational video. The first one consists of a supersetof NADU signaling based on the RTCP XR Discard Metricspacket reports and ordinary RTCP RR information. Also a newsender-side algorithm is described. We will denote algorithmand signaling as C-NADU, which stands for ConversationalNADU. The second one consists of variants of TMMBR thatcouple Uplink information and Downlink information comingfrom the network. TMMBR using both Uplink and Downlinkinformation will be denoted in the following as TMMBR-A,

using only downlink information will be denoted as TMMBR-B, and unassisted TMMBR will be denoted as TMMBR-U.Finally, reactive scheduling of RTCP reports is also a way toimprove system performance, and is part of our solution.

A. Conversational NADU (C-NADU)

We define two modes of operations for this new sender-siderate adaptation scheme: Congestion Avoidance and CongestionMitigation. In congestion avoidance, the sender (or receiver)tries to detect if the link is undergoing light congestion and,based on the input, it slightly increases or decreases thesending rate. For example, slight reductions or increases inround-trip time (RTT), jitter, packets in transit (PiT), etc. canbe indicators for light congestion or under-utilization. How-ever, in the case of congestion mitigation, the rate adaptationmodule realizes that there is already heavy congestion, andneeds to take a corrective action immediately. For example,high packet loss might provide indication for the presence ofheavy congestion. Moreover, one might also associate that incongestion avoidance only small changes to the bandwidthmight occur. However, in congestion mitigation more drasticchanges might be made to mitigate the congestion. Therate adaptation algorithm takes input from many parameterssignaled from the receiver to the sender via various extensionsdefined for RTCP, namely:

• Normal RTCP Receiver Report (RR) [13].

– Fraction Loss (FL)– Inter-arrival Jitter (Jitter)– Calculated RTT (RTT )– Highest Sequence Number, (HSN )

• NADU Packet [12] reports

– Next Sequence Number (NSN) is the RTP sequencenumber of the next packet to be decoded from thereceiver queue. If no packets are available for playoutthen, NSN = HSN + 1 (this packet has not beenreceived by the receiver yet).

– Playout Delay of NSN (PDNSN ) is the differencebetween the scheduled playout time of the NSNpacket and the time the receiver sends the RTCPreport [12]. If no packets are available for playoutthen the receiver can signal PDNSN = 0×FFFF .

• RTCP XR Discard Metrics [14] packet reports

– the number of bytes discarded (bytesdiscarded) re-lated to the packets dropped at the receiver due tolate arrival of packets.

In addition to the above signaling information, the sendermaintains a ring buffer with the size of all video packets sentsince the last RR received. Figure 1 describes the receiver sidequeuing model and visually represents some of the entitiesreceived in the RR and local state information.

Using the aforementioned parameters the sender is able tocalculate the following:

1) PiT = LPS −HSN packets.2) Receiver Buffer Fill-level, if NSN < HSN

• number of Packets in Receiver Buffer, PiB =(HSN −NSN) + 1 packets.

• Time to drain the buffer, Buffer F illlevelin ms =RTP TSHSN −RTP TSNSN ms.

• size of packets from ring-buffer, Buffer F ill-levelin bytes =

∑HSNi=NSN sizeof(i)in bytes bytes.

3) Playout Delay experienced by the HSN if no underflowor losses occurs, PDHSN = PDNSN + Buffer F ill-levelin ms + RTT

21 ms.

4) Perceived Receiver Rate at the sender,ReceiverRateperceived(inkbps) =∑HSN+P iT

i=HSNlast RRsizeof(i)in bytes×(1.0−FL)×8

1000×(tnow−tlast RR)5) Perceived Receiver Goodput, the perceived rate that was

played back,Goodputperceived(inkbps) =((∑HSN

i=HSNlast RRsizeof(i)in bytes×(1.0−FL))−bytesdiscarded)×8

1000×(tnow−tlast RR)In addition to the ring-buffer, the sender keeps a shorthistory of some of the above parameters, namely PiT, PiB,Jitter, and RTT by calculating the correlation of the currentvalue with the moving average of the last 3 values or the90th − percentile values of lossless reports.

1) Correlated RTT, by using the 90th − percentilevalue of all loss-less RTTs it is possible to calcu-late the correlation of the current RTT, CorrRTT =90−percentilelossless(RTT )

RTTnow

2) Correlated PiT and PiB are calculated to ascertain ifthe queues in the network and at the receiver areincreasing or decreasing. CorrP iT = PiTavg last3

PiTnowand

CorrP iB = PiBavg last3PiBnow

In the Algorithm 1: line 15, δundershoot is calculated onlyfor the first loss event of a new downward trend, and is done toquickly mitigate congestion because of higher rate packets intransit and lines 9, 22, and 45 use constants (α, β,Ψ) whenno conclusive information is available in cases of extremecongestion, or underflow.

B. TMMBR-A, TMMBR-B and TMMBR-U

In TMMBR-A, the network notifies the sender and receiverof the Uplink and Downlink rates respectively. The sender isaware of the downlink capacity, but this information arrives atthe sender delayed by an order of a one-way delay from thereceiver. However, the downlink may not be the constraininglink. Therefore, the sender receives also information about theUplink rate. In TMMBR-B, the network notifies the receiverof the Downlink rate. As before, the sender is notified aboutthe current downlink capacity by the receiver; however thesender is not aware of the Uplink rate. Hence, the TMMBRmessages from the receiver are considered as an upper boundfor the current encoding rate and the TMMBR message rateis never exceeded.

In TMMBR-U, the network does not assist the sender northe receiver. The receiver sends the new bandwidth request to

1one-way delays are presumed to be symmetric even though video data isflowing only in one direction which makes the delays asymmetric.

Algorithm 1 Sender-side Rate Adaptation AlgorithmRequire: Encoder maintains a ring-buffer with sizes of pack-

ets sent since the HSN of Last RREnsure: Reception of Latest RR from receiver

Parse (RR)⇒ (RTTnow, Jitter, FL,HSNnow)2: if available, Parse (NADU)⇒ (NSN,PDNSN )

if available, Parse (RTCP XR Discard Metric) ⇒(bytesdiscarded)

4: Calculate PiBnow, P iTnow, CorrRTT,CorrP iT,CorrP iBand ReceiverRateperceived, GoodPutperceived, PDHSN

6: if (HSNnow = HSNlast RR) then//No Packets were received!

8: NewBw ← CurrentBw × α;∀ α ∈ (0, 1), we use α = 0.5

10: elseif ((FL > 0)||(bytesdiscarded > 0)) then

12: //Congestion mitigation!if (CurrentBw > GoodPutperceived then

14: NewBw ← GoodPutperceived × δundershoot

∀δundershoot ∈ (0, 1]16: else

//High congestion!18: if (CorrRTT < 1.0) then

NewBw ← CurrentBw × CorrRTT20: else

NewBw ← CurrentBw × β22: ∀ β ∈ (0, 1), we use β =

√2

2end if

24: end ifelse

26: //Congestion Avoidance!if (CorrP iT < 1.0) then

28: NewBw ← CurrentBw × CorrP iTelse if (CorrP iB < 1.0) then

30: NewBw ← CurrentBw × CorrP iBelse if ((CorrP iT > 1.0)AND(CorrP iB > 1.0))then

32: NewBw ← CurrentBw × corrP iTend if

34: if (PDHSN �= 0× FFFF ) thenNewBw ← CurrentBw × PDmax

PDHSN

36: ∀ PDmax = 400mselse

38: //Underflow!if (CurrentBw < ReceiverRateperceived) then

40: NewBw ← ReceiverRateperceived

else if (CorrRTT > 1.0) then42: NewBw ← CurrentBw × CorrRTT

else44: NewBw ← CurrentBw ×Ψ;

∀ Ψ ∈ (1, 2), we use Ψ = 1.146: end if

end if48: end if

end if

Fig. 2. Simulation environment

the sender using TMMBR based on the average inter-arrivaltime of RTP packets between two RTCP RRs. The receiveralso enhances the performance of TMMBR in all scenarios(TMMBR-A, TMMBR-B and TMMBR-U) by signaling thenumber of discarded bytes [14] to the sender as it helps inundershooting and thus temporarily alleviating the stress onthe network queues. Due to link induced losses, the senderimplements some light congestion avoidance techniques basedon increasing RTT, discarded bytes [14] and packet loss.

C. Reactive scheduling of RTCP RRs from receiver to sender

RFC4585 [9] allows throttling of RTCP to 2.5% of the avail-able bandwidth to each end-point in a point-to-point scenario,which is quicker than the 5 ± 2.5sec restriction described in[13]. [15] describes that sending feedback at every 200ms orup to 380ms helps in quicker adaptation to congestion butuses non-compound RTCP [16] to conserve RTCP bandwidth.However, we do not use non-compund RTCP [16] reporting,as normal RTCP packet contains essential information such asRTT, HSN, Jitter, etc. to the sender.

For reactive scheduling of RTCP we consider the bad packetrate which takes into account both the lost and discardedpackets at the receiver. We define a threshold of between20-30% to reduce the feedback rate by half. However, whilereducing the feedback rate we limit our lower-bound to theminimum RTCP interval set by [9].

IV. SIMULATION ENVIRONMENT

Our simulation environment is built using ns2 [17] for thecore network. The simulator interfaces with the Nokia H.264codec [18] so that the rate adaptation algorithms can be evalu-ated in a real-world setting. We have extended ns2 as describedin [19] to provide real-time exchange of RTP/RTCP messagesbetween the codec and ns2 by routinely synchronizing theirclocks. Furthermore, the receiver RTP layer is extended togenerate feedback messages while the sender is extendedto respond to them. Fig 2 represents an overview of thesimulation environment. The sender/receiver generate videodata encapsulated in RTP packets [13]. The decoder generatesRTCP feedback based on [7], [13], [14] [20], and conformsto the timing rules described in RFC4585 [9]. Furthermore,the 3G core network is presumed to be a well provisionederror-free network. The four 3G Links are used as accesslinks between the codec and core network. The 3G linksconform to the behavior described in [21]. The Radio Link

Control (RLC) [22] frame sizes and their scheduling controlsthe amount of data (inclusive all headers) that can flow on the3G links. The RLC frame sizes and scheduling opportunity ofthe frames conform to those defined by 3GPP for evaluationof rate adaptation [3].

There are four different RLC pattern files. Two for thesender side: uplink (UL) / downlink (DL); and two for thereceiver side: uplink/downlink. The simulation environmentcan also produce 0.5% to 1.5% link layer losses (3G Link)using error patterns defined in [21]. To simulate the 0.5%losses, the RLC frames [3] are further broken down into 40-byte frames and sent over the 3G link. If a 40-byte frameis dropped, reconstruction of the associated IP packets fails,therefore, 0.5% loss rate may cause higher IP layer packet loss[19]. It should also be noted that no header compression wasused over the 3G links.

The uplink and downlink queues in the network are longqueues with 200ms time-to-live for a packet in the queue.Therefore, only complete IP packets are transmitted throughthe core network. Apart from the queuing delay caused bythe RLC scheduling of each packet at the UL/DL queues,the packets are queued for a further 240ms as static one-waydelay just before they are delivered to the receiver. Insteadof using fixed packet sizes as described in [3], we use amedium motion media sequence (“Foreman” QCIF sequence)encoded at 15frame/second and the sender encapsulates1frame/IP packet (for simplicity, even though the H.264codec [18] supports slicing of frames). Furthermore, in allscenarios the sender begin with an initial sending rate of 128kbps and are not restricted by a maximum encoding rate.

We have chosen two types of scenarios to evaluate the rateadaptation scheme. The first is a highly dynamic 3G link basedon the 3G traces [3]; the sender’s uplink is a concatenatedpattern based on excellent, poor, and elevator scenarios (60seach) while the receiver uses the elevator RLC pattern fileconcatenated three-times. The second is a more stable linkwith slowly changing links with link bitrate changing at 0,20, and 40 seconds to 192, 96, 128 kbps respectively at alllinks. The second scenario is chosen to test the stability ofthe algorithms. In the case of TMMBR, bandwidth updatesare generated at the end of every 1s interval in the dynamicscenario (by averaging the available RLC bytes in that interval)and in the more stable scenario it is generated every time thebandwidth changes.

V. PERFORMANCE EVALUATION

TFRC is implemented as defined in [8]. Not all scenariosand extensions mentioned in [6] were developed. However allextensions suggested in [5] were implemented along with thesignaling mechanism for conveying loss event rate, timestampof the last received packet and current decoding rate at thereceiver (TFRC-FB). TFRC-FB is sent along with each RRand is sent every 500ms.

For TMMBR, we introduce three cases: TMMBR-A,TMMBR-B and TMMBR-U (see sec. III-B). In TMMBR-A and TMMBR-B the network assists the sender or receiver

(a) TFRC (b) TFRC: ABU = 33%

(c) TMMBR-U (d) TMMBR-U: ABU = 40%

(e) TMMBR-A (f) TMMBR-A: ABU = 60%

(g) TMMBR-B (h) TMMBR-B: ABU = 50%

(i) NADU-C (j) NADU-C: ABU = 55%

Fig. 3. Plot of Link rate, encoder rate, goodput (left column) and Histogram of Probability of per-instance %Utilization (right column) and Average BWUtilization (ABU) of Dynamic 3G Links

(a) TFRC (b) TMMBR-A (c) NADU-C

Fig. 4. Plot of Link rate, encoder rate, goodput in a stable and slow bandwidth changing scenarios.

(a) TFRC: ABU = 40% (b) TMMBR-A: ABU = 70% (c) NADU-C: ABU = 60%

Fig. 5. Histogram of Probability of per-instance %Utilization of the stable, slow bandwidth changing links

or both. In TMMBR-U there is no network assistance, andthe receiver notifies the sender with a recommendation forthe sending rate based on losses or increase in inter-arrivaltime of packets. In all the cases, the receiver signals thenumber of bytes discarded to the sender. Furthermore, dueto link induced losses, the sender implements some lightcongestion avoidance techniques based on increasing RTTand packet loss. We do not run simulations for TMMBR-Bin the second scenario (slowly changing bandwidth) becausethe uplink and downlink traces in this scenario are exactlythe same. Therefore, adaptation of TMMBR-B follows that ofTMMBR-A.

C-NADU uses the algorithms describes in section III andthe signaling defined in [12]. The NADU feedback is sent withevery RTCP RR, even if the buffer is empty due to no newpackets arriving or underflow. However, the bytes discardedextension [14] is only sent by the receiver when it actuallydiscards packets due to late arrival. Feedback messages aresent every 500ms except when interval losses exceed 30%after which the RRs are sent every 250ms, however the SRsending rate is not affected and is sent at 500ms.

Figures 3 (left column) and 4, show the instantaneousvariation of the encoder rate and decoder goodput to thelink bandwidth which is the minimum of the UL and DLbandwidth for the rate adaptation schemes. Tables I and IIpresent the average encoder rate, average goodput, averagePSNR and the delta loss rate (DLR) for the two scenarios.The latter is defined as the additional loss rate caused bythe operation of the rate adaptation algorithm. This delta loss

TABLE ISCENARIO 1: 3G LINKS USING RAN TRACES (180 MS SIMULATION)

Avg. enc. rate Avg. goodput DLR Avg. PSNR(kbps) (kbps) (%) (dB)

TFRC 98.6 84.1 6.9% 29.3TMMBR-U 99.7 89.8 3.7% 30.5TMMBR-A 97.7 90.1 1.3% 32.3TMMBR-B 98.5 90.5 2.9% 31.7C-NADU 99.4 92 2.2% 31.9

TABLE IISCENARIO 2: 3G LINKS WITH STABLE AND SLOW BW CHANGES

Avg. enc. rate Avg. goodput DLR Avg. PSNR(kbps) (kbps) (%) (dB)

TFRC 75.7 66.1 4.4% 30.5TMMBR-A 87.6 82.9 0% 31.8C-NADU 88.5 80.9 2.1% 31.2

rate occurs whenever the uplink and downlink network buffersoverflow, and it is therefore induced by congestion losseson the top of inherent losses caused by the wireless natureof the link. It has to be pointed out that in our simulationsthe air interface loss rate in normal conditions was 1.9% forTFRC%, 1.8% for TMMBR-U, 1.9% for TMMBR-A, 2% forTMMBR-B and 1.8% for NADU-C in the dynamic 3G linkscenarios. In Figures 3 (right column) and 5, we present thepercentage of bandwidth utilization in terms of probability. i.e.%BW Utilization = goodput

actual linkrate .TMMBR-A due to its knowledge about the network con-

ditions at the UL and DL provides the best adaptation

(1.3% and 0% delta loss rate and 60% and 70% AverageBW utilization (ABU)2) while TFRC, basing its knowledgesolely on normal RRs, suffers from the maximum packet loss(6.9% and 4.4%) and under utilizes the link (33% and 40%ABU) in both the scenarios. In the dynamic 3G scenario,TMMBR-B receives the upper-bound bandwidth informationof the downlink, and is therefore able to provide better utiliza-tion (50%) of the link when compared to TFRC. However, dueto probing (based on RTT, inter-arrival times of packets at thereceiver) it causes a delta loss rate of 2.9%. C-NADU on theother hand, without any assistance from the network producesbetter results in terms of delta loss rate (2.2% and 2.1%) andABU (55% and 60%) when compared to TFRC and unassistedTMMBR (TMMBR-U) which produces 3.7% delta loss rateand only 40% ABU.

VI. CONCLUSION

Network-assisted rate adaptation provides the best adapta-tion, which can be useful in scenarios such as handovers, cell-loading where the operator has knowledge of an event beforeit takes place. In this case TMMBR-A (TMMBR with networkassisted adaptation) has shown the best performance. When nodirect information about the uplink and downlink bit rates isavailable from the network, our new algorithm (C-NADU) hasshown performance close to that of TMMBR-A and better thanthe unassisted TMMBR (TMMBR-U). Moreover, by usingcross layer technologies it could be possible to get some ofthis information from within the device instead of signaling itexplicitly. Results also show that TFRC adapted for real timemedia is still not well suited for multimedia applications asit under utilizes the link. We believe that C-NADU can beextended to operate in the general internet because it does notget link updates like TMMBR and makes decisions based onperceived network conditions.

Extension to the current work will involve adapting thealgorithms to consider video slices, proactive RTCP schedul-ing to send feedback early, considering scenarios with shortintermediate queue. Furthermore, develop these rate adaptationmechanisms for the general internet environment with cross-traffic.

REFERENCES

[1] 3GPP TS 26.114, “IP Multimedia Subsystem (IMS): Multimediatelephony; media handling and interaction.” [Online]. Available:http://www.3gpp.org/ftp/specs/html-info/26114.htm

[2] ITU-T Rec. H.264, “Advanced video coding for generic audiovisualservices.”

[3] 3GPP S4-080771, “MTSI video dynamic rate adaptation: Evaluationframework ver 1.0.” 3rd Generation Partnership Proejct (3GPP),Proposal S4-080771, Oct. 2008. [Online]. Available: http://www.3gpp.org/FTP/tsg sa/WG4 CODEC/TSGS4 51/Docs/S4-080771.zip

[4] S. Floyd, M. Handley, J. Padhye, and J. Widmer, “Equation-basedcongestion control for unicast applications,” in SIGCOMM ’00: Pro-ceedings of the conference on Applications, Technologies, Architectures,and Protocols for Computer Communication. New York, NY, USA:ACM, 2000, pp. 43–56.

2this is weighted average utilization and > 100% utilization is consideredas 100% utilization.

[5] L. Gharai, “RTP with TCP Friendly Rate Control,” work inprogress, January 2008. [Online]. Available: http://tools.ietf.org/id/draft-ietf-avt-tfrc-profile-10.txt

[6] S. Floyd, M. Handley, J. Padhye, and J. Widmer, “TCP FriendlyRate Control (TFRC): Protocol Specification,” Internet EngineeringTask Force, RFC 5348, Sep. 2008. [Online]. Available: http://www.rfc-editor.org/rfc/rfc5348.txt

[7] S. Wenger, U. Chandra, M. Westerlund, and B. Burman, “Codec ControlMessages in the RTP Audio-Visual Profile with Feedback (AVPF),”RFC 5104 (Proposed Standard), Feb. 2008. [Online]. Available:http://www.ietf.org/rfc/rfc5104.txt

[8] M. Handley, S. Floyd, J. Padhye, and J. Widmer, “TCP FriendlyRate Control (TFRC): Protocol Specification,” Internet EngineeringTask Force, RFC 3448, Jan. 2003. [Online]. Available: http://www.rfc-editor.org/rfc/rfc3448.txt

[9] J. Ott, S. Wenger, N. Sato, C. Burmeister, and J. Rey, “ExtendedRTP Profile for Real-time Transport Control Protocol (RTCP)-BasedFeedback (RTP/AVPF),” RFC 4585 (Proposed Standard), Jul. 2006.[Online]. Available: http://www.ietf.org/rfc/rfc4585.txt

[10] I. Curcio and D. Leon, “Application rate adaptation for mobile stream-ing,” WOWMOM ’05: Proceedings of the Sixth IEEE InternationalSymposium on World of Wireless Mobile and Multimedia Networks, pp.66–71, 13-16 June 2005.

[11] ——, “Evolution of 3gpp streaming for improving qos over mobilenetworks,” ICIP 2005: IEEE International Conference on Image Pro-cessing, 2005., vol. 3, pp. III–692–5, 11-14 Sept. 2005.

[12] 3GPP TS 26.234, “Transparent end-to-end Packet-switched StreamingService (PSS); Protocols and codecs.” [Online]. Available: http://www.3gpp.org/ftp/Specs/html-info/26234.htm

[13] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: ATransport Protocol for Real-Time Applications,” RFC 3550 (Standard),Jul. 2003. [Online]. Available: http://www.ietf.org/rfc/rfc3550.txt

[14] J. Ott, I. Curcio, and V. Singh, “Real-time Transport ControlProtocol Extension Report for Run Length Encoding of DiscardedPackets,” work in progress, June 2009. [Online]. Available: http://tools.ietf.org/id/draft-ott-avt-rtcp-xt-discard-metrics-00.txt

[15] H. Garudadri, H. Chung, N. Srinivasamurthy, and P. Sagetong, “Rateadaptation for video telephony in 3g networks,” Packet Video 2007, pp.342–348, Nov. 2007.

[16] I. Johansson and M. Westerlund, “Support for Reduced-Size RTCP, Opportunities and Consequences,” work inprogress, May 2009. [Online]. Available: http://tools.ietf.org/id/draft-ietf-avt-rtcp-non-compound-08.txt

[17] “Homepage of the Network Simulator (ns2) and the Network Animator(nam),” http://nsnam.isi.edu/nsnam.

[18] Nokia, “Nokia’s public h.264 codec.” [Online]. Available: http://www.nokia.com

[19] J. Devadoss, V. Singh, J. Ott, C. Liu, Y.-K. Wang, and I. Curcio,“Evaluation of error resilience mechanisms for 3g conversational video,”IEEE International Symposium on Multimedia,, vol. 0, pp. 378–383, 15-17 December 2008.

[20] T. Friedman, R. Caceres, and A. Clark, “RTP Control ProtocolExtended Reports (RTCP XR),” RFC 3611 (Proposed Standard), Nov.2003. [Online]. Available: http://www.ietf.org/rfc/rfc3611.txt

[21] 3GPP S4-050560, “Software Simulator for MBMS Streaming overUTRAN and GERAN,” 3rd Generation Parnership Proejct (3GPP),Proposal S4-050560, Sep. 2005. [Online]. Available: http://www.3gpp.org/FTP/tsg sa/WG4 CODEC/TSGS4 36/Docs/S4-050560.zip

[22] 3GPP, “Radio Link Control (RLC) protocol specification,” 3rdGeneration Partnership Project (3GPP), TS 25.322, Sep. 2008. [Online].Available: http://www.3gpp.org/ftp/Specs/html-info/25322.htm

[P10] Igor D.D. Curcio, Vinod K.M. Vadakital, Miska M. Hannuksela, “Geo-Predictive Real Time Media Delivery in Mobile Environment”, Proc. 3rd ACM International

Workshop on Mobile Video Delivery (MoViD) (in conjunction with 18th ACM Multimedia Conference 2010), 25 Oct. 2010, Firenze, Italy.

© 2010 ACM, Inc. http://doi.acm.org/10.1145/1878022.1878036.

Geo-Predictive Real-Time Media Delivery in Mobile Environment

Igor D.D. Curcio

Nokia Research Center P.O. Box 1000

33721 Tampere, Finland

[email protected]

Vinod Kumar Malamal Vadakital Department of Information Technology

Tampere University of Technology Tampere, Finland

[email protected]

Miska M. Hannuksela Nokia Research Center

P.O. Box 1000 33721 Tampere, Finland

[email protected]

ABSTRACT Multimedia streaming is one of the most popular services today. When the user is in a mobile scenario, the delivery of multimedia streaming services becomes more challenging. Mobile streaming suffers from discontinuous playback that sometimes impairs user experience. Among other factors, this is also due to the high network bandwidth variation that a user can experience along a path. In some cases, the available bandwidth is close to zero when traversing tunnels or areas where the network capacity goes below what is required for a multimedia session to be pause-less. Typically, media adaptation and rate control are used to fight against variable bandwidth. However, these are usually reactive algorithms, where an event is first detected (e.g., a drop in available bandwidth), and then an action is taken, either by the streaming client or by the server. This action may result just in a mitigation of the problem and not in the complete removal. In this paper we introduce the novel concept of Geo-Predictive mobile streaming. This is a collaborative service that makes use of prediction rather than reaction. Network coverage maps are built with the aid of mobile users; with these maps, the available network bandwidth for each location is recorded in a server, and when a user travels from point A to point B, it is possible to predict well in advance what will be the experienced bandwidth along that route. In case of bandwidth drops, these can be known in advance and media adaptation algorithms can be triggered so that a pause-less media playback experience can be guaranteed to the end users anywhere and all the time.

Categories and Subject Descriptors C.2.2 [Newtork Protocols]: Applications (SMTP, FTP, etc.), Protocol Architecture (OSI model).

General Terms Algorithms, Performance, Experimentation, Standardization.

Keywords Mobile Streaming, Geo-Prediction, Media Adaptation.

1. INTRODUCTION Multimedia application and services are available to the mobile users is several countries. Examples applications range from the simple multimedia messaging (MMS) to imaging and video capture, to media downloading and playback, to real-time video sharing and streaming. Some of these mobile applications and services have been the focus of some standardization committees in the last years (e.g., [1], [3]). Among the set of mobile multimedia applications, those that deal with real-time media transmission and consumption encounter the major challenges. For instance, multimedia streaming and bi-directional video telephony pose several difficulties due to the medium/low-delay nature of these applications. Mobile networks, on one hand, have been carefully designed to support guaranteed bit rates for the mentioned real-time applications [4]; on the other hand, existing implementations and deployments of 2.5G or 3G networks, have shown that there may always be areas where guaranteed bit rates for multimedia applications are not available. Therefore, best-effort traffic classes and bearers [4] are often the network channels over which real-time multimedia traffic is carried. Even if networks are well provisioned, the physical topology of the territory may prevent yielding good radio coverage and good received throughput all the time. Examples include skyscrapers, tunnels and rural areas. For real-time multimedia applications, the three main critical dimensions for media quality are delay, error rate and throughput. Streaming applications are less delay sensitive, since they typically buffer some amount of data, before starting playback, so that network jitter can be smoothed out. This paper does not deal with error rates. Packet losses can be repaired by means of packet retransmission or Forward Error Correction (FEC) techniques (e.g., [1], [3]), depending on the amount of available delay before playback. Throughput and its variability in the time and space dimensions is the subject of this paper. In a mobile streaming application, whenever the received throughput is not sufficient (i.e., when the network throughput is not at least as much as the bit rate used for encoding the media and its transport), glitches in the continuous playback may occur. In a best effort network, the received throughput can be variable over time in the same location. In addition, when the user moves to different locations, the probability of having a constant received throughput for all the duration of a streaming session decreases dramatically. In this paper we study the problem of fighting variable received throughput for guaranteeing a constant media quality and pause-less playback to mobile streaming users. The problem has been studied for years and some solutions have been proposed (see next

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MoViD’10, October 25, 2010, Firenze, Italy. Copyright 2010 ACM 978-1-4503-0165-7/10/10...$10.00.

3

sections). However, in some cases, these solutions offer unsatisfactory performance because they handle short-term throughput outages, and lack of long-term visibility and prediction. An example is when a user travels through a tunnel, where the throughput outage is sudden and may last even minutes. This work describes a predictive rate adaptation solution for mobile multimedia streaming. By means of user context information, for instance location and motion data, together with past throughput information, it is possible to radically change the approach to do network bit rate adaptation and streaming delivery. The adaptation becomes predictive (as opposed to reactive algorithms proposed in the past literature). The geo-predictive system proposed in this paper allows for example detecting that a tunnel is in proximity of the user, and that it will produce a very low (or zero) bit rate. Therefore, it allows taking the best action well in time in order to handle the situation. Section 2 introduces some related work in the area. Section 3 includes a review of existing techniques on mobile media adaptation. The essential notions of geo-predictive streaming are introduced in section 4. The simulation environments and results are respectively shown in sections 5 and 6, while section 7 concludes the paper.

2. RELATED WORK Service maps were introduced in [6]. In this work the authors took the approach of a mobile data management service that allows mobile users to obtain a detailed view of the available networks (e.g., WLAN) and the services they offer depending on geographic position, mobility paths, etc. In [7], the authors developed an algorithm for context-aware rate adaptation for VANETs (Vehicular Ad-Hoc Networks) based on 802.11 cross layer information exchange between the application layer and the MAC layer. Distance and speed of the other vehicles were used as the context information. The reported results are mainly for vehicle-to-vehicle and vehicle-to-infrastructure communication. The paper aims at predicting the Packet Error Rate as a function of a) the distance between vehicles, b) their speed and c) transmission rate. Our paper does not require other cars to be within a certain range, since the solution offered here is in the form of a mobile service with a centralized server. The authors of [8] demonstrated that the bandwidth along a path is more predictable if the location information is taken into account, and that there is no significant correlation between the bandwidths at different points in time within a given trip. Furthermore, the authors found that the bandwidth uncertainty may reduce considerable whenever observations from past trips are taken into account. In [9], the same authors introduced the concept of geo-intelligence which seeks to exploit a correlation between location and link behavior in a high-speed vehicular mobility scenario. Geo-profiles, i.e., a statistical profile for each location, can then be used to predict the network behavior as a function of the location. However, differently from our work, the focus of their work was on a traffic scheduler for scheduling downlink user traffic amongst multiple WWAN links in a multi-homed on-board network inside a vehicle.

3. MEDIA ADAPTATION TECHNIQUES This section includes a short review and taxonomy of media adaptation techniques for real-time streaming. For the sake of clarity, in the following by real-time streaming it is meant the transmission of a real-time media stream over an unreliable

transport protocol, such as RTP. (Pseudo-) streaming or progressive downloading media transmissions over reliable protocols (e.g., TCP) are purposely out of scope of this paper. In the most basic case, no adaptation is used during media streaming from a server to a client via a mobile network. In this case, whenever the network throughput becomes lower than the server transmission bit rate, then packets are accumulated in the network buffers for a certain time, and then lost. The user experiences this fact as discontinuous playback, poor media quality and sometimes also session interruption. This phenomenon happens if the media stream is encoded with Constant Bit Rate (CBR), but also if the media is encoded at a Variable Bit Rate (VBR). The problem remains, i.e., how to fit a continuous media stream of a certain average bit rate into a network pipe of variable bit rate, without causing disruptions in the streaming client playback. If we look at the server output media stream, there are two variables that can influence the media flow that exits the server: the media encoding bit rate and the transmission bit rate. The former is defined as the rate at which the media (e.g., audio or video) is encoded and played back. The latter is defined as the rate at which the media is transmitted from the server. The two do not necessarily have to match. Given a fixed encoding bit rate stream, a server may decide to transmit (parts of) that stream slower of faster, depending on the situations. For example, in periods of network outage (e.g., a handover) the available bandwidth can be close to zero, and the server may intelligently decide to slow down or stop the transmission, in order to avoid unnecessary packet losses (due do network buffer overflow) that would surely require packet retransmission in order to be repaired; conversely, the server may decide to stream media faster when more bandwidth is available, in order to avoid a client buffer underflow and quickly bring the buffer to a healthy level. The encoding bit rate can also be subject of rate-adaptive techniques. For example, the already compressed media stream rate could be further reduced when using bit rate thinning techniques. For example, dropping the B pictures of a video stream results in a lower amount of bits to send over the network pipe with little or imperceptible quality degradation. Also multi-rate video encoding and bit-stream switching [1] are commonly used techniques for reducing the amount of media bits to send over the air. In case of live media encoding media, it is of course possible also to change the encoding parameters and use a different (e.g., lower) encoding bit rate on-the-fly. Therefore, with the fact that both the encoding and transmission bit rates are variable and can be manipulated at the sender side, we can say that the server output media bit rate is a function of both media encoding rate and transmission rate. Once the capability of changing the server output bit rate is established, there are still two issues to handle, in order to define a complete media adaptation scheme: 1) the entity that drives the media adaptation mechanism and the related signaling mechanism between server and client; 2) the time instant when the media adaptation should take place. Depending on the entity that drives the decisions and actions, the schemes for rate adaptation can be classified into: • Server-driven, if the decisions and actions about when and

how to operate the rate adaptation, as well as the client buffer control are under the streaming server. In this case,

4

the client task is that of periodically reporting some useful information to the server. Examples of server driven rate adaptation signaling mechanisms are available in [1] and [10] and performance evaluation is available in [5] and [16].

• Client-driven, if the decisions above are under the streaming client control. An example of client driven rate adaptation is available in [11].

• Co-operative, if there is a clear responsibility split between the server and client decisions and their actions [12]. An example of co-operative rate adaptation technique is available in [13].

Discussions on the advantages and disadvantages of the different approaches are available in [12]. Now, since a streaming system is essentially a real-time system, the time instant when the rate adaptation is performed is critical for the whole application user experience. Therefore, a classification of rate adaptation schemes is possible also depending on when the rate adaptation is performed: • Reactive schemes: here the rate adaptation action is triggered

upon the occurrence of an event (e.g., a handover or a sudden drop of network bandwidth). The streaming client sends some signals to the streaming server, either on a periodic basis or on an event occurrence basis. The server analyzes the information and then reacts appropriately.

• Proactive schemes: here the rate adaptation action is of the reactive type (i.e., after an event has happened). However, the server could have a proactive role in the rate adaptation action (e.g., looking in advance at the future of the media bit stream to be transmitted, in order to understand its characteristics and better adapt to outages in bandwidth availability).

• Predictive schemes: in these cases the client and/or the server have mechanisms for looking into the future of the media bit rate to be transmitted, and also into the future of the network bandwidth characteristics.

The first two types must use an estimation window that may impact performance. This window must be kept small, in order to guarantee a reaction in the shortest possible time. Reactive and proactive schemes both rely on past information reported from the streaming client in order to estimate the current network situation. The advantage of predictive rate adaptation schemes is that they allow more time to a server and client for performing rate adaptation. This yields enormous benefits, if we think for example that for a mobile user traveling through a tunnel (where is usually no network coverage) may mean virtually continuous playback all the time and everywhere. This is the subject of the next sections.

4. GEO-PREDICTIVE STREAMING The received quality of a media stream can be improved when using Geo-Predictive media adaptation. This novel way of doing media streaming makes use of geographical information in order to do predictions of the future network state, and take the most appropriate action to guarantee a pause-less media playback at the client. Figure 1 shows the basic system architecture for Geo-Predictive streaming.

Figure 1: Architecture of a Geo-Predictive Streaming system.

4.1 Basic operations The main operations of streaming server, geo-predictive server and client are reported in the following. Note that the geo-predictive server and streaming server may be co-located in the same physical server. However, for simplicity, here we consider them as separate logical and physical entities.

4.1.1 Streaming server 1. The streaming server sets up the initial session. 2. It performs the ordinary media streaming functions. 3. It also receives bit rate adaptation requests and periodic

reports from the streaming client, as in [1]. 4. It is also responsible for the media adaptation and delivery of

the (adapted) media to the streaming client.

4.1.2 Streaming client 1. The client is responsible of the ordinary media reception and

playback operations. 2. In addition, during mobility, it sends its route, speed,

location information and experienced received throughput to the Geo-Predictive server. The client may also perform network probing [14] to have a better estimate of the available network bandwidth in that location.

3. It receives Network Coverage Data from the Geo-Predictive server.

4. Based on the speed and the Network Coverage Data, the client calculates if its buffer is going to suffer from an underflow (e.g., because a tunnel is in proximity)

5. It calculates the new buffering parameters to be reported to the streaming server.

4.1.3 Geo-Predictive Server 1. This server is responsible of keeping a Network Coverage

Database of locations and other data associated to each location (e.g., network throughput, measurement time, speed of the vehicle, etc.). Each location along the user route is sampled at regular intervals δ, forming a finite well ordered set L={l1,l2,…,ln}. All locations in between (li-δ/2) and (li+δ/2) are called the near-neighborhood of li. δ is the location sampling granularity of the geo-predictive system.

2. Upon reception of route, speed information and experienced throughput from the client, the server performs an update of the Network Coverage Database (e.g., replacing old data).

3. The server performs also a look-up operation for extracting enough prediction data (based on route and speed) to send to the client. The amount of data to send depends on the granularity of location sampling and the speed of the user.

5

4.2 Realization in the 3GPP PSS For making this idea work on a 3GPP PSS system [1], three parameters play an important role in the client to server communication: free buffer space, the total buffer size and the target buffer level. For example, if there is a tunnel in a few seconds and no network coverage is expected, the client may temporarily expand the size of its buffer by an amount which is enough to overcome the network outage and prevent a buffer underflow and a disruption in the user experience. The new buffer size implies also a new size of free buffer space and a new desired target buffer level. This data is communicated to the streaming server well in advance, and the server will therefore try to keep the client buffer at a healthy state by pushing more data in order to reach the target buffer level as quickly as possible.

5. SIMULATION ENVIRONMENT Simulations were performed to evaluate the efficiency of using geo-predictive rate adaptation compared to PSS rate adaptation and normal transmission. The available radio network throughput is computed from the patterns provided by 3GPP [15] for an LTE HSPA channel. Of the five set of patterns provided, only the ‘Fair’ and ‘Bad’ downlink channel conditions were used. The throughput patterns provided were one minute in duration. The two throughput patterns were concatenated ten times to obtain a route that could be traversed in ten minutes assuming a user speed of 10 km/h. The throughput condition, whether in a fair state or a bad state, was obtained by using a two-state Markov model. The used model, along with the state transition matrix is shown in Figure 2. In the figure, the state g denotes fair throughput condition, and the state b denotes a bad throughput condition. The state transition matrix was chosen arbitrarily.

g b

p(gb)

p(bg)

p(bb )p(g g)

g

b

g bp

0.1

0.95

0.9

0.05

Figure 2: Two state Markov Model used to simulate the routes

{R1, R2, R3}.

Three such routes, {R1, R2, R3}, each with different throughput conditions, were generated. The throughput patterns indicate the total amount of bits that could be transmitted over the physical channel (for multiple users), every millisecond, for the generated routes. To obtain the throughput condition for a single media flow and a single user, a downlink scheduling pattern [2] was applied. The downlink scheduling pattern provided the times at which the media flow could use the physical channel. Figure 3 shows the network available bandwidth (throughput condition) for the route R1, generated by concatenating the fair and bad throughput patterns, and after application of the scheduling pattern. The average throughput rate, rg, and the average throughput rate during bad the period, rb, were calculated from the respective patterns. They were respectively 217 kbps (for all routes) and 88 kbps (80 kbps for routes R2, R3). A ten minute long, QVGA video sequence, with a frame rate of 12.5 frames per second, was encoded using an H.264/AVC encoder. Two coded bit streams of approximately (re – 80) and rb kbps were generated (indicated as horizontal lines in Figure 3, respectively Stream 0 and Stream 1). The Intra frame insertion interval was set to once every four seconds.

0 100 200 300 400 500 60050

100

150

200

250

300

350

Playout Time (Seconds)

Bit

rate

(K

bp

s)

Available BandwidthStream 0Stream 1

Figure 3: Network available bandwidth, and the minimum

rates (re- 80) and rb kbps, for the route R1.

6. RESULTS AND DISCUSSION For space reasons the following results will be limited only to route R1.The first algorithm simulated was the normal streaming scenario with no rate adaptation (NOR), where the stream was encoded at the (re – 80) rate. Each IP packet encapsulated a video frame. Every packet had an associated size and a sampling time. The packet was assumed to experience a constant delay in the core network. A network buffer was located at the edge of the core network. The packet entered the network buffer with no delay after the sampling time. The network buffer was assumed to be 5re bits in size. Therefore, a packet, after the arrival time, had five seconds to be transmitted from the network buffer, else the packet was discarded. If the packet was transmitted without discarding, it was transmitted based on the available bandwidth.

0 100 200 300 400 500 6000

50

100

150

200

250

300

350


Bit

rate

(K

bps)

Available BandwidthTransmission BitrateAverage Media Rate

Figure 4: Available network and client received rates (NOR).

0 100 200 300 400 500 6000

1

2

3

4

5

6

7

8

9

10


Buf

fer

Fu

llne

ss (

Sec

ond

s)

NOR

Figure 5: Client buffer fullness for NOR.

6

If a packet could not completely be transmitted at the scheduled time, the remaining packet data was held in the network buffer, and sent at the next available time slot. This process was iterated until the whole packet was transmitted. Figure 4 shows the available network bandwidth, the streaming server transmission bit rate and the average media encoding rate plotted. As it can be seen, the transmission rate closely follows the average media bit rate (horizontal line). However, during the period of bad available bandwidth, the sender does not do anything to compensate for the bad period (which is about 25 seconds long), but it continues to transmit at the same bit rate producing congestion, packets to be lost in the network, and bad media quality at the client. The client buffer fullness for this case is plotted in Figure 5. There is a buffer underflow when the available bandwidth goes below the average media bit rate for a long period of time.

The second algorithm used the 3GPP PSS rate-adaptation transmission (RAT). It was assumed that the streaming server is aware of the client reception rate only after a 300 milliseconds delay. The client calculated the reception rate based on an average over a one second period. As mentioned earlier, two coded video streams of the same content were available at the server side: the first was encoded at a rate of (re - 80) kbps for the entire ten minutes duration, and the second was encoded at the rate of rb kbps also for the same duration. When the available bandwidth, as computed by the client and reported to the server, went 25% above or below the currently transmitted media rate, a stream switching was done at the server. There was no change in the network buffer size and the playout buffer size from the NOR case.

Figure 6: Available network and client received rates (RAT).

0 100 200 300 400 500 6000

1

2


Str

eam

s

Figure 7: Switching of streams. Stream 0 is the bit stream coded at (re – 80) and Stream 1 is the bit stream coded at rb.

Figure 6 shows the available network bandwidth and the transmission rate using RAT. The stream switching from, (re - 80)

to rb, as shown in figure 7, can be clearly seen when the available

bandwidth reduces drastically. From this figure it is intuitive to conclude that the global average media quality is lower, because the down-switches to the lower bit rate stream yield a lower media quality. The buffer fullness at the client also varies because of stream switching. When there is a stream switch to a lower quality bit stream, the buffer fullness may be lower too for some transient time, but there are no buffer underflows. The client buffer fullness when RAT is used is shown in Figure 8.

0 100 200 300 400 500 6000

1

2

3

4

5

6

7

8

9

10


Bu

ffer

Fu

lln

ess

(Sec

ond

s)

RAT

Figure 8: Client buffer fullness for RAT.

The third algorithm simulated the geo-predictive rate adaptation transmission (GPT) as described in section 4. The available bandwidth was known a-priori from the GPT algorithm. With this knowledge, the client was completely aware of the time period [ts, te] and the duration, Δt = (te - ts), of the next bad available bandwidth condition. At the time instant, ts-Δt-ε, the client sent an update of the buffer parameters to the streaming server, as described in sec. 4.1.2 and 4.2. ε is to account for additional delays (e.g., a one way delay between client and server). The client request was then translated by the server into a transmission rate decrease so that a healthy buffer level could be maintained at the streaming client, with no overflows or underflows or playback disruptions. After the time instant te, the client again signaled new buffering parameters (with the original values before the adaptation started) to the server that continued to send data at the original transmission rate. This procedure was repeated for every bad available bandwidth period in the session.

0 100 200 300 400 500 6000

50

100

150

200

250

300

350


Bit

rate

(K

bp

s)


Figure 9: Available network and client received rates (GPT).

Figure 9 plots the available bandwidth along with the media transmission rate. As it can be seen in the figure, the media transmission rate is increased before the predicted outage in available bandwidth. During the outage the transmission bit rate is dropped. The increase in transmission bit rate can be recognized

0 100 200 300 400 500 6000

50

100

150

200

250

300

350

Playout (Seconds)

Bit

rate

(K

bp

s)


7

by the steep buffer increase, shown in Figure 10, just before the outage in available bandwidth. During the low available bandwidth period when the transmission bit rate is dropped, the data in the buffer is used up to reaching approximately the same level as before the outage occurred.

0 100 200 300 400 500 6000

1

2

3

4

5

6

7

8

9

10


Bu

ffer

Fu

lln

ess

(Sec

ond

s)

GPT

Figure 10: Client buffer fullness for GPT. Table 1 summarizes the results for the different rate adaptation schemes described in this paper. The table shows the number of rebufferings, cumulative length in time for the rebufferings, the average transmitted media bit rate, and the packet loss rate. NOR is the basic comparison scenario, and it does not provide a good user experience because of the rebufferings. In the worst case the streaming client suffers of a 20.2s rebuffering, which is a media playback disruption occurred to fetch data after a network bandwidth outage. In addition, NOR suffers from packet losses that produce bad media quality at the streaming client. RAT and GPT both can survive rebufferings and packet losses. However, GPT offers always the best media quality, compared to RAT, because a single stream is used in the transmission without the need of switching stream and lower the media quality. With GPT the media quality remains constant and the highest possible, and no disruptions in the playback occur.

Table 1: Performance results for NOR, RAT and GPT

7. CONCLUSIONS Geo-Predictive streaming makes use of prediction to foresee network coverage outage periods and change the streaming server transmission parameters to improve performance. This technique has shown to be effective as to guarantee pause-less playback and the highest media quality to the end user. The Geo-Predictive functionality could even be more tightly integrated into the streaming server main functions as part of future mobile streaming standards and could also be used with the existing 3GPP rate adaptation scheme. Further work may be directed towards the investigation of a comparison of geo-predictive streaming against TCP-based progressive download or pseudo-

streaming, in order to quantify pros and cons of both streaming technologies.

8. REFERENCES [1] 3GPP TS 26.234, Transparent end-to-end Packet-switched

Streaming Service(PSS);Protocols and codecs,v.9.3.0,06-10.

[2] Qualcomm, Ericsson, Nokia, MTSI Video Dynamic Rate Adaptation: Evaluation Framework, ver 0.91, 3GPP TSG-SA4#51, 3-7 Nov. ‘08, Shenzhen, China,Tdoc S4-080761.

[3] 3GPP TS 26.346, Multimedia Broadcast/Multicast Service (MBMS); Protocols and codecs, v. 9.3.0, 06-2010.

[4] 3GPP TS 23.107, Quality of Service (QoS) concept and architecture, v. 9.1.0, 06-2010.

[5] Curcio I.D.D. and Leon D. 2005. Evolution of 3GPP Streaming for Improving QoS over Mobile Networks. In Proc. of IEEE Int. Conf. on Image Processing (ICIP) (Genova, Italy, 11-14 Sep. ‘05, Vol. III, pp. 692-695).

[6] Kutscher D., and Ott J. 2006. Service Maps for Heterogeneous Network Environments. Proc. IEEE Conf. on Mobile Data Management (MDM) (Nara, JP, 10-12 May).

[7] Shankar P., Nadeem T., Rosca J., and Iftode L. 2008. CARS: Context-Aware Rate Selection for Vehicular Networks, In Proc. of IEEE Int. Conf. on Network Protocols (ICNP) (Orlando, FL, U.S.A., 19-22 Oct. ‘08).

[8] Yao J., Kanhere S.S, and Hassan M. 2008. An Empirical Study of Bandwidth Predictability in Mobile Computing. In Proc. of ACM Int. Workshop on Wireless Network Testbeds, Experimental Evaluation and Characterization (WiNTECH) (San Francisco, CA, U.S.A., 19 Sep. ‘08, pp. 11-18).

[9] Yao J., Kanhere S.S. and Hassan M. 2009. Geo-intelligent Traffic Scheduling for Multi-Homed On-Board Networks. In Proc. of 4th ACM Int. Workshop on Mobility in the Evolving Internet Architecture (MobiArch) (Krakow, PL, 22 Jun. ‘09).

[10] Codec Control Messages in the RTP Audio-Visual Profile with Feedback (AVPF), IETF RFC 5104, Feb. ‘08.

[11] Gentric P., RTSP Stream Switching, IETF I-D, draft-gentric-mmusic-stream-switching-01.txt, Jan. ‘04, Expired.

[12] Nokia, Some issues on rate adaptation, 3GPP TSG-SA4#26 meeting, 5-9 May ‘03, Paris, France, Tdoc S4-030348.

[13] Nokia, New client to server signalling for co-operative rate adaptation, 3GPP TSG-SA4#25bis meeting, 24-28 Feb. ‘03, Berlin, Germany, Todc S4-030126.

[14] Dovrolis C., Ramanathan P. and Moore D. 2004. Packet-Dispersion Techniques and a Capacity-Estimation Methodology. IEEE/ACM Trans. on Networking, Vol. 12, No. 6, Dec. ‘04, pp. 963-977.

[15] 3GPP, LS Response to Request for Evaluation Framework Link Level Data, TSG-SA4#49 meeting, 30 June-3 July ‘08, Philadelphia, PA, U.S.A.,Tdoc S4-080322.

[16] Curcio I.D.D. and Leon D. 2005. Application Rate Adaptation for Mobile Streaming. In Proc. of IEEE Int. Sym. on a World of Wireless, Mobile and Multimedia Networks (WoWMoM) (Taormina/Giardini Naxos, Italy, 13-16 Jun. ‘05, pp. 66-71).

Number of

Rebufferings

CumulativeLength of

Rebufferings

Average Media

Bit rate

PacketLossRate

NOR RAT GPT NORSec

RATSec

GPTSec

NORkbps

RATkbps

GPTkbps

NOR%

RAT%

GPT%

Route 1 1 0 0 14.0 0 0 136 128 136 2.3 0 0

Route 21 0 0 19.5 0 0 136 130 136 3.3 0 0

Route 31 0 0 20.2 0 0 136 130 136 3.5 0 0

Number of

Rebufferings

CumulativeLength of

Rebufferings

Average Media

Bit rate

PacketLossRate

NOR RAT GPT NORSec

RATSec

GPTSec

NORkbps

RATkbps

GPTkbps

NOR%

RAT%

GPT%

Route 1 1 0 0 14.0 0 0 136 128 136 2.3 0 0

Route 21 0 0 19.5 0 0 136 130 136 3.3 0 0

Route 31 0 0 20.2 0 0 136 130 136 3.5 0 0

8

[P11] Sujeet Mate, Igor D.D. Curcio, “Mobile and Interactive Social Television”, IEEE Communications Magazine, Vol. 47, No. 12, Dec. 2009, pp. 116-122.


IEEE Communications Magazine • December 2009116 0163-6804/09/$25.00 © 2009 IEEE

INTRODUCTION

Thanks to developments in processing power formobile devices and improvements in wireless bit-pipe size over the last decades, services thatwere initially designed for a static environment(e.g., a home) can now be implemented intomobile devices (e.g., phones). At the same time,services with traditionally passive-consumption-oriented paradigms (e.g., television) are movingtoward participative and interactive services(e.g., interactive television). Interactive mobileservices are therefore now possible.

Television has been around for manydecades. Since its introduction, TV viewing hashad a social dimension associated with it [1].People invite friends or family members towatch some interesting program or movietogether. The main reason is to make thewatching experience more enjoyable and social,making it not just an individual action but asocial experience. When watching contenttogether, the content itself very often repre-sents the subject of common interest andbecomes the “medium for social interaction”between people [2]. Wai-Tian et al. in [3] havecombined community streaming with interac-tive visual overlays. In the typical case, peoplewishing to watch some content together needto gather at a mutually convenient time andplace. This is neither always practical nor desir-able. Therefore, a mobile-based system, whichallows users to interact with each other andwatch content at the same time, provides ameans of having a social viewing experiencewith people of interest even if they are in dif-ferent cities or countries. Such a participativeand interactive TV/video watching paradigm onmobiles which allows geographically dispersedpeople to meet in a virtual shared space (VSS)to watch TV while being able to interact witheach other is Mobile and Interactive Social TV(MIST) [4–6] (Fig. 1).

The concept of Mobile and InteractiveSocial TV is relatively new compared to socialTV viewing in a static context (e.g., in the liv-ing room). The mobile context imposes addi-tional technical challenges and userrequirements compared to the static context.The MIST system described in this articleoffers rich interaction possibilities to partici-pants watching content together to facilitate astronger feeling of virtual presence (Fig. 1).Rich interaction between participants coupledwith synchronized content playback with mini-mal time difference among them ensures thatthe participants have a common shared contextof the viewing experience. This shared contextis the key to creating a feeling of watchingtogether.

In this article we focus on two aspects. First,we discuss the requirements for a MIST type ofsystem, and then discuss the MIST system archi-tecture. We then describe the features of theMIST proof-of-concept system developed by us.Second, we describe the motivation and resultsof a consumer experience study of the MISTconcept.

ABSTRACT

Services that were traditionally designed for astatic environment can now be implemented intomobile devices. At the same time, services withtraditionally passive-consumption-orientedparadigms are moving toward participative andinteractive services. One such service is Mobileand Interactive Social TV (MIST), which allowsgeographically dispersed people to meet in a vir-tual shared space and watch TV while being ableto interact with each other. This service allowsusers to create an experience of watching togeth-er by providing its participants a common sharedcontext. We present two novel architectures of aMIST system. In both of the architectures, theinteraction is represented by rich audio-visualmedia, allowing users to hear and see each other.In the first architecture, the mixing of the TVcontent with the interaction media is performedat the server side. In the second architecture, themixing is performed in each client device. Thereare many questions that arise from the consumerperspective regarding a radical change in experi-ence when compared to traditional laid-back TVwatching. Mobile and Interactive Social TV isrelatively new when compared to the concept oftraditional TV watching in a static context. Todevelop understanding of the consumer experi-ence with the MIST concept, a focus group studyapproach was conducted. The study revealedthat the feeling of social presence of people ofinterest when watching content with them wasconsidered to add value to the viewing experi-ence. The key system requirement is the abilityfor selective enabling/disabling of individualinteraction features as per the user preferencesand context. The context was considered to beinfluenced by both the relation with other partic-ipants and the content being consumed.

AD HOC AND SENSOR NETWORKS

Sujeet Mate and Igor D. D. Curcio, Nokia Research Center

Mobile and Interactive Social Television

CURCIO LAYOUT 11/18/09 12:01 PM Page 116

Authorized licensed use limited to: Nokia. Downloaded on March 10,2010 at 08:25:30 EST from IEEE Xplore. Restrictions apply.

IEEE Communications Magazine • December 2009 117

MIST SYSTEM REQUIREMENTS ANDARCHITECTURE

As described above, a MIST system providesmeans for interaction between participatingusers to facilitate virtual presence between peo-ple. The watching together experience is thusvery closely linked to the interaction possibilitiesbetween the users. The closer the feeling ofinteraction to face-to-face interaction, the betterthe feeling of virtual presence between geo-graphically distributed users. The challenge onthe flip side is to minimize the distraction thatmay be caused by the interaction modalities tothe viewing experience compared to watchingTV/video with friends or family in a living room.Different interaction modalities can be employedto create a feeling of virtual presence.

Modern mobile phones increasingly havemore processing power, memory, and networkbandwidth availability to support rich multime-dia applications. Mobiles also have better displayresolution to make TV/video viewing moreenjoyable. As interaction modalities, we haveconsidered text chat, emoticons, audio confer-encing and videoconferencing, since these pro-vide a higher feeling of social presence to theparticipants. The function of the interactionmodalities is to enable the participants to dowhat they would usually do when watchingtogether and being physically collocated. This

would include, but not be limited to, comment-ing on the events shown in the content, lookingat each other and making gestures, and talkingto each other. Videoconferencing functionalitybetween participants can be used to facilitate theabove described interactions. In our MIST archi-tecture, we have coupled videoconferencingfunctionality, text chat, and synchronized contentplayback.

We considered two architectures for theMIST system. The first one is a centralized mix-ing architecture, and the second one is an end-point mixing architecture. They are described inthe next sections.

CENTRALIZEDMIXING ARCHITECTURE

The centralized mixing architecture is a thinclient approach that attempts to minimize theprocessing and other resource requirements atthe client side. This architecture has three mainentities: the content provider, the interaction serv-er, and the mobile clients (Fig. 2). The contentprovider sends the content to be watched to theinteraction server. Each mobile client capturesthe interaction media (consisting of participantaudio, video, and text) and sends it to the inter-action server. The interaction server combinesthe content media and interaction media receivedfrom the participants. The combined media

Figure 1. Mobile and Interactive Social Television concept overview.

Talk

Chat

See

Mark,Helsinki

Marja, Helsinki

Jo, London

Alex, Milan

Anna, Oslo

A mobile-based

system, which allows

users to interact with

each other and

watch content at the

same time, provides

a means of having a

social viewing

experience with

people of interest

even if they are in

different cities or

countries.



IEEE Communications Magazine • December 2009118

stream is then transmitted to all the mobileclients that are watching together. The central-ized mixing architecture resembles a star networktopology, with the interaction server at the centerof the star that combines interaction media frommobile clients and the content from the contentprovider. The mobile client has been implement-ed on Nokia N95 devices. The interaction serveralong with the content provider is implementedon Linux. It consists of an HTTP server for ses-sion control and a media mixing engine for gen-erating the composite media stream sent to eachparticipant. An instance of a media mixing engineis created for each group of users watching con-tent together. The communication between theHTTP server and the media mixing engine isachieved using interprocess communication(IPC). The content provider for the purpose oftesting is collocated on the interaction server andstores the TV/video content locally.

To ensure a high synchronization levelbetween the different participants and low laten-cy for a highly interactive application like MIST,Real-Time Transport Protocol (RTP) over UserDatagram Protocol (UDP) is used to transportmultimedia data. The session negotiation andsetup for the multimedia sessions between themobile clients and the interaction server is doneusing Session Initiation Protocol (SIP)/SessionDescription Protocol (SDP). HTTP is used forthe control channel between the mobile clientand the interaction server. The control channelis used to establish the MIST TV/video viewingsession among the participants. The HTTP con-trol channel is used for functions such as choos-ing the content, playback controls (PLAY,PAUSE, STOP), switching to videoconference-only mode, and zooming in on an individualuser, among other things.

ENDPOINT MIXING ARCHITECTUREThe endpoint mixing architecture is similar tothe centralized mixing approach, except that theinteraction media and watched content are com-bined at the mobile client (as opposed to theinteraction server in centralized mixing) (Fig. 3).Therefore, in this case the content provider can

send the content to be watched directly to themobile clients. The flow of interaction mediabetween the mobile clients and the interactionserver remains unchanged from the centralizedmixing scenario. To combine interaction mediaand content media at the mobile client entailsreceiving two media streams, one from the inter-action server and the other from the contentprovider. The primary challenge for a mobiledevice in this case is to process two incomingvideo streams and combine them with low delayon a resource constrained device.

COMPARISONIn the centralized mixing architecture, themobile client needs to receive only one mediastream. The centralized mixing architecturerequires only one composite media stream to beprocessed by the mobile client. For H.264QVGA video, 15 frames/s results in approxi-mately 160 kb/s of media data processing. Forthe endpoint mixing architecture, the contentprovider can send the streams directly to themobile clients. Using the endpoint mixingapproach, the mobile client would need to pro-cess one content media stream at QVGA inaddition to the interaction media stream. Thisproved to be very challenging on a mobile device.To reduce this complexity in endpoint mixing,pre-downloaded content was used. But eventhen, the need for two video decoder instancesresulted in reduced available battery time. Thereis no requirement for a direct connectionbetween the content provider and the interactionserver. This gives each individual user widerchoice by allowing him/her to obtain contentfrom any desired content provider, which is anadvantage of the endpoint mixing architectureover the centralized mixing architecture. Theneed to minimize the processing power require-ment and at the same time maximize batteryavailability is very specific to MIST systems,which is not as critical for the social TV conceptin a static context. Taking these factors into con-sideration, the centralized mixing approach wasused as the base for conducting further investi-gation of consumer experience of the MISTproof of concept system.

It needs to be noted that we have not dwelton the legal and copyright aspects of using con-tent in the above described architectures in acommercial or public service. The terms fordoing this are likely to vary depending on thecontent type and country-specific legislation.

MIST PROOF-OF-CONCEPT SYSTEMThe system allows users to invite people of inter-est to participate in a social watching sessionfrom their mobiles. The idea, as described above,is to create a virtual shared space (VSS). A usercan initiate the creation of a VSS by inviting theusers he/she wishes. The SIP URI of each of theinvitees is sent as a list to the interaction server(IS). The IS invites each user on the list. All theinvitees that accept the invitation request to jointhe VSS. Users who reject the invitation can jointhe VSS later at their convenience. Figure 4shows the steps involved in creating and interact-ing with a VSS.

Figure 2. Centralized mixing architecture.

TV/video content flow

Virtual shared space

Individual participant

(mobile client)

Interaction server

Interaction and TV/video content

Participant media

Content provider




In the VSS, the participants can talk to eachother, see each other, send text messages to theshared space, and, at the same time, view thecontent on the screen. The social watching ses-sion starts like a conventional multiparty video-conference. The participants can talk anddiscuss to select the content or just socializebefore starting to watch the content. On startingto watch the content, all the participants seeexactly the same content, which is synchronizedbetween the participants. It is possible to speakto the other participants or gesture to them bypopping in the participant video to the screen.The popping in and removing of the individualinteraction video from the screen is user con-trolled. To make optimal use of the small mobiledisplay, the individual interaction video is linkedto the voice activity of each participant suchthat the interaction videos of participants whoare silent are kept small. The interaction videowindow grows in size only when a participant isspeaking (in Fig. 5 the user in the top left cor-ner is speaking). This keeps unnecessary clutteraway from the content being watched. There isa shared remote control between the partici-pants, which can be used to collectively choosethe content from the channel list and also forplayback control. Each user has equal control. IfPAUSE is pressed by one user, the contentpauses for all the participants of the socialwatching session; the same is true if PLAY orSTOP is pressed.

The proof-of-concept MIST system has beentested in both WLAN and third-generation withhigh-speed packet access (3.5G) environments.Considering first the WLAN bearer, the generaluser feedback has been positive. The responsetime of user actions such as selecting the movieand switching between different views or play-back control commands (like PLAY, PAUSE,STOP), is within acceptable limits for the usersat around 0.5 s. Considering the 3.5G network,the MIST system proved to be stable when test-ed with mobile clients in different locations (onein Tampere, Finland, and the other one in Bris-tol, United Kingdom) with a response time ofless than 1 s. The response time for WLAN and3.5G were slightly more than the round-tripttime (RTT) experienced in the access network.Thus, the response time of the system will varyin step with the network condition.

CONSUMER EXPERIENCE STUDYAs mentioned earlier, MIST is a relatively newconcept and many questions arise regarding sucha system. Some of the questions relate to its util-ity and desirability to consumers in the realworld. Another crucial aspect is the potential foracceptance of a MIST system in the absence of ade facto benchmark system. According to Schatzet al. [7], social features coupled with mobile TVwatching enrich the viewing experience despitethe inherent possibility of causing distractions. Afocus group study was conducted in order toprovide user feedback and opinions about ourproof-of-concept MIST system presented above.

The main goals of the consumer experiencestudy to gather consumer perceptions andrequirements were as follows:

• Assessing the perceived user benefit ofMIST systems

• Finding out the effect of interaction onmedia consumption patterns

• Exploring the type of content most suitedfor a MIST system

• Getting feedback about the key require-ments of a functional MIST type of systemThe methodology was designed to gather sub-

jective and qualitative inputs from the study sub-jects to increase consumer experienceunderstanding of mobile-based social TV sys-tems in general, and also about our proof-of-concept MIST system. A focus group methodwas used in our study. A focus group study is aform of qualitative research in which a group ofpeople are asked about their attitude toward aproduct, service, concept, advertisement, idea, orpackaging. Questions are asked in an interactivegroup setting where participants are free to talkwith other group members. For the focus groupstudy, nine persons from four countries consist-ing of seven males and two females between theages of 25 and 35 participated. The criteria usedfor selecting the people were that the study sub-jects were average Internet users and had usedmobile phones with Internet and multimedia

Figure 3. Endpoint mixing architecture.

Interaction (videoconference)

Participant media

Virtual shared space

Individual participant

(mobile client)

Interaction server

Content provider

TV/video content flow to

each client

Figure 4. Virtual shared space creation and user interaction sequence diagram.

To create virtual shared

space (VSP)

Host chooses

participants

Anna

Send user Invite each user (dial-out) User

accept

User reject

Invitation

End

Interaction server

Invited users (Alex, Jo, Marja, Mark)

VSS

Users who accepted invitation

join the VSS

User interactions withVSS: choosing

content, play/pauseplayback, send

text, etc.




capability. The users were put in groups of twoand three persons such that they knew eachother beforehand. This was to facilitate free dis-cussion and mutual comfort in hands-on trial ofthe MIST system between them. The subjectswere interviewed and participated in the discus-sion according to the following methodology:• Initially, the users were introduced to the

concept in abstract form without mention-ing any details about the system.

• Opinions about value addition, concerns,and requirements were collected from thetest subjects.

• In the next step, our proof-of-concept MISTsystem [4, 5] was shown to the users. Theywere then allowed to use this system bythemselves, with some initial help as andwhen required.

• First impressions about the system were col-lected again after showing and trying outthe system.

• Following that, more specific questions wereasked. These were about the suitability ofcontent types for such a social and interac-tive type of communication. Another ques-tion was about their preferences forinteraction modalities, and also about theircomfort level about sharing participantmedia with other people participating inthe social TV session.The resulting responses from the focus groups

and individual user feedback were analyzedtogether to find out strengths and weaknesses. Itshould be noted that the study was limited to asmall number of subjects, and would need to beconducted as part of a larger pilot study withdiverse groups before consumer preferences canbe determined quantitatively.

Overall, the MIST concept and proof-of-con-cept MIST system were received positively bythe users. The feeling of social presence whenwatching content was found to add value by allthe study subjects. Two factors were found to bekey to determining the preference for richness ofinteraction modalities while watching the con-tent. The first factor is the relationship of thesocial TV user with the other participants

involved in the particular social TV session. Thesecond factor is the type of content being con-sumed.

In the absence of a de facto standard or abenchmark system, when the MIST concept wasintroduced to the study subjects, there was initialskepticism about the usability and desirability ofthe system. The advantage of such initial skepti-cism was the delightful experience (sample reac-tions from the users: “I did not expect it to be sogood,” “looks far better than I imagined”) dur-ing first-hand use of the system. Such initialskepticism can be a challenge for large-scaleacceptance of such a concept. The advantage ofnot having a de facto standard or a benchmarksystem provides the application designers withgreater freedom in defining the consumer expe-rience.

EFFECT OF INTERACTION MODALITIESOur study revealed that preference for interac-tion modalities is subject to personal preferencesand also the context of interaction. Some usersfound audio conferencing to be more sociallyengaging, and thus more distracting when con-suming content (sample reaction: “text is pre-ferred since it is less intrusive and easy to ignore”).There were other users who found keying inresponses as text messages far more laborious(sample reaction: “text would not be suitable, itwill divert attention”). Interestingly, for someusers we also discovered a preference for use ofasymmetric interaction modalities, where theycould talk to the other participants, but the par-ticipant responses are rendered as text and viceversa.

Videoconferencing capability was consideredto add a greater feeling of social presence thanaudio only or text only interaction (sample reac-tions: “text interaction is not suitable for sports asit is less spontaneous,” “talking and seeing givesspontaneity and liveliness”). The use of videocon-ferencing was found to be more context-depen-dent than audio conferencing and text onlyinteraction, in decreasing order. This could beexplained by the fact that the least engaginginteraction method was perceived to be used

Figure 5. Voice activated participant video zoom and shared "remote control."

The relationship of

the social TV user

with the other

participants involved

in the particular

social TV session and

the type of content

being consumed

determine the

preference for

richness of

interaction modalities

while watching

content.




most liberally by the users. The above discusseddifferences in social presence match the workdone in [8], where it was observed that voice-video, voice only interaction, and text only inter-action provide the feeling of presence indescending order. The study brought out thatthere is an implicit feeling of etiquette thatcomes when watching content socially withfriends even when it is on their mobiles. The eti-quette was transferred transparently from realworld interaction experiences into the VSS.Overall, 100 percent of the consumer study par-ticipants expressed the desire for availability ofaudio, video, and text interaction modalities.The preference for audio over text and viceversa was 22 percent (2 out of 9) each, while theother 55 percent (5 out of 9) had no clear pref-erence.

CONTENT SUITABLE FOR MIST SYSTEMSThe content that was considered suitable by thestudy subjects was in line with general prefer-ences observed in typical mobile use of differentservices like Internet browsing and video view-ing. Long format content was less preferred thanshort duration content. This could be attributedto the lack of a long contiguous time intervaland occupied mindshare while being involved inparticipative as well as interactive content con-sumption. User generated videos (especiallyhome and family videos, short video clips, funnyclips, user generated videos for other purposes,etc.), sports content, short TV episodes (TV pro-gram episodes, celebrities, etc.), and news con-tent were considered most suited for mobilesocial consumption.

USER REQUIREMENTSThe study revealed that although users desirerich interaction capabilities, they do not want allof it enabled all the time (i.e., during the wholeduration of the content). For example, someusers prefer participant video overlay only dur-ing half-time periods in sports or during com-mercial breaks between programs. Table 1 showsthe user preference for the content type andinteraction modalities that they would like touse. In the tabular representation, interaction

level refers to the preferred frequency of inter-action with other participants. The content typedefinitions used in the table are the same asdescribed earlier.

USER CONCERNSThe first and most common concern was relatedto the privacy of the participating user. The userswanted to be able to control the participantmedia, that is, their own interaction content(their personal audio, video, and text) beingshared with other participants. These concernsinfluence some design choices, like voice activat-ed participant audio/video rendering vs. manualenabling or press-to-talk. The users more con-cerned about privacy preferred the manualoption, while others preferred automatic voice-based activation. It is interesting to note thatthese users wanted the system to retain bothoptions and to choose the option depending onthe context. In this case the context is their com-fort level in relation to the other participants,whether they are alone at home, and so on. Onepotential issue pointed out was related to findingthe best trade-off in position and distance of thephone with respect to (the body of) a user. Towork around this problem, headsets equippedwith microphones could be used, and the camera(recording participant interaction video) zoomsetting is done such that the distance users cancomfortably keep from the mobile is sufficient toget a good overall view of the user in the cameraview. Another concern was the cost and also thewide availability of such a system to be able toinclude the people of interest. The most com-mon technical concerns raised were about thequality and availability of suitable networks,since it was pointed out by the study subjectsthat good media quality would be essential forthe usability of the system. The other prominentconcern was about the effect of using this systemon the mobile device talk time and standby time.

CONCLUSIONSIn this article we have presented the Mobile andInteractive Social Television system that createsa virtual shared space for participants watching

Table 1. User preferences for content type and interaction modalities.

Content typeNumber of users thatchose the particularcontent type

Preferred interactionmodalities for thegiven content type

Perceived suitableinteraction level for pre-ferred content types

Percentage of usersselecting content typesuitable for socialconsumption on mobiles

Sports 9 Audio, video, text High 100

News 6 Audio, video, text High 67

TV content (TVepisodes, celebrity TV,etc.)

6 Audio, video, text Medium 67

Movies (cinema, docu-mentaries, etc.) 2 Audio, text Low 29

User generated 9 Audio, video, text High 100




content together on mobiles. Rich and real-timeinteraction modalities like voice and videocon-ferencing in addition to text chat can be used byparticipants involved in social watching. Central-ized mixing and endpoint mixing architectures forrealizing a MIST system were presented, andtheir relative advantages as well as drawbackswere discussed. It was observed that forresource-constrained devices like mobiles, thecentralized mixing architecture can deliver betterperformance. The proof-of-concept system usingthe centralized mixing approach was used forconducting our consumer studies.

The consumer experience studies giveinsight into four key questions of the study.The first key question is about the perceiveduser benefit. The study reveals that there is aclear and positive user benefit experienced bythe subjects based on their reactions to theconcept and first hand use of the MIST system.“Feeling of social presence” was considered toadd value by the study subjects. The effect ofinteraction on media consumption was affectedby the type of content being consumed and thecomfort level between the participants. Thehigher the comfort level between the partici-pants, the more openness there is to use richinteraction modalities, tempered only by thecontent type being watched together. Contentthat is of short duration and time-sensitive(e.g., news) was considered suitable for MISTtype consumption. The key requirement fordesigning the system was found to be the abili-ty of the system to allow personalization andcustomization of the interaction features andother MIST system controls such as contentplayback control.

REFERENCES[1] B. Lee and R. S. Lee, “How and Why People Watch TV:

Implications for the Future of Interactive Television,” J.Adv. Research, vol. 35, no. 6, 1995, pp. 9–18.

[2] R. Schatz et al., “Mobile TV becomes Social — Integrat-ing Content with Communications,” Proc. ITI ’07, June2007, pp. 263–70.

[3] W. T. Tan et al., “Community Streaming with InteractiveVisual Overlays: System and Optimization,” IEEE Trans.Multimedia, vol. 11, no. 5, Aug. 2009, pp. 986–97.

[4] F. Cricri et al., “Mobile and Interactive Social Television— A Virtual TV Room,” 10th IEEE Symp. World of Wire-less, Mobile and Multimedia Networks, Kos, Greece,15–19 June 2009.

[5] S. Mate and I. Curcio, “Consumer Experience Study ofMobile and Interactive Social Television,” 10th IEEESymp. World of Wireless, Mobile and Multimedia Net-works, Kos, Greece, 15–19 June 2009.

[6] R. Schatz, S. Wagner, and N. Jordan, “Mobile Social TV:Extending DVB-H Services with P2P-Interaction,” Proc.2nd Int’l. Conf. Digital Telecommun., 2007, pp. 14–19.

[7] R. Schatz and S. Egger, “Social Interaction Features forMobile TV Services,” Proc. IEEE Broadband MultimediaSys. and Broadcast Symp., Las Vegas, NV, Apr. 2008.

[8] E. Sallnas, The Effect of Modality on Social Presence,Presence and Performance in Collaborative Virtual Envi-ronments, doctoral thesis, KTH Stockholm, Sweden,2004.

BIOGRAPHIESSUJEET MATE ([email protected]) is a senior researcherat Nokia Research Center, Tampere, Finland. He receivedhis B.E. degree in electrical engineering from REC Surat,India, and his M.S. degree in electrical engineering fromthe University of Texas at Dallas. He has been active indeveloping applications for real-time mobile multimediaand in particular interactive video, conferencing, and imag-ing services. His interests include multimedia applicationsand services architectures for wireless networks, systemprototyping for Internet driven multimedia services, socialTV, and context-sensitive multimedia applications.

IGOR D. D. CURCIO [S’91, M’03, SM’04] ([email protected]) worked for several companies as a freelancesoftware engineer, project manager, and IT trainer from1986 to 1997. He received his Laurea degree in computerscience from the University of Catania, Italy, in 1997. In1998 he joined Nokia, where he covered several researchand management positions in the areas of real-time mobilemultimedia. He is now a principal member of research staffat Nokia Research Center. He has been active in severalstandardization organizations (3GPP, IETF, DLNA, DVB,ARIB) where he covered sub-working group and task forceChair positions, and contributed over 200 standardizationpapers. He holds about 20 international patents. He is anACM member since 1990 and has published more than 40papers in the areas of mobile multimedia applications,video on demand, and software engineering. His currentinterest areas include mobile video applications and ser-vices, streaming, conferencing, mobile TV, P2P multimedia,social media, multimodal sensing, and context applications.

The study reveals

that there is a clear

and positive user

benefit experienced

by the subjects

based on their

reactions to the

concept and first

hand use of the

MIST system.



Documents

QoS Aspects of Mobile Multimedia Applicationsmoncef/publications/curcio.pdf · QoS Aspects of Mobile Multimedia Applications Thesis for the degree of Doctor of Science in Technology