[IEEE 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) - New Paltz, NY, USA (2011.10.16-2011.10.19)] 2011 IEEE Workshop on Applications of Signal

2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 16–19, 2011, New Paltz, NY

AN ADAPTIVE STREAMING SYSTEM FOR MPEG-4 SCALABLE TO LOSSLESS AUDIO

Rongshan Yu, Haiyan Shu, Susanto Rahardja

Institute for Infocomm ResearchAgency for Science, Technology & Research, Singapore 138632

ABSTRACTIn this paper, we propose an adaptive streaming system

for MPEG-4 Scalable to Lossless (SLS) encoded audio. Inthe proposed system, the fine grain scalable (FGS) featureof SLS is utilized to achieve optimal audio streaming qualitythrough network with limited and possibly time-varying net-work bandwidth. To this end, the streaming system selects anoptimal target quality according to the available network re-sources and rate-quality relationship of SLS encoded frames,and the SLS frames are then truncated according to the targetquality before they are transmitted to the client. The proposedsystem adapts to both the network conditions and the rate-distortion characters of the SLS bit-stream so that the qualityof the streaming audio is optimized without over-utilizing thenetwork resources.

1. INTRODUCTION

Audio streaming refers to a method that constantly transmits,and presents digital audio to clients from a streaming serverover computer networks. Compared to other audio deliv-ery techniques such as file-based downloading, the stream-ing technique is characterized with an “instantaneous” featuresuch that the digital audio is presented to the end user almostimmediately after it commences transmission from the serverto the client. In order to utilize the network bandwidth moreefficiently, the audio to be streamed is compressed to lowerdata rates prior to streaming by using audio coding technolo-gies [1]. Typically, in an audio encoder, audio contents aresegmented into a consecutive of audio frames of constant timeduration, and these audio frames are further processed so thatredundancies and/or irrelevant information are removed, re-sulting in compressed audio bit-streams with reduced datarates compared to those of the original contents.

Typically, in network streaming systems audio encoderssuch as MPEG-1 Audio Layer III (mp3) [2] or MPEG-4Advanced Audio Coding (AAC) [3] are used to produce aConstant Bit-Rate (CBR) bit-stream that consists of com-pressed audio frames of equal size throughout audio contents.Due to non-stationary nature of audio signal, audio decodedfrom a CBR audio bit-stream usually exhibits quality fluctu-ation at multiple time-scales. As a result, streaming of CBRaudio may not be optimal from the perspective of quality

of streamed audio. An alternative solution to this problemis to use Variable Bit-Rate (VBR) audio encoder [4] whichgenerates variable bit-rate, but constant quality bit-streams.However, although VBR coding solves the quality fluctuationproblem, VBR audio is in general not network friendly asthe bit rate fluctuation of VBR audio depends only on theaudio signal itself and therefore it may not match availablebandwidth during the streaming session.

The introduction of Fine Granular Scalable (FGS) audiocoding such as MPEG-4 Scalable to Lossless (SLS) cod-ing [5] provides a potential solution to the aforementionedproblems. Unlike other audio codecs, an SLS compressed au-dio frame can be further truncated to lower data rates at littleor no additional costs in terms of computation. This featureallows a streaming system to adapt the streaming quality/ratein real-time depending on the available bandwidth. Thisreal-time quality adjusting feature makes it possible for astreaming system to fully utilize the network resources to-wards constant and optimal streaming quality.

In this paper, an adaptive streaming system based on SLSencoded audio is proposed. In the proposed system, a tar-get quality is first selected, and the sizes of the audio framesto be streamed are truncated accordingly so that this targetquality is achieved. To ensure best possible quality of thestreamed audio, the target quality is jointly determined byboth the available streaming bandwidth, and the rate-qualityrelationship of the SLS encoded frames. As will be shown inthis paper, the optimal target quality can be obtained by sim-ply solving a linear equation if the rate-quality relationshipsof SLS encoded frames are represented by piece-wise linearfunctions interpolated from discrete rate-distortion (R-D) datapoints. Therefore, the proposed system can be implementedwith very limited computational costs.

2. MPEG-4 SLS

MPEG-4 SLS [5] is one of the latest additions to the MPEG-4audio coding tool family from ISO/IEC. It allows the scalingup of a perceptually coded representation such as MPEG-4AAC to a lossless representation with a wide range of inter-mediate bit-rate representations. It also has a non-core modein which the MPEG-4 AAC core is not present, and the qual-ity is scaled up virtually from bit-rate zero. One of the major

978-1-4577-0693-6/11/$26.00 c©2011 IEEE 41


merits of SLS is that the bit-stream generated from the en-coder can be further truncated to lower data rates as illustratedin Fig. 1. The truncation can be done either in the streamingserver, or in the network gateway during the streaming ses-sion. Since the truncation only consumes very little computa-tion compared to alternative approaches such as transcoding,this feature is particularly useful for streaming server/gatewaythat needs to handle large number of simultaneous streamingsessions.

Frame n-1

Frame n

Frame n+1

Frame n-1

Frame n

Framen+1

SLS lossless bit-stream

Truncated bit-stream with reduced bit-rate

rn-1 rn rn+1

r'n-1 r'n r'n+1

Fig. 1. Top: Lossless SLS bit-stream with frame size rn where nis the frame index. Bottom: Truncated SLS bit-stream with reducedbit-rate r′

n.

3. ADAPTIVE STREAMING OF MPEG-4 SLS

In this section, the proposed adaptive streaming system is in-troduced. This system includes three parts: rate-distortionmodel, buffer control, and streaming scheme. Detailed infor-mation is given below.

3.1. Rate-Distortion Model for MPEG-4 SLS

In order to effectively adjust the frame sizes of the SLS com-pressed audio for optimal streaming quality under the stream-ing bandwidth constraint, it is essential for the streaming sys-tem to be aware of the rate vs. quality relationships of SLSencoded audio at the frame level. Due to the non-stationarynature of audio, this rate-quality relationship may be highlynon-uniform and highly dynamic across different time andaudio sequences. As a result, it is not easy to convey thisinformation to the streaming server.

In the proposed system, the distortions of the decoded au-dio are explicitly measured during the encoding process frameby frame, and the results are recorded into an R-D table. Theresulting R-D table can be stored as meta data of the com-pressed SLS file to aid the streaming process. Since the R-Dtable will only be used at the server side and will not be trans-mitted to the decoder, it does not introduce additional burdento the network resources during streaming.

In a current implementation, the distortion of an SLS en-coded frame is calculated as the average value of Maskingto Noise Ratio (MNR) of all the scale-factor bands [1]. It is

20 40 60 80 100 120 140 160 180−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

Bit−rate (kbps)

MN

R (

dB)

Interpolated R−D function

Measured R−D data

Fig. 2. Interpolated R-D function vs. actual R-D data measuredduring encoding process.

well-known that MNR alone is not sufficient to represent theperceptual quality of compressed audio and hence in practicalapplications, more sophisticated objective quality measure-ments, such as that proposed in [6, 7] can be implementedinto the proposed R-D table to improve the overall perceptualquality.

Since the R-D table can only record a limited number ofdata points, the R-D data that are not recorded in this tablewill be determined by using piecewise linear interpolationfrom measured data points in the streaming server. As shownin Fig. 2, the R-D relationship exhibits high linear propertyfor SLS bitstream. Although piecewise linear interpolationmay introduce approximation errors, such an error is usuallytolerable if locations and the density of the rate-distortion datapoints in the R-D table are carefully selected.

3.2. Transmission buffer control

In an audio streaming system, typically, first in first out(FIFO) buffers are used in both the transmitter and the re-ceiver to absorb discrepancies between the rate of the VBRaudio bit-stream and the actual network throughput. A buffercontrol algorithm is generally employed to properly controlthe rates that audio data enter and leave these buffers so thatthey won’t be underflowed (data to leave an empty buffer)or overflowed (data to enter a full buffer). In this paper,we only focus on receiver side buffer underflow because re-ceiver/transmitter buffer overflow can be easily avoided ifsufficient memory is available, and transmission side bufferunderflow can be simply solved by either reducing the trans-mission rate or using stiff bits. It can be shown that to preventreceiver buffer underflow, it is equivalent to perform buffercontrol at transmitter side to ensure that the transmitter bufferlevel doesn’t exceed a certain level determined jointly by theamount of initial delay, which is the amount of time that thereceiver has to wait after the data transmission starts beforeit starts to play the music, and network throughput. Morespecifically, the following relationship holds for the levels oftransmitter and receiver buffers if one ignores the network

42


delay for simplicity:

BR(i + Δ) =i+Δ∑

j=i+1

Cj − BT (i), (1)

where BR and BT are, respectively, levels of receiver andtransmitter buffers, Δ is the initial delay, and Ci is the amountof data that are transmitted to the network by the transmit-ter at each frame interval i. Clearly, to prevent the receiverbuffer from underflow we need to keep the right-hand side ofabove equation always greater than zero, that is, the transmit-ter buffer size should never exceed the effective buffer level∑i+Δ

j=i+1 Cj .We now turn to the details on how to leverage the scalabil-

ity feature of SLS so that the streaming system can stream atmaximum possible quality without introducing undesired re-ceiver buffer underflow for continuous playback. We start ourderivation by formally setting up the link between the trans-mitter buffer level BT (j), and the SLS streaming quality q.Considering a look-ahead window of length L starting fromcurrent frame i, the following relationship holds for the trans-mitter buffer level:

BT (i + L − 1) = BT (i) + R(q) −i+L∑

j=i+1

Cj . (2)

Here, R(q) is the aggregated R-D function for the look-aheadwindow under consideration, which is defined as:

R(q) =i+L−1∑

j=i

rj(q), (3)

where rj(q) is the size of frame j if the SLS audio is stream-ing at quality q.

Now, in order to maintain an appropriate transmitterbuffer level, the following constraint is imposed to deter-mine the maximum allowed SLS streaming quality q

Tat the

beginning of the look-ahead window:

BT (i + L − 1) = αBT (i), (4)

where 0 ≤ α < 1 so that the transmitter buffer level is de-creasing over time to avoid receiver buffer underflow. Substi-tuting (2) into the above equation, q

Tis thus given by:

R(qT) =

i+L∑

j=i+1

Cj − (1 − α)BT (i). (5)

In practical application, the amounts of data that being trans-mitted to the network Cj , j = i + 1, . . . , i + L are in general,unknown a priori at time i. However, depending on the under-lying streaming network infrastructure it can still be predictedby, e.g., available streaming bandwidth from bandwidth esti-mation algorithms, that is

i+L∑

j=i+1

Cj ≈ LFRi, (6)

where Ri is the available streaming bandwidth in bits per sec-ond estimated at current frame i, and F is the duration of anSLS frames in seconds. The target quality selection will needto be performed periodically during the streaming session inorder to cater for the potential bandwidth fluctuation. Mathe-matically, it can be shown that the transmitter buffer level willbe bounded by:

BT (i) <1

1 − αLFRmax (7)

where Rmax = max(Rj), j = i + 1, . . . , i + L is the largestestimated bandwidth for streaming within the current look-ahead window. Therefore, receiver buffer underflow can beavoided if one can guarantee that the effective buffer size islarger than this upper bound, i.e.,

i+Δ∑

j=i+1

Cj ≥ 11 − α

LFRmax. (8)

Assuming the minimum available bandwidth for streaming isRmin, (8) will be satisfied as long as

ΔFRmin ≥ 11 − α

LFRmax. (9)

or

Δ ≥ LRmax

(1 − α)Rmin. (10)

In addition, it is found in our experiments that this lowerbound on initial delay is a bit pessimistic, and buffer under-flow rarely occurs for most audio sequences we tested as longas there is a reasonable initial delay and α < 1.

3.3. Streaming scheme

The solution of the target quality (5) is non-trivial becausethe aggregated R-D function R(q) is, in general, a non-linearfunction. However, if the R-D functions are approximatedby the piece-wise linear interpolated R-D functions r̄j(q) assuggested in section 3.1, the aggregated R-D function R(q)will be a piece-wise linear function as well. As a result, (5)is a linear equation and its solution is straightforwardly givenby:

qT =Ri − RL

RU − RLqL +

RU − Ri

RU − RLqU , (11)

where RL and RU are, respectively, lower and upper endsof the linear segment of R(q) in which the estimated avail-able bandwidth Ri locates, and qL and qU the correspondingqualities. Once the target quality is obtained, the size of eachstreamed audio frame is easily determined from the interpo-lated rate-quality function as r̄j(qT

).

43


4. SIMULATIONS

The effectiveness of the proposed system is verified by simu-lation. In this simulation, an SLS encoder with an AAC coreat 32 kbps/channel is used. The test sequence used in our sim-ulation is “es01” from MPEG-4 Audio testing sequences. TheR-D table of MPEG-4 SLS encoded audio is generated witha step size of 32 kbps from the AAC core rate up to losslessquality, in which the qualities of the audio frames are mea-sured in MNR.

First, we investigate the effectiveness of the buffer controlmechanism proposed in section 3.2 with respect to differentvalues of buffer control parameter α. The simulation resultsare shown in Fig. 3. Here, the available bandwidth is set to64 kbps, the sliding window length L is set to 20 frames, andinitial delay is set to 10 frames. Update of target quality isperformed at every audio frame. It is observed that, whenα = 1, i.e., transmitter buffer level is not considered in de-termining the target SLS quality, the receiver buffer level willdrift randomly and can easily go to underflow. However, theunderflow is effectively avoided when α < 1.

0 50 100 150 200 250 300 350 400 450−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3x 10

4

Frame number

Buffe

r sta

tus

es01(32k core), available bandwidth 64k, L=20

α = 1α = 0.5α = 0

Fig. 3. Receiver buffer status with different α setting.

Next, we evaluate the performance of our proposedstreaming scheme in terms of quality of the streaming au-dio. Here, α is set to 0.5, and the same settings in theprevious simulation are applied. In Fig. 4, the qualities vs.Time performance of three different cases: CBR streamingand the proposed system with sliding window length L = 20and L = 100 are given. For clear illustration, only frames150 to 250 are shown. From the results it is evident thatthe proposed system leads to much smoother streamed au-dio quality, and the qualities of critical frames (frames withvery poor perceptual qualities at CBR streaming system) aredramatically improved as well. It is also evident from thefigure that in general, a longer sliding window will lead to asmoother streamed audio quality and thus better perceptualquality of the streamed audio. However, in practical appli-cation, care should be taken to avoid using a sliding windowthat is too long as it not only increases the complexity of thetarget quality calculation, but also increases streaming bit-

rate fluctuation that exceeds the capability of buffer controlalgorithm.

150 200 250−25

−20

−15

−10

−5

0

5

10

Frame number

MN

R(d

B)

es01(32k core), available bandwidth 64k, α = 0.5

CBRAdaptive streaming (L = 20)Adaptive streaming (L = 100)

Fig. 4. Comparison of different streaming schemes.

5. CONCLUSION

In this paper, we present an adaptive streaming system forMPEG-4 SLS encoded audio. The proposed system achievesconstant audio quality streaming by dynamically adjustingthe rate of streamed audio on-the-fly during a streaming ses-sion according to the R-D behaviors of SLS encoded audio.Furthermore, to ensure smooth audio playback, a buffer con-trol mechanism is implemented, which avoids receiver bufferunderflow by keeping the transmission buffer status in loopwhen the streaming rate is adjusted. The proposed systemfully utilizes the available network resources for optimalstreaming quality and its superior performance compared toCBR streaming is confirmed by simulation results.

6. REFERENCES

[1] T. Painter and A. Spanias, “Perceptual Coding of Digital Audio,”Proceedings of the IEEE, vol. 88, no. 4, pp. 451 – 515, 2002.

[2] ISO/IEC 11172-3:1992, “Information technologyCoding ofmoving pictures and associated audio for digital storage mediaat up to about 1.5 Mbit/s – Part 3: Audio,” 1992.

[3] ISO/IEC 14496-3:2009, “Information technology – Coding ofaudio-visual objects – Part 3: Audio,” Oct. 2009.

[4] A. Szwabe and C. Jedrzejek, “Perceptually Transparent AudioCompression Based on a Variable Bit Rate AAC Coder,” 4thEURASIP Conference focused on Video/Image Processing andMultimedia Communications, vol. 2, pp. 685–690, 2003.

[5] ISO/IEC 14496-3:2005/Amd 3:2006, “Scalable Lossless Cod-ing (SLS),” June 2006.

[6] ITU-R Recommendation BS.1387, “Method for objective mea-surements of perceived audio quality,” 2001.

[7] J. C. Hardin and C. D. Creusere, “Objective Analysis of Tem-porally Varying Audio Quality Metrics,” 42nd Asilomar Con-ference on Signals, Systems and Computers, pp. 1245 – 1249,2008.

44

Documents

[IEEE 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) - New Paltz, NY, USA (2011.10.16-2011.10.19)] 2011 IEEE Workshop on Applications of Signal