4
Voice Quality Evaluation of Recent Open Source Codecs Anssi R¨ am¨ o, Henri Toukomaa Nokia Research Center, Tampere, Finland [email protected], [email protected] Abstract This paper introduces Silk, CELT, and BroadVoice that are available in the internet as open source voice codecs. Their voice quality is evaluated with a subjective listening test. AMR, AMR-WB, G.718, G.718B, and G.722.1C were used as stan- dardized references. In addition Skype’s Silk codec’s peculiar bandwidth and bitrate characteristics are examined in more de- tail. Index Terms: speech coding, subjective listening test, open source 1. Introduction There is an effort ongoing in IETF to standardize a new voice codec for internet telephony (VoIP) that is also an open source. In this paper three different relatively recent open source voice codecs Silk, CELT and BroadVoice are evaluated against 3GPP standardized AMR and AMR-WB and ITU-T standardized G.718, G.718B and G.722.1C codecs. All the codecs were tested in a same listening test with both clean and noisy mono speech signals. There exists several other open source codecs, and older codecs such as iLBC and Speex were already tested in our earlier papers ([1] [2]) and they were not included into this listening test. In addition to pure listening test results a more detailed analysis is made from the Silk codec, since it is not a traditional fixed bit-rate codec. It adapts both its bitrate and signal bandwidth on a frame-by-frame basis resulting a highly variable bitrate codec. This paper is constructed as follows. First some general information about user experience is given in Section 2. Next Section 3 describes the test methodology and listening test setup in detail. Section 4 shows how Silk codec achieves different bi- trates and how it affects the signal bandwidth and voice quality. Sections 5 and 6 explain subjective listening test results from CELT and BroadVoice codecs respectively. Finally conclusions are drawn in Section 7. 2. Improved user experience For over a century telephony has relied on narrowband (NB) voice quality, which is barely good enough to transport the most important elements of human speech. Wideband (WB) has been coming to mobile services and devices for a few years, but the final breakthrough has not yet happened. Current trend in voice codec development is to use even wider bandwidths and higher sampling rates for improved user experience. Tra- ditionally sampling rates have increased by doubling (like 8, 16, and 32 kHz), however Skype’s Silk codec is using some un- conventional bandwidths and sampling rates for describing their codecs bandwidth. In addition Silk codec uses critical sampling and for example at 16 kHz sampling rate all frequencies upto 8 kHz are present. See Table 1 for detailed bandwidth ranges and sampling rates. Abbr. Meaning Pass-band Sampling rate NB 2 Narrowband 300- 3 400Hz 8 kHz NB 1 Narrowband 150- 4 000Hz 8 kHz MB 1 Mediumband 80- 6 000 Hz 12 kHz WB 2 Wideband 50- 7 000 Hz 16 kHz WB 1 Wideband 80- 8 000 Hz 16 kHz SWB 1 Superwideband 80- 12 000 Hz 24 kHz SWB 2 Superwideband 50- 14 000 Hz 32 kHz FB 1 Fullband 20- 20 000 Hz 48 kHz Table 1. Abbreviations used for different signal bandwidths and respective sampling rates. 1 Silk bandwidth 2 ITU-T definition 3. Test Description Listening test was conducted in Nokia Research Center Listen- ing Test Laboratory [3]. The main research question was: How do na¨ ıve listeners experience differently encoded NB, WB and SWB signals without any preparatory information. 40 na¨ ıve lis- teners took part in the listening test. Each listener evaluated over 30 conditions with 6 voice samples from all scenarios de- scribed in section 3.2. Thus each listener scored about 200 in- dividually processed samples each in random order. Since each sample took about 10 seconds to listen and evaluate, and there are mandatory comfort breaks every twenty minutes, the listen- ing took about one hour per listener. Each condition obtained 240 votes. In order to have some initial scale to the listeners, the test started with 12 introductory (practice) samples, which represented the full scale of the conditions. These preparatory test results were omitted from the final results. 3.1. Extended Range MOS Test Method A modified version of the traditional ACR (Absolute Category Rating) [4] (MOS) method was used for the listening test. The MOS scale was extended to be 9 categories wide. Only the ex- treme categories were defined with verbal description: 1 ”very bad” and 9 ”Excellent”. We have experienced that the 9-scale MOS saturates less easily than the standard 5-scale MOS with these kind of high quality samples. In practise this new scale is somewhat between MUSHRA (ITU-R BS.1534) and 5-scale MOS. The assessment is not free sliding, but nine different val- ues still provide listener more ways to discriminate the samples. In practise 9-scale MOS test is also much faster to conduct with na¨ ıve listeners than MUSHRA. 3.2. Test samples The test material contained female and male voice samples with clean voice and voice in background noise. Each listened sam- ple contained a sentence pair that lasted a total of 8 seconds. Ta- ble 2 shows all six different sample types that were used for the voice quality assessment. Also all different background noise types and relative background noise levels are shown. In addition to normal bar graphs (Figure 3) codec and ref- Copyright © 2010 ISCA 26 - 30 September 2010, Makuhari, Chiba, Japan INTERSPEECH 2010 2390

Voice Quality Evaluation of Recent Open Source Codecs · AMR-WB, G.718, G.718B, and G.722.1C were used as stan-dardized references. ... 1. Introduction There is an effort ongoing

Embed Size (px)

Citation preview

Voice Quality Evaluation of Recent Open Source CodecsAnssi Ramo, Henri Toukomaa

Nokia Research Center, Tampere, [email protected], [email protected]

Abstract

This paper introduces Silk, CELT, and BroadVoice that areavailable in the internet as open source voice codecs. Theirvoice quality is evaluated with a subjective listening test. AMR,AMR-WB, G.718, G.718B, and G.722.1C were used as stan-dardized references. In addition Skype’s Silk codec’s peculiarbandwidth and bitrate characteristics are examined in more de-tail.Index Terms: speech coding, subjective listening test, opensource

1. IntroductionThere is an effort ongoing in IETF to standardize a new voicecodec for internet telephony (VoIP) that is also an open source.In this paper three different relatively recent open source voicecodecs Silk, CELT and BroadVoice are evaluated against 3GPPstandardized AMR and AMR-WB and ITU-T standardizedG.718, G.718B and G.722.1C codecs. All the codecs weretested in a same listening test with both clean and noisy monospeech signals. There exists several other open source codecs,and older codecs such as iLBC and Speex were already tested inour earlier papers ([1] [2]) and they were not included into thislistening test. In addition to pure listening test results a moredetailed analysis is made from the Silk codec, since it is nota traditional fixed bit-rate codec. It adapts both its bitrate andsignal bandwidth on a frame-by-frame basis resulting a highlyvariable bitrate codec.

This paper is constructed as follows. First some generalinformation about user experience is given in Section 2. NextSection 3 describes the test methodology and listening test setupin detail. Section 4 shows how Silk codec achieves different bi-trates and how it affects the signal bandwidth and voice quality.Sections 5 and 6 explain subjective listening test results fromCELT and BroadVoice codecs respectively. Finally conclusionsare drawn in Section 7.

2. Improved user experienceFor over a century telephony has relied on narrowband (NB)voice quality, which is barely good enough to transport themost important elements of human speech. Wideband (WB)has been coming to mobile services and devices for a few years,but the final breakthrough has not yet happened. Current trendin voice codec development is to use even wider bandwidthsand higher sampling rates for improved user experience. Tra-ditionally sampling rates have increased by doubling (like 8,16, and 32 kHz), however Skype’s Silk codec is using some un-conventional bandwidths and sampling rates for describing theircodecs bandwidth. In addition Silk codec uses critical samplingand for example at 16 kHz sampling rate all frequencies upto 8kHz are present. See Table 1 for detailed bandwidth ranges andsampling rates.

Abbr. Meaning Pass-band Sampling rateNB2 Narrowband 300- 3 400Hz 8 kHzNB1 Narrowband 150- 4 000Hz 8 kHzMB1 Mediumband 80- 6 000 Hz 12 kHzWB2 Wideband 50- 7 000 Hz 16 kHzWB1 Wideband 80- 8 000 Hz 16 kHzSWB1 Superwideband 80- 12 000 Hz 24 kHzSWB2 Superwideband 50- 14 000 Hz 32 kHzFB1 Fullband 20- 20 000 Hz 48 kHzTable 1. Abbreviations used for different signal bandwidths andrespective sampling rates. 1 Silk bandwidth 2 ITU-T definition

3. Test DescriptionListening test was conducted in Nokia Research Center Listen-ing Test Laboratory [3]. The main research question was: Howdo naıve listeners experience differently encoded NB, WB andSWB signals without any preparatory information. 40 naıve lis-teners took part in the listening test. Each listener evaluatedover 30 conditions with 6 voice samples from all scenarios de-scribed in section 3.2. Thus each listener scored about 200 in-dividually processed samples each in random order. Since eachsample took about 10 seconds to listen and evaluate, and thereare mandatory comfort breaks every twenty minutes, the listen-ing took about one hour per listener. Each condition obtained240 votes. In order to have some initial scale to the listeners,the test started with 12 introductory (practice) samples, whichrepresented the full scale of the conditions. These preparatorytest results were omitted from the final results.

3.1. Extended Range MOS Test Method

A modified version of the traditional ACR (Absolute CategoryRating) [4] (MOS) method was used for the listening test. TheMOS scale was extended to be 9 categories wide. Only the ex-treme categories were defined with verbal description: 1 ”verybad” and 9 ”Excellent”. We have experienced that the 9-scaleMOS saturates less easily than the standard 5-scale MOS withthese kind of high quality samples. In practise this new scaleis somewhat between MUSHRA (ITU-R BS.1534) and 5-scaleMOS. The assessment is not free sliding, but nine different val-ues still provide listener more ways to discriminate the samples.In practise 9-scale MOS test is also much faster to conduct withnaıve listeners than MUSHRA.

3.2. Test samples

The test material contained female and male voice samples withclean voice and voice in background noise. Each listened sam-ple contained a sentence pair that lasted a total of 8 seconds. Ta-ble 2 shows all six different sample types that were used for thevoice quality assessment. Also all different background noisetypes and relative background noise levels are shown.

In addition to normal bar graphs (Figure 3) codec and ref-

Copyright © 2010 ISCA 26-30 September 2010, Makuhari, Chiba, Japan

INTERSPEECH 2010

2390

Set Speaker Background noise Noise Level1 Female 1 clean na2 Male 1 clean na3 Female 2 Music (Clapton) -20 dB4 Male 2 Cafeteria -20 dB5 Female 3 Street -15 dB6 Male 3 Office -15 dB

Table 2. Sample sets used for listening test

erence conditions are also collected to a X-Y line graph (Figure5), where bullets point to individual MOS results and interpo-lated line connects the bullets, when relevant scalable codec orcodec family result is shown. On the left side of the table MOSscale is shown. On the bottom bitrate is shown.

4. Skype’s Silk voice codecSilk codec is currently used by the popular VoIP service Skype.It uses technology described in [5]. The main parameters are 20ms frame size and 5 ms look-ahead, target bitrate can be set toany value between 6 and 40 kbit/s. Silk also supports redundantinformation and upto five frames can be coded and packetizedtogether. Packetization improves coding efficiency slightly, butincreases delay. Redundant information increases bitrate, butimproves quality if there are frame losses. Also the complexitycan be adjusted. Lighter computation naturally means slightlydegraded quality.

Silk codec bitrate and signal bandwidth vary in time. Forexample Figures 1 and 2 show that when the bitrate is set to ei-ther 16 kbit/s or 8 kbit/s the bitrate starts from a much highervalue and in couple of ten seconds reaches the required aver-age bitrate. Similarly Silk also uses wider bandwidth at the be-ginning before it notices that requested bitrate does not allowso wide bandwidth and reduces bandwidth with internal resam-pling to a lower sampling frequency. Also the lowest frequency(high-pass filtering) frequency is adapted according to the inputsignal properties. Due to this behauviour, we processed the testsamples in such a way that a 30 second voice presignal withlow level background music was processed before the actualtest sample (24 seconds) was processed. Thus the Silk codechad enough time to iterate to the requested bitrate and select thesuitable bandwidth that was possible with that bitrate.

The measured bitrates can be seen in Table 3. In the tablethere are several bitrates described. The first column tells thebitrate that was requested from the command line. The secondcolumn tells the average bitrate during the tested sequence (30second ”initial” signal was omitted from all measured bitrates).The third column tells the bitrate during active speech. Thedifference between average and active bitrate is quite large withclean speech, but with noisy speech the difference is quite small.Compare e.g. Figures 1 and 2, where one can see that withclean speech, pauses between speech bursts are coded with rela-tively low bitrate and with noisy speech, the background noise iscoded only with sligthly reduced bitrate. It should be noted thatin addition to this behauviour Silk also supports proper DTX,but that was not used in this listening test nor analysis. Thefourth column tells the maximum measured bitrate during thetest sequence. Finally 90% bitrate is shown, that tells that 9out 10 frames are below this bitrate. The last bitrate is used forlistening test result Figure 5, since it tells approximately howmuch net bandwidth is needed in order to avoid excessive frameloss due to variable bitrate.

Time in s

Freq

uenc

y in

Hz

Silk signal spectrum and bitrate at 16000 bit/s

10 20 30 40 50 600

2000

4000

6000

8000

10000

12000

0 10 20 30 40 50 60

5

10

15

20

25

30

Bitr

ate

in k

bit/s

Time in s

Figure 1. Silk spectrum and bitrate with noisy speech headerand clean speech test signal, when bitrate was set to 16 kbit/swith SWB input signal. At 16 kbit/s signal is tested voice signalis still WB

Requested Average Active Maximum 90 %bitrate bitrate bitrate bitrate bitrate6000 5830 6440 12000 68008000 7280 8250 13200 880012000 11080 12530 18400 1360016000 14530 16570 24000 1800020000 18190 20860 30000 2240024000 21650 24930 37200 2680032000 28820 33320 48400 3600040000 35710 41300 58800 44600

Table 3. Silk bitrates measured after the presignal

4.1. Silk codecs true bandwidth

Silk codec supports internally and externally sampling rates 8,12, 16, and 24 kHz and bitrates from 6 to 40 kbit/s [6]. Externalsampling rate however does not tell the actual internally usedbandwidth of the codec. Skype claims in [6] that wideband issupported from 7 kbit/s upwards and superwideband from 12kbit/s. From our evaluation it seems that required bitrates aresignificantly higher for these bandwidths. For example Figure2 shows that at 8 kbit/s Silk is providing only NB signal band-width. With iterating 500 bit/s steps we found out that with ourspeech database (including both noisy and clean speech) wide-band is reached with 9.5 kbit/s and SWB requires 24.5 kbit/s.See Table 4 for a complete list.

In the Figure 4 we see how the frequency response widenswith increasing bitrate. Typical WB bandwidth can be seen at16 kbit/s. Full SWB 80- 12 000 Hz bandwidth is shown whenbitrate is 32 kbit/s. Even with 24 kbit/s highest frequencies from10 kHz to 12 kHz are significantly attennuated. At 12 kbit/s,which is claimed to be the first SWB bitrate, the codec is barelyreaching 6 kHz bandwidth, pointing that Silk is still internallyrunning at 12 kHz (MB) sampling rate.

4.2. Silk Voice Quality

From Figure 3 Silk voice quality can be compared to all othertested codecs and clean references. As can be seen Silk per-forms quite well compared to state-of-the-art AMR and AMR-

2391

Figure 3. Codecs MOS scores with bars

Time in s

Freq

uenc

y in

Hz

Silk signal spectrum and bitrate at 8000 bit/s

10 20 30 40 50 600

2000

4000

6000

8000

10000

12000

0 10 20 30 40 50 60

5

10

15

20

25

Bitr

ate

in k

bit/s

Time in s

Figure 2. Silk spectrum bitrate with noisy speech header andcafe noisy speech test signal, when bitrate was set to 8 kbit/s.At 8 kbit/s the tested voice signal is NB.

WB. Especially at around 16 kbit/s or above Silk is better thanAMR-WB at comparable bitrates. This is due to the fact thatSilk wideband is critically sampled upto 8 kHz instead of ITU-T or 3GPP defined 7 kHz. This added bandwidth (from 7 to 8kHz) shows up in the results favourable to Silk. It seems thatSilk provides quite artifact free voice quality for the whole 16-24 kbit/s range with WB signals. At 32 and 40 kbit/s Silk isSWB and competes quite equally against G.718B or G.722.1Calthough having a slightly narrower bandwidth than the ITU-Tstandardized codecs.

Since Silk bitrate varies from frame to frame, drawing MOSvalues to X-Y graph in Figure 5 with average bitrate or even ac-tive average bitrate (see Table 3) is not fair for other codecs, wechose to set the Silk bitrate to 90% value. 90%-value indicatesthe bitrate below which 90% of frames have lower than this bi-trate, but at the same time it also tells that 10% of the frames

0 2000 4000 6000 8000 10000 120000

10

20

30

40

50

60

70

80

90

100

Frequency in Hz

Am

plitu

de re

spon

se

Silk signal bandwidth with various bitrates

32 kbit/s24 kbit/s16 kbit/s12 kbit/s8 kbit/s

Figure 4. Silk frequency response with clean speech signal atbitrates 8, 12, 16, 24, and 32 kbit/s

have higher bitrate, which may cause frame loss in low bitratechannels.

5. CELT codec voice qualityCELT codec is low delay generic audio codec [7]. It supportssampling rates from 32 kHz upwards. In this test we used only32 kHz sampling rate with 512 samples long frame (16 ms).Codec also requires half a frame (8 ms) for a look-ahead. LTPwas enabled, variable bitrate disabled and complexity was leftto default value.

Results in Figures 3 and 5 show, that for useable quality atleast 32 kbit/s is needed. At that bitrate it’s voice quality is stillbehind the ITU-T G.722.1C at 32 kbit/s and about the same asAMR-WB at 12.65 kbit/s or Silk 12 at kbit/s. With 40 kbit/s thevoice quality is already very good and similar to G.718B at 32kbit/s. For truly transparent quality 48 kbit/s or more is needed.

2392

Figure 5. Silk voice quality with different operating points

Bandwidth Internal Bitrate rangesampling rate in steady state

Narrowband (NB) 8 kHz 5.0- 9.0 kbit/sMediumband (MB) 12 kHz 9.5- 14.0 kbit/sWideband (WB) 16 kHz 14.5- 24.0 kbit/sSuperwideband (SWB) 24 kHz ≥ 24.5 kbit/sTable 4. Silk internal sampling rates and supported signal band-widths with different bitrates

6. BroadVoice codec voice qualityBroadCom has also provided BroadVoice codec for IETF stan-dardization [8]. BroadVoice’s solution consists of two fixed bi-trate codecs BV16 and BV32, but they can be considered tobe just two different modes of the same codec, since most ofthe algorithms and source code are the same in both codecs.BroadVoice 16 kbit/s variant is narrowband codec and 32 kbit/sis wideband. They are descrbibed in more detail in [9] and[10]. Both codecs have delay of 5 ms without lookahead, mak-ing them very low delay. Complexities are 12 and 17 MIPSrespectively. From the results in Figures 3 and 5 it can beseen that both BroadVoice codecs provide reasonable NB andWB absolute voice quality. Unfortunately their bitrates are notcompetitive. AMR 10.2 kbit/a mode provides similar quality toBroadVoice at 16kbit/s. AMR-WB 12.65 kbit/s provides similarquality to BroadVoice 32 kbit/s. However, if some applicationrequires extremely low delay BroadVoice is a good candidatethere.

7. ConclusionsSilk codec provides useable voice quality at quite competitivebitrates compared to AMR or AMR-WB. However, there aretwo things to consider when using the Silk codec. The firstone is the highly variable bitrate, that may cause problems de-

pending on the transmission network and channel. The secondone is that the provided signal bandwidth is also changing withtime and for wideband quality significantly higher bitrates thanAMR-WB are needed (at average 14.5 kbit/s, with peak bitrateapproaching 20 kbit/s). Also Silk’s promises about SWB band-width at 12 kbit/s are not valid. In practise 24.5 kbit/s or higherbitrate is needed. CELT codec provides a good alternative toITU-T G.722.1C or G.718B at slighly higher bitrate. FinallyBroadVoice is not competitive bitrate vice compared to AMR,AMR-WB or Silk. However due to its extremely low delay, itmay have some special applications.

8. References[1] A. Ramo and H. Toukomaa, “On comparing speech quality of

various narrow- and wideband speech codecs,” in Proc. of ISSPA,Sydney, Australia, 2005.

[2] A. Ramo, “Voice quality evaluation of various codecs,” in Proc.of ICASSP, Dallas, TX, USA, March 2010.

[3] M. Kylliainen et al., “Compact high performance listeningspaces,” in Proc. of Euronoise, Italy, 2003.

[4] ITU-T Rec. P.800, “Methods for subjective determination oftransmission quality,” 1996.

[5] Koen Vos, Soeren Jensen, and Karsten Soerensen, “Silk speechcodec,” in IETF draft, 2010.

[6] PDF document, “Silk data sheet,” https://developer.skype.com/silk, 2009.

[7] Jean-Marc Valin et. al, “A high-quality speech and audio codecwith less than 10 ms delay,” IEEE Transactions on audio, speechand language processing, vol. 18, no. 1, pp. 58–67, January 2010.

[8] J.-H. Chen, W. Lee, and J. Thyssen, “Rtp payload format forbroadvoice speech codecs,” in IETF RFC 4298, 2005.

[9] ANSI/SCTE 24-21, “Bv16 speech codec specification for voiceover ip applications in cable telephony,” 2006.

[10] ANSI/SCTE 24-23, “Bv32 speech codec specification for voiceover ip applications in cable telephony,” 2007.

2393