Application Note - Voice Quality With ITU P.863 'POLQA'_2012_07

7/24/2019 Application Note - Voice Quality With ITU P.863 'POLQA'_2012_07

1/33

A Rohde & Schwarz Company

Voice Quality with ITU-T P.863 POLQAApplication Note

July 2012

SwissQual License AGAllmendweg 8 CH-4528 Zuchwil Switzerland

t+41 32 686 65 65 f+41 32 686 65 66 e [email protected]

Part Number: 12-070-200912-4


2/33

SwissQual has made every effort to ensure that eventual instructions contained in the document are adequate and freeof errors and omissions. SwissQual will, if necessary, explain issues which may not be covered by the documents.SwissQuals liability for any errors in the documents is limited to the correction of errors and the aforementioned advisoryservices.

Copyright 2000 - 2012 SwissQual AG. All rights reserved.

No part of this publication may be copied, distributed, transmitted, transcribed, stored in a retrieval system, or translated

into any human or computer language without the prior written permission of SwissQual AG.

Confidential materials.

All information in this document is regarded as commercial valuable, protected and privileged intellectual property, and isprovided under the terms of existing Non-Disclosure Agreements or as commercial-in-confidence material.

When you refer to a SwissQual technology or product, you must acknowledge the respective text or logo trademarksomewhere in your text.

SwissQual, Seven.Five, SQuad, QualiPoc, NetQual, VQuad, Diversity as well as the following logos areregistered trademarks of SwissQual AG.

Diversity Explorer, Diversity Ranger, Diversity Unattended,NiNA+, NiNA, NQAgent, NQComm, NQDI,NQTM, NQView, NQWeb, QPControl, QPView, QualiPoc Freerider, QualiPoc iQ, QualiPoc Mobile,QualiPoc Static, QualiWatch-M, QualiWatch-S, SystemInspector, TestManager, VMon, VQuad-HD aretrademarks of SwissQual AG.

SwissQual acknowledges the following trademarks for company names and products:

Adobe, Adobe Acrobat, and Adobe Postscript are trademarks of Adobe Systems Incorporated.

Apple is a trademark of Apple Computer, Inc.

DIMENSION, LATITUDE, and OPTIPLEX are registered trademarks of Dell Inc.

ELEKTROBIT is a registered trademark of Elektrobit Group Plc.

Google is a registered trademark of Google Inc.

Intel, Intel Itanium, Intel Pentium, and Intel Xeon are trademarks or registered trademarks of Intel Corporation.

INTERNET EXPLORER, SMARTPHONE, TABLET are registered trademarks of Microsoft Corporation.

Java is a U.S. trademark of Sun Microsystems, Inc.

Linux is a registered trademark of Linus Torvalds.

Microsoft, Microsoft Windows, Microsoft Windows NT, and Windows Vista are either registered trademarks ortrademarks of Microsoft Corporation in the United States and/or other countries U.S.

NOKIA is a registered trademark of Nokia Corporation.

Oracle is a registered US trademark of Oracle Corporation, Redwood City, California.

SAMSUNG is a registered trademark of Samsung Corporation.

SIERRA WIRELESS is a registered trademark of Sierra Wireless, Inc.

TRIMBLE is a registered trademark of Trimble Navigation Limited.

U-BLOX is a registered trademark of u-blox Holding AG.

UNIX is a registered trademark of The Open Group.


3/33


2000 - 2012 SwissQual AG

|

CONFIDENTIAL MATERIALS

ii

Contents

Voice Quality with ITU-T P.863 POLQA........................................................................................................ 0

1

IntroductionWhat is P.863 POLQA?............................................................................................ 4

More complex telecommunication networks and handsets ................................................................... 4

Demand for wideband audio transmission ............................................................................................ 5

Narrowband vs. super-wideband tests .................................................................................................. 5

2 Technical Details of POLQA ............................................................................................................... 6

POLQA as a model of a subjective listening test .................................................................................. 6

POLQA in narrow-band and super-wideband mode ............................................................................. 7

POLQA internal processing steps ......................................................................................................... 8

Time Alignment ................................................................................................................................ 8

Psycho-acoustic model .................................................................................................................. 10

POLQA prediction performance and typical results ............................................................................ 14

3 Narrow-band Voice Quality measurements with P.863 POLQA' in Diversity............................ 16

Idea of the narrowband test ................................................................................................................. 16

Speech reference signals for narrow-band tests ................................................................................. 17

What are the differences to the previous ITU P.862 PESQ?............................................................. 18

Test definition and result presentation................................................................................................. 19

4

Wideband Voice Quality measurements with P.863 POLQA' in Diversity ................................. 22

Idea of the wideband test .................................................................................................................... 22

Wideband speech reference signals ................................................................................................... 22

What are the differences to narrowband? ........................................................................................... 23

Where wideband quality can be assessed .......................................................................................... 24

Wideband analysis in Diversity ............................................................................................................ 25

5 Real field measurements .................................................................................................................. 26

Results in GSM / UMTS networks compared to P.862 PESQ........................................................... 26

Results in real field networks compared to P.862 PESQ................................................................... 28

Sample dependency of P.863 POLQA scores in real field measurements....................................... 29

Results in real field networks in super-wideband mode of P.863 POLQA......................................... 30

6 Conclusion ......................................................................................................................................... 32


4/33



|


iii

Figures

Figure 1: Application of a full-reference psycho-acoustic model in telecommunication networks .................... 6

Figure 2: Scheme of a full-reference psycho-acoustic motivated speech quality model .................................. 6

Figure 3: Basic scheme of the main components of P.863 POLQA................................................................ 8

Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts ................. 9

Figure 5: Illustration of assigned signal parts and the optimal path of signal correspondences..................... 9

Figure 6: Example of an aligned pair of reference and degraded signal ........................................................... 9

Figure 7: Block-scheme of POLQA as in ITU-T P.863 .................................................................................... 10

Figure 8: Application of masking slopes to the Bark spectrum........................................................................ 12

Figure 9: Consideration of fully and partially masked spectral parts ............................................................... 13

Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking ....................... 13

Figure 11: Insertion and capturing in a speech test setup ............................................................................... 16

Figure 12: IRS in send and receive direction as specified in ITU-T P.48 ........................................................ 17

Figure 13: P.863 POLQA narrowband main result representation in NQDI .................................................. 19

Figure 14: P.863 POLQA narrowband detail result representation in NQDI................................................. 20

Figure 15: P.863 POLQA test selection in NQDI ........................................................................................... 20

Figure 16: P.863 POLQA statistical report in MS EXCEL.............................................................................. 21

Figure 17: P.863 POLQA wideband main result representation in NQDI ...................................................... 25

Figure 18: P.863 POLQA wideband audio bandwidth representation in NQDI............................................. 25

Figure 19: P.863 POLQA and P.862.1PESQ presentation in NQDI........................................................... 26

Figure 20: P.863 POLQA and P.862.1 PESQ presentation in NQDI with signal interruptions .................... 26

Figure 21: Distribution of predicted MOS scores by P.862.1 PESQ ............................................................. 27

Figure 22: Distribution of predicted MOS scores by P.863 POLQA.............................................................. 28

Figure 23: Distribution of predicted MOS scores by P.863 POLQA SWB modeand NB mode .................... 31

Figure 24: Distribution of predicted MOS scores by P.863 POLQA SWB mode in wideband networks ....... 32

Tables

Table 1: Improvement in performance of P.863 POLQA to P.862 PESQ .................................................... 14

Table 2: Typical predicted MOS-LQ values for common transmission techniques ......................................... 18

Table 3: Typical P.863 POLQA scores for common transmission techniques.............................................. 24

Table 4: Comparison of P.862.1 PESQ scores to P.863 POLQA in high qualitative UMTS/GSM setups ... 27

Table 5: Comparison of P.862.1 PESQ scores to P.863 POLQA in common real field setups .................. 29

Table 6: Comparison of different speech samples in common real field setups ............................................. 30

Table 7: Comparison of the NB and SWB mode of P.863 POLQA in common real field setups.................. 31


5/33



Chapter 1 | IntroductionWhat is P.863 POLQA?


4

1 IntroductionWhat is P.863 POLQA?

SwissQual has been driving the development of new objective perceptual quality prediction algorithms sinceit was founded in 2000. Immediately after SwissQuals foundation, its voice quality predictor SQuad wasspecifically developed to meet the requirements of mobile and Voice-over-IP scenarios. SQuad still forms thebackbone of the entire voice quality suite of SwissQual to this day. Already from the beginning it overcamedisadvantages of ITU-T P.862 PESQ in these application areas. In order to keep up with the latestadvancements in network and processing technologies, SQuad was continuously maintained and improvedover the years to deliver precise quality scores to the customer.

Already in 2005 ITU-T started a project for standardization of a new objective voice quality model. Thisproject called P.OLQA should extend the scope of the existing ITU-T P.862 PESQ and overcomedisadvantages and known problems of PESQ.

The P.OLQA project was finalized in 2010 by a competition between six candidate models, including thelatest SQuad algorithm. In a detailed analysis based on more than 45000 speech files, the SQuad algorithmwas selected as one winning model passing the challenging thresholds set by ITU-T.

Together with the two other selected models from Opticom and TNO, SQuad was integrated into a JointModel POLQA that combines the strengths of the three underlying algorithms and now forms the new ITU-TP.863 POLQA approved in January 2011.

SwissQual is one of the most active drivers for the development of objective measures in internationalstandardization bodies. As a consequence SwissQual leads the corresponding working group at ITU-T andboth initiated and set several standards over the last years such as ITU-T P.563 (a no-reference voice qualitymeasure), ITU-T J.341 (a full-reference measure for HDTV), ETSI TR 102 506 (a method for an Estimationof Quality per Call) and now the brand-new ITU-T P.863 POLQA.

POLQA is becoming an integral and central part of SwissQuals voice quality analysis suite and will be therecommended voice quality predictor for both narrowband and wideband speech.

The existing and widely introduced SQuad algorithm remains a part of Diversity and can still be used if

desired. SQuad can still be combined with the previous ITU-T P.862 PESQ. This gives all customers thepossibility to continue their ongoing measurement campaigns and to plan a transition to ITU-TP.863 POLQA on their own schedule.

More complex telecommunication networks and handsets

Telecommunication networks are being equipped more and more with highly non-linear components andlong distance calls are usually passing through several such components and even through differentnetworks.

Today, speech quality is no longer determined by the speech codec used or by lost frames alone. We also

have an interaction with different other components that perform automatic the signal level adjustments,smart loss concealments and similar strategies in order to increase intelligibility in case of critical situations.Unfortunately, these components are not used just once in a connection; there are rather several of them,potentially causing interferences.

In addition to that, we also saw some progress in the standardization of speech codecs, with recentlystandardized coding schemes now being integrated in the networks. These new schemes, such as EVRCand EVRC-B used in CDMA networks as well as wideband coding schemes such as AMR-WB andEVRC-WB were considered from the very beginning of the development of the latest SQuad version and arenow covered by POLQA as well. In addition to traditional schemes for voice source coding, audiocompression methods (e.g. MP3, AAC, ) are increasingly being used in telecommunication services aswell.

Besides more complex handsets for traditional telephony applications, new and packet-based transmission

technologies and multi-stage connections are becoming wide-spread. This leads to a lot of new distortiontypes that did not appear in traditional quasi circuit-switched networks. We will see an increasing amount oftime-warping and asynchronous re-sampling effects as well as non-exact replacements of missed packets or


6/33



Chapter 1 | IntroductionWhat is P.863 POLQA?


5

speech frames. All of this considerably changes the physical signal, without necessarily affecting itsqualitative perception. The correct rating of these types of signal distortions is a clear shortcoming of PESQand is now solved by POLQA.

Demand for wideband audio transmission

Telecom industries are now initiating the evolution from narrowband telephony to wideband speechtransmission. The codecs for wideband are ready and approved by the standardizations bodies, thehandsets are not restricted in processing power and the core networks are being upgraded.

Of course, narrowband speech is the normal experience for telephone users and has been accepted fordecades. As the mobile telephone becomes an increasingly multi-media based device, the traditionaltelephone sound seems less and less acceptable. The expectation of the consumer is changing whilst, atthe same time, the increased processing power allows wider audio bandwidths. The standardization bodiesprovide the corresponding coding schemes and the core networks are being upgraded. The first step wideband transmission up to 7000Hzis already being overhauled by the emergence of so-called super-wideband transmission technology which opens the band up to 14000Hz, and of course by the family of

audio codecs including MP3 and AAC which even allows transmission above the hearing threshold.

SwissQual had already upgraded the entire audio processing chain in their products to address widebandaudio transmission in 2009 and in SwissQual products widebandtest applications are already availablesince early 2010 using the SQuad algorithm.

SwissQuals Diversity platform was the first measurement platform for mobile network testing andbenchmarking to integrate a real-time test of wideband speech. Therefore, SwissQual has been able to gainthe most experience with wideband testing in the field and can use this superior know-how for a smooth andtransparent integration of P.863 POLQA in their products.

Narrowband vs. super-wideband tests

P.863 POLQA offers the same two operational modes supported by SQuad: one for traditional narrowbandtelephony and one for super-wideband (which translates to no audio bandwidth limitation in practice).

Choosing the operational mode is less a matter of how the channel is set up. The question is rather to whichkind of reference the received signal shall be compared and what is inserted into the channel.

For narrowband tests, the reference signal is a perfect signal in telephony bandwidth (in principle 300 3400Hz). This signal is inserted into the channel as well. The resulting MOS prediction gives a quality that isrelative to the narrowband signal. This means that a perfectly transmitted signal will be scored close toMOS = 5 (practically 4.5). A potential wideband or super-wideband capability of the channel is not tested,since the input signal is just narrowband.

This narrowband test case is well defined over years; it allows compatibility to existing measures likeP.862 PESQ and SQuad in narrowband mode as well. The entire quality scale is used for these narrowband

conditions.In case of the super-wideband mode, the reference signal is a perfect signal in almost full bandwidth (5014000Hz). Here this full-bandwidth signal is inserted into the channel. The resulting MOS prediction gives aquality that is relative to the super-wideband signal. This means that a perfectly transmitted super-wideband(!) signal will be scored with close to MOS = 5 (practically 4.5). If the actual channel is justnarrowband, the super-wideband signal becomes limited in audio bandwidth by the channel and the scorerelative to the super-wideband reference is lower than 5.0.

This super-wideband mode should be used in all scenarios where super-wideband or wideband systems areto be compared against each other or qualified against a narrowband network.


7/33



Chapter 2 | Technical Details of POLQA


6

2 Technical Details of POLQA

POLQA as a model of a subjective listening test

ITU-T P.863 POLQAis a so-called full-reference model. The quality estimation is based on the comparisonof the transmitted signal with the high quality reference signal.

transmission

channel

Full reference

measurement

high quality

speech signal

copy of

high quality

speech signal

Figure 1: Application of a full-reference psycho-acoustic model in telecommunication networks

The basic approach follows the common approach used by other measures such as SQuad. It compares thereceived and potentially degraded signal with an undistorted reference signal. This allows a very detailedand fine analysis of any kind of differences between the two signals. To consider human perception, at first amodel of the listening device (i.e. a handset or a headphone) is applied. That way, the exact same signal asit would be heard by using such a device is used.

Figure 2: Scheme of a full-reference psycho-acoustic motivated speech quality model

The more important step however is the application of a psycho-acoustic model that transforms the signalinto an internal sound representation under consideration of frequency and intensity warping and maskingeffects. In this internal sound representation plane, the differences between the degraded and referencespeech are calculated. These differences describe differences that would be perceptible in a direct subjective

comparison. Since speech perception and recognition is more than just listening to sound stimuli, a cognitivemodel is the last step of the quality prediction. Here individual distortions are weighted according to speechperception. For example, in case of human voice, a listener is more tolerant to certain distortion types aslong as they can be considered natural even if they differ significantly from the reference signal.

POLQA predicts the voice quality as it is perceived in an ITU-T P.800 subjective listening only test (LOT).Those tests are the most used listening tests in telecommunications. A listener scores the quality of apresented voice sample on a 1 (bad) to 5 (excellent) so-called Absolute Category Rating (ACR) scale. Thelistener does not compare the signal directly to a reference; he compares the signal to an internal reference,i.e. his or her expectation of how it should it sound if it were perfect).

Model of Device

(i.e.handset)

Psycho-acoustic model

(frequency and intensity

warping, masking)

Model of Device

(i.e.handset)


(frequency and intensity

warping, masking)

Cognitive

model

Distance

SimilarityMOS-LQO

Distorted

signal

Reference

signal


8/33





7

POLQA in narrow-band and super-wideband mode

Please note that POLQA has two different operational modes, one for narrowband and one for super-

wideband. The structure of the model is the same; the main difference is: What is the reference?In narrowband mode the reference signal is a narrowband (telephony band) signal. This simply means thatall degradations are rated in relation to this reference. The assumed listening situation iscorresponding tonarrowband telephonyan ordinarily shaped handset on one ear.

The band limitation of the telephony band itself is not considered as degradation, since the expectation of thelistener is exactly that limited signal. Therefore POLQA compares to a telephony band signal too. In addition,POLQA in narrowband mode does not take distortions into account that are outside of the spectral range of atelephone handset; usually frequencies below 200Hz are not transmitted anymore.

As a consequence, a only band limited but undistorted signal is scored by POLQA with a high value in thiscontext. A perfect narrow-band signal will receive a POLQA score of 4.5. This narrow-band mode of POLQAand the maximum value of 4.5 makes it backwards compatible to ITU-T P.862 PESQ as well as to SQuadwith respect to the scale use and the targeted telephony test scenario.

Additionally, POLQA supports super-wideband signals in a super-wideband operational mode. Hereofcoursethe reference signal which POLQA compares to is super-wideband. In this mode, POLQA scoreslike a human listener wearing headphones and expecting HiFi quality. Consequently, for getting high scoreswith POLQA the signal needs to be a clean signal and almost unlimited in bandwidth. In case of an idealsuper-wideband signal, POLQA scores with 4.75.

It is very important to understand that a narrowband telephony signal scored in super-wideband mode willget significantly lower scores than it would get in narrowband mode. This is logical, since due to thecomparison to a super-wideband reference, all missing spectral components are considered as distortions.That is the same as listening to telephony speech through a headphone: the listener expects HiFi quality, butonly telephony bandwidth is received.

ITU-T P.863 POLQA is the first recommended model by ITU-T for super-wideband speech. In the context of

speech, super-wideband can even be considered unlimitedspeech, since there are no relevant speechparts anymore above 14000Hz.

An interesting intermediate case is the traditional wideband that is limited to 7000Hz. However, wideband isa term that has been used in different ways. The first trials for extended audio bandwidths already started inthe 1980s and opened the band up to 7000Hz (a sampling frequency of 16kHz was used here). The lowerband limitation was sometimes 50Hz, sometimes 100Hz. An early coding standard was based on an ADPCMscheme (ITU-T G.722) and remained untouched for many years. Now wideband speech transmission iscoming to mobile networks and will be enabled by AMR-WB and EVRC-WB. There is still an upper limit of7000Hz, however the lower end is just limited by the electro-acoustical components in the mobile phones.

These traditional wideband scenarios will be scored by POLQA in its super-wideband mode as well. Thismeans that the wideband signal is compared against a super-wideband reference. Consequently, the partsabove 7000Hz are missing in the signal, leading to a measured degradation. However, the parts above

7000Hz only contribute to a lesser extent tothe perceived speech quality. Therefore, an ideal signal justlimited at 7000Hz will be scored by POLQA at around 4.5, making it backwards compatible with ITU-TP.862.2 PESQ-WB.

It is important to note that a traditional wideband channel must be scored in POLQA super-wideband mode.It is not in line with ITU-T P.863 to use a 7000Hz reference signal.


9/33





8

POLQA internal processing steps

The following block scheme provides a brief overview of POLQA. There is a differentiation between the time-

alignment part and the psycho-acoustic perceptual and cognitive model.

Idealization

Space / TimeAlignment

Idealization

Perceptualmodel

Cognitivemodel

Perceptualmodel

InputReference

DegradedOutput

Internalrepresentationof the ideal

Internalrepresentationof the output

Difference ininternal

representation

Quality

Idealization

Figure 3: Basic scheme of the main components of P.863 POLQA

Time Alignment

Why does POLQA perform time alignment?

POLQA and other objective measures following the same base structure compare the (spectral) short-termcharacteristics of the reference signal and the degraded signal frame by frame. The alignment markscorresponding sections in both signals. Only this way can the correct frames be compared to each other.

What makes it challenging?

Aligning two signals is simple for constant delay between the two signals and a linear transmission. Here,just an offset has to be compensated. More complicated are un-synchronous devices (clock drift), they leadto a constantly increasing / decreasing delay. Here the compensation is not constant but at least constantlyand linearly changing over time. Even more challenging are processing components transmitting individualparts of the signal with different delays. These can lead to stretched or compressed speech pauses but alsoto stretched or compressed speech parts. This stretching or compressing can be done by preserving the

pitch or by just warpingthe entire signal part.

In all these cases, each individual short frame of the degraded signal (usually 32ms in length) has to beassigned to a corresponding frame in the reference signal.

How can it be done in a robust and fast way?

At first POLQA indicates signal parts where the delay can be assumed to be constant and flags them aslandmarks. These parts can be of different length; in the simplest case one single part covers the entiresignal (if there is a constant delay over the entire file).


10/33





9

REFERENCE PROCESSED

Correspondence

with confidence

Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts

In a second step, the areas between these landmarks are analyzed. Therefore, the signal is sub-dividedmore and more into a series of smaller parts. Each part has an assigned corresponding part in the othersignal.

Each assigned signal part is given a value that rates the confidence of the assignment. In less confidentareas a wider signal range is analyzed, whereas the assignment correspondences of parts with a highconfidence are considered as fixed.

This approach allows a very efficient and robust search structure since the search range becomes more andmore restricted as more landmarks are set. The result is a kind of matrix with corresponding signal parts andassociated search ranges.

Figure 5: Illustration of assigned signal parts and the optimal pathof signal correspondences

A Viterbi-like algorithm then calculates the most likely path through this matrix and fixes the correspondingsignal parts.

The end result of the time alignment step is a correspondence table with start and the end times of eachsignal part and its correspondence in the reference. Parts of the degraded signal with no correspondence inthe reference (i.e. inserted or added parts), as well as parts of the reference signal that are missing in thedegraded signal, are marked as well. The following signal graph illustrates a practical example. The uppergraph shows the (complete) reference signal, the lower graph shows the received and degraded signal.

Figure 6: Example of an aligned pair of reference and degraded signal


11/33





10

The green areas denote signal parts assigned with high confidence, the blue ones are those with lowerconfidence. The red signal part indicates a part of the reference signal that was lost during transmission andis no longer present in the degraded signal. Unassigned silent parts (white) are not used for directcomparison but rather for an analysis of the annoyance of the noise floor in there.


Just like any of the models that have the same basic approach as POLQA, the psycho-acoustic model startswith a global level alignment followed by a frame-wise spectral analysis of overlapping frames. As is usual inthese models, a short-term level scaling is applied as well, and the application of a cosine-based window anda FFT is used for converting the audio signal from the time domain to the spectral domain.

The block scheme of the POLQA psycho-acoustic model is shown in the figure below.

Scaling towards

degradedIdealization

Frequency warpingto pitch scale

Windowed FFT

Input Reference

x

Masking

Intensity warping to

loudness scale

x

Scaling towards

playback level

Frequency warpingto pitch scale

Windowed FFT

Masking

x

Frequency response

Noise estimationReverb

Perceptual subtraction

Output Degraded

Asymmetry processing

Lp time integration Lp time integration

FRQ NOI RVB indicators

Disturbance indic. DaDisturbances in speech

Disturbance indic. DDisturbances in speech

NOIStationary andswitched noises

FRQspectral shapingband limitation

Nose suppression

Partial Local and Global

scaling

Frequency responsecompensation

Intensity warping to

loudness scale

Nose suppression

Cognitive model- Combination of individual indicators- Training on subjective reference scores

- Mapping into MOS scale

Predicted Listening

Quality MOS-LQO

RVBRoom

reverberations

Figure 7: Block-scheme of POLQA as in ITU-T P.863

The basic approach of the psycho-acoustic model, which means the use of critical bands and the loudnesscompression, looks similar to well-known state-of-the-art models.


12/33





11

However, there are three parts that make P.863 POLQA different from established standards such asP.862 PESQ.

Removing / Reduction of individual distortion types and separate consideration of them

Idealization of the reference signal

Sharpened loudness spectra

Removing / Reduction of individual distortion types

It is too easy to assume that each difference in the signal after transformation to an internal psycho-acousticrepresentation will be considered correctly by this purely physiological view. There are some kinds ofdistortions that are not well covered by the established and used psycho-acoustic models. These are forexample some kinds of linear distortions, such as so-called frequency responses (leading to a colored orshaped spectral distribution), echoes and reverberations as well as strong additive background noises.

The reason for these shortcomings becomes clearer when we look at how psycho-acoustic models weredesigned and evaluated. These models mainly describe the spectral integration into critical bands (due to the

so-called frequency-to-place transformation on the basilar membrane in the human ear), the sensitivity indifferent spectral areas, the non-linear perception of intensity as well as spectral and temporal maskingeffects. These models were widely developed and validated with signals like sine waves and narrow-bandnoises. They do not include any assumptions about speech recognition. For example an echo, creating adistortion due to slightly delayed repetition of the same speech, cannot be distinguished from a pure noise ofthe same intensity using this approach.

Therefore, it makes sense to detect and quantify those distortion types in prior to the application of thepsycho-acoustic model. The long-term frequency response is calculated and compensated in the signals. Anindicator FRQ is calculated separately and considered in the final MOS prediction. The same applies tobackground noises. They are measured and widely removed from the signal. The amount of noise is laterconsidered through a NOI indicator. In a similar way, echoes and reverberations are calculated forcorrection of the final predicted MOS.

By applying these corrections, the signal is now much closer to a signal to which the psycho-acoustic modelcan be applied. It has been freed of spectral shaping and strong noises.

Idealization of the reference signal

A truly new part to existing and established methods is a so-called idealization of the reference signal. Theidea is to remove slight distortions such as noises and to align the spectral shape and timbre towards anideal. This makes sense, since a listener in a scoring situation as in a P.800 ACR test, does not compare thedegraded signal to the input signal (the actually used reference), but rather to a conception of how that talkershould sound. This step is modeled for the first time in P.863 POLQA.

Common objective models compare the signal to be scored with a reference and weight all (perceptuallyrelevant) differences as distortion. As a consequence, if the signal for scoring is identical to the reference

(totally transparent transmission or just a copy of the reference signal), no differences will be found and thepredicted MOS is at the maximum (e.g. 4.5 for narrow-band).

POLQA is different. The internal representation of the idealreference signal is not equal to the internalrepresentation of the signal used as reference. This means that a non-optimal reference, e.g. having a lownoise floor, will have that noise removed. If POLQA gets this (noisy) reference signal for scoring, it comparesthat signal with the internally calculated ideal and may provide a score lower than the maximum. POLQA isnota model that rates differences between two signals; it rates an absolutequality and uses the idealizedreference as an upper bound. Absolute quality is the difference to an imagined or expected ideal, just like insubjective absolute category rating tests.

It may be a bit irritating for a technical user, but POLQA is just a consequent model of the subjective listeningtest. In a listening test too, listeners will never give a very high score to a signal that is a bit noisy.


13/33





12

Sharpened loudness spectra

The usual approach for transforming a signal to an internal representation is

1. Application of a time-to-frequency transformation by estimation of a short-term spectral powerdensity. This can be done by a filter-bank or a Fourier transformation. POLQA uses a FFT approachafter applying a Hann window. Window length is 32ms with 50% overlap.

2. Subdivision of spectrum into bands. This is usually motivated by thein principlelogarithmicperception of frequency. This may be done in a simplified manner using 3

rdoctave bands oras is

more common and used in POLQA toousing critical bands according to Zwicker. At the end of thisaggregation, the hearing spectrum is sub-divided into 24 bands with increasing bandwidth towardshigher frequencies. For each band a power or intensity value is computed. This frequency scaleusing critical bands is known as Barkscale.

3. The intensity is transformed to a perceived loudness scale. Basically, the intensity is compressed athigher sound intensities, similarly to a decibel scale. In addition, the varying sensitivity at differentfrequencies is taken into account. Intensities below the hearing threshold are discarded as well. Thisloudness scale is known as Sonescale.

Of course, all objective models following this approach will apply the standard range of signal compensationsin addition to the plain psycho-acoustic transformation. These compensations include further and individualshort-term level scaling, spectral compensation and weighting functions. However, the psycho-acousticmodels almost always follow the approach outlined above.

Common models such as P.862 PESQ apply the spectral masking thresholds directly to this internalrepresentation. The result is a so-called smeared spectrum. In principle this is modeling the self-maskingeffects of the signal. That means that quieter parts are masked by louder parts at neighboring frequencies.This effect is widely used and described for audio coding as in e.g. MP3, where masked parts of the signalare considered redundant and are not transmitted. Furthermore, the quantization noise is shaped such as tobe masked by the signal and thus not to add perceptible distortions.

In POLQA this approach was revised, since we are less interested in the self-masking effects of the signalbut rather in the perception of remaining (or unmasked) differences between two signals. The chosen

approach can be imaginedcompared to a smeared loudness spectrum as a sharpened loudnessspectrum.

In a first step the masking slopes are calculated (Figure 8):

Figure 8: Application of masking slopes to the Bark spectrum

The second step consists of analyzing which masking slopes other parts of the spectrum, either fully orpartially (Figure 9):

Bark

Sone

Bark

Sone


14/33





13

Figure 9: Consideration of fully and partially masked spectral parts

In a third step the fully masked spectral parts are removed and the partially masked parts are reduced intheir loudness (Figure 10):

Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking

Finally, we get a loudness spectrum that represents the individual spectral parts as they contribute toperception. This means that fully masked parts are taken out, while partially masked parts are attenuated.

These modified spectra of the reference and the degraded signal are then compared and differences areconsidered as perceptible differences.The big advantage of the sharpened approach is the remaining highresolution of the spectrum. It allows a high spectral resolution in the analysis, as required e.g. for a validqualitative assessment of the reproduction of fine spectral structures in upper bands by compressionalgorithms.

Aggregation and Cognitive Effects

The steps above are performed for a short-term spectral comparison across all frames in the speech signal.At the end of this main loopacross all frames, we have a quality indicator for each frame window. Thisquality indicator is based on the differences of the short-term representations of the reference and degradedsignal. It is dimensionless, but may be imagined as a kind of signal similarity over time.

These individual quality indicators represent a quality at a certain point in the audio signal. They areaggregated over all frames. This aggregation is not a plain averaging; the aggregation contains non-linearweightings and slopes in time. These aggregated quality indicators are then weighted and corrected usingthe previously calculated descriptors for spectral shaping (FRQ), additive noises (NOI) and echoes and re-verberations (RVB).

Finally, the aggregated overall quality indicator is mapped to the MOS scale. In the narrowband operationalmode the indicator is mapped to a range from 1.0 to 4.5, in case of super-wideband to a range of 1.0 to 4.75.The upper bound represents the typical maximal MOS obtained in subjective listening tests.

Bark

S

one

Bark

S

one

Masked

Partially masked

unmasked

Bark

Sone

Masked

Partially masked

unmasked

Bark

Sone


15/33





14

POLQA prediction performance and typical results

The main trigger for the development of ITU-T P.863 was that the existing P.862 PESQdid not covertodays quality variations in telecommunication networks. This decreasing inaccuracy of the prediction

performance of P.862 PESQ required an evolved voice quality prediction model to cope with New types of speech codecs and codecs not yet used in telecommunications, e.g. audio codecs

Enhanced frame loss concealment techniques

Voice Quality Enhancement (VQE) systems, non-linear processing for increasing intelligibility

Re-sampling, time-warping

In addition the P.863 development should extend the scope of P.862 mainly by

Extension to super-wideband (50 to 14000Hz)

Qualitative prediction of intermediate bandwidth, changes in audio bandwidth, bandwidth extension

Acoustical interfaces, echoes, reverberations

Sound presentation level

Due to the wide scope of P.863, the development and evaluation required a huge amount of test data. Testdata means, speech samples with this variation of degradations scored by human listeners in defined sub-

jective experiments. In the end, for the evaluation of P.863 POLQA atotal of 62 subjectively scored datasets were used containing more than 45000 voice samples.

These data sets1were used for calculating the prediction performance by means of residual square errors or

correlation coefficients. The residual square error oras in previous timesPearsons correlation coefficientis the indicator for the accuracy of the objective measure; it is given by the remaining prediction error to the

true scores obtained in the subjective tests. These values give an overview of the performance in general. However, the actual reached numbers dependon the construction of the data set and the kind of conditions it contains. It is always true that there are testconditions that can be predicted easily in an accurate way by a model(e.g. noises, waveform codecs andso on) and others where the deviation is higher (usually combinations of distortions). The occurrence of suchconditions in a data set has a strong influence on these figures. This is not only due to the objectiveprediction method rather caused by uncertainties of the listeners in the auditory tests as well.

For the P.863 POLQA evaluation ITU-T has chosen a statistic approach that is based on an r.m.s.e.calculation, but takes the uncertainty of the subjectively derived MOS values into account. Based on thesefigures, the performance evaluation of P.863 POLQA compared to P.862.1 and P.862.2PESQ was done.

Table 1: Improvement in performance of P.863 POLQA to P.862 PESQ

rmse* P.862.1 'PESQ' P.863 'POLQA' in NB mode Improvement byClassical narrowband exp. 0.157 0.123 22%

Advanced narrowband exp. 0.227 0.154 32%

P.862.2 'PESQ-WB' P.863 'POLQA' in SWB mode Improvement by

Wideband experiments 0.345 0.150 57%

1A data set, also often called experiment or database, is a set of speech files processed or transmitted under different

real field or simulated conditions and scored subjectively. A data set usually consists of about 200 individual speechsamples. The prediction accuracy is calculated by comparison of the MOS scores given by the listeners and theprediction by the objective measure as e.g. P.863 POLQA.


16/33





15

The so-called classical set of narrowband experiments covers 22 data sets used in ITU-T already forstandardization efforts from the mid 90s until about 2003. They contain common codec and noisedistortions, mobile channels of the 2

ndand 3

rdgeneration as well as VoIP as it was state of the art at the

millennium. Even though these databases cover distortions that were already used during the developmentof P.862 PESQ, the new method P.863 POLQAshows even higher prediction accuracy here.

The advanced set of narrowband experiments is more focused on the latest coding technologies, frame lossconcealment strategies, noise reduction and of course 3

rdand 4

thgeneration mobile as well as the newest

VoIP implementations. This set is based on 15 data sets. The improvement reached with the new methodP.863 POLQA is evident. This set covers a wide range of test conditions of latest technologies which P.863was designed for.

Finally, there was a set of common wideband data as well. It covers 7 different data sets. Here theimprovement over P.862.2 PESQ-WB is extremely high.


17/33



Chapter 3 | Narrow-band Voice Quality measurements withP.863 POLQA' in Diversity


16

3 Narrow-band Voice Quality measurements withP.863 POLQA'in Diversity

Idea of the narrowband test

The idea of a narrowband test is a test situation, in which a listener listens to a speech signal using aconventionally shaped telephone handset. That means, he or she is restricted to the telephony bandwidthand does expect such a band-limited signal as well. As a consequence a perfect sounding but band-limitedsignal will get a high score, since it exactly matches that listeners expectationof excellent quality in thiscontext. Despite of a channel having a wider audio bandwidth; the listener would not experience this, sincethere is a limitation in the transfer of the handset.

Therefore, a narrowband speech test, independent of whether it is performed with P.863 POLQA, withP.862 PESQ or with SQuad in narrowband mode, always models a conventional narrowband telephonysituation.

This test approach has been very commonly used over the last years or even decades and is perfectly suitedfor characterization of narrowband networks and systems. However, no qualification relatively to wideband orsuper-wideband systems is possible.

Of course, neither P.863 POLQA nor SQuad listens to the speech signal using a handset. The handsettransfer characteristic is modeled in the algorithms itself. That means, only spectral parts that are perceptiblevia such a handset will by analyzed by the quality predictor. Both P.863 POLQA andSQuad analyze signalsas they are recorded at an electrical interface to the network. This can either be an ISDN or PSTN line butalso a headphone connector of a mobile device. The specification of such a receiving handset is called IRSreceive (IRS: Intermediate Reference System) and is described in ITU-T P.48. The IRS receive characteristiccan be imagined as a weak telephone band-pass with a slight preference for higher frequencies towards3kHz.

network / real channel

network / real channel

89 dB(A) SPL -26dB ovl -26dB ovl 79 dB(A) SPL

electr.

interface

electr.

interface

electr.

interface

electr.

interface

Reference circuit with the assumption of a channel gain of 0dB and and Overall Loudness Rating of 10dB

model of

microphone

Reference

speech signal

MOS predictor (i.e. POLQA)

model of

handset

model of

handsetCopy of reference

speech signal

psycho-

acoustic

model

Figure 11: Insertion and capturing in a speech test setup

(SPL: sound pressure level, OVL relative to overload point)

Similar to this, the sending direction is modeled in this narrow-band setup as well. The source speech signalis inserted into an electrical interface, either a PSTN or ISDN line or into the microphone input of a mobiledevice. In reality at this point the signal has passed the microphone and some voice processing componentsalready. To emulate this part of the signal path, a model of a typical narrowband microphone is applied. Thisis called IRS send, since it models the device in sending direction. It can also be imagined as a weak


18/33





17

telephony band-pass but with a quite strong pre-emphasis up to 3kHz. This makes the speech sound a bitsharpbut with higher intelligibility in background noise situations.

2

Figure 11 schematizes the idea behind a narrow-band test. The modeled sending device allows a directelectric coupling to the channel under test and guarantees reproducible results independent from an actual

used microphone.The frequency responses for the two filters modeling the device are given inFigure 12.It is clearly visiblethat there is a bandwidth limitation to the telephony band, although a slightly wider band can pass than just300 to 3400Hz.

IRS send direction (ITU-T P.48)

-30

-20

-10

0

10

0 1000 2000 3000 4000

f / Hz

a/dB

IRS rcv direction (ITU-T P.48)

-30

-20

-10

0

10

0 1000 2000 3000 4000

f / Hz

a/dB

Figure 12: IRS in send and receive direction as specified in ITU-T P.48

While for ISDN and PSTN interfaces defined level and impedance requirements are given and fulfilled by theinterface devices, for mobile phones only the headset connector as a proprietary interface is available.SwissQualsconnector interface for mobile phones is adjusted for this type of interface. It applies the correctlevel, adjusts the frequency response and matches to the impedance of each individual phone type andenables a quasi-standard electrical network termination point even for mobile handsets.

Speech reference signals for narrow-band tests

ITU-T P.863 POLQA is a so-called full-reference model. The basic approach follows the common approachof those measures that is the same for e.g. SQuad or P.862 PESQ as well. It compares the received andpotentially degraded signal with an undistorted reference signal. This allows a very detailed and fine analysisof any kind of difference between the two signals. In consideration of human perception, at first a model ofthe listening device (in case of narrow-band a handset) is applied in the model itself. This way, the exactsame signal is compared as it would be heard by using such a device. In a narrow-band test case the signalis compared to an optimal narrow-band reference.

ITU-T P.800 and P.862.3 give constraints and requirements to the speech samples to be used. That ismainly the temporal structure and signal level of the speech signal. SwissQuals measurement systems pro-

vide a set of speech samples in different languages. All speech samples follow the same rules compositionand pre-processing, they are all composed of a meaningful female and a male sentence. The referencespeech file is 6s in length and contains more than 3.5s of active speech. There is a speech pause betweenthe two sentences as required by P.862.3 and P.863. The signal is adjusted to a speech r.m.s.e. level of -26dB rel. OVL

3that corresponds to an analogue level of -20dBm at a 600 Ohms four-wire interface.

In addition P.862.3 defines the insertion and capturing process for the speech signal. These definitions aredescribing the insertion point and the expected spectral characteristics. In narrow-band channels the signalsare inserted after a pre-filtering according to the IRSsendcharacteristic as shown and explained above.

2This characteristic is taken from older carbon microphones: the pre-emphasis should compensate the low-pass

characteristic of the inductive loaded analogue lines at that time.

3The value of -26dB relates to an overload point of 32767/-32768 as is used in 16bit resolution in the digital signaldomain.


19/33





18

What are the differences to the previous ITU P.862 PESQ?

Actually the differences are quite small for common applications in cellular networks. A customer may onlysee a slightly changing MOS-LQ value for error-free or high quality transmission using EFR or AMR with

higher bitrates. Instead of a typical value in the range of 4.0 for EFR, they may now obtain values in therange of 3.9. This is mainly due to the fact that the actual bandwidth limitation in the narrowband channel isalso considered by P.863 POLQA, i.e. limitations relative to the IRSrcvfrequency response. In case theactually used bandwidth is slightly narrower, the MOS will be lower by a very narrow margin as well.

An improvement will be seen for EVRC type codecs as used in CDMA. The new P.863 POLQA shows aneven better comparability to EFR/AMR codecs.

4

Furthermore, POLQA is trained for scoring complex channels including more than just a codec, e.g. noisereduction, variable gain and filtering as well as strong time warping.

The following table shows the main differences in scores between SQuad version 08, P.862.1 PESQ andP.863 POLQA.

The results are based on typical speech samples and are an average across six speech samples (i.e.

American English as used in SwissQual Diversity). Except the transparent transmission all samples werepre-filtered by IRSsend.

Table 2: Typical predicted MOS-LQ values for common transmission techniques for SQuad 08, as well asP.862.1 PESQ and P.863POLQA.

P.862.1(narrowband)

SQuad-LQ 08(narrowband)

P.863(narrowband)

Linear distortions

Transparent transmission~40~3800 Hz

4.50 4.50 4.50

Transparent transmission

~180~3500 Hz (G.712)

4.40 4.50 4.30

Transparent transmission~200~3500 Hz (IRSsend)

4.50 4.50 4.40

Transparent transmission3003400 Hz (box block)

4.10 4.30 3.60

IRSsend+ G.711(A-Law standard PCM)

4.40 4.40 4.30

Codec conditions

IRSsend+ EFR / AMR 12.2kbps 4.15 4.15 4.20

IRSsend+ EFR(real loss-free connection)

4.10 4.15 4.10

IRSsend+ QCELP 13kbps 3.90 4.00 4.00

IRSsend+ EVRC 9.5 kbps 3.75 3.90 3.90

IRSsend+ EVRC-B 9.3 kbps 3.75 4.00 3.90

IRSsend+ AMR 7.95 kbps 3.90 4.00 3.95

IRSsend+ AMR 6.70 kbps 3.75 3.90 3.85

AMR 4.75 kbps 3.40 3.70 3.65

4ITU-T and 3GPP do not recommend the use of the P.862 family for EVRC-type codecs.


20/33





19

The codecs are used as reference SW implementations. In addition one EFR condition is shown as itbehaves in a real loss-free channel, using a commercial Nokia handset as access device to the network. Thechannel was terminated by an ISDN card device running G.711 A-Law.

Firstly, a very slight more pessimistic prediction is enabled by P.863 POLQA compared to SQuad08.However, for practical use cases this absolute difference is negligible. Compared to P.862.1 the higher ratesof AMR match very well even though the lower rates are scored higher by P.863. In addition, the EVRC typecodecs are scored higher and more realistic by P.863 and especially SQuad08 compared to P.862.1.

P.863 POLQA considers linear distortions and bandwidth limitations in its score. For super-wideband modeit is obvious. There, a signal is always compared to a super-wideband reference (50 to 14000 Hz). It isimportant to note that P.863 POLQA in narrow-band mode considers a full narrow-band signal (~50 to3800 Hz) as reference. To this signal an IRSrcvfilter is applied in P.863 POLQA itself. That meanslimitations lowering this bandwidth will lead to a predicted distortion. With P.863 POLQA the actual channelfilters and band-pass characteristics in the microphone and loudspeaker path of the used mobile phone aretaken more into account as it was for P.862 PESQ.

5

SwissQuals SQuad08 also considers linear distortion in narrow-band mode; however it is less sensitive thanP.863 POLQA and is supposed to be less dependent from the actually used phone and its internal filtering.

SwissQuals speech quality suite offers two methods for predicting listening quality: The known SQuad08and the new ITU-T P.863 POLQA. Both models may be combined with ITU-T P.862 PESQ as an option.

The entire framework as known from SQuad including the voice samples, the insertion and capturingprocedure andof courseall of the additional signal analysis results are used and available forP.863 POLQA in the same way.

Test definition and result presentation

The definition of tests, the timing and the selection of speech files are exactly the same as for Speech testsusing SQuad. The only difference is the naming: speech tests with P.863 are called Speech POLQA.

P.863 POLQA is embedded in the same framework that is used for SQuad. For P.863 POLQA tests aswell, all additional information such as levels, noise analysis, delay variations and frequency response arecalculated and are available. Consequently, the obtained results are presented in the same format inSwissQuals NQDI. This underlines the close relationship between the speech quality measures. Only theMOS prediction is either made by SQuad or by P.863 POLQA; the value measured by both is ListeningQuality as indicated by the type of test.

Figure 13: P.863 POLQA narrowband main result representation in NQDI

To differentiate P.863 POLQA tests from SQuad and P.862PESQ, the actually usedmethod is given inparentheses behind label Listening Quality. For an immediate visual feedback, the POLQA logo is shownright below the predicted MOS score.

5Since, P.863 POLQA measures the actual spectral loss of the speech signal, the actual impact by band-limitations

depend on the actual spectral power distribution if the speech sample. That means there are samples more or less

affected by this filtering due to their spectral characteristic e.g. losing more or less high frequency parts.


21/33





20

In addition to the global values for the entire speech sample, graphs illustrate the quality profile over thesample duration, the signal envelopes as well as the signal gain

Figure 14: P.863 POLQA narrowband detail result representation in NQDI

P.863 POLQA is treated as a separate method for listening quality measurements in NQDI. The testselection tab sheet in NQDI can be used to select individual P.863 POLQA tests.

Figure 15: P.863 POLQA test selection in NQDI

For reporting, the group of Voice reports in NQDI sports a LQ narrowband statistic report. It reports not

only the P.863 results but rather the results of all other algorithms such as SQuad and P.862 PESQ in the


22/33





21

same table. The results for each algorithm are given in a separate column.

Figure 16: P.863 POLQA statistical report in MS EXCEL


23/33



Chapter 4 | Wideband Voice Quality measurements withP.863 POLQA' in Diversity


22

4 Wideband Voice Quality measurements withP.863 POLQA' in Diversity

Idea of the wideband test

The idea of a wideband orin correct termsa super-wideband test is a test situation in which a listenerlistens to a speech signal using HiFi headphones. This means that he or she is not restricted to anybandwidth. The headphone is able to transmit the entire perceptible audio bandwidth.

As a consequence, a perfect sounding and not band-limited signal will get a high score, since it exactlymatches the listeners expectation ofexcellent quality in such a setup. On the one hand, the headphoneequipment itself sets a high expectation; on the other hand, the listener knows the unlimited speech signal,it is presented in this experimental context.

The modeling is similar to the narrowband telephony case as shown inFigure 11 but is adapted to thechanged setup. This means that the MOS predictor, e.g. P.863 POLQA, will not listen through a telephony

handset, but rather models a headphone as listening device. It is modeled in a simplified manner as a flatfilter from 50 to 14000Hz.

6

Similarly, a wideband or super-wideband device does not follow the IRS send filter characteristic in themicrophone path anymore. It is also close to a flat filter with a band limitation at a higher point in frequency.

This has the consequence that the channel or system under test receives a super-wideband input signal. Incase there is a narrowband channel or device, this channel or device will restrict the bandwidth. At the otherend the predictor listens with a headset and compares the received signal to the unlimited reference signal.

This leads to recognition of missed spectral parts and this missinginformation consequently leads to a dropin quality. It can be imagined as listening to HiFi signals through a headphone and suddenly being presenteda narrow-band signal. Here a human listener will also perceive it as lower in quality.

The transmitted bandwidth becomes much more important for a super-wideband test. Restrictions in audio

bandwidth are always compared relative to a super-wideband reference signal.

Note: The use of the super-wideband test scenario is NOT restricted to wideband or super-widebandchannels or devices. The scenario just defines the reference that is super-wideband in this case. Testscenarios in super-wideband will be the common test case in the near future, they are not only required for avalid evaluation of wideband systems but rather also for correct ranking of systems or networks with differentbandwidths downwards to narrowband.

A super-wideband test scenario implies some technical requirements. Within SwissQuals product linesthewhole audio processing chain from the handsets audio connector across the analogue circuits in Diversity aswell as the digital signal processing is designed and extended to higher sampling frequencies and audiobandwidth already from the beginning.

Along with Diversity Release 10.2.0, SwissQual has launched a super-wideband test application for the first

time along with SQuad. Now, in Release 10.6.5 the wideband test application has been completed by theintegration of the new ITU-T Recommendation P.863 POLQA'.

Wideband speech reference signals

As already mentioned ITU-T P.863 POLQA is a so-called full-reference model. It compares the receivedand potentially degraded signal with an undistorted reference signal but this is the undistorted reference thatis practically unlimited in bandwidth.

6For narrow-band mode P.863 POLQA applies an IRS receive filter that emulates a narrow-band handset(see:Figure 12: IRS in send and receive direction as specified in ITU-T P.48)


24/33





23

This is the difference to the narrow-band case. The comparison of the recorded signal is made relatively to asuper-wideband reference. In the same way, the recorded signal is not post-filtered to avoid any bandlimitation that models a receiving HiFi headphone.

That means, in case of a full-bandaudio channel (i.e. a VoIP connection using full audio bandwidth or an

application using a MP3 with sufficient bitrate as in video or audio streaming), the recorded signal matches tothe reference in its bandwidth. In case of a common wideband or even a narrow-band channel or device, thebandwidth becomes limited during transmission. In case this signal is recorded and compared to the fullreference, the spectral loss is weighted as degradation.

Of course the exploration of a wideband channel requires also the insertion of a signal with sufficientbandwidth. To actually feed wideband signals into the channel, new voice samples were recorded. They arewithout a perceptual bandwidth limitation and are stored at 32kHz sampling frequency in a separatereference folder Speech-Widebandor Speech-Wideband POLQA respectively. As usual, the samples areconstructed out of a male and a female spoken sentence and have a constant length of 6s. Thus, thecontinuity to the narrowband tests is completely given.

For the time being SwissQual provides samples in

German (German pronunciation)

German (Swiss pronunciation)

British English

Italian

Dutch

Each language sample is provided without any pre-filtering (except for a 5014000Hz band-pass) andcalled i.e. GE_fm_wide.wav. As specified for wideband devices, the microphone path is considered as flat inthe transmission band. It means no IRSsend as for required narrow-band is applied. The signal remains flat,without any further band limitation and without any pre-emphasis as in the IRS.

What are the differences to narrowband?

In traditional telephony scenarios, the expectation is set to a perfect but narrowband voice signal. A signalthat is close or identical to such a signal is scored with a high quality value (usually a MOS-LQ of around 4.5on a five-point scale).

7Additional degradations will decrease the quality value up to a minimum of 1.0.

Within a wideband scenario, the expectation of excellent quality is a perfect wideband speech signal. Sincethe same scale is used here, such a perfect wideband signal is scored with 4.5 too. Obviously, a narrowbandsignal in the same context will not fulfill the expectation of high quality due to its band limitation.Consequently, it will be scored lower in this context.

This is roughly spoken the main difference. There are other effects such as a different perception of noises,since there are noise parts in the higher frequency ranges which are less or not masked by voice anymore,as well as other effects. But the main difference will be the lower scored narrowband signals.

Most important for customers will be typical values to be obtained with the wideband application compared tonarrowband measurements.

The following table shows typical values obtained in the two test scenarios for the same type of conditions byaveraging the predicted scores of five different speech samples.

7For more detailed information, please refer to:White PaperAbout MOS and Quality Measurementspublished by SwissQual AG in 2011.


25/33





24

Table 3: Typical P.863 POLQAscores for common transmission techniques in a wideband and a narrowband context

P.863 insuper-wideband

(50-14000 Hz)

P.863 innarrowband

(300-3400 Hz)

Transparent transmission 5014000Hz or wider 4.75 -

Transparent transmission 507000Hz (commonwideband) 4.3 -

AMR-WB 12.65 kbps (507000Hz) 3.8 -

Transparent transmission 503800Hz (Full Narrowband) 3.6 4.5

Transparent transmission ~2503500Hz (IRSsend) 3.5 4.4

Transparent transmission 3003400Hz (telephony box block) 3.0 3.6

IRSsend + G.712 + G.711 (A-Law standard PCM channel) 3.5 4.3

IRSsend + EFR / AMR 12.2kbps 3.2 4.2

IRSsend + EVRC 9.5 kbps 3.0 3.9

IRSsend + EVRC-B 9.3 kbps 3.0 3.9

IRSsend + AMR 7.95 kbps 2.9 3.9

It can be seen that the rank-order of the systems remains independent from the test scenario. The upperrange of the wide-band scale is just used for the high qualitative wideband voice samples. The commonnarrowband scenarios are compressed to the lower 60% of the scale and thus show a smaller gradient aswell.

In case of optimizing and benchmarking pure narrowband networks and applications, the common

narrowband test application can be used without any problems. The individual systems are more clearlydiscriminated due to the wider scale range used.

For optimizing wideband applications and networks and especially for benchmarking of wideband networksagainst narrowband ones, a wideband test application is required.

Firstly, the degradations in wideband mode can only be assessed in a wideband test application andsecondly, a wideband signal can only show its better quality against narrowband in wideband mode.

Note: Narrowband MOS-LQ values and wideband MOS-LQ values must never be mixed or directlycompared. They are referring to different interpretations of the MOS scale.

Where wideband quality can be assessed

Although wideband is a normal use case in daily lifes communication such as TV and FM radio, it is still notpopular in telecommunications.

It was used for commercial video conferencing systems, but it was Internet Telephony that enabledwideband telephony for normal users for the first time. Today, common VoIP clients support a wide range ofwideband codecs and use them if a sufficient bit-rate is available for the service.

Now the next step in wideband telephony is the evolution of cellular networks and handsets. The networksand user devices are being equipped with AMR-WB, allowing an audio bandwidth up to 7000Hz while stillremaining in the typical bit-rate range used for GSM and UMTS.

Typical applications where wideband is already in use are mobile connections in GSM / UMTS. Here the firstoperators have enabled AMR-WB in TrFO mode. Usually, in GSM the AMR-WB bitrate is restricted to12.65 kbps while in UMTS AMR-WB bitrates up to 23.85kbps are used. Another application that can be

tested in real field applications with Diversity today is VoIP connections. Here even super-wideband trans-mission is possible. There are different codecs in use, with both standardized and proprietary solutions. Both


26/33





25

were considered in the huge training set for SQuad and P.863 POLQA.

The main focus of Diversitys wideband test solution is of course the evaluation and benchmarking ofwideband channels in cellular networks.

An additional application area for wideband voice testing in Diversity is video streaming. In video streaming

audio codecs are usually used; these dont have any bandwidth restriction, except in very low bitrateconditions. Consequently, Speech Wideband as a test case is also applied to video streaming starting withRelease 10.2 of Diversity and completed in Release 11.0 with the full support of ITU-T P.863 POLQA.

Wideband analysis in Diversity

The super-wideband test application forms an own-standing test Speech WidebandPOLQA, whilst the testsSpeech and Speech POLQA remain at narrowband.

While Speech Wideband runs the SQuad algorithm, Speech Wideband POLQA enables P.863 as thevoice quality estimator. The same is true for Speech and Speech POLQA.

All these test types, Speech, Speech Wideband, Speech POLQA and Speech Wideband POLQA can beselected as separate tests with SwissQuals Test Manager and in the post-processing tools NQDI andNQView. The presentation in NQDI looks almost the same; however, the test name differs to differentiatebetween the tests and the used algorithms.

Figure 17: P.863 POLQA wideband main result representation in NQDI

The application type (highlighted in red) explains the modeled listening situation in detail. In addition, since apotential bandwidth reduction is a serious impact in a wideband scenario, the actual bandwidth of the

channel is measured and reported as well (highlighted in green). There are three classes:narrowband (up to ~3800Hz)

wideband (up to 8000Hz)

super-wideband (up to 14000Hz).

The remaining values are the same as usual and well known for SQuad and are visible in narrowband testsas well. They provide information about the speech level, noise floor, the amount of missed voice and thegain applied by the channel.

The tab sheet Speech Detailsclearly shows the audio bandwidth of the measured audio channel, in thiscase a common wide band channel up to almost 8000Hz(Figure 18).

Figure 18: P.863 POLQA wideband audio bandwidth representation in NQDI

The lower and upper bound are marked with blue lines. As is clearly visible, Diversity and ITU-T P.863 makeuse of real super-wideband signals. The frequency scale here ends at 16000 Hz; this corresponds to aninternal sampling frequency of 32000kHz.


27/33



Chapter 5 | Real field measurements


26

5 Real field measurements

One of the most important questions is the relation of P.863 POLQA results to previous P.862 PESQmeasurements under real field conditions. Of course, P.862 PESQ and P.863 POLQA are different

algorithms and treat distortions in the signal differently. However, at the end the predicted MOS shouldaccurately describe the quality of the voice or of the voice channel. This means that in cases where P.862PESQ delivered accurate predictions, the newer and improved P.863 POLQA should predict almost thesame value. For distortions where P.862 PESQ produced more inaccurate predictions, P.863 POLQA asan improved method will predict more accurate but therefore differently from P.862 PESQ.

8

In real field measurements the channel consists of more than just a codec. Even under perfect radioconditions there can be other factors that limit the maximum quality. These could be further bandwidthlimitations that are due to the actual device used, or further speech processing steps such as noise and gaincontrol that are applied in the device or in the network. There might also be trans-coding, i.e. a secondencoding/decoding step, for example in case of mobile-to-mobile connections or in special gateways fromthe mobile core to PSTN networks. For these reasons, the MOS scores obtained in a plain codec emulationas given inTable 3 are usually only reached in real field cases where the device and the network can beconsidered as transparent and do not apply further speech signal processing as e.g. through noise or gaincontrol.

Results in GSM / UMTS networks compared to P.862 PESQ

The following arbitrarily picked sample shows the correspondence. It is a perfect transmission from aperfectly matched PSTN line at the land side to a GSM network using AMR at 12.2 kbps. The audio level isperfectly adjusted, there are no perceptible audio bandwidth limitations and there is no speech missed (notemporal clipping, no interruptions). Most notably, the devices can be considered as quite transparent, asthey dont apply aggressive noise reduction or gain control mechanisms. The results can be considered asidentical between the two measures.

Figure 19: P.863 POLQA and P.862.1 PESQ presentation in NQDI

A good example of the difference between the two algorithms is the treatment of interruptions and lostspeech. Here P.862 PESQ is suspected of scoring inaccurately and usually too optimistic. In the examplealmost 4% of the original speech was lost, however P.862 PESQ scores with 3.2, while P.863 POLQA onlypredicts 2.7 which appears closer to the perceived score here.

Figure 20: P.863 POLQA and P.862.1 PESQ presentation in NQDIwith signal interruptions

By analyzing a larger number of quality scores obtained in a drive test, the picture remains almost the same.The following figures are based on a drive test and a collection of data from a European operator. The

8P.862 PESQ defines the algorithm technically. The actual transformation from the P.862 outcome to a MOS-like scaleis defined in P.862.1. All predicted MOS scores in this document are computed in accordance to P.862 and wereconverted to the MOS domain according to P.862.1.


28/33





27

speech sample used was American English and each given number is based on a collection of around 100individual scores.

Table 4: Comparison of P.862.1 PESQ scores toP.863 POLQA in high qualitative UMTS/GSM setups

P.862.1PESQ

P.863POLQA

P.862.1PESQ

P.863POLQA

Device A 3.97 3.97 4.19 4.19

Device B 4.04 4.06 4.17 4.17

Device A 3.78 3.77 4.19 4.18

Device B 3.87 3.87 4.17 4.20

Device A 3.92 3.80 4.13 4.02

Device B 4.01 3.83 4.12 4.04

Device A 3.74 3.60 4.11 4.01

Device B 3.78 3.59 4.10 3.99

GSM 900

UMTS 2100

GSM 900

Uplink

Downlink UMTS 2100

Average Maximum

Just looking at Downlinkwhich is usually the less critical direction, there is on average a difference betweenPESQ and POLQA averages of just 0.02, which is completely negligible. There are small differences inaverage between the phones and the two technologies GSM and UMTS. But the behavior is always thesame for either method, i.e. GSM 900 is scored lower by 0.2 MOS on average with both methods.

In Uplink the situation is slightly different. Here P.863 POLQA scores slightly lower than PESQ, on averageby 0.15 MOS. This effect is due to several reasons, the main one being the more restricted audio bandwidthby using the microphone path of the mobile device as it is the case in Uplink. By contrast, the Downlink isusing the (wider) loudspeaker path of the phone. The former P.862 PESQ compensates the frequencyresponse of the channel and therefore ignores that band-limitation mostly. P.863 POLQA considerschanges in bandwidth as they are perceived by a user and consequently a limitation will lead to a slightly

lower score here.Besides the average values, the distribution of the predicted values provides information of the measuresbehavior. The following two graphs are based on the downlink scores of Device Ain UMTS 2100 as above.

Listening Quality distribution (P.862.1 'PESQ') PDF

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

1-1.

1

1.

2-1.

3

1.

4-1.

5

1.

6-1.

7

1.

8-1.

9

2-2.

1

2.

2-2.

3

2.

4-2.

5

2.

6-2.

7

2.

8-2.

9

3-3.

1

3.

2-3.

3

3.

4-3.

5

3.

6-3.

7

3.

8-3.

9

4-4.

1

4.

2-4.

3

4.

4-4.

5

4.

6-4.

7

4.

8-4.

9

Listening Quality

PDFNumberofValue

Figure 21: Distribution of predicted MOS scores by P.862.1 PESQ (Device A, UMTS, Downlink as inTable 4)


29/33





28

Listening Quality distribution (P.863-NB 'POLQA') PDF

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

1-1.

1

1.

2-1.

3

1.

4-1.

5

1.

6-1.

7

1.

8-1.

9

2-2.

1

2.

2-2.

3

2.

4-2.

5

2.

6-2.

7

2.

8-2.

9

3-3.

1

3.

2-3.

3

3.

4-3.

5

3.

6-3.

7

3.

8-3.

9

4-4.

1

4.

2-4.

3

4.

4-4.

5

4.

6-4.

7

4.

8-4.

9

Listening Quality

PDFNumberofValue

Figure 22: Distribution of predicted MOS scores by P.863 POLQA (Device A, UMTS, Downlink as inTable 4)

Both distribution functions are very close and concentrate a wide majority of the scores in the range of 4.0 to4.2 that corresponds to the best quality in error-free connections. It is logical that a certain quality cant beexceeded. It is set by the coding scheme, the channel limits and other included voice processing. Even inundistorted conditions they insert a certain amount of degradation. This defines the upper level that cant beexceeded in this setup. This causes the steep decline towards higher values on the right-hand side. Usually,the majority of scores are in this region which corresponds to error-free transmission.

In the direction of lower values, the distribution falls shallower. Values in this region indicate degradations inaddition to the unavoidable distortions. In cellular networks these problems are usually interruptions (due tohandovers), falling back to lower bitrates in case of AMR (due to bad radio conditions) and frame losses thatwere concealed artificially by the AMR decoder. In principle there could be other distortions as well,e.g. transcodings in case of special routing or noise bursts coupled into analogue parts of the PSTN.

Regarding the absolute maximum as shown inTable 4 there is no difference between the phones and the

technologies used, meaning that the reachable quality is identical for both and the slightly differing averagesare caused by individual test conditions e.g. slightly different RF coupling or a few more bad channels in theaveraging process. It should be noted that the reached maximum is the same as obtained by just processingthe same speech sample over an AMR 12.2 kbps codec in offline emulation. This indicates that there are nofurther distortions introduced by the phone or constant speech processing components in the network.

Results in real field networks compared to P.862 PESQ

The previous section discussed results of an almost transparent network without fur

Documents

Application Note - Voice Quality With ITU P.863 'POLQA'_2012_07