Front-end Audio Processing: Reflections on Issues, Requirements, and Solutions Tomas Gaensler mh acoustics Summit NJ/Burlington VT

Front-end Audio Processing: Reflections on Issues, Requirements, and Solutions

Tomas Gaensler

mh acoustics

www.mhacoustics.com

Summit NJ/Burlington VTUSA

Front-end Audio Processing

Processing to enhance perceived and/or measured sound quality

in communication and recording devices

Not So Famous Quotes (Acoustic Jewelry/Bluetooth Headset)

Gary Elko (mh/Bell labs colleague)

At IWAENC 1995: “Acoustic Echo cancellation will not be needed in the future when people wear acoustic jewelry”

Arno Penzias (1978 Nobel prize laureate)

“No one would want acoustic jewelry because people would think the users talking to themselves are crazy”

I’m glad the success of Bluetooth headsets show that both were completely wrong!

Classical Front-end Architectures - POTS

BPF

Receive side(Rx)

Send side (Tx)

BPF

Carbon microphone with expansion effect that reduces noise

Large coupling loss in handset mode

SwitchLoss

Switch loss in speakerphone supporting telephones

Classical Front-end Architectures – Cellphone 1995

BPF/ADC

Receive side(Rx)

Send side (Tx)

BPF/DAC

EQ

EQ

En

cod

er/

Decod

er

Vol

AEC

NLP

Classical Front-end Architectures – Cellphone 2005 - 2010

BPF/ADC

Receive side(Rx)

Send side (Tx)

BPF/DAC

EQ

EQ

En

cod

er/

Decod

er

Vol

AEC

TXLEV

NS NLP

RXLEV

NS

Cellphones and Handsfree

Common problems:

Far-end listener does not

hear near-end talker

Near-end listener does not

understand far-end talker

Why?

Form factor – Size

Limited understanding of

physics and acoustics(?)

Echo louder than near-end:

Linear AEC

ERLE 20-30 dB

After cancellation Residual

Echo to Near-end Ratio

(RENR):

RENR 90-20-70 = 0 dB

RX/TX Levels, Coupling and Doubletalk

>20 dB of residual echo

suppression required

Duplexness suffers

Far-end 95—100 dBSPL at loudspeaker

85—90dBSPL at mic

Near-end talker 55—70 dBSPL at mic

SPL [dB]

110

70

RAIL (e.g. 32768 or 1)

Digital Level

Speech lev.

Q-noise (white)

14 bits

Mic S

NR

=65 dB

26Mic circuit noise (1/f)

94

29

Room noise lev.43

TX: Dynamic Range and Noise

Echo 90 dBSPL Peak echo 105-110 dB

No saturation of echo in TX path

ADC

Near-end speechLevel: 70 dBSPL

Actual speech to room noise ratio is

only about 27 dB at best

Echo Level: 90 dBSPL

Gain is required to get loud enough output

Perceived noise level is ~20 dB above normal room

noise level

TX: Fixed-point Processing and Quantization Noise

N=64 Q-noise increases by 36 dB

Double-precision “required”

ADC

AFB(FFT)

SFB(FFT)

Q-noise increases by 6log2(N) dB!

SPL [dB]

110

70

RAIL (e.g. 32768 or 1)

Digital Level

Speech lev.

LSB for 16-bits14

Q-noise from 64-point FFT processing

50

6log2(64)

EQDAC

RX: Dynamic Range and Distortion

Small loudspeakers have rather high cut-off frequency (high-pass)

EQ often required to get acceptable “sound” (frequency response). However EQ means:

Loss of signal loudness and dynamic range

Increased (analog) distortion

Many manufacturers compensate the loss of signal level by excessive digital gain and therefore get (digital) saturation

To AEC

Digital gainAnalog gain

What Can or Should be Done?

Minimize acoustical coupling by good physical design

TX

Use noise suppression but not excessively

Double-precision, block scaling, or floating-point

RX

Compression instead of fixed gain

10% or less loudspeaker/driver THD is desired

What about Non-linear AEC Algorithms?

Interesting problem proposed and worked on for many years

Not practical in most AEC applications since

Complicated model Gain and therefore saturation possibly in both TX and RX

paths

Added complexity and system cost

Often slow convergence

Difficult to fine-tune in field

Even when non-linear cancellation works perfectly, the user still perceives a distorted loudspeaker signal!

Classical Front-end Architectures – Cellphone 2005 - 2010

BPF/ADC

Receive side(Rx)

Send side (Tx)

BPF/DAC

EQ

EQ

En

cod

er/

Decod

er

Vol

AEC

TXLEV

NS NLP

RXLEV

NS

Why RX NS?

Why TX NS?

Single Channel Noise Suppression

Basic single channel noise suppressor

An extremely successful signal processing invention by

Manfred Schroeder in the 1960s

Musical tones – is it a (solved) problem?

How do we evaluate and improve quality?

How about convergence rate?

Background to Single Channel Noise Suppressors

Block processing:

Frequency domain model:

Linear Time-varying filter:

Wiener filter:

speech

NS)()()( nvnsny )(ˆ ns

noise

“enhanced”speech

( , ) ( , )( , )( , )

( , ) ( , ) ( , )y vs

s v y

P k m P k mP k mH k m

P k m P k m P k m

ˆ( , ) ( , ) ( , )S k m H k m Y k m

12 /

0

( , ) ( ) ( )K

j kn K

n

X k m w n x m n e

( , ) ( , ) ( , )Y k m S k m V k m

Background to Single Channel Noise Suppressors

Estimation of spectra is often done recursively:

Frequency smoothing:

2 2( , ) [ ( , 1) ( , ) ] ( , )y yP k m P k m Y k m Y k m

2 2( , ) [ ( , 1) ( , ) ] ( , )v vP k m P k m Y k m Y k m , when speech is “not” present

, time-dimension averaging constants

'

( , ) ( ', ) ( ', )b

b

k

k k

H k m b k k H k k m

( ', )b k k frequency-dimension averaging constants

, , ( ', )b k k and are critical for musical tone control

Musical Tones – Is it a (Solved) Problem?

Examples Original (“Sally Sievers’ reel, June-Sept. 1964” by Manfred Schroeder

and Mohan Sondhi at Bell Labs)

Original + noise (iSNR ~ 6 dB)

Schroeder – 1960s

“Generic spectral subtraction” – Boll 1979

IS-127 – 1995

“A problem of last century”, only a constraint in design

Controlling variance of suppression gains

Any NS algorithm should be constrained not to have musical tones

Must only have a small impact on voice quality

Quality Metrics

Most importantly: Listen!

SNR

Total

Segmental

During speech

Distortion metrics:

ISD (Itakura-Saito distance)

ITU-T P.862: PESQ/MOS-LQO

Quality Metric – P.862 (PESQ/MOS-LQO)

MOS-LQO (MOS Listening

Quality Objective)

Alg-1/2 – Wiener methods with

12 dB noise suppression

P.862.2

1.5

2

2.5

3

3.5

4

4.5

0 5 10 15 20 25 30 35 40 45 50 55 60

SNR (dB)

MO

S-L

QO

unproc Alg-1 Alg-2

What can the best noise suppressor achieve?

Quality Metric – “My Rule of Thumb”

P.862.2

1.5

2

2.5

3

3.5

4

4.5

0 5 10 15 20 25 30 35 40 45 50 55 60

SNR (dB)

MO

S-L

QO

unproc Alg-1 Alg-2 Bound (12 dB)

12 dB

Ideal MOS (PESQ) performance

bound is given by shifting the

unprocessed PESQ-curve to

the left

Example for 12 dB suppression

12 dB shift to the left

Convergence Rate

Important performance criterion:

Non-stationary noise conditions

Frame loss

Main objective:

Maximize convergence rate while maintaining speech

quality

Convergence Rate – A Useful Test

a) Input sequence

b) IS-127

c) Wiener Based

d) A spectral

subtraction m-script

retrieved from the

internet

Convergence Rate and MOS-LQO

a) “Normal”

b) “Fast”

c) MOS-LQO

Current Applications and Drivers of NS Technology

Where is NS going in industry now?

Beyond “12 dB” of suppression

Multi-microphone solutions

Two- or more channel suppressors

Linear beamforming

Applications

Mobile phones (a few two-microphone models have

reached the market)

Bluetooth headsets: great "new" application for signal

processing (Ericsson BT headset 2000)

Background to Linear Beamforming

N : Number of microphones

Broadside linear beamforming (e.g. delay-sum)

Directional gain: 10log(N)

White Noise Gain (WNG)>0

Practical size: “large” (~30cm)

Endfire differential beamforming

Directional gain: 20log(N)

WNG<0

Practical size: “small” (1.5-5cm)

Pro

cess

ing

Endfire direction

Broadside direction

Differential beamformers more suitable for small form-factors

Background to Linear Beamforming

What do we gain?

Less reverberation (increased intelligibility)

Less (environmental) noise

No (or low) distortion on axis

Possible interference rejection by spatial zero(s)

Some Issues:

Performance is given by critical distance!

Increase in sensor noise (WNG, differential beamforming)

Beamforming: Critical Distance

Critical distance (Reverberation radius): reverberant-to-direct path energy ratio is 0 dB:

DI = Directivity Index: gain of direct to reverberant energy over an omni-directional microphone

Order of finite differences used. 1st : 2 mics, 2nd : 3 mics etc)

1/2

60

0.1cV

rT

( /10)DI directivity factor = 10

OrderDI [dB]

00

16 2.0

29.5 3.0

312 4.0

cr

0r

0r

0r

cr

0r

First-Order Differential Beamforming

0 1

11 0 1 1 0 1

1

1 1

1( , ) [ cos( )], ( )

( ) ( , ) ( ) [ cos( )] [ (1 )cos( )] ,/

: (1 )cos( )

L

L

dE P T H f

c

d TY E H T P P

c T d c

Beamformer response

( , )E ( )Y m1

m2

d

T1

- HL(w)

0P

Classical First-Order Beamformer Responses

1 0.5 1 0.25 1 0.0 Cardioid Hypercardioid Dipole

Beamforming Demo: DEWIND processing

Documents

Front-end Audio Processing: Reflections on Issues, Requirements, and Solutions Tomas Gaensler mh acoustics Summit NJ/Burlington VT