Speech Segregation

Speech Segregation

DeLiang Wang

Perception & Neurodynamics LabOhio State University

http://www.cse.ohio-state.edu/pnl/

ICASSP'10 tutorial 2

Outline of presentation

I. Introduction: Speech segregation problemII. Auditory scene analysis (ASA)III. Speech enhancementIV. Speech segregation by computational auditory scene

analysis (CASA)V. Segregation as binary classificationVI. Concluding remarks


Real-world auditionWhat?• Speech

messagespeaker

age, gender, linguistic origin, mood, …

• Music• Car passing byWhere?• Left, right, up, down• How close?Channel characteristicsEnvironment characteristics• Room reverberation• Ambient noise


Sources of intrusion and distortion

additive noise from other sound sources

reverberation from surface reflections

channel distortion


Cocktail party problem

• Term coined by Cherry• “One of our most important faculties is our ability to listen to, and

follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry, 1957)

• “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp, 1992)

Ball-room problem by Helmholtz “Complicated beyond conception” (Helmholtz, 1863)

• Speech segregation problem


Listener performance

Speech reception threshold (SRT)

• The speech-to-noise ratio needed for 50% intelligibility

• Each 1 dB gain in SRT corresponds to 5-10% increase in intelligibility (Miller et al., 1951) dependent upon materials

Source: Steeneken (1992)


Effects of competing source

Source: Wang and Brown (2006)

SRT Difference(23 dB!)


Some applications of speech segregation

• Robust automatic speech and speaker recognition

• Processor for hearing prosthesis• Hearing aids

• Cochlear implants

• Audio information retrieval


Approaches to speech segregation

• Monaural approaches• Speech enhancement • CASA• Focus of this tutorial

• Microphone-array approaches• Spatial filtering (beamforming)

– Extract target sound from a specific spatial direction with a sensor array– Limitation: Configuration stationarity. What if the target switches or

changes location?

• Independent component analysis– Find a demixing matrix from mixtures of sound sources– Limitation: Strong assumptions. Chief among them is stationarity of

mixing matrix


Part II: Auditory scene analysis

• Human auditory system

• How does the human auditory system organize sound?• Auditory scene analysis account


Auditory periphery

A complex mechanism for transducing pressure variations in the air to neural impulses in auditory nerve fibers


Beyond the periphery

• The auditory system is complex with four relay stations between periphery and cortex rather than one in the visual system• In comparison to the auditory

periphery, central parts of the auditory system are less understood

• Number of neurons in the primary auditory cortex is comparable to that in the primary visual cortex despite the fact that the number of fibers in the auditory nerve is far fewer than that of the optic nerve (thousands vs. millions)

The auditory system (Source: Arbib, 1989)

The auditory nerve


Auditory scene analysis

• Listeners are capable of parsing an acoustic scene (a sound mixture) to form a mental representation of each sound source – stream – in the perceptual process of auditory scene analysis (Bregman, 1990)• From acoustic events to perceptual streams

• Two conceptual processes of ASA:• Segmentation. Decompose the acoustic mixture into sensory

elements (segments)

• Grouping. Combine segments into streams, so that segments in the same stream originate from the same source


Simultaneous organization

Simultaneous organization groups sound components that overlap in time. ASA cues for simultaneous organization:• Proximity in frequency (spectral proximity)

• Common periodicity• Harmonicity

• Temporal fine structure

• Common spatial location

• Common onset (and to a lesser degree, common offset)

• Common temporal modulation• Amplitude modulation (AM)

• Frequency modulation (FM) • Demo:


Sequential organization

Sequential organization groups sound components across time. ASA cues for sequential organization:• Proximity in time and frequency

• Temporal and spectral continuity

• Common spatial location; more generally, spatial continuity

• Smooth pitch contour• Smooth format transition?

• Rhythmic structure

Demo: streaming in African xylophone music • Note in pentatonic scale


Primitive versus schema-based organization

Primitive grouping. Innate data-driven mechanisms, consistent with those described by Gestalt psychologists for visual perception – feature-based or bottom-up It is domain-general, and exploits intrinsic structure of

environmental sound Grouping cues described earlier are primitive in nature

Schema-driven grouping. Learned knowledge about speech, music and other environmental sounds – model-based or top-down It is domain-specific, e.g. organization of speech sounds into

syllables


Organization in speech: Spectrogram

offset synchrony

onset synchrony

continuity

“… pure pleasure … ”

harmonicity


Interim summary of ASA

• Auditory peripheral processing amounts to a decomposition of the acoustic signal

• ASA cues essentially reflect structural coherence of natural sound sources

• A subset of cues believed to be strongly involved in ASA• Simultaneous organization: Periodicity, temporal modulation, onset

• Sequential organization: Location, pitch contour and other source characteristics (e.g. vocal tract)


Part III. Speech enhancement Speech enhancement aims to remove or reduce

background noise Improve signal-to-noise ratio (SNR) Assumes stationary noise or at least that noise is more stationary

than speech A tradeoff between speech distortion and noise distortion

(residual noise) Types of speech enhancement algorithms

Spectral subtraction Wiener filtering Minimum mean square error (MMSE) estimation Subspace algorithms

Material in this part is mainly based on Loizou (2007)


Spectral subtraction It is based on a simple principle:

Assuming additive noise, one can obtain an estimate of the clean signal spectrum by subtracting an estimate of the noise spectrum from the noisy speech spectrum

The noise spectrum can be estimated (and updated) during periods when the speech signal is absent or when only noise is present It requires voice activity detection or speech pause detection


Basic principle In the signal domain

y(n) = x(n) + d(n) x: speech signal; d: noise; y: noisy speech

In the DFT domain Y(ω) = X(ω) + D(ω)

Hence we have the estimated signal magnitude spectrum

To ensure nonnegative magnitudes, which can happen due to noise estimation errors, half-wave rectification is applied

)(ˆ)()(ˆ DYX

else0

0)(ˆ)( if )(ˆ)()(ˆ DYDY

X


Basic principle (cont.) Assuming that speech and noise are

uncorrelated, we have the estimated signal power spectrum

In general

Again, half-wave rectification needs to be applied

222)(ˆ)()(ˆ DYX

pppDYX )(ˆ)()(ˆ


Flow diagram

FFTNoisySpeech

| . |p

Phaseinformation

1/| . | pIFFT

EnhancedSpeech

Noise estimation/ update

ˆ| ( ) |pD ( )Y ( )y n

+


Effects of half-wave rectification

0 500 1000 1500 2000 2500 3000 3500 40000

2

4|Y

(f)|

0 500 1000 1500 2000 2500 3000 3500 40000

2

4

|X(f

)|

0 500 1000 1500 2000 2500 3000 3500 40000

2

4

|X(f

)|

0 500 1000 1500 2000 2500 3000 3500 40000

2

4

|X(f

)|

Frequency (Hz)

(a)

(b)

(c)

(d)

Noise spectrum

Frame t

Frame t

Frame t+1

Frame t+2


Musical noise

Time (msecs)

Fre

q. (

KH

z)

0 200 400 600 800 1000 1200 1400 1600 18000

1

2

3

4

5

Isolated peaks cause musical noise


Over-subtraction to reduce musical noise

• By over-subtracting the noise spectrum, we can reduce the amplitude of isolated peaks and in some cases eliminate them altogether. This by itself, however, is not sufficient because the deep valleys surrounding the peaks still remain in the spectrum

• For that reason, spectral flooring is used to “fill in” the spectral valleys

• α is over-subtraction factor (α > 1), and β is spectral floor parameter (β < 1)

else)(ˆ

)(ˆ)()( if )(ˆ)()(ˆ

2

22222

D

DYDYX


Effects of parameters: Sound demo

• Half-wave rectification: α =1, β = 0

• α =3, β = 0

• α =8, β = 0

• α =8, β = 0.1

• α =8, β = 1

• α =15, β = 0

• Noisy sentence (+5 dB SNR)

• Original (clean) sentence


Wiener filter

• Aim: To find the optimal filter that minimizes the mean square error between the desired signal (clean signal) and the estimated output• Input to this filter: Noisy speech

• Output of this filter: Enhanced speech


Wiener filter in frequency domain

• Wiener filter for noise reduction

• H(ω) denotes the filter

• Minimizing mean square error between filtered noisy speech and clean speech leads to

for frequency ωk

• Pxx(ωk ): power spectrum of x(n)

• Pdd(ωk ): power spectrum of d(n)

)()()(ˆ YHX

)()(

)()(

kddkxx

kxxk PP

PH


Wiener filter in terms of a priori SNR

• Define a priori SNR at frequency ωk:

• Wiener filter becomes

• More attenuation at lower SNR and less attenuation at higher SNR

1)(

k

kkH

)(

)(

kdd

kxxk P

P


Iterative Wiener filtering

• Optimal Wiener filter depends on input signal power spectrum, which is not available. In practice, we can estimate the Wiener filter iteratively

• We can consider the following procedure at iteration i to estimate H():• Step 1: Obtain an estimate of the Wiener filter based

on the enhanced signal obtained at iteration i – Initialize with noisy speech signal

• Step 2: Filter the noisy signal through the newly obtained Wiener filter according to:

to get the new enhanced signal, . Repeat the above procedure

)()()(ˆ1 kkiki YHX

)(ˆ nxi

0x̂

)(ˆ 1 nxi


MMSE estimator• The Wiener filter is the optimal (in the mean

square error sense) complex spectrum estimator, not the optimal magnitude spectrum estimator

• Ephraim and Malah (1984) proposed an MMSE estimator which is the optimal magnitude spectrum estimator

• Unlike the Wiener estimator, the MMSE estimator does not require a linear model between the observed data and the estimator, but assumes the probability distributions of speech and noise DFT coefficients:• Fourier transform coefficients (real and imaginary parts)

have a Gaussian probability distribution. The mean of the coefficients is zero, and the variances of the coefficients are time-varying due to the nonstationarity of speech

• Fourier transform coefficients are statistically independent, and hence uncorrelated


MMSE estimator (cont.)

• In the frequency domain: Y(ωk) = X(ωk) + D(ωk) or

• The MMSE derivation leads to

• In(.) is the modified Bessel function of order n

• a posteriori SNR:

)()()( kjk

kjk

kjk

dxy eDeXeY

kk

kk

kk

k

kk Y

vIv

vIv

vvX )]

2()

2()1)[(

2exp(

2ˆ

10

})({)( where,)(

22

kdd

kk DEk

k

Y

kk

kkv

1


MMSE gain function

• MMSE spectral gain function

)]2

()2

()1)[(2

exp(2

ˆ),( 10

kk

kk

k

k

k

k

kkk

vIv

vIv

vv

Y

XG


-15 -10 -5 0 5 10 15-30

-25

-20

-15

-10

-5

0

5

Instantaneous SNR (dB)

k - 1

20

log

(G

ain

) d

B

k=-15 dB

k=-10 dB

k= 5 dB

Gain for given a prior SNR


-20 -15 -10 -5 0 5 10 15 20-45

-40

-35

-30

-25

-20

-15

-10

-5

0

5

A priori SNR k (dB)

20

log

(G

ain

) d

B

Wiener

k-1= -20 dB

k-1= 0 dB

k-1= 20 dB

Gain for given a posteriori SNR


Estimating a priori SNR

• The suppression curves suggest that the posteriori SNR has a small effect and the a priori SNR is the main factor influencing suppression

• The a priori SNR can be estimated recursively (frame-wise) using the so-called “decision-directed” approach at frame m:

• 0 < a < 1, and a = 0.98 is found to work well

]0,1)(max[)1()1,(

)1(ˆ)(ˆ

2

mamk

mXam k

d

kk


Other remarks and sound demo

• It is noted that when the a priori SNR is estimated using the “decision-directed” approach, the enhanced speech has no “musical noise”

• A log-MMSE estimator also exists, which might be perceptually more meaningful

• Sound demo:• Noisy sentence (5 dB SNR):

• MMSE estimator:

• Log-MMSE estimator:


Subspace-based algorithms

• This class of algorithms is based on singular value decomposition (SVD) or eigenvalue decomposition of either data matrices or covariance matrices

• The basic idea behind the SVD approach is that the singular vectors corresponding to the largest singular values contain speech information, while the remaining singular vectors contain noise information

• Noise reduction is therefore accomplished by discarding the singular vectors corresponding to the smallest singular values


Subjective evaluations

• In terms of speech quality, a subset of algorithms improve the overall quality in a few conditions against the unprocessed condition. No algorithm produces improvement in multitalker babble

• In terms of intelligibility, no algorithm produces significant improvement over unprocessed noisy speech


Interim summary on speech enhancement

• Algorithms are derived analytically• Optimization theory

• Noise estimation is key• These algorithms are particularly needed for highly non-stationary

environments

• Speech enhancement algorithms cannot deal with multitalker mixtures

• Inability to improve speech intelligibility


Part IV. CASA-based speech segregation Fundamentals of CASA for monaural mixtures CASA for speech segregation

Feature-based algorithms Model-based algorithms


Cochleagram: Auditory spectrogram

Spectrogram• Plot of log energy across time and

frequency (linear frequency scale)

Cochleagram• Cochlear filtering by the gammatone

filterbank (or other models of cochlear filtering), followed by a stage of nonlinear rectification; the latter corresponds to hair cell transduction by either a hair cell model or simple compression operations (log and cube root)

• Quasi-logarithmic frequency scale, and filter bandwidth is frequency-dependent

• A waveform signal can be constructed (inverted) from a cochleagram

Spectrogram

Cochleagram


Neural autocorrelation for pitch perception

Licklider (1951)


Correlogram

• Short-term autocorrelation of the output of each frequency channel of the cochleagram

• Peaks in summary correlogram indicate pitch periods (F0)

• A standard model of pitch perception

Correlogram & summary correlogram of a vowel with F0 of 100 Hz


Onset and offset detection

• An onset (offset) corresponds to a sudden intensity increase (decrease), which can be detected by taking the time derivative of the intensity

• To reduce intensity fluctuations, Gaussian smoothing (low-pass filtering) is typically applied (as in edge detection for image analysis):

• Note that , where s(t) denotes intensity and

)2

exp(2

1),(

2

2

t

tG

),()()),()(( tGtstGts

)2

exp(2

),(2

2

3 tt

tG


Onset and offset detection (cont.)• Hence onset and offset detection is a three-step

procedure• Convolve the intensity s(t) with G' to obtain O(t)• Identify the peaks and the valleys of O(t)• Onsets are those peaks above a certain threshold, and offsets are

those valleys below a certain threshold

Onsets

Offsets


Segmentation versus grouping

• Mirroring Bregman’s two-stage conceptual model, a CASA model generally consists of a segmentation stage and a subsequent grouping stage

• Segmentation stage decomposes an acoustic scene into a collection of segments, each of which is a contiguous region in the cochleagram with energy primarily from one source• Based on cross-channel correlation that encodes correlated responses

(temporal fine structure) of adjacent filter channels, and temporal continuity

• Based on onset and offset analysis

• Grouping aggregates segments into streams based on various ASA cues


Cross-channel correlation for segmentation

Time (seconds)0.0 1.5

5000

2741

1457

729

315

80

Correlogram and cross-channel correlation for a mixture of speech and trill telephone Segments generated based on cross-channel correlation and temporal continuity


Ideal binary mask

• A main CASA goal is to retain the parts of a mixture where target sound is stronger than the acoustic background (i.e. to mask interference by the target), and discard the other parts (Hu & Wang, 2001; 2004)• What a target is depends on intention, attention, etc.

• In other words, the goal is to identify the ideal binary mask (IBM), which is 1 for a time-frequency (T-F) unit if the SNR within the unit exceeds a threshold, and 0 otherwise It does not actually separate the mixture! More discussion on the IBM in Part V


Ideal binary mask illustration


CASA for speech segregation Monaural CASA systems for speech segregation are

based on harmonicity, onset/offset, AM/FM, and trained models (Weintraub, 1985; Brown & Cooke, 1994; Ellis, 1996; Hu & Wang, 2004; Barker et al., 2005; Radfar et al., 2007) Hu-Wang and Barker et al. models will be explained as

representatives for feature-based and model-based methods, respectively


CASA system architecture

Typical architecture of CASA systems


Voiced speech segregation

For voiced speech, lower harmonics are resolved while higher harmonics are not

For unresolved harmonics, the envelopes of filter responses fluctuate at the fundamental frequency of speech

A voiced speech segregation model by Hu and Wang (2004) applies different grouping mechanisms for low-frequency and high-frequency signals: Low-frequency signals are grouped based on periodicity and

temporal continuity High-frequency signals are grouped based on amplitude modulation

and temporal continuity


Pitch tracking

Pitch periods of target speech are estimated from an initially segregated speech stream based on dominant pitch within each frame

Estimated pitch periods are checked and re-estimated using two psychoacoustically motivated constraints: Target pitch should agree with the periodicity of the T-F units in the

initial speech stream Pitch periods change smoothly, thus allowing for verification and

interpolation


Pitch tracking example

(a) Dominant pitch (Line: pitch track of clean speech) for a mixture of target speech and ‘cocktail-party’ intrusion

(b) Estimated target pitch


T-F unit labeling

In the low-frequency range: A T-F unit is labeled by comparing the periodicity of its

autocorrelation with the estimated target pitch

In the high-frequency range: Due to their wide bandwidths, high-frequency filters respond to

multiple harmonics. These responses are amplitude modulated due to beats and combinational tones (Helmholtz, 1863)

A T-F unit in the high-frequency range is labeled by comparing its AM rate with the estimated target pitch


Amplitude modulation illustration

(a) The output of a gammatone filter (center frequency: 2.6 kHz) in response to clean speech

(b) The corresponding autocorrelation function


Final segregation

New segments corresponding to unresolved harmonics are formed based on temporal continuity and cross-channel correlation of response envelopes (i.e. common AM). Then they are grouped into the foreground stream according to AM rates

Other units are grouped according to temporal and spectral continuity


Voiced speech segregation example


Model-based speech segregation

Basic observations• Pure primitive CASA may not be sufficient for speech separation

and listeners use top-down information Pure schema-based CASA needs models for all sources and ignores

organization in the input Barker, Cooke and Ellis (2005) proposed a model-based

approach attempting to integrate primitive and schema-based processes in CASA Key idea: Use speech recognition to group segments (fragments)

generated in primitive organization


CASA recognition

• By extending automatic speech recognition (ASR) framework, the goal of CASA recognition is to find the word sequence and speech/background separation which jointly have the maximum a posteriori probability given Y

• Y represents noisy speech and X clean speech


Introducing speech acoustics

since W is independent of S and Y given X

Here, P(X) cannot be dropped from the integral


For HMM recognition, a hidden state sequence is introduced. This leads to

CASA recognition with HMM modeling

Language modelbigrams, dictionary

Acoustic modelschemas

Segregation modelprimitive grouping

Search algorithmModified decoder

Segregation weightConnection to observations


Acoustic model and segregation weighting

• Impractical to evaluate the acoustic model and the segregation weighting term over entire search space of X. Simplifying assumptions must be made

• Segregation weighting is determined by primitive organization to make search feasible


Experimental evaluation


Results

• Factory noise: stationary background, hammer blows, machine noise, etc.

• MFCC + CMN: standard ASR approach

• Missing data recognition focuses on reliable T-F regions

• The model-based approach provides about 1.5 dB improvement over missing data recognition at low SNR levels • No significant CASA is done


Interim summary on CASA

• Main progress occurs in voiced speech segregation based on pitch tracking and model-based grouping• Recent work starts to address unvoiced speech separation (e.g. Hu &

Wang, 2008)

• Integration of feature-based and model-based segregation is promising • Substantial effort attempts to incorporate speaker models into

segregation (e.g. Shao & Wang, 2006)


Part V. Segregation as binary classification What is the goal of speech segregation?

Ideal binary mask Speech intelligibility tests

Classification approach to speech segregation An MLP based algorithm to separate reverberant speech A GMM based algorithm to improve intelligibility


What is the goal of speech segregation?

From the perspective of perceptual information processing, the analysis of the computational goal is critically important (Marr, 1982) Computational theory analysis

What is the goal of audition? By analogy to vision (Marr, 1982), the purpose of audition is to

produce an auditory description of the environment for the listener

What is the goal of CASA? The goal of ASA is to segregate sound mixtures into separate

perceptual representations (or auditory streams), each of which corresponds to an acoustic event (Bregman, 1990)

By extrapolation the goal of CASA is to develop computational systems that extract individual streams from sound mixtures


Computational-theory analysis of ASA

To form a stream, a sound must be audible on its own The number of streams that can be computed at a time is

limited Magical number 4 for simple sounds such as tones and vowels

(Cowan, 2001)? 1+1, or figure-ground segregation, in noisy environment such as a

cocktail party?

Auditory masking further constrains the ASA output Within a critical band a stronger signal masks a weaker one


Computational-theory analysis of ASA (cont.)

ASA outcome depends on sound types (overall SNR is 0) Noise-Noise: pink , white , pink+white Speech-Speech: Noise-Speech: Tone-Speech:


Some alternative goals

Extract all underlying sound sources or the target sound source (the gold standard) Implicit in speech enhancement and spatial filtering Segregating all sources is implausible, and probably unrealistic with

one or two microphones

Enhance ASR Close coupling with a primary motivation of speech segregation Perceiving is more than recognizing (Treisman, 1999)

Enhance human listening Advantage: close coupling with auditory perception There are applications that involve no human listening


Ideal binary mask as CASA goal

• As mentioned in Part IV, the ideal binary mask has been suggested as a main goal of CASA

• Definition

s(t, f ): Target energy in unit (t, f ) n(t, f ): Noise energy θ: A local SNR criterion (LC) in dB, which is typically chosen to be

0 dB

otherwise0

),(),( if1),(

ftnftsftIBM


Properties of IBM

IBM notion is consistent with computational-theory analysis of ASA Audibility and capacity Auditory masking Effects of target and noise types (spectral overlap)

Optimality: Under certain conditions the IBM with θ = 0 dB is the optimal binary mask from the perspective of SNR gain (Li & Wang, 2009)


Subject tests of IBM

• Recent studies found large speech intelligibility improvements by applying ideal binary masking for normal-hearing (Brungart et al., 2006; Li & Loizou, 2008), and hearing-impaired (Anzalone et al., 2006; Wang et al., 2009) listeners• Improvement for stationary noise is above 7 dB for normal-

hearing (NH) listeners, and above 9 dB for hearing-impaired (HI) listeners

• Improvement for modulated noise is significantly larger than for stationary noise


Test conditions of Wang et al.’09

SSN: Unprocessed monaural mixtures of speech-shaped noise (SSN) and Dantale II sentences (0 dB: -10 dB: )

CAFÉ: Unprocessed monaural mixtures of cafeteria noise (CAFÉ) and Dantale II sentences (0 dB: -10 dB: )

SSN-IBM: IBM applied to SSN (0 dB: -10 dB: -20 dB: )

CAFÉ-IBM: IBM applied to CAFÉ (0 dB: -10 dB: -20 dB: )

Intelligibility results are measured in terms of SRT


Intelligibility results

12 NH subjects (10 male and 2 female), and 12 HI subjects (9 male and 3 female) SRT means for the 4 conditions for NH listeners: (-8.2, -10.3, -15.6, -20.7) SRT means for the 4 conditions for HI listeners: (-5.6, -3.8, -14.8, -19.4)

NH H I

SSN CAFE SSN -IBM CAFE-IBM-24

-22

-20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

2

4

Dan

tale

II

SR

T (

dB)


Speech perception of noise with binary gains Wang et al. (2008) found that, when LC is chosen to be the same

as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech)

Time (s)

Ce

nte

r F

req

ue

ncy

(H

z)

0.4 0.8 1.2 1.6 2

7743

2489

603

55

96 dB

72 dB

48 dB

24 dB

0 dB

Time (s)

Ce

nte

r F

req

ue

ncy

(H

z)

0.4 0.8 1.2 1.6 2

7743

2489

603

55

Time (s)

Ch

an

ne

l Nu

mb

er

0.4 0.8 1.2 1.6 2

32

22

12

2

Time (s)

Ce

nte

r F

req

ue

ncy

(H

z)

0.4 0.8 1.2 1.6 2

7743

2489

603

55


IBM-gated noise results

Despite a great reduction of spectrotemporal information, a pattern of binary gains is apparently sufficient for human speech recognition

Mean numbers for the 4 conditions: (97.1%, 92.9%, 54.3%, 7.6%)

N umber of channels

4 8 16 320

10

20

30

40

50

60

70

80

90

100P

erc

en

t c

orr

ec

t


Main points

Speech intelligibility results support the IBM as an appropriate goal of CASA in general, and speech segregation in particular

Hence solving the speech segregation problem would amount to binary classification This is a strong claim with major consequences This formulation opens the problem to a variety of pattern

classification methods


Segregation of reverberant speech

Room reverberation poses an additional challenge to the problem of speech segregation Segregation performance drops significantly in reverberation Common approach to deal with reverberation is inverse filtering,

which is sensitive to different room configurations

Jin and Wang (2009) proposed a model to segregate reverberant voiced speech based on classification


System overview

A multilayer perceptron (MLP) is trained for each frequency channel in order to label T-F units as target or interference dominant


Feature extraction

A 6-dimensional pitch-based feature vector is constructed for each T-F unit

τm: given pitch period at frame m; A(c,m,τm): autocorrelation function : average instantaneous frequency [.]: Nearest integer, indicating harmonic number |.|: Deviation from the nearest harmonic (subscript E for envelope)

]),([),(

]),([

),,(

]),([),(

]),([

),,(

,

mEmE

mE

mE

mm

m

m

mc

mcfmcf

mcf

mcA

mcfmcf

mcf

mcA

y

),( mcf


MLP learning

The objective function is to maximize SNR:

d: desired output; a: actual output; E: energy

Generalized mean-square-error criterion – each squared error weighted by normalized energy (i.e., cost sensitive learning)

mc

cc

mcc mE

mEmamdJ

)(

)()]()([ 2


MLP based unit labeling

MLP encodes the posterior probability of a T-F unit being target dominant given observed features

Labeling criterion:

H1: Hypothesis that the unit is dominated by the target

2

1)|( ,1 mcyHP


Segmentation and grouping

Segmentation is based on cross-channel correlation at low frequencies, and onset/offset analysis at high frequencies

Label each segment based on the labels of its T-F units Units not in any segment are grouped according to temporal and

spectral continuity

Time (Sec)

Fre

qu

en

cy

(H

z)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.650

363

1246

3255

8000


Segregation results

Train on different room configurations Case 1: Train on one

configuration and test on others in the same room (reverb. time)

Case 2: Train on one configuration from each of the 6 rooms

Case 3: Train on room 6 Generalize well to different utterances

and different speakers


Discussion and demo

MLP based classification leads to substantial SNR gain It generalizes well to different utterances and different

speakers Limitations: known pitch and voiced speech

See Jin and Wang lecture on Tuesday afternoon (SP-L2)

Demo: Two speech utterances (trained just in Room 6 with T60 = 0.6 s)

T60 (s) Original Hu-WangInverse

filterJin-

Wang’09

Ideal binary mask

0.0

0.3

0.6


GMM-based classification

Instead of treating voiced speech and unvoiced speech separately, a simpler approach is to perform classification on noisy speech regardless of voicing

A classification model by Kim, Lu, Hu, and Loizou (2009) deals with speech segregation in a speaker and masker dependent way: AM spectrum (AMS) features are used Classification is based on Gaussian mixture models (GMM) Speech intelligibility evaluation is performed with normal-hearing

listeners


Diagram of Kim et al.’s model


Feature extraction and GMM

Peripheral analysis is done by a 25-channel mel-frequency filter bank

A 15-dimensational AMS feature vector is extracted within each T-F unit This vector is then concatenated with the delta vectors over time and

frequency to form a 45-dimensional feature vector for each unit

One GMM (λ1) models target-dominant T-F units and another GMM models (λ0) noise-dominant units Each GMM has 256 Gaussian components


Training and classification

To improve efficiency, each GMM is subdivided into two models during training: one for relatively high local SNRs and one for relatively low SNRs

With the 4 trained GMMs, segregation comes down to Bayesian classification with prior probabilities of P(λ0) and P(λ1) estimated from the training data

The training and test data are mixtures of IEEE sentences and 3 masking noises: babble, factory, and speech-shaped noise Separate GMMs are trained for each speaker (a male and a female)

and each masker


A separation example

Target utterance

-5 dB mixture with babble

Estimated mask

Masked mixture


Clean: 0-dB mixture with babble: Segregated:

Intelligibility results and demo

UN: unprocessedIdBM: ideal binary masksGMM: trained on a single noisemGMM: trained on multiple noises


Discussion

GMM classifier achieves a hit rate (active units correctly classified) higher than 80% for most cases while keeps the false-alarm (FA) rate relatively low As expected, mGMM results are worse than sGMM HIT – FA well correlates with intelligibility

The first monaural separation algorithm to achieve significant speech intelligibility improvements

Main limitation is speaker and masker dependency Also, AMS features would not be applicable to multitalker mixtures


Concluding remarks

Monaural speech segregation is of fundamental importance Compared to beamforming, it does not assume configuration

stationarity, hence more versatile for application It may hold the key to robustness to room reverberation

Speech enhancement algorithms are based on statistical analysis of general signal properties For example, uncorrelatedness of speech and noise

CASA-based speech segregation is based on analysis of perceptual and speech properties For example, heavy use of pitch and model training


Concluding remarks (cont.)

Speech enhancement aims to increase the SNR of noisy speech Reasonable for improving speech quality, but unsuccessful for

improving speech intelligibility

Binary masking is a key concept of CASA, which has led to the new formulation of segregation as binary classification Classification based segregation shows promising results for

improving speech intelligibility


Remarks on SNR and intelligibility

SNR measures signal similarity to clean speech Recent intelligibility studies provide evidence that

lower SNR can result in higher intelligibility! LC = 0 dB is best for SNR gain (Li & Wang, 2009) However, intelligibility results in Brungart et al. (2006), Wang et

al. (2009), and Kim et al. (2009) indicate that negative LC values produce higher intelligibility

Hence the pursuit of increased SNR could be wrongly headed

It is important to identify appropriate computational goals, and design evaluation metrics accordingly This way, one can avoid making effort in wrong directions


Review of presentation

I. Introduction: Speech segregation problemII. Auditory scene analysis (ASA)III. Speech enhancementIV. Speech segregation by computational auditory scene

analysis (CASA)V. Segregation as binary classificationVI. Concluding remarks


Further information

This tutorial is partly drawn from the following two books (Loizou, 2007; Wang & Brown, 2006)


BibliographyAnzalone et al. (2006) Ear & Hearing 27: 480-492.Arbib (1989) The Metaphorical Brain 2. Wiley.Barker, Cooke, Ellis (2005) Speech Comm. 45: 5-25.Bregman (1990) Auditory Scene Analysis. MIT Press.Bronkhorst & Plomp (1992) JASA 92:3132-3139.Brown & Cooke (1994) Comp. Speech & Lang. 8: 297-

336.Brungart et al. (2006) JASA 120: 4007-4018.Cherry (1957) On Human Communication. Wiley.Cowan (2001) Beh. & Brain Sci. 24: 87-185.Ellis (1996) PhD MIT.Ephraim & Malah (1984) IEEE T-ASSP 32: 1109-1121.Helmholtz (1863) On the Sensation of Tone. Dover.Hu & Wang (2001) WASPPA.

Hu & Wang (2004) IEEE T-NN 15: 1135-1150.

Hu & Wang (2008) JASA 124: 1306-1319.

Jin & Wang (2009) IEEE T-ASLP 17: 625-638.Kim et al. (2009) JASA 126: 1486-1494.Li N & Loizou (2008) JASA 123: 1673-1682.

Li Y & Wang (2009) Speech Comm. 51: 230-239.

Licklider (1951) Experientia 7: 128-134.

Loizou (2007) Speech enhancement: Theory and practice. CRC.

Marr (1982) Vision. Freeman

Miller, Heise, & Lichten (1951) JEP 41: 329-335.

Radfar, Dansereau, & Sayadiyan (2007) EURASIP J-ASMP: Article 84186

Shao & Wang (2006) IEEE T-ASLP 14: 289-298

Steeneken (1992) PhD Univ of Amsterdam.

Treisman (1999) Neuron 24: 105-110.

Wang & Brown, Ed. (2006) Computational auditory scene analysis. Wiley & IEEE Press.

Weintraub (1985) PhD Stanford.

Wang et al. (2008) JASA 124: 2303-2307.

Wang et al. (2009) JASA 125: 2336-2347.


Resources and acknowledgments

Source programs for some studies referred to in this tutorial are available at OSU Perception & Neurodynamics Lab’s website

http://www.cse.ohio-state.edu/pnl

Bregman and Ahad (1995) produced a CD that accompanies Bregman’s 1990 book, which can be ordered from MIT Press

Houtsma, Rossing, and Wagenaars (1987) produced a CD that demonstrates basic auditory perception phenomena, which can be ordered from the Acoustical Society of America

Thanks to M. Cooke and P. Loizou for making available their slides, some of which are incorporated in this tutorial

Documents

Speech Segregation