Effects of Reverberation on Pitch, Onset/Offset, and ...labrosa.ee.columbia.edu/Montreal2004/talks/deliang2.pdfPitch Trac king 0.0 5 5 0.0 5 5 0 2.5 80 120 160 Tim (s c) Cle n Reverberan

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

DeLiang Wang

Perception & Neurodynamics LabThe Ohio State University

2

Outline of presentation

IntroductionHuman performanceReverberation effects

On pitch On onset/offsetOn binaural cues

Monaural enhancement of reverberant signalBinaural segregation of reverberant signalDiscussion and summary

3

Reverberation as linear transmission system

Early reflections

Time (ms)

Late reflections

( ) ( ) ( )∫∞

∞−−= τττ dtshtx x(t): reverberant signal; s(t): source signal

h(τ): room impulse response function

4

Reverberation and speech quality

Room reverberation causes two distinct perceptual effects on speech quality

Early reflections lead to coloration or spectral deviation, determined by signal-to-reverberant energy ratio; it also boasts loudnessLate reflections (long-term reverberation) smear the time-frequency components of speech, and are characterized by the reverberation time (T60)

5

Human performance

• Though speech perception in quiet seems robust to reverberation, speech intelligibility in noise suffers in the presence of reverberation (Plomp’76; Culling et al.’03)• Culling et al. showed that reverberation

(T60 = 0.4 s) produces 5 dB increase in speech reception threshold when naturally intonated speech is presented together with a competing talker

• Hearing impaired listeners are particularly susceptible to reverberation

• The binaural advantage for speech perception in noise is diminished by reverberation• The Culling et al. study found no

advantage at all

Culling et al. (2003)

6

Human performance

• Darwin and Hukin (2000) compared reverberation effects on spatial, pitch, and vocal-tract size cues for sequential organization and found that• ITD cues are seriously impaired by reverberation• Pitch cues (F0 trajectory) are more resistant • A combination of pitch and vocal-tract size cues is very

resistant to reverberation

7





8

Pitch tracking of a single utterance

0.0 2.5

5000

0

Freq

uenc

y

0.0 2.5

5000

0

Freq

uenc

y0.0 2.5

40

60

80

100

Time (sec)Pi

tch

(tim

e la

g)

CleanReverberant

Clean Female Utterance

Reverberant Female Utterance (T60=0.3 s)

Pitch Tracking

0.0 2.5

5000

0

Freq

uenc

y

0.0 2.5

5000

0

Freq

uenc

y

0.0 2.5

80

120

160

Time (sec)

Pitc

h (ti

me

lag)

CleanReverberant

Clean Male Utterance

Reverberant Male Utterance (T60=0.3 s)

Pitch Tracking

• Pitch is pretty robust to reverberation, especially for slowly changing pitch tracks and long voiced speech segments

• Noticeable artifacts: elongated pitch tracks

9

Pitch tracking of two utterances

0.0 2.5

5000

0

Freq

uenc

y (H

z)

0.0 2.540

80

120

160

Time (sec)

Pitc

h (ti

me

lag)

One-source trackingTwo-source tracking

Reverberant Mixture (T60=0.3 s)

Pitch Tracking

Multipitch tracking using the Wu et al. algorithm (2003). Even with multiple reverberant sources, pitch tracking works reasonably well

10

Reverberation effects on harmonic structure

• From Darwin and Hukin (2000). The utterance is “Could you please write the word bead down now.” T60 = 0.4 s

• Primarily in the low-frequency range

11

Implications on pitch-based grouping

Pitc

h (ti

me

lag)

0.0 1.570

100

130

160

190

Pitc

h (ti

me

lag)

0.0 1.570

100

130

160

190

0 50 100 15070

100

130

160

190

Time (sec)

Pitc

h (ti

me

lag)

CleanReverberant

Histogram of selected peaks (Clean)

Histogram of selected peaks (T60=0.3 s)

Pitch Tracks

Smearing of harmonic structure is worse in the high-frequency range. The figure shows the histogram of peak positions that are nearest to the detected pitch periods for frequencies greater than 800 Hz. This smearing effect would degrade the performance of pitch-based grouping.

12

Reverberation effects on temporal envelope

0 0.5 1 1.5 2

0

20

40

60

(a) Smoothed temporal envelope of anechoic utterance

Am

plitu

de (d

B)

0 0.5 1 1.5 2-20

0

20

40

(b) Smoothed temporal envelope of reverberant utterance

Am

plitu

de (d

B)

Time (s)

• Response envelope of a gammatone filter centered near 1 kHz to the utterance “That noise problem grows more annoying each day.” (a) T60 = 0 and (b) T60 = 0.3 s

• Amplitude modulation (AM) depth is reduced, but the AM pattern is reasonably maintained

13

(b) Reverberant utterance

Time (s)

Freq

uenc

y (H

z)

0.5 1.0 1.5 2 2.550

363

1246

3255

8000

(a) Anechoic utterance

Freq

uenc

y (H

z)

0.5 1.0 1.5 2 2.550

363

1246

3255

8000

Onset and offset detection

Cochleogram representation. Red/black marks indicate detected onsets/offsets.The utterance: “That noise problem grows more annoying each day.”

14

Reverberation effects on onset/offset detection

• Both the times and strengths of onsets and offsets are affected• Onset times are slightly shifted• Onsets of weak phones (e.g. unvoiced stops) are smeared • Offset times are shifted forward (delayed)

• Reverberation introduces spurious offsets

15

Reverberation effects on binaural cues: ITD

• Shinn-Cunningham and Kawakyu (2003) showed that the responses of a neural model to ITD (interaural time difference) are poor indicators of source azimuth in the presence of reverberation

• Integration over time enhances the estimation robustness

16

ITD estimation in time-frequency (T-F) units

Azimuth (degrees)

Cha

nnel

Cen

ter F

requ

ency

(Hz)

-90 0 45 900

5000

Azimuth (degrees)

Cha

nnel

Cen

ter F

requ

ency

(Hz)

-90 0 45 900

5000

AZIMUTH HISTOGRAM: Target source at 450, anechoic

AZIMUTH HISTOGRAM: Target source at 450, T60=0.3 s

Azi

mut

h (d

egre

es)

0 0.5 1 1.5 2 2.5

-90

0

45

90

Time (sec)

Azi

mut

h (d

egre

es)

0 0.5 1 1.5 2 2.5

-90

0

45

90

Across Frequency Integration (Clean)

Across Frequency Integration (T60

=0.3 s)

ITD estimation in individual T-F units using a cross-correlation model (Roman et al.’03). The input is natural speech. The distribution of local azimuth estimates is much noisier in the reverberant condition

17

Interaural intensity difference estimation in T-F units

IID (dB)

Cha

nnel

Cen

ter F

requ

ency

(Hz)

-10 -5 0 5 10 15 200

5000

IID (dB)

Cha

nnel

Cen

ter F

requ

ency

(Hz)

-10 -5 0 5 10 15 200

5000

IID HISTOGRAM: Target source at 450, anechoic

IID HISTOGRAM: Target source at 450, T60=0.3 s

0 5000-2

0

2

4

6

8

10

12

14

16

Channel Center Frequency (Hz)

IID (d

B)

CleanReverberant

Mean IID for one utterance

The distribution of IID (interaural intensity difference) is also much noisier in reverberation, and the mean IID values lose characteristics

18





19

A two-stage enhancement algorithm (Wu’03)

Identify an inverse filter to reduce coloration distortion by maximizing kurtosis of LPC residue (Gillespie et al.’01)

Estimate and subtract the effects of long-term reverberation

Clean speech (kurtosis = 12.2)

Reverberant speech (kurtosis = 3.6)

Time (ms)

20

Results of Wu’s enhancement algorithm

Original speech

Reverberant speech

Enhanced speech

Inverse-filtered speech

21

Binaural segregation of reverberant speechRoman and Wang (2004) proposed a figure-ground segregation strategy to identify the T-F units dominated by target using spatial information, without imposing restrictions on the number, location or content of interfering sourcesBasic idea

First perform cancellation of reverberant target (with detected target location) using adaptive filteringThen label those T-F units that have been largely attenuated in the first stage since they are more likely to originate from the target location

-

H2S+N2

H1S+N1 D

FT M

ATR

IX

W DFT

MA

TRIX

. . .

. . .

. . .

. . .

BIN

AR

Y M

ASK

22

Segregation resultsAn example with a target speaker at 0ο and 4 other interfering speakers at (-135ο, -45ο, 45ο, 135ο) and T60 = 0.3 s

23

ASR results

The segregation output is fed to a missing data recognizer (Cooke et al.’01)

(a) 5 speaker configuration

(b) Nonspeech intrusion: rock music at 45º

Baseline performanceEstimated binary maskIdeal binary mask

24

Summary and discussion

Reverberation corrupts auditory cuesPitch estimation is relatively robust, but harmonic structure issmeared, particularly in high-frequencyAM depth is reduced but the AM pattern is reasonably maintainedOnset times, and especially offset times, are shifted; onset and offset synchrony is weakenedBinaural cues become unreliable

A two-stage monaural algorithm for reverberant speech enhancementA binaural algorithm for segregating reverberant speechIssues

What is “ground truth” pitch for a reverberant signal?Dereverberation versus enhancementHow to deal with both segregation and reverberation monaurally?

25

Acknowledgment

N. Roman and G. Hu for performing some computer experimentsFunding by AFOSR/AFRL and NSF

Documents

Effects of Reverberation on Pitch, Onset/Offset, and ...labrosa.ee.columbia.edu/Montreal2004/talks/deliang2.pdfPitch Trac king 0.0 5 5 0.0 5 5 0 2.5 80 120 160 Tim (s c) Cle n Reverberan