Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues
DeLiang Wang
Perception & Neurodynamics LabThe Ohio State University
2
Outline of presentation
IntroductionHuman performanceReverberation effects
On pitch On onset/offsetOn binaural cues
Monaural enhancement of reverberant signalBinaural segregation of reverberant signalDiscussion and summary
3
Reverberation as linear transmission system
Early reflections
Time (ms)
Late reflections
( ) ( ) ( )∫∞
∞−−= τττ dtshtx x(t): reverberant signal; s(t): source signal
h(τ): room impulse response function
4
Reverberation and speech quality
Room reverberation causes two distinct perceptual effects on speech quality
Early reflections lead to coloration or spectral deviation, determined by signal-to-reverberant energy ratio; it also boasts loudnessLate reflections (long-term reverberation) smear the time-frequency components of speech, and are characterized by the reverberation time (T60)
5
Human performance
• Though speech perception in quiet seems robust to reverberation, speech intelligibility in noise suffers in the presence of reverberation (Plomp’76; Culling et al.’03)• Culling et al. showed that reverberation
(T60 = 0.4 s) produces 5 dB increase in speech reception threshold when naturally intonated speech is presented together with a competing talker
• Hearing impaired listeners are particularly susceptible to reverberation
• The binaural advantage for speech perception in noise is diminished by reverberation• The Culling et al. study found no
advantage at all
Culling et al. (2003)
6
Human performance
• Darwin and Hukin (2000) compared reverberation effects on spatial, pitch, and vocal-tract size cues for sequential organization and found that• ITD cues are seriously impaired by reverberation• Pitch cues (F0 trajectory) are more resistant • A combination of pitch and vocal-tract size cues is very
resistant to reverberation
7
Outline of presentation
IntroductionHuman performanceReverberation effects
On pitch On onset/offsetOn binaural cues
Monaural enhancement of reverberant signalBinaural segregation of reverberant signalDiscussion and summary
8
Pitch tracking of a single utterance
0.0 2.5
5000
0
Freq
uenc
y
0.0 2.5
5000
0
Freq
uenc
y0.0 2.5
40
60
80
100
Time (sec)Pi
tch
(tim
e la
g)
CleanReverberant
Clean Female Utterance
Reverberant Female Utterance (T60=0.3 s)
Pitch Tracking
0.0 2.5
5000
0
Freq
uenc
y
0.0 2.5
5000
0
Freq
uenc
y
0.0 2.5
80
120
160
Time (sec)
Pitc
h (ti
me
lag)
CleanReverberant
Clean Male Utterance
Reverberant Male Utterance (T60=0.3 s)
Pitch Tracking
• Pitch is pretty robust to reverberation, especially for slowly changing pitch tracks and long voiced speech segments
• Noticeable artifacts: elongated pitch tracks
9
Pitch tracking of two utterances
0.0 2.5
5000
0
Freq
uenc
y (H
z)
0.0 2.540
80
120
160
Time (sec)
Pitc
h (ti
me
lag)
One-source trackingTwo-source tracking
Reverberant Mixture (T60=0.3 s)
Pitch Tracking
Multipitch tracking using the Wu et al. algorithm (2003). Even with multiple reverberant sources, pitch tracking works reasonably well
10
Reverberation effects on harmonic structure
• From Darwin and Hukin (2000). The utterance is “Could you please write the word bead down now.” T60 = 0.4 s
• Primarily in the low-frequency range
11
Implications on pitch-based grouping
Pitc
h (ti
me
lag)
0.0 1.570
100
130
160
190
Pitc
h (ti
me
lag)
0.0 1.570
100
130
160
190
0 50 100 15070
100
130
160
190
Time (sec)
Pitc
h (ti
me
lag)
CleanReverberant
Histogram of selected peaks (Clean)
Histogram of selected peaks (T60=0.3 s)
Pitch Tracks
Smearing of harmonic structure is worse in the high-frequency range. The figure shows the histogram of peak positions that are nearest to the detected pitch periods for frequencies greater than 800 Hz. This smearing effect would degrade the performance of pitch-based grouping.
12
Reverberation effects on temporal envelope
0 0.5 1 1.5 2
0
20
40
60
(a) Smoothed temporal envelope of anechoic utterance
Am
plitu
de (d
B)
0 0.5 1 1.5 2-20
0
20
40
(b) Smoothed temporal envelope of reverberant utterance
Am
plitu
de (d
B)
Time (s)
• Response envelope of a gammatone filter centered near 1 kHz to the utterance “That noise problem grows more annoying each day.” (a) T60 = 0 and (b) T60 = 0.3 s
• Amplitude modulation (AM) depth is reduced, but the AM pattern is reasonably maintained
13
(b) Reverberant utterance
Time (s)
Freq
uenc
y (H
z)
0.5 1.0 1.5 2 2.550
363
1246
3255
8000
(a) Anechoic utterance
Freq
uenc
y (H
z)
0.5 1.0 1.5 2 2.550
363
1246
3255
8000
Onset and offset detection
Cochleogram representation. Red/black marks indicate detected onsets/offsets.The utterance: “That noise problem grows more annoying each day.”
14
Reverberation effects on onset/offset detection
• Both the times and strengths of onsets and offsets are affected• Onset times are slightly shifted• Onsets of weak phones (e.g. unvoiced stops) are smeared • Offset times are shifted forward (delayed)
• Reverberation introduces spurious offsets
15
Reverberation effects on binaural cues: ITD
• Shinn-Cunningham and Kawakyu (2003) showed that the responses of a neural model to ITD (interaural time difference) are poor indicators of source azimuth in the presence of reverberation
• Integration over time enhances the estimation robustness
16
ITD estimation in time-frequency (T-F) units
Azimuth (degrees)
Cha
nnel
Cen
ter F
requ
ency
(Hz)
-90 0 45 900
5000
Azimuth (degrees)
Cha
nnel
Cen
ter F
requ
ency
(Hz)
-90 0 45 900
5000
AZIMUTH HISTOGRAM: Target source at 450, anechoic
AZIMUTH HISTOGRAM: Target source at 450, T60=0.3 s
Azi
mut
h (d
egre
es)
0 0.5 1 1.5 2 2.5
-90
0
45
90
Time (sec)
Azi
mut
h (d
egre
es)
0 0.5 1 1.5 2 2.5
-90
0
45
90
Across Frequency Integration (Clean)
Across Frequency Integration (T60
=0.3 s)
ITD estimation in individual T-F units using a cross-correlation model (Roman et al.’03). The input is natural speech. The distribution of local azimuth estimates is much noisier in the reverberant condition
17
Interaural intensity difference estimation in T-F units
IID (dB)
Cha
nnel
Cen
ter F
requ
ency
(Hz)
-10 -5 0 5 10 15 200
5000
IID (dB)
Cha
nnel
Cen
ter F
requ
ency
(Hz)
-10 -5 0 5 10 15 200
5000
IID HISTOGRAM: Target source at 450, anechoic
IID HISTOGRAM: Target source at 450, T60=0.3 s
0 5000-2
0
2
4
6
8
10
12
14
16
Channel Center Frequency (Hz)
IID (d
B)
CleanReverberant
Mean IID for one utterance
The distribution of IID (interaural intensity difference) is also much noisier in reverberation, and the mean IID values lose characteristics
18
Outline of presentation
IntroductionHuman performanceReverberation effects
On pitch On onset/offsetOn binaural cues
Monaural enhancement of reverberant signalBinaural segregation of reverberant signalDiscussion and summary
19
A two-stage enhancement algorithm (Wu’03)
Identify an inverse filter to reduce coloration distortion by maximizing kurtosis of LPC residue (Gillespie et al.’01)
Estimate and subtract the effects of long-term reverberation
Clean speech (kurtosis = 12.2)
Reverberant speech (kurtosis = 3.6)
Time (ms)
20
Results of Wu’s enhancement algorithm
Original speech
Reverberant speech
Enhanced speech
Inverse-filtered speech
21
Binaural segregation of reverberant speechRoman and Wang (2004) proposed a figure-ground segregation strategy to identify the T-F units dominated by target using spatial information, without imposing restrictions on the number, location or content of interfering sourcesBasic idea
First perform cancellation of reverberant target (with detected target location) using adaptive filteringThen label those T-F units that have been largely attenuated in the first stage since they are more likely to originate from the target location
-
H2S+N2
H1S+N1 D
FT M
ATR
IX
W DFT
MA
TRIX
. . .
. . .
. . .
. . .
BIN
AR
Y M
ASK
22
Segregation resultsAn example with a target speaker at 0ο and 4 other interfering speakers at (-135ο, -45ο, 45ο, 135ο) and T60 = 0.3 s
23
ASR results
The segregation output is fed to a missing data recognizer (Cooke et al.’01)
(a) 5 speaker configuration
(b) Nonspeech intrusion: rock music at 45º
Baseline performanceEstimated binary maskIdeal binary mask
24
Summary and discussion
Reverberation corrupts auditory cuesPitch estimation is relatively robust, but harmonic structure issmeared, particularly in high-frequencyAM depth is reduced but the AM pattern is reasonably maintainedOnset times, and especially offset times, are shifted; onset and offset synchrony is weakenedBinaural cues become unreliable
A two-stage monaural algorithm for reverberant speech enhancementA binaural algorithm for segregating reverberant speechIssues
What is “ground truth” pitch for a reverberant signal?Dereverberation versus enhancementHow to deal with both segregation and reverberation monaurally?
25
Acknowledgment
N. Roman and G. Hu for performing some computer experimentsFunding by AFOSR/AFRL and NSF