Upload
garey-carroll
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Speech Intelligibility Context
E.Godoy, Speaking Style Conversion3
Speech is often heard in adverse conditions Noisy environments Listener has difficulty hearing/understanding
How to transform speech to make it more intelligible…? To make speech synthesis systems more effective
December 11, 2012
Example of speech with environmental barriers: the speech is not very intelligible!
noise no noise
Intelligible Speaking Styles
December 11, 2012E.Godoy, Speaking Style Conversion4
I. Lombard speech Speaker is immersed in noise Human reflex to increase the speech loudness
II. Clear speech Listener faces barrier (noise, hearing, language,
…) Speaker adapts strategy to increase speech
clarity
normal Lombard
casual clear
VC to improve speech intelligibility?
E.Godoy, Speaking Style Conversion5
Voice Conversion Modify speech to change the speaker identity Learn transformation from source-to-target
speaker
Speaking Style Conversion Modify speech to improve intelligibility Determine transformation from normal-to-
intelligible style
Spectral Envelope: still very important!
December 11, 2012
Overview: Analyses-to-Modifications
E.Godoy, Speaking Style Conversion6
I. Acoustic analyses to identify (mainly spectral) characteristics of Lombard & Clear styles
i. Average Spectra ii. Vowel Spaces
II. Result of analyses inspire spectral modifications to improve intelligibility
i. Spectral energy band boosting (corrective filters)ii. Formant shifting (frequency warping)
December 11, 2012
Corpora
E.Godoy, Speaking Style Conversion7
Lombard-normal: Grid 8 speakers (4 male, 4 female) 50 sentences each LombardNinf96: most extreme (Lu & Cooke)
Clear-casual: LUCID read sentences 8 speakers (4 male, 4 female) 50 sentences each Read speech: most exaggerated (Baker & Hazan)
December 11, 2012
Average Relative Spectra
December 11, 2012E.Godoy, Speaking Style Conversion8
Recall Amplitude Scaling in DFWA
Average Relative spectra is similar: difference between normal (X) and intelligible (Y)
style Average across all frames
)))((log())(log())(log( 1 fWSfSfA qxq
yqq
))(log())(log())(log( fSfSfS Xq
Yq
R
Average Relative Spectra (by Speaker)
E.Godoy, Speaking Style Conversion9
0
2000
4000
6000
2
4
6
8
-4
-2
0
2
4
Hz
LUCID Average Relative Spectra for each speaker
speaker index
dB
0
2000
4000
6000
2
4
6
8
-10
-5
0
5
Hz
GRID Average Relative Spectra for each speaker
speaker index
dB
Lombard-normal Clear-casual
December 11, 2012
Average Relative Spectra (Overall)
Lombard speech: Spectral energy boosting “where formants are” (~500-4500Hz)
Clear speech: Varies depending on speaker strategy, extent of differences mild overall
E.Godoy, Speaking Style Conversion10
0 1000 2000 3000 4000 5000 6000 7000 8000-8
-6
-4
-2
0
2
4
6Average Relative Spectra: All frames, All speakers
Hz
dB
Lombard-normal
Clear-casual
December 11, 2012
Vowel Spaces (average for all speakers)
E.Godoy, Speaking Style Conversion11
Lombard speech: Vowel Space Translation Clear speech: Vowel Space Expansion
300 350 400 450 500 550 600 650 700 750 800800
1000
1200
1400
1600
1800
2000
2200
2400
2600
F1 (Hz)
F2
(Hz)
Clear-casual: Vowel Space, ALL Speakers
casual
clear
350 400 450 500 550 600 650 7001000
1200
1400
1600
1800
2000
2200
2400
F1 (Hz)
F2
(Hz)
Lombard-normal: Vowel Space, ALL Speakers
normal
lombard
December 11, 2012
Inspiration for Speech Modifications
E.Godoy, Speaking Style Conversion12
1. Spectral energy band boosting (Lombard)2. Vowel space expansion (Clear)
Features attributed with increased speech intelligibility
Though not observed together in human speech production…
Signal processing algorithms can accomplish both!
December 11, 2012
Spectral Energy Band Boosting
E.Godoy, Speaking Style Conversion13
Corrective Filters
0 1000 2000 3000 4000 5000 6000 7000 8000-15
-10
-5
0
5
10
15
20
Hz
dB
Spectral Energy Band Boosting, Varying Gain 0:0.5:3
0 1000 2000 3000 4000 5000 6000 7000 8000-15
-10
-5
0
5
10
15
Hz
dB
Average Correction Filter for All Speakers
all frames
Enhanced (Lombard: high SII
Lombard-inspired & Enhanced (high SII) Corrective Filter: Varying Gain
December 11, 2012
Frequency Warping for VS Expansion
December 11, 2012E.Godoy, Speaking Style Conversion14
Curve fitting formant shifts inspires warping…
300 350 400 450 500 550 600 650 700 750 800800
1000
1200
1400
1600
1800
2000
2200
2400
2600
F1 (Hz)
F2
(Hz)
Clear-casual: Vowel Space, ALL Speakers
casual
clear
0 500 1000 1500 2000 2500 3000-250
-200
-150
-100
-50
0
50
100
150
Casual F1 and F2 (Hz)
Hz
LUCID: Frequency differences for F1, F2; ALL
F1diff
F2diff
Sound Samples
E.Godoy, Speaking Style Conversion15
With Noise (SSN, 0dB) Original Warp Boost BW
No Noise Original WarpE Boost BW
December 11, 2012
Want more ?
E.Godoy, Speaking Style Conversion16
See Maria’s presentation for more details …
December 11, 2012
Voice & Speaking Style Conversion Parallels
December 11, 2012E.Godoy, Speaking Style Conversion17
Voice Conversion Dynamic Frequency Warping + Amplitude Scaling
(based on acoustic-phonetic spaces of source & target speakers)
Speaking Style Conversion Frequency Warping + Corrective Filter
1. Clear-speech inspired frequency warping for vowel space expansion
2. Lombard-speech inspired corrective filters to increase loudness
Objective Metrics for Evaluation
December 11, 2012E.Godoy, Speaking Style Conversion20
I. Loudness Energy in frequency bands weighted based on
human hearing
II. Speech Intelligibility Index (SII) Energy & modulations in frequency bands
relative to a noise masker
Loudness Distributions
E.Godoy, Speaking Style Conversion21
Lombard speech: “louder” for voiced (bi-modal) Clear speech: not “louder” than casual speech Transients: neither style distinguishes on average
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Loudness Histogram
Loudness value
casual
clear
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0.005
0.01
0.015
0.02
0.025
0.03
Loudness Histogram
Loudness value
normal
lombard
December 11, 2012
Extended SII Distributions
E.Godoy, Speaking Style Conversion22
extSII highly correlated with ave loudness Lombard speech objectively more intelligible Clear speech intelligibility gain not captured by extSII
limitations of objective intelligibility metrics
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.005
0.01
0.015
0.02
0.025
0.03extended SII Histogram
SII
casual
clear
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.005
0.01
0.015
0.02
0.025
0.03extended SII Histogram
SII
normal
lombard
December 11, 2012
Observations from Analyses
E.Godoy, Speaking Style Conversion23
Lombard Speech Spectral boosting in inclusive formant region
Increase in Loudness (also extSII) Vowel space translation, but no expansion
Clear Speech Small changes in average spectra (slight spectral “flattening”) Consistent vowel space expansion
Greater vowel discrimination Comparison between styles
Acoustic differences translate into perceptual distinctions linked to intelligibility gains
Spectral boosting & Vowel space expansion: mutually exclusive
December 11, 2012