74
Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil [email protected]

Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil [email protected]

Embed Size (px)

Citation preview

Page 1: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard Speech Recognition

Hynek Boř[email protected]

Page 2: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 2 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

OverviewOverview

Model of Speech Production

Automatic Speech Recognition (LE) Outline Feature Extraction Acoustic Models

Lombard Effect Definition & Motivation Acquisition of Corpus capturing Lombard Effect Analysis of Speech under LE Methods Increasing ASR Robustness to LE

Page 3: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 3 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech ProductionSpeech Production Model of speech production understanding speech signal

structure design of speech processing algorithms

Page 4: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 4 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Speech ProductionLinear Model

Speech ProductionLinear Model

Voiced Excitation

Unvoiced Excitation

Page 5: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 5 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech ProductionLinear Model

Speech ProductionLinear Model

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

1/F0

TimeFreq.F0 2F0 ...

-12 dB/oct

|I(F)G(F)|

=

Page 6: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 6 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech ProductionLinear Model

Speech ProductionLinear Model

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Time

=

Frequency

|N(F)|

Page 7: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 7 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech ProductionLinear Model

Speech ProductionLinear Model

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

1

( )1

Nk

kk

GV z

z

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

Frequency

+6 dB/oct

Frequency

|V(F)| |R(F)|

Page 8: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 8 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech ProductionLinear Model

Speech ProductionLinear Model

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

1

( )1

Nk

kk

GV z

z

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

Frequency

+6 dB/oct

1/F0

TimeFreq.F0 2F0 ...

-12 dB/oct

Frequency

|V(F)| |R(F)|

|I(F)G(F)|

Time

=

Frequency

|N(F)|

=

Page 9: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 9 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech ProductionLinguistic/Speaker Information in Speech Signal

Speech ProductionLinguistic/Speaker Information in Speech Signal

How is Linguistic Info Coded in Speech Signal? Phonetic Contents

Energy: voiced phones (v) – higher energy than unvoiced phones (uv) Low formants: locations and bandwidths ( changes in configuration

of vocal tract during speech production) Spectral tilt: differs across phones, generally flatter for uv (due to

changes in excitation and formant locations) Other Cues

Pitch contour: important to distinguish words in tonal languages (e.g., Chinese dialects)

How is Speaker Identity Coded in Speech Signal? Glottal Waveform Vocal Tract Parameters Prosody (intonation, rhythm, stress,…)

Page 10: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 10 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech ProductionPhonetic Contents in FeaturesSpeech Production

Phonetic Contents in Features

800

1000

1200

1400

1600

1800

2000

2200

200 400 600 800

F2

(Hz)

F1 (Hz)

/u/

/i/

/ae/

/a/

Neutral

Vowel # N T (s)

Slope (dB/oct)

(dB/oct)

/a/ 454 69.03 -6.8

(-6.9; -6.7) 1.13

/e/ 1064 69.33 -5.6

(-5.7; -5.6) 1.06

/i/ 509 58.92 -5.0

(-5.1; -4.9) 1.15

/o/ 120 9.14 -8.0

(-8.1; -7.8) 0.91

/u/ 102 5.73 -6.1

(-6.3; -6.0) 0.77

Example 2 – Spectral Slopes in Czech Vowels

Example 1 – First 2 Formants in US Vowels(Bond et. al., 1989)

Page 11: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 11 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Automatic Speech Recognition (ASR)Architecture of HMM Recognizer

Automatic Speech Recognition (ASR)Architecture of HMM Recognizer

LANGUAGE

MODEL (BIGRAMS)

DECODER (VITERBI)

ESTIMATED WORD

SEQUENCE

SPEECH SIGNAL

FEATURE EXTRACTION (MFCC/PLP)

ACOUSTIC MODEL

SUB-WORD LIKELIHOODS

(GMM/MLP)

LEXICON (HMM)

Feature extraction – transformation of time-domain acoustic signal into

representation more effective for ASR engine: data dimensionality reduction,

suppression of irrelevant (disturbing) signal components

(speaker/environment/recording chain-dependent characteristics), preserving phonetic

content

Sub-word models – Gaussian Mixture Models (GMMs) – mixture of gaussians used to

model distribution of feature vector parameters; Multi-Layer Perceptrons (MLPs) – neural

networks – multi-layer perceptrons (MLPs) (much less common than GMMs)

Page 12: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 12 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Automatic Speech Recognition (ASR)HMM-Based Recognition – Stages

Automatic Speech Recognition (ASR)HMM-Based Recognition – Stages

Speech Signal

Feature Extraction

(Windowing,…,

cepstrum)

o1 o2 o3 …

Acoustic Models

(HMMs word sequences)

(HTK book, 2006)

Language Model

Speech Transcription

Page 13: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 13 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Automatic Speech Recognition (ASR)Feature Extraction – MFCC

Automatic Speech Recognition (ASR)Feature Extraction – MFCC

Mel Frequency Cepstral Coefficients (MFCC)Davis & Mermelstein, IEEE Trans. Acoustics, Speech, and Signal Processing, 1980

MFCC is the first choice in current commercial ASR

WINDOW

(HAMMING)

|FFT|2

c(n)

s(n)

PREEMPHASIS

Log( )

.

IDCT

MFCC FILTER

BANK (MEL)

Preemphasis: compensates for spectral tilt (speech production/microphone channel)

Windowing: suppression of transient effects in short-term segments of signal

|FFT|2: energy spectrum (phase is discarded)

MEL Filter bank: MEL scale – models logarithmic perception of frequency in humans;

triangular filters – dimensionality reduction

Log + IDCT: extraction of cepstrum – deconvolution of glottal waveform, vocal tract

function, channel characteristics

Linear Frequency

Page 14: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 14 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP

Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP

Perceptual Linear Predictive Coefficients (PLP)Hermansky, Journal of Acoustical Society of America, 1990

An alternative to MFCC, used less frequently

Many stages similar to MFCC

Linear prediction – smoothing of the spectral envelope

WINDOW

(HAMMING)

|FFT|2

EQUAL LOUDNESS

PREEMPHASIS

LINEAR PREDICTION

c(n)

s(n)

PLP INTENSITY

LOUDNESS 3

RECURSION

CEPSTRUM

FILTER BANK

(BARK)

WINDOW

(HAMMING)

|FFT|2

c(n)

s(n)

PREEMPHASIS

Log( )

.

IDCT

MFCC FILTER

BANK (MEL)

MFCC

PLP

Page 15: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 15 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP

Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP

Perceptual Linear Predictive Coefficients (PLP)Hermansky, Journal of Acoustical Society of America, 1990

An alternative to MFCC, used less frequently

Many stages similar to MFCC

Linear prediction – smoothing of the spectral envelope (may improve robustness)

WINDOW

(HAMMING)

|FFT|2

EQUAL LOUDNESS

PREEMPHASIS

LINEAR PREDICTION

c(n)

s(n)

PLP INTENSITY

LOUDNESS 3

RECURSION

CEPSTRUM

FILTER BANK

(BARK)

WINDOW

(HAMMING)

|FFT|2

c(n)

s(n)

PREEMPHASIS

Log( )

.

IDCT

MFCC FILTER

BANK (MEL)

MFCC

PLP

Page 16: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 16 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Automatic Speech Recognition (ASR)Acoustic Models – GMM-HMM

Automatic Speech Recognition (ASR)Acoustic Models – GMM-HMM

Gaussian Mixture Models (GMMs)Motivation: distributions of cepstral coefficients can be well modeled by a mixture (sum) of

gaussian functions

Example – distribution of c0 in certain phone and corresponding gaussian (defined uniquely by

mean, variance, and weight)

0

20

40

60

80

100

120

c0

# S

ampl

es

Multidimensional observations (c0,…,c12) multidimensional gaussians – defined uniquely by

means, covariance matrices, and weights

GMMs – typically used to model parts of phones

0

20

40

60

80

100

120

c0

Pr(

c 0)

Histogram Probability Density Function (pdf)Weight

Hidden Markov Models (HMMs)States (GMMs) + transition probabilities between states

Models of whole phones; lexicon word models built of phone models

Page 17: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 17 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard EffectDefinition & Motivation

Lombard EffectDefinition & Motivation

What is Lombard Effect?When exposed to noisy adverse environment, speakers modify the

way they speak in an effort to maintain intelligible communication

(Lombard Effect - LE)

Why is Lombard Effect Interesting?Better understanding mechanisms of human speech

communication (Can we intentionally change particular parameters

of speech production to improve intelligibility, or is LE an automatic

process learned through public loop? How the type of noise and

communication scenario affect LE?)

Mathematical modeling of LE classification of LE level, speech

synthesis in noisy environments, increasing robustness of automatic

speech recognition and speaker identification systems

Page 18: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 18 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard EffectMotivation & Goals

Lombard EffectMotivation & Goals

Ambiguity in Past LE InvestigationsLE has been studied since 1911, however, many investigations disagree in the

observed impacts of LE on speech production

Analyses conducted typically on very limited data – a couple of utterances from few

subjects (1–10)

Lack of communication factor – a majority of studies ignore the importance of

communication for evoking LE (an effort to convey message over noise) occurrence

and level of LE in speech recordings is ‘random’ contradicting analysis results

LE was studied only for several world languages (English, Spanish, French,

Japanese, Korean, Mandarin Chinese), no comprehensive study for any of Slavic

languages

1st GoalDesign of Czech Lombard Speech Database addressing the need of communication

factor and well defined simulated noisy conditions

Systematic analysis of LE in Czech spoken language

Page 19: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 19 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard Effect Motivation & Goals

Lombard Effect Motivation & Goals

ASR under LEMismatch between LE speech with by noise and acoustic models trained on clean neutral

speech

Strong impact of noise on ASR is well known and vast number of noise suppression/speech

emphasis algorithms have been proposed in last decades (yet no ultimate solution is reached)

Negative impact of LE on ASR often exceeds the one of noise; recent state-of-the-art ASR

systems mostly ignore this issue

LE-Equalization MethodsLE-equalization algorithms typically operate in the following domains: Robust features, LE-

transformation towards neutral, model adjustments, improved training of acoustic models

The algorithms display various degrees of efficiency and are often bound by strong

assumptions preserving them from the real world application (applying fixed transformations

to phonetic groups, known level of LE, etc.)

2nd GoalProposal of novel LE-equalization techniques with a focus on both level of LE suppression

and extent of bounding assumptions

Page 20: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 20 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

LE CorporaLE Corpora

Available Czech CorporaCzech SPEECON – speech recordings from various environments including office

and car

CZKCC – car recordings – include parked car with engine off and moving car

scenarios

Both databases contain speech produced in quiet in noise candidates for study of

LE, however, not good ones, shown later

Design/acquisition of LE-oriented database – Czech

Lombard Speech Database‘05 (CLSD‘05) Goals – Communication in simulated noisy background high SNR

-Phonetically rich data/extensive small vocabulary material

-Parallel utterances in neutral and LE conditions

Page 21: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 21 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data AcquisitionRecording Setup

Data AcquisitionRecording Setup

Simulated Noisy ConditionsNoise samples mixed with speech feedback and produced to the speaker and

operator by headphones

Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible,

operator asks the subject to repeat it speakers are required to convey message

over noise communication LE

Noises: mostly car noises from Car2E database, normalized to 90 dB SPL

Speaker Sessions14 male/12 female speakers

Each subject recorded both in neutral and simulated noisy conditions

Close talk

Noise + speech feedback

Middle talk

H&T RECORDER

OK – next / / BAD - again

Noise + speech monitor SPEAKER

SMOOTH OPERATOR

Page 22: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 22 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data Acquisition Recording Setup

Data Acquisition Recording Setup

Simulated Noisy ConditionsNoise samples mixed with speech feedback and produced to the speaker and

operator by headphones

Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible,

operator asks the subject to repeat it speakers are required to convey message

over noise real LE

Noises: mostly car noises from Car2E database, normalized to 90 dB SPL

Speaker Sessions14 male/12 female speakers

Each subject recorded both in neutral and simulated noisy conditions

NB2 ME-104

ME-104 NB2

Page 23: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 23 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data AcquisitionImpact of Headphones

Data AcquisitionImpact of Headphones

Environmental Sound Attenuation by HeadphonesAttenuation characteristics measured on dummy head

Source of wide-band noise, measurement of sound transfer to dummy head’s

auditory canals when not wearing/wearing headphones

Attenuation characteristics – subtraction of the transfers

Page 24: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 24 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data AcquisitionImpact of Headphones

Data AcquisitionImpact of Headphones

102

103

104

0

50

100

150

200-10

0

10

20

30

Frequency (Hz)Angle (°)

Att

enua

tion

(dB

)

-10

-5

0

5

10

15

20

25

100 1000 10000

0° 90°180°Rec. room

Frequency (Hz)

Atte

nu

atio

n (

dB

)

Attenuation by headphones

-100102030

0

15

30

45

60

75 90 105

120

135

150

165

180

195

210

225

240

255270285

300

315

330

345

0 180 -10 0 10 20 30 0 10 20 30

1 kHz 2 kHz 4 kHz 8 kHz

Angle (°)

Attenuation (dB)

Environmental Sound Attenuation by HeadphonesDirectional attenuation – reflectionless sound booth

Real attenuation in recording room

Page 25: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 25 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech Production under Lombard EffectSpeech Production under Lombard Effect

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency

rises

1

( )1

Nk

kk

GV z

z

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Page 26: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 26 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech Production under Lombard EffectSpeech Production under Lombard Effect

1

( )1

Nk

kk

GV z

z

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency

rises

Vocal tract transfer function: center frequencies of low formants increase,

formant bandwidths reduce

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Page 27: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 27 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech Production under Lombard EffectSpeech Production under Lombard Effect

1

( )1

Nk

kk

GV z

z

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency

rises

Vocal tract transfer function: center frequencies of low formants increase,

formant bandwidths reduce

Vocal effort (intensity) increase

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Page 28: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 28 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Speech Production under Lombard EffectSpeech Production under Lombard Effect

1

( )1

Nk

kk

GV z

z

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency rises

Vocal tract transfer function: center frequencies of low formants increase,

formant bandwidths reduce

Vocal effort (intensity) increase

Other: voiced phonemes prolonged, energy ratio in voiced/unvoiced increases,…

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Page 29: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 29 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Analysis of Speech Features under LEFundamental Frequency

Analysis of Speech Features under LEFundamental Frequency

0

2

4

6

8

10

12

70 170 270 370 470 570

Office FCar F

Office MCar M

Fundamental frequency (Hz)

Distribution of fundamental frequencyCzech SPEECON

Nu

mb

er

of s

am

ple

s (x

10

,00

0)

0

2

4

6

8

10

12

14

16

70 170 270 370 470 570

Eng off F

Eng on F

Eng off M

Eng on M

Fundamental frequency (Hz)

Nu

mb

er

of s

am

ple

s (x

10

00

)

Distribution of fundamental frequencyCZKCC

0

1

2

3

4

5

6

70 170 270 370 470 570

Neutral FLE FNeutral MLE M

Fundamental frequency (Hz)

Nu

mb

er

of s

am

ple

s (x

10

,00

0)

Distribution of fundamental frequencyCLSD'05

Page 30: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 30 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Analysis of Speech Features under LEFormant Locations

Analysis of Speech Features under LEFormant Locations

900

1100

1300

1500

1700

1900

2100

2300

2500

300 400 500 600 700 800 900 1000

Female_N

Female_LE

F1 (Hz)

F2

(H

z)

Formants - CZKCCFemale digits/i/

/i'/

/e//e'/

/a/

/a'//o/

/o'//u/

/u'/

500

700

900

1100

1300

1500

1700

1900

2100

200 300 400 500 600 700 800 900

Male_N

Male_LE

/i//i'/

F1 (Hz)

F2

(H

z)

/e//e'/

/a/

/a'/

/o//o'/

/u//u'/

Formants - CZKCCMale digits

900

1100

1300

1500

1700

1900

2100

2300

2500

300 400 500 600 700 800 900 1000

Female_N

Female_LE

F1 (Hz)

F2

(H

z)

Formants - CLSD'05Female digits/i/

/i'/

/e/

/e'/

/a//a'/

/o/

/o'/

/u/

/u'/

500

700

900

1100

1300

1500

1700

1900

2100

200 300 400 500 600 700 800 900

Male_N

Male_LE

Formants - CLSD'05Male digits/i/

/i'/

F1 (Hz)

F2

(H

z)

/e/ /e'/

/a//a'/

/o/ /o'//u/

/u'/

Page 31: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 31 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Analysis of Speech Features under LEFormant Bandwidths

Analysis of Speech Features under LEFormant Bandwidths

CZKCC

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 207* 74 210* 84 275 97 299 78

/e/ 125* 70 130* 78 156 68 186 79

/i/ 124* 49 127* 44 105 44 136 53

/o/ 275 87 222 67 263* 85 269* 73

/u/ 187 100 170 89 174* 96 187* 101

CLSD‘05

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 269 88 152 59 232 85 171 68

/e/ 168 94 99 44 169 73 130 49

/i/ 125 53 108 52 132* 52 133* 58

/o/ 239 88 157 81 246 91 158 62

/u/ 134* 67 142* 81 209 95 148 66

SPEECON, CZKCC: no consistent BW changes

CLSD‘05: significant BW reduction in many voiced phonemes

Page 32: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 32 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Analysis of Speech Features under LEPhoneme Durations

Analysis of Speech Features under LEPhoneme Durations

CZKCC

Word Phoneme # OFF TOFF (s) TOFF (s) # ON TON (s) TON (s) (%)

Nula /a/ 349 0.147 0.079 326 0.259 0.289 48.50

Jedna /a/ 269 0.173 0.076 251 0.241 0.238 39.36

Dva /a/ 245 0.228 0.075 255 0.314 0.311 38.04

Štiri /r/ 16 0.045 0.027 68 0.080 0.014 78.72

Sedm /e/ 78 0.099 0.038 66 0.172 0.142 72.58

CLSD‘05

Word Phoneme # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)

Jedna /e/ 583 0.031 0.014 939 0.082 0.086 161.35

Dvje /e/ 586 0.087 0.055 976 0.196 0.120 126.98

Čtiri /r/ 35 0.041 0.020 241 0.089 0.079 115.92

Pjet /e/ 555 0.056 0.033 909 0.154 0.089 173.71

Sedm /e/ 358 0.080 0.038 583 0.179 0.136 122.46

Osm /o/ 310 0.086 0.027 305 0.203 0.159 135.25

Devjet /e/ 609 0.043 0.022 932 0.120 0.088 177.20

Significant increase in duration in some phonemes, especially voiced phonemes

Some unvoiced consonants – duration reduction

Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC

Page 33: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 33 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard EffectInitial ASR Experiments

Lombard EffectInitial ASR Experiments

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

D – word deletions

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Page 34: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 34 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard EffectInitial ASR Experiments

Lombard EffectInitial ASR Experiments

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

D – word deletions

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Page 35: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 35 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard EffectInitial ASR Experiments

Lombard EffectInitial ASR Experiments

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

D – word deletions

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Page 36: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 36 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard EffectInitial ASR Experiments

Lombard EffectInitial ASR Experiments

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

D – word deletions

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Page 37: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 37 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

LE Suppression in ASR Model Adaptation

LE Suppression in ASR Model Adaptation

Model AdaptationOften effective when only limited data from given conditions are available

Maximum Likelihood Linear Regression (MLLR) – if limited amount of data per

class, acoustically close classes are grouped and transformed together

'MLLR μ Aμ b

Maximum a posteriori approach (MAP) – initial models are used as informative

priors for the adaptation

'MAP

N

N N

μ μ μ

Adaptation ProcedureFirst, neutral speaker-independent (SI) models transformed by MLLR, employing

clustering (binary regression tree)

Second, MAP adaptation – only for nodes with sufficient amount of adaptation data

Page 38: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 38 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

LE Suppression in ASR Model Adaptation

LE Suppression in ASR Model Adaptation

0

10

20

30

40

50

60

70

80

90

Baseline digits LE Adapted digits LE Baseline sentences LE Adapted sentences LE

SI adapt to LE (same spkrs)

SI adapt to LE (disjunct spkrs)

SD adapt to neutral

SD adapt to LE

Model adaptation to conditions and speakers

WE

R (

%)

Adaptation SchemesSpeaker-independent adaptation (SI) – group dependent/independent

Speaker-dependent adaptation (SD) – to neutral/LE

Page 39: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 39 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

LE Suppression in ASR Data-Driven Design of Robust Features

LE Suppression in ASR Data-Driven Design of Robust Features

Filter Bank ApproachAnalysis of importance of frequency components for ASR

Repartitioning filter bank (FB) to emphasize components carrying phonetic information and suppress

disturbing components

Initial FB uniformly distributed on linear scale – equal attention to all components

Consecutively, a single FB band is omitted impact on WER?

Omitting bands carrying more information will result in considerable WER increase

ImplementationMFCC front-end, MEL scale replaced by linear, triangular filters replaced by rectangular filters

without overlap

Page 40: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 40 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesImportance of Frequency Components

Data-Driven Design of Robust FeaturesImportance of Frequency Components

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

Page 41: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 41 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesImportance of Frequency Components

Data-Driven Design of Robust FeaturesImportance of Frequency Components

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

Page 42: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 42 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesImportance of Frequency Components

Data-Driven Design of Robust FeaturesImportance of Frequency Components

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

Page 43: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 43 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesImportance of Frequency Components

Data-Driven Design of Robust FeaturesImportance of Frequency Components

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

Page 44: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 44 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesImportance of Frequency Components

Data-Driven Design of Robust FeaturesImportance of Frequency Components

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 Area of 1st and 2nd formant occurrence – highest portion of phonetic information, F1 more important

for neutral speech, F1–F2 for LE speech recognition

Omitting the 1st band considerably improves LE ASR while reducing performance on neutral speech

tradeoff

Next step – how much of the low frequency content should be omitted for LE ASR?

Page 45: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 45 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard EffectLombard Effect

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

Optimizing Filter Banks – Omitting Low Frequencies

1 19

Page 46: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 46 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesOmitting Low Frequencies

Data-Driven Design of Robust FeaturesOmitting Low Frequencies

1 19

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

Page 47: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 47 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesOmitting Low Frequencies

Data-Driven Design of Robust FeaturesOmitting Low Frequencies

1 19

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

Page 48: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 48 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesOmitting Low Frequencies

Data-Driven Design of Robust FeaturesOmitting Low Frequencies

1 19

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

Page 49: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 49 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesOmitting Low Frequencies

Data-Driven Design of Robust FeaturesOmitting Low Frequencies

Devel set Set

Neutral LE

LFCC, full band 4.8

(4.1–5.5)

29.0

(27.5–30.5) WER

(%) LFCC, 625 Hz

6.6

(5.8–7.4)

15.6

(14.4–16.8)

Effect of Omitting Low Spectral Components Increasing FB low cut-off results in almost linear increase of WER on neutral speech while

considerably enhancing ASR performance on LE speech

Optimal low cut-off found at 625 Hz

Page 50: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 50 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution

Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution

15

20

25

30

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Omitted band

LE speech

WE

R (

%)

625 Hz

Increasing Frequency Resolution Idea – emphasize high information portion of spectrum by increasing FB resolution

Experiment – FB decimation from 1912 bands (decreasing computational costs)

Increasing number of filters at the peak of information distribution curve

deterioration of LE ASR (17.2 % 26.9 %)

Slight F1–F2 shifts due to LE affect cepstral features

No simple recipe on how to derive efficient FB from the information distribution curves

Page 51: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 51 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution

Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution

13

15

17

19

21

23

25

27

500 1000 1500 2000 2500 3000 3500 4000

LE speech

Band 1 Band 2 Band 3 Band 4 Band 5 Band 6

Critical frequency (Hz)

WE

R (

%)

Consecutive Filter Bank RepartitioningConsecutively, from lowest to highest, each FB high cut-off is varied, while the upper rest of the FB

is redistributed uniformly across the remaining frequency band

Cut-off resulting in WER local minimum is fixed and the procedure is repeated for adjacent higher

cut-off WER reduction by 2.3 % for LE, by 1 % on neutral (Example – 6 bands FB)

Page 52: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 52 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesStandard vs. Novel Features

Data-Driven Design of Robust FeaturesStandard vs. Novel Features

3988700 10 1 0 2000 Hz

Expolog

2595 log 1 2000 4000 Hz700

f

f

ff

f

State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000) FB redistributed to improve stressed speech recognition (including loud and Lombard speech)

Increased resolution in the area of F2 occurrence

Linear frequency (Hz)Linear frequency (Hz)

Exp

olo

g f

req

ue

ncy

(H

z)

Exp

olo

g f

req

ue

ncy

(H

z)

Page 53: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 53 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesStandard vs. Novel Features

Data-Driven Design of Robust FeaturesStandard vs. Novel Features

Evaluation in ASR Task Standard MFCC, PLP, variations MFCC-LPC, PLP-DCT – altered cepstrum exctraction schemes

Expolog – Expolog FB replacing trapezoid FB in PLP

20Bands-LPC – uniform rectangular FB employed in PLP front-end

Big1-LPC – derived from 20Bands-LPC, first 3 bands merged – decreased resolution at frequencies

disturbing for LE ASR

RFCC-DCT – repartitioned FB, 19 bands, starting at 625 Hz, employed in MFCC

RFCC-DCT – RFCC employed in PLP

Page 54: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 54 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Data-Driven Design of Robust FeaturesStandard vs. Novel Features

Data-Driven Design of Robust FeaturesStandard vs. Novel Features

0

10

20

30

40

50

60

70

80

MFCC MFCC-LPC PLP PLP-DCT Expolog 20Bands-LPC Big1-LPC RFCC-DCT RFCC-LPC

Neutral

LE

CLE

CLEF0

WE

R (

%)

Features - performance on female digits

Page 55: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 55 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

LE Suppression in ASR Frequency Warping

LE Suppression in ASR Frequency Warping

Maximum Likelihood (ML) Approach Vocal tract length normalization (VTLN): mean formant locations inversely proportional to vocal tract

length (VTL) compensation for inter-speaker VTL variations by frequency transformation (warping):

Formant-Driven (FD) Approach Warping factor determined from estimated mean formant locations

WF F

Warping factor searched to maximize likelihoods of observations and acoustic models:

ˆ arg max Pr ,

O W Θ

Factor searched in typically in the interval 0.8–1.2 (corresponds to ratio of VTL differences in males

and females)

Page 56: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 56 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Frequency WarpingVTLN – Principle

Frequency WarpingVTLN – Principle

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

=

Page 57: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 57 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Frequency WarpingVTLN – Principle

Frequency WarpingVTLN – Principle

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

>

Page 58: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 58 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Frequency WarpingVTLN – Principle

Frequency WarpingVTLN – Principle

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

>

Page 59: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 59 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Frequency WarpingVTLN – Principle

Frequency WarpingVTLN – Principle

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

>

Page 60: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 60 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Frequency WarpingVTLN vs. Lombard Effect

Frequency WarpingVTLN vs. Lombard Effect

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

F2

F1

F3

F4

Page 61: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 61 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Frequency WarpingVTLN vs. Lombard Effect

Frequency WarpingVTLN vs. Lombard Effect

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4?

What to choose?

Good approx. of low formants?

Page 62: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 62 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Frequency WarpingGeneralized TransformFrequency Warping

Generalized Transform

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

What to choose?

Good approx. of higher formants?

F2

F3

F4?

F1

Page 63: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 63 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Frequency WarpingGeneralized TransformFrequency Warping

Generalized Transform

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

F2

F3

F4?

F1

Generalized Transform

Case: VTL1 ? VTLNORM

Page 64: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 64 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Frequency WarpingEvaluation – VTLN vs. Generalized Transform

Frequency WarpingEvaluation – VTLN vs. Generalized Transform

Females Males Set

Neutral LE Neutral LE

# Digits 2560 2560 1423 6303

Baseline 4.3

(3.5–5.0)

33.6 (31.8–35.5)

2.2 (1.4–2.9)

22.9 (21.8–23.9)

Utterance-dependent VTLN 3.6

(2.9–4.3)

28.2

(26.4–29.9)

1.8

(1.1–2.4)

16.6

(15.7–17.6)

WER

(%)

Speaker-dependent VTLN 4.0

(3.2–4.7)

27.7

(26.0–29.5)

1.8

(1.1–2.4)

17.4

(16.5–18.3)

Females Males

Set Neutral LE Neutral LE

# Digits 2560 2560 1423 6303

Baseline bank 4.2

(3.4–5.0)

35.1 (33.3–37.0)

2.2 (1.4–2.9)

23.2 (22.1–24.2) WER

(%) Warped bank

4.4

(3.6–5.2)

23.4

(21.8–25.0)

1.8

(1.1–2.4)

15.7

(14.8–16.6)

Generalized transform better addresses LE-induced formant shifts

Formant-driven approach is less computationally demanding (no need of multiple alignment passes as in

VTLN), but requires reliable formant tracking problem in low SNR’s ML approach more stable

Page 65: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 65 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

LE Suppression in ASR Two-Stage Recognizer (TSR)

LE Suppression in ASR Two-Stage Recognizer (TSR)

LE recognizer

Speech Signal

Estimated Word

Sequence

Neutral/LE Classifier

Neutral Recognizer

Tandem Neutral/LE Classifier – Neutral/LE Dedicated Recognizers Improving ASR features for LE often results in performance tradeoff on neutral speech

Idea - combining separate systems ‘tuned’ for neutral and LE speech directed by neutral/LE classifier

Page 66: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 66 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Two-Stage Recognizer (TSR)Neutral/LE Classification

Two-Stage Recognizer (TSR)Neutral/LE Classification

100

101

102

103

104

-80

-60

-40

-20

0

20

40

60

Frequency (Hz)

Am

plitu

de (

dB)

Mag

nitu

de (

dB)

Log frequency (Hz)

Spectral slope – female vowel /a/

100

101

102

103

104

-80

-60

-40

-20

0

20

40

60

Frequency (Hz)

Am

plitu

de (

dB)

Mag

nitu

de (

dB)

Log frequency (Hz)

Spectral slope – female vowel /a/

Proposal of Neutral/LE Classifier Search for a set of features providing good discriminability between neutral/LE speech

Requirements – speaker/gender/phonetic content independent classification

Extension of the set of analyzed features for the slope of short-term spectra

Page 67: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 67 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Two-Stage Recognizer (TSR)Neutral/LE Classification

Two-Stage Recognizer (TSR)Neutral/LE Classification

Neutral LE

Set # N T (s)

Slope (dB/oct)

(dB/oct) # LE T (s) Slope

(dB/oct) (dB/oct)

M 2587 618 -7.42

(-7.48; -7.36) 1.53 3532 1114

-5.32 (-5.37; -5.27)

1.55 0–8000

Hz F 5558 1544

-6.15

(-6.18; -6.12) 1.30 5030 1926

-3.91

(-3.96; -3.86) 1.77

Neutral – LE distribution overlap (%) Set

0–8000 Hz 60–8000 Hz 60–5000 Hz 1k–5k Hz 0–1000 Hz 60–1000 Hz

M 26.00 28.13 29.47 100.00 27.81 27.96

F 26.20 28.95 16.76 100.00 25.75 22.18

M+F 28.06 30.48 29.49 100.00 27.54 26.00

Mean Spectral Slopes in Voiced Male/Female Speech

Overlap of Neutral/LE Spectral Slope Distributions

Classification Feature Set A feature set providing superior classification performance on the development data set was found:

SNR, spectral slope (60–1000 Hz), F0, F0

Training GMM and multi-layer perceptron (MLP) classifiers

Page 68: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 68 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Two-Stage Recognizer (TSR)Neutral/LE Classification

Two-Stage Recognizer (TSR)Neutral/LE Classification

1

0

P H

P H i

i

o

o

1

1f

1 jj qqe

2

1

fj

i

q

j Mq

i

eq

e

Pr(N) Pr(LE)

GMMN GMMLE

Acoustic Observation (Classification Feature Vector)

Binary Classification Task

GMM Classifier

11

21e

2

Ti i

i nP

o μ Σ o μo

Σ

MLP Classifier (Softmax)

(Sigmoid)

… …

Pr(N) Pr(LE)

Classification Feature Vector

Page 69: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 69 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Two-Stage Recognizer (TSR)Neutral/LE Classification – Feature Set

Two-Stage Recognizer (TSR)Neutral/LE Classification – Feature Set

0

20

40

60

80

100

120

0 20 40 60 80 100

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Dev_N_M+FDev_LE_M+FPDF_LEPDF_N

SNR (dB)

GMM PDFsSNR

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

10

20

30

40

50

60

70

80

-20 -10 0 10 20 30

0

0.04

0.08

0.12

0.16

Dev_N_M+FDev_LE_M+FPDF_LEPDF_N

Spectral slope (dB/oct)

GMM PDFsSpectral slope

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)

SNR (dB)

ANN posteriorsSNR

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

10

20

30

40

50

60

70

80

-20 -10 0 10 20 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)

Spectral slope (dB/oct)

ANN posteriorsSpectral slope

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

Page 70: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 70 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Two-Stage Recognizer (TSR)Neutral/LE Classification – Feature Set

Two-Stage Recognizer (TSR)Neutral/LE Classification – Feature Set

0

50

100

150

200

250

0 20 40 60 80 100 120

0

0.01

0.02

0.03

0.04

Dev_N_M+F

Dev_LE_M+F

PDF_LE

PDF_N

GMM PDFsF0

F0 (Hz)

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

140

160

0 100 200 300 400 500

0.000

0.004

0.008

0.012

Dev_N_M+FDev_LE_M+FPDF_LEPDF_N

F0 (Hz)

GMM PDFsF0

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

140

160

0 100 200 300 400 500

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)

F0 (Hz)

ANN posteriorsF0

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

50

100

150

200

250

0 20 40 60 80 100 120

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dev_N_M+FDev_LE_M+F

Pr(N)Pr(LE)

F0 (Hz)

ANN posteriors

F0

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

Page 71: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 71 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Two-Stage Recognizer (TSR)Neutral/LE Classification – Performance

Two-Stage Recognizer (TSR)Neutral/LE Classification – Performance

Set Train CV Open

# Utterances 2202 270 1371

UER (%) 9.9

(8.7–11.1)

5.6

(2.8–8.3)

1.6

(0.9–2.3)

Set Devel FM Open FM Devel DM Open DM

# Utterances 2472 1371 2472 1371

UER (%) 6.6

(5.6–7.6)

2.5

(1.7–3.3)

8.1

(7.0–9.2)

2.8

(1.9–3.6)

Set #Utterances

Devel 2472 4.10 1.60

Open 1371 4.01 1.50

sUtterT sUtterT

Classification Data Sets

Classification Performance UER – Utterance Error Rate – ratio of incorrectly classified utterances to all utterances

GMM

MLP

Page 72: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 72 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Two-Stage Recognizer (TSR)Overall Performance

Two-Stage Recognizer (TSR)Overall Performance

Set Real – neutral Real – LE

# Female digits 1439 1837

PLP 4.3

(3.3–5.4)

48.1

(45.8–50.4)

RFCC–LPC 6.5

(5.2–7.7)

28.3

(26.2–30.4)

MLP TSR 4.2

(3.2–5.3)

28.4

(26.4–30.5)

FM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.4–30.5)

WER

(%)

DM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.3–30.4)

Discrete Recognizers Either good on neutral or LE speech

LE recognizer

Speech Signal

Estimated Word

Sequence

Neutral/LE Classifier

Neutral Recognizer

Page 73: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 73 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

LE Suppression in ASR Comparison of Proposed Methods

LE Suppression in ASR Comparison of Proposed Methods

0

10

20

30

40

50

60

Model Adapt toLE - SI

Model Adapt toLE - SD

VoiceConversion -

CLE

Modified FB -RFCC-LPC

VTLNRecognition -

Utt. Dep. Warp

FormantWarping

MLP TSR

Baseline Neutral

Baseline LE

LE Suppression

WE

R (

%)

Comparison of proposed techniques for LE-robust ASR

Page 74: Email: John.Hansen@utdallas.edu Slide 1 Speech and Speaker Recognition SLIDES  by John H.L. Hansen, 2007 Lombard Speech Recognition Hynek Bořil hxb076000@utdallas.edu

Email: [email protected] Slide 74 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Thank YouThank You

Thank You for Your Attention!