19
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Spectral/Temporal Acoustic Features for Automatic Speech Recognition Stephen A. Zahorian, Hongbing Hu, Jiang Wu Department of Electrical and Computer Engineering Binghamton University November 16th, 2010

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

Embed Size (px)

DESCRIPTION

3 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Most Typical Speech Features for ASR  Spectral Features (Static Features)  Represent the vocal tract information  MFCCs (Mel-Frequency Cepstral Coefficients)  Temporal Features (Dynamic Features)  Capture time variation (trajectory) of spectral features  Delta and Delta-Delta terms of MFCCs

Citation preview

Page 1: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

1Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Spectral/Temporal Acoustic Features for Automatic Speech Recognition

Stephen A. Zahorian, Hongbing Hu, Jiang WuDepartment of Electrical and Computer Engineering

Binghamton UniversityNovember 16th, 2010

Page 2: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

2Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Overview of talk Background/Introduction Review of traditional spectral/temporal features DCTC/DCS features Experimental results Conclusions

Page 3: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

3Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Most Typical Speech Features for ASR Spectral Features (Static Features)

Represent the vocal tract information MFCCs (Mel-Frequency Cepstral Coefficients)

Temporal Features (Dynamic Features) Capture time variation (trajectory) of spectral features Delta and Delta-Delta terms of MFCCs

Page 4: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

4Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

MFCCs (Mel-Frequency Cepstral Coefficients)

Mel-Frequency Scale

The coefficients ci are calculated from the log filter-bank amplitudes using the Cosine transform

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

Frequency (kHz)

Mel scale filter banks (20)

)700

1(log2595)( 10ffMel

N

jji j

Nim

Nc

1

))5.0(cos(2 N: Number of banksmj: Log amplitudes

Page 5: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

5Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Speech Recognition Architecture

Recognizer (HMM/NN)

i n i: d sil e

I need a

Speech Waveform Feature

Extraction

Speech Features

Phonemes

Words

Classification

(Recognition)

Page 6: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

6Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Hidden Markov Models (HMMs) Speech vectors are

generated by a Markov model

The overall probability is calculated as the product of the transition and output probabilities

Likelihood can be approximated by only considering the most likely state sequence

Page 7: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

7Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

DCTC Features Discrete Cosine Transform Coefficients (DCTCs) Given the spectrum X with the frequency f normalized to a [0, 1]

range, the ith DCTC is calculated:

First 3 DCTC basis vectors

1

0)()))((()( dfffgXaiDCTC i

Basis vector : dfdgfigfi )](cos[)(

a(X): nonlinear amplitude scaling (log)g(f): nonlinear frequency warping (Mel-like function)

0 2 4 6 8-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Frequency [kHz]

Am

plitu

de

BV0

BV1BV2

Page 8: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

8Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

DCS Features Discrete Cosine Series Coefficients (DCSCs)

Represent the spectral evolution of DCTCs over time and encode the modulation spectrum

1

0)())(,(),( dttthiDCTCjiDCSC j

Basis vectors: dtdhtihtj )](cos[)(

h(t): time “warping” function—non-uniform time resolution

First 3 DCSC basis vectors

-60 -40 -20 0 20 40 60-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Time [ms]

Am

plitu

de

BV0

BV1BV2

Page 9: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

9Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Example Original spectrogram, and its rebuilt version with different selection

of features.

Original spectrogramRebuilt with 13 DCTC and

3 DCS termsRebuilt with 8 DCTC and 5

DCS terms

Time (Sec)

Freq

uenc

y (H

z)

Original Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000

Time (Sec)

Freq

uenc

y (H

z)

Rebuilt Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000

Time (Sec)

Freq

uenc

y (H

z)

Rebuilt Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000

Page 10: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

10Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

DCTC/DCS Computation

Time (Sec)

Freq

uenc

y (H

z)

Original Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000

z

DCS1 DCS2DCS3

DCTC 1DCTC 2DCTC 3DCTC 4DCTC 5

Frame Length Block Length

Spectrogram

DCTC/DCSFeatures

Page 11: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

11Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Experimental Evaluation Database

Recognizer: HMMs Left-to-right Markov models with no skip 48 monophone HMMs are created using the HTK toolkit Bigram phone information was used as the language model

Cambridge University/Microsoft HTK toolkit (Ver3.4) Provide powerful tools for data preparation, HMM training and testing,

result analysis

TIMIT database (“SI” and “SX” only)Phoneme Reduced 48 phone set mapped down

from the TIMIT 62 phone setTraining data 3696 sentences (462 speakers)Testing data 1344 sentences (168 speakers)

Page 12: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

12Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Experimental Evaluation TIMIT database

630 total speakers, 10 sentences each 462 speakers for training, 168 test speakers

3 state HMM phone models Results given as phone accuracy for 39

“standard” phone categories Number of mixtures per state “relatively” high to

maximize accuracy

Page 13: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

13Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Evaluation with Static Only Features Vary frame length from 5 ms to 30ms (5ms as the frame

space) Vary number of DCTCs (7, 10, 13, 16, 19) 8 GMM mixtures for each state of HMMs

510

1520

2530

5

10

15

2048

50

52

54

56

58

Frame length /ms

Static only features

Number of used DCTC

Rec

ogni

tion

Rat

e %

Page 14: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

14Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Evaluation with Dynamic Features Use small number of DCTCs (1 , 2, 3, or 4), and vary the

number of DCSs Vary the number of frames per block, so that

DCS terms are computed over 50, 100, or 300 ms 10 ms frame length, 5 ms frame space 8 GMM mixtures for each GMM state

Page 15: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

15Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

100150

200250

300

34

56

78

30

35

40

45

50

55

60

Block length /ms

Dynamic features w/ 1 DCTC

Number of DCS

Rec

ogni

tion

rate

/%

100150

200250

300

34

56

78

30

35

40

45

50

55

60

Block length /ms

Dynamic features w/ 2 DCTCs

Number of DCS

Rec

ogni

tion

rate

/%

100150

200250

300

34

56

78

30

35

40

45

50

55

60

Block length /ms

Dynamic features w/ 3 DCTCs

Number of DCS

Rec

ogni

tion

rate

/%

100150

200250

300

34

56

78

30

35

40

45

50

55

60

Block length /ms

Dynamic features w/ 4 DCTCs

Number of DCS

Rec

ogni

tion

rate

/%

Page 16: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

16Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Evaluation with Spectral/Temporal Features Use 40 features total, and 40 GMM mixtures. Vary frame length and the number of frames per block 2 ms frame space 8 ms block space Vary the combination of different numbers of DCTCs and

DCSs—but fix number of parameters to 40

10ms 15ms 25ms646566676869707172

69.225 69.059268.4554

Frame Length Effect

50ms 100ms 300ms646566676869707172

68.8663

70.2012

67.6721

Block Length Effect

Page 17: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

17Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Evaluation with Spectral/Temporal Features

Condition 1: 8 DCTCs and 5 DCSs

Condition 2: 9 DCTCs and 5 DCSs

Condition 3: 10 DCTCs and 4 DCSs Ss

Condition 4: 11 DCTCs and 4 DC

1 2 3 4 5 6 7 8646566676869707172

69.6970.38 70.71

71.19 71.43 71.41 71.11 71.11

Best Result of Each Condition

Condition 5: 12 DCTCs and 4 DCSs

Condition 6: 13 DCTCs and 4 DCSs

Condition 7: 14 DCTCs and 3 DCSs

Condition 8: 15 DCTCs and 3 DCSs

Page 18: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

18Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Conclusions from these results Features which represent trajectories of global

spectral shape carry considerable information for ASR.

There are tradeoffs between “static” spectral features and “dynamic” spectral trajectory features

Spectral resolution can be relatively low for spectral ASR features

“Information” in trajectory features is more “dilute” than in spectral features

Page 19: 1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

19Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Questions?