1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State

1Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York

Spectral/Temporal Acoustic Features for Automatic Speech Recognition

Stephen A. Zahorian, Hongbing Hu, Jiang WuDepartment of Electrical and Computer Engineering

Binghamton UniversityNovember 16th, 2010


Overview of talk Background/Introduction Review of traditional spectral/temporal features DCTC/DCS features Experimental results Conclusions


Most Typical Speech Features for ASR Spectral Features (Static Features)

Represent the vocal tract information MFCCs (Mel-Frequency Cepstral Coefficients)

Temporal Features (Dynamic Features) Capture time variation (trajectory) of spectral features Delta and Delta-Delta terms of MFCCs


MFCCs (Mel-Frequency Cepstral Coefficients)

Mel-Frequency Scale

The coefficients ci are calculated from the log filter-bank amplitudes using the Cosine transform

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

Frequency (kHz)

Mel scale filter banks (20)

)700

1(log2595)( 10ffMel

N

jji j

Nim

Nc

1

))5.0(cos(2 N: Number of banksmj: Log amplitudes


Speech Recognition Architecture

Recognizer (HMM/NN)

i n i: d sil e

I need a

Speech Waveform Feature

Extraction

Speech Features

Phonemes

Words

Classification

(Recognition)


Hidden Markov Models (HMMs) Speech vectors are

generated by a Markov model

The overall probability is calculated as the product of the transition and output probabilities

Likelihood can be approximated by only considering the most likely state sequence


DCTC Features Discrete Cosine Transform Coefficients (DCTCs) Given the spectrum X with the frequency f normalized to a [0, 1]

range, the ith DCTC is calculated:

First 3 DCTC basis vectors

1

0)()))((()( dfffgXaiDCTC i

Basis vector : dfdgfigfi )](cos[)(

a(X): nonlinear amplitude scaling (log)g(f): nonlinear frequency warping (Mel-like function)

0 2 4 6 8-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Frequency [kHz]

Am

plitu

de

BV0

BV1BV2


DCS Features Discrete Cosine Series Coefficients (DCSCs)

Represent the spectral evolution of DCTCs over time and encode the modulation spectrum

1

0)())(,(),( dttthiDCTCjiDCSC j

Basis vectors: dtdhtihtj )](cos[)(

h(t): time “warping” function—non-uniform time resolution

First 3 DCSC basis vectors

-60 -40 -20 0 20 40 60-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Time [ms]

Am

plitu

de

BV0

BV1BV2


Example Original spectrogram, and its rebuilt version with different selection

of features.

Original spectrogramRebuilt with 13 DCTC and

3 DCS termsRebuilt with 8 DCTC and 5

DCS terms

Time (Sec)

Freq

uenc

y (H

z)

Original Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000

Time (Sec)

Freq

uenc

y (H

z)

Rebuilt Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000

Time (Sec)

Freq

uenc

y (H

z)

Rebuilt Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000


DCTC/DCS Computation

Time (Sec)

Freq

uenc

y (H

z)

Original Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000

z

DCS1 DCS2DCS3

DCTC 1DCTC 2DCTC 3DCTC 4DCTC 5

Frame Length Block Length

Spectrogram

DCTC/DCSFeatures


Experimental Evaluation Database

Recognizer: HMMs Left-to-right Markov models with no skip 48 monophone HMMs are created using the HTK toolkit Bigram phone information was used as the language model

Cambridge University/Microsoft HTK toolkit (Ver3.4) Provide powerful tools for data preparation, HMM training and testing,

result analysis

TIMIT database (“SI” and “SX” only)Phoneme Reduced 48 phone set mapped down

from the TIMIT 62 phone setTraining data 3696 sentences (462 speakers)Testing data 1344 sentences (168 speakers)


Experimental Evaluation TIMIT database

630 total speakers, 10 sentences each 462 speakers for training, 168 test speakers

3 state HMM phone models Results given as phone accuracy for 39

“standard” phone categories Number of mixtures per state “relatively” high to

maximize accuracy


Evaluation with Static Only Features Vary frame length from 5 ms to 30ms (5ms as the frame

space) Vary number of DCTCs (7, 10, 13, 16, 19) 8 GMM mixtures for each state of HMMs

510

1520

2530

5

10

15

2048

50

52

54

56

58

Frame length /ms

Static only features

Number of used DCTC

Rec

ogni

tion

Rat

e %


Evaluation with Dynamic Features Use small number of DCTCs (1 , 2, 3, or 4), and vary the

number of DCSs Vary the number of frames per block, so that

DCS terms are computed over 50, 100, or 300 ms 10 ms frame length, 5 ms frame space 8 GMM mixtures for each GMM state


100150

200250

300

34

56

78

30

35

40

45

50

55

60

Block length /ms

Dynamic features w/ 1 DCTC

Number of DCS

Rec

ogni

tion

rate

/%

100150

200250

300

34

56

78

30

35

40

45

50

55

60

Block length /ms

Dynamic features w/ 2 DCTCs

Number of DCS

Rec

ogni

tion

rate

/%

100150

200250

300

34

56

78

30

35

40

45

50

55

60

Block length /ms


Number of DCS

Rec

ogni

tion

rate

/%

100150

200250

300

34

56

78

30

35

40

45

50

55

60

Block length /ms


Number of DCS

Rec

ogni

tion

rate

/%


Evaluation with Spectral/Temporal Features Use 40 features total, and 40 GMM mixtures. Vary frame length and the number of frames per block 2 ms frame space 8 ms block space Vary the combination of different numbers of DCTCs and

DCSs—but fix number of parameters to 40

10ms 15ms 25ms646566676869707172

69.225 69.059268.4554

Frame Length Effect

50ms 100ms 300ms646566676869707172

68.8663

70.2012

67.6721

Block Length Effect


Evaluation with Spectral/Temporal Features

Condition 1: 8 DCTCs and 5 DCSs


Condition 3: 10 DCTCs and 4 DCSs Ss

Condition 4: 11 DCTCs and 4 DC

1 2 3 4 5 6 7 8646566676869707172

69.6970.38 70.71

71.19 71.43 71.41 71.11 71.11

Best Result of Each Condition






Conclusions from these results Features which represent trajectories of global

spectral shape carry considerable information for ASR.

There are tradeoffs between “static” spectral features and “dynamic” spectral trajectory features

Spectral resolution can be relatively low for spectral ASR features

“Information” in trajectory features is more “dilute” than in spectral features


Questions?

Documents

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State