27
SCALING UP: LEARNING LARGE-SCALE RECOGNITION METH ODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andr eas Stolcke International Computer Science Institute, Ber keley, CA, USA Presenter: Chen Hung-Bin 2004 Special Workshop in Maui (SWIM)

SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

  • Upload
    janina

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS. Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International Computer Science Institute, Berkeley, CA, USA Presenter: Chen Hung-Bin. 2004 Special Workshop in Maui (SWIM). Outline. Introduction - PowerPoint PPT Presentation

Citation preview

Page 1: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas StolckeInternational Computer Science Institute, Berkeley, CA, USA

Presenter: Chen Hung-Bin

2004 Special Workshop in Maui (SWIM)

Page 2: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Outline

• Introduction• Conventional Features• Multi-Layered Perceptrons (MLPs)• three different temporal resolutions• Experiments• Conclusion

Page 3: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Introduction

• In this pape, we describe a three-stage process of scaling to the larger conversational telephone speech (CTS) task.

• One goal was to improve conversational telephone speech (CTS) recognition by modifying the acoustic front end.– We found that approaches developed for the recognition of natur

al numbers scaled quite well to two different levels of CTS complexity:

• recognition of utterances primarily consisting of the 500 most frequent words in Switchboard

• and large vocabulary recognition of Switchboard conversations

Page 4: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Conventional Features

• Mel Frequency Cepstral Coefficients (MFCC) • Perceptual Linear Prediction (PLP)• Hidden Activation TRAPS (HATS)• Modulation-filtered spectrogram (MSG)• Relative Spectral Perceptual Linear Prediction

(RASTA-PLP)

Page 5: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Perceptual Linear Prediction

• Equal loudness preemphasis ( 等響度曲線預強 )

)661.9()56.1()644.1()( 222

42

efeffeffE

4kHz 附近 人的耳朵是最靈敏的

Page 6: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Perceptual Linear Prediction

• Intensity-loudness power law

13

Perceived loudness, L(w), is approximately the cube root of the intensity, I(w)

L(w) = I(w)

Page 7: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Perceptual Linear Prediction

• HTK 做法 :– Fill filterbank channels– equal-loudness curve– Do IDFT to get autocorrelation values– transfer from lpc to cepstral coef

// Mel to Hz conversionfor( i=1; i<=pOrder; i++ ) { // cf = fbank centre frequencies f_hz_mid = 700*(exp(cf[i]/1127)-1); fsq = (f_hz_mid * f_hz_mid); fsub = fsq / (fsq + 1.6e5); EQL[i] = fsub * fsub * ((fsq + 1.44e6) /(fsq + 9.61e6));}

// equal-loudness curvefor( i=1; i<=pOrder; i++ ) { p[i+1] = bins[i] * EQL[i] if( F_Debug == 2 )}

)661.9()56.1()644.1()( 222

42

efeffeffE

Page 8: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

RASTA-PLP

• modulation filtering

1-

2-2

0.94z-12z-z-z2z0.1H(z)

filterRASTA

Perceptually Inspired Signal-processing Strategies forRobust Speech Recognition in Reverberant Environments, 1998

Page 9: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Multi-Layered Perceptrons (MLPs)

• A multilayer perceptrons is a feedforward neural network with one or more hidden layers

• The signals are propagated in a forward direction on a layer-by-layer basis

• The network consists of – an input layer of source neurons– at least one hidden layer of computational neurons– an output layer of computational neurons

Page 10: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

TempoRAl Patterns (TRAPs)

• spectral-energy based vector at time t with variables– Based on posterior probabilities of speech categories for long an

d short time functions of the time-frequency plane– These feature may be represented as multiple streams of probab

ilistic information

• Working with narrow spectral subbands and long temporal windows – Naive One Stage Approach– Two Stage Linear Approaches– Two Stage Non-Linear Approaches

Hidden Activation TRAPS (HATS)

Page 11: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Naive One Stage Approach

• baseline approach – 51 frames of all 15 bands of log critical band energies (LCBEs) a

s inputs to an MLP. – These inputs are built by stacking 25 frames before and after the

current frame to the current frame, and the target phoneme comes from the current frame.

Page 12: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Two Stage Linear Approaches

• 15 Bands x 51 Frames– first, calculate principal component analysis (PCA) transforms– second, combine what was learned at each critical band posteriors

Page 13: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Two Stage Non-Linear Approaches

Page 14: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Augmenting PLP Front End Features

• We used three different temporal resolutions. – The original PLP features were derived from short term spectral analysis– the PLP/MLP features used 9 frames of PLP features – and the TRAPS features used 51 frames of log critical band energies

dimension

reduce dimension to 17

39 dimension 56 dimension

42 dimension

Page 15: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Inverse entropy weighted combination (INVENT)

• The combined output posterior probability– the MLP feature with lower entropy is more important than an MLP featu

re with high entropy

I

iin

ini

n

nin

in

nini

n

I

iin

n

K

ki

inki

ink

in

I

ii

ink

innk

h

hw

hhhhh

h

I

hh

xqPxqPh

xqPwXqP

1

1

12

1

~/1

~/1

: : 10000~

),|(log),|(

),|(),|(ˆ

I

Innn

k

th

th

xxX

qI

n

i

,,,,

n observatio:3) of (case stream ofnumber :

setparameter : number frame :

vectorfeature stream :

1

1

i

K

k

K

k

i=1

K

k

i=2

K

k

i=3

Page 16: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

softmax

• Therefore we cannot use the entropy based weighting directly.– We convert the spectrum into a probability mass function

(PMF) using the equation

N

iii

thi

N

iiii

xxH

iXXXx

12

1

log

spectrum ofenergy theis , /iX

1X

NX

Page 17: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Average of the posteriors combination (AVG)

• For the average combination–

)|()|()|( 22

11 xqPwxqPwxqP kkk

5.021 ww

Page 18: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Experiments goal

• The PLP/MLP and the TRAPS features, developed for a very small task, were then applied to successively larger problems

• Our methods work on the small vocabulary continuous numbers task even when we did not train explicitly on continuous numbers

• There were several advantages to use – First, since the recognition vocabulary consisted of common words, it wa

s likely that error rate reduction would apply to the larger task as well– Second, there were many examples of these 500 words in the training d

ata, so less training data was required than would be needed for the full task

Page 19: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

THE 500WORD CTS TASK

• The 500 word test set was a subset of the 2001 Hub-5 evaluation data. – Given the 500 most common words in Switchboard I, we chose utteranc

es from the 2001 evaluation data in which 90% or more of the words in all utterance

• training set – consisted of 217 female and 205 male speakers– contained one third of the total number of utterances– The female speech consisted of

• 0.92 hours from English CallHome, • 10.63 hours from Switchboard I with transcriptions,• 0.69 hours from the Switchboard Cellular Database.

– The male speech consisted of • 0.19 hours from English CallHome, • 10.08 hours from Switchboard I, • 0.59 hours from Switchboard Cellular,• 0.06 hours from the Switchboard Credit Card Corpus.

Page 20: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

THE 500WORD CTS TASK

• We used the tuning set to tune system parameters like word transition weight and language model scaling

• And we determined word error rates on the test set

• tuning set – 0.97 hours– 8242 total word tokens

• test set – 1.42 hours– 11845 total word tokens

• language model– Triphone gender-independent HMMs using the SRI speech recognition s

ystem and using a simple bigram language model

Page 21: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Results on Top 500Words Task

• baseline PLP features– we trained gender dependent triphone HMMs on the 23 hour RUSH train

ing set – and then tested this system on the 500 word test set achieving a 43.8%

word error rate

Word error rate (WER) and relative eduction of WER on the top 500 word test set of systems

Page 22: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

OGI NUMBERS TASK

• The training set for this stage was an 18.7-hour subset of the old “short” SRI Hub training set– 48% of the training data was male and 52% female– 4.4 hours of this training set comes from English CallHome– 2.7 hours from Hand Transcribed Switchboard – 2.0 hours from Switchboard Credit Card Corpus– 9.6 hours from Macrophone (read speech)

• tuning set ?• testing set

– 1.3 hours of speech– 2519 utterances and 9699 word tokens

• language model– Triphone gender-independent HMMs using the SRI speech recognition s

ystem and using a simple bigram language model

Page 23: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Results on Numbers Task

• The testing dictionary contained thirty words for numbers and two words for hesitation

Word error rate (WER) and relative reduction of WER on Numbers using different combination approaches.

Page 24: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

FULL CTS VOCABULARY

• in the 500 word task like 500WORD CTS TASK• This set contained a total of 68.95 hours of CTS

– female speaker • 2.75 hours of English CallHome• 31.30 hours fromMississippi State transcribed Switchboard I• 2.03 hours of Switchboard Cellular form the data

– male speaker• 0.56 hours of English CallHome• 30.28 hours from Switchboard I• 1.83 hours from Switchboard Cellular• 0.20 hours of Switchboard Credit Card Corpus

Page 25: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

FULL CTS VOCABULARY

• tuning set ?• testing set

– 6.33 hours of speech – 62890 total word tokens

• language model– Triphone gender-independent HMMs using the SRI speech recognition s

ystem and using a simple bigram language model

Page 26: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Results on Full CTS Task

• 2001 Hub-5 evaluation set

Word error rate (WER) and relative reduction of WER on Numbers using different combination approaches.

Page 27: SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

CONCLUSION

• Word error rate was significantly reduced for the larger tasks as well

• The combination methods, which gave equivalent performance for the smaller task, were also comparable on the larger tasks.