Upload
gayatri-venugopal
View
221
Download
0
Embed Size (px)
Citation preview
8/2/2019 20060518_jojo_20060517 MLP1
1/27
SCALING UP: LEARNING LARGE-SCALE RECOGNITIONMETHODS FROM SMALL-SCALE RECOGNITION TASKS
Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke
International Computer Science Institute, Berkeley, CA, USA
Presenter: Chen Hung-Bin
2004 Special Workshop in Maui (SWIM)
8/2/2019 20060518_jojo_20060517 MLP1
2/27
Outline
Introduction
Conventional Features
Multi-Layered Perceptrons (MLPs)
three different temporal resolutions
Experiments Conclusion
8/2/2019 20060518_jojo_20060517 MLP1
3/27
Introduction
In this pape, we describe a three-stage process of
scaling to the larger conversational telephone speech(CTS) task.
One goal was to improve conversational telephonespeech (CTS) recognition by modifying the acoustic front
end. We found that approaches developed for the recognition of
natural numbers scaled quite well to two different levels of CTScomplexity:
recognition of utterances primarily consisting of the 500 most
frequent words in Switchboard
and large vocabulary recognition of Switchboardconversations
8/2/2019 20060518_jojo_20060517 MLP1
4/27
Conventional Features
Mel Frequency Cepstral Coefficients (MFCC)
Perceptual Linear Prediction (PLP)
Hidden Activation TRAPS (HATS)
Modulation-filtered spectrogram (MSG)
Relative Spectral Perceptual Linear Prediction(RASTA-PLP)
8/2/2019 20060518_jojo_20060517 MLP1
5/27
Perceptual Linear Prediction
Equal loudness preemphasis ()
)661.9()56.1(
)644.1()(
222
42
efef
feffE
4kHz
8/2/2019 20060518_jojo_20060517 MLP1
6/27
Perceptual Linear Prediction
Intensity-loudness power law
1
3
Perceived loudness, L(w), is approximately the cube root of the intensity, I(w)
L(w) = I(w)
8/2/2019 20060518_jojo_20060517 MLP1
7/27
Perceptual Linear Prediction
HTK :
Fill filterbank channels
equal-loudness curve
Do IDFT to get autocorrelation values
transfer from lpc to cepstral coef
// Mel to Hz conversion
for( i=1; i
8/2/2019 20060518_jojo_20060517 MLP1
8/27
RASTA-PLP
modulation filtering
1-
2-2
0.94z-1
2z-z-z2z0.1H(z)
filterRASTA
Perceptually Inspired Signal-processing Strategies for
Robust Speech Recognition in Reverberant Environments, 1998
8/2/2019 20060518_jojo_20060517 MLP1
9/27
Multi-Layered Perceptrons (MLPs)
A multilayer perceptrons is a feedforward neural network with one or
more hidden layers The signals are propagated in a forward direction on a layer-by-layer
basis
The network consists of
an input layer of source neurons
at least one hidden layer of computational neurons
an output layer of computational neurons
8/2/2019 20060518_jojo_20060517 MLP1
10/27
TempoRAl Patterns (TRAPs)
spectral-energy based vector at time t with variables
Based on posterior probabilities of speech categories for longand short time functions of the time-frequency plane
These feature may be represented as multiple streams ofprobabilistic information
Working with narrow spectral subbands and longtemporal windows
Naive One Stage Approach
Two Stage Linear Approaches
Two Stage Non-Linear ApproachesHidden Activation TRAPS (HATS)
8/2/2019 20060518_jojo_20060517 MLP1
11/27
Naive One Stage Approach
baseline approach
51 frames of all 15 bands of log critical band energies (LCBEs)as inputs to an MLP.
These inputs are built by stacking 25 frames before and after thecurrent frame to the current frame, and the target phonemecomes from the current frame.
8/2/2019 20060518_jojo_20060517 MLP1
12/27
Two Stage Linear Approaches
15 Bands x 51 Frames
first, calculate principal component analysis (PCA) transforms
second, combine what was learned at each critical band posteriors
8/2/2019 20060518_jojo_20060517 MLP1
13/27
Two Stage Non-Linear Approaches
8/2/2019 20060518_jojo_20060517 MLP1
14/27
Augmenting PLP Front End Features
We used three different temporal resolutions.
The original PLP features were derived from short term spectral analysis
the PLP/MLP features used 9 frames of PLP features
and the TRAPS features used 51 frames of log critical band energies
dimension
reduce
dimension to 17
39 dimension 56 dimension
42 dimension
8/2/2019 20060518_jojo_20060517 MLP1
15/27
Inverse entropy weighted combination (INVENT)
The combined output posterior probability
the MLP feature with lower entropy is more important than an MLPfeature with high entropy
I
i
i
n
i
ni
n
nin
in
n
i
ni
n
I
i
i
n
n
K
ki
i
nki
i
nk
i
n
I
i
i
i
nk
i
nnk
h
hw
hhh
hh
h
I
hh
xqPxqPh
xqPwXqP
1
1
12
1
~/1
~/1
:
:10000~
),|(log),|(
),|(),|(
I
I
nnn
k
th
th
xxX
q
I
n
i
,,
,,
nobservatio:
3)of(casestreamofnumber:
setparameter:
numberframe:
vectorfeaturestream:
1
1
i
K
k
K
k
i=1
K
k
i=2
K
k
i=3
8/2/2019 20060518_jojo_20060517 MLP1
16/27
softmax
Therefore we cannot use the entropy based weighting
directly. We convert the spectrum into a probability mass function
(PMF) using the equation
N
i
ii
th
i
N
i
iii
xxH
iXXXx
1
2
1
log
spectrumofenergytheis,/iX
1X
N
X
8/2/2019 20060518_jojo_20060517 MLP1
17/27
Average of the posteriors combination (AVG)
For the average combination
)|()|()|( 221
1 xqPwxqPwxqP kkk
5.021 ww
8/2/2019 20060518_jojo_20060517 MLP1
18/27
Experiments goal
The PLP/MLP and the TRAPS features, developed for a very small
task, were then applied to successively larger problems
Our methods work on the small vocabulary continuous numbers taskeven when we did not train explicitly on continuous numbers
There were several advantages to use
First, since the recognition vocabulary consisted of common words, itwas likely that error rate reduction would apply to the larger task as well
Second, there were many examples of these 500 words in the trainingdata, so less training data was required than would be needed for the
full task
8/2/2019 20060518_jojo_20060517 MLP1
19/27
THE 500WORD CTS TASK
The 500 word test set was a subset of the 2001 Hub-5 evaluation
data. Given the 500 most common words in Switchboard I, we choseutterances from the 2001 evaluation data in which 90% or more of thewords in all utterance
training set
consisted of 217 female and 205 male speakers
contained one third of the total number of utterances The female speech consisted of
0.92 hours from English CallHome,
10.63 hours from Switchboard I with transcriptions,
0.69 hours from the Switchboard Cellular Database.
The male speech consisted of 0.19 hours from English CallHome,
10.08 hours from Switchboard I,
0.59 hours from Switchboard Cellular,
0.06 hours from the Switchboard Credit Card Corpus.
8/2/2019 20060518_jojo_20060517 MLP1
20/27
THE 500WORD CTS TASK
We used the tuning set to tune system parameters like word
transition weight and language model scaling And we determined word error rates on the test set
tuning set
0.97 hours
8242 total word tokens
test set
1.42 hours
11845 total word tokens
language model Triphone gender-independent HMMs using the SRI speech recognition
system and using a simple bigram language model
8/2/2019 20060518_jojo_20060517 MLP1
21/27
Results on Top 500Words Task
baseline PLP features
we trained gender dependent triphone HMMs on the 23 hour RUSHtraining set
and then tested this system on the 500 word test set achieving a 43.8%word error rate
Word error rate (WER) and relative eduction of WER on
the top 500 word test set of systems
8/2/2019 20060518_jojo_20060517 MLP1
22/27
OGI NUMBERS TASK
The training set for this stage was an 18.7-hour subset of the old
short SRI Hub training set 48% of the training data was male and 52% female 4.4 hours of this training set comes from English CallHome
2.7 hours from Hand Transcribed Switchboard
2.0 hours from Switchboard Credit Card Corpus
9.6 hours from Macrophone (read speech)
tuning set ?
testing set 1.3 hours of speech
2519 utterances and 9699 word tokens
language model
Triphone gender-independent HMMs using the SRI speech recognitionsystem and using a simple bigram language model
8/2/2019 20060518_jojo_20060517 MLP1
23/27
Results on Numbers Task
The testing dictionary contained thirty words for numbers
and two words for hesitation
Word error rate (WER) and relative reduction of WER onNumbers using different combination approaches.
8/2/2019 20060518_jojo_20060517 MLP1
24/27
FULL CTS VOCABULARY
in the 500 word task like 500WORD CTS TASK
This set contained a total of 68.95 hours of CTS female speaker
2.75 hours of English CallHome
31.30 hours fromMississippi State transcribed Switchboard I
2.03 hours of Switchboard Cellular form the data
male speaker
0.56 hours of English CallHome
30.28 hours from Switchboard I
1.83 hours from Switchboard Cellular
0.20 hours of Switchboard Credit Card Corpus
8/2/2019 20060518_jojo_20060517 MLP1
25/27
FULL CTS VOCABULARY
tuning set ?
testing set 6.33 hours of speech
62890 total word tokens
language model
Triphone gender-independent HMMs using the SRI speech recognitionsystem and using a simple bigram language model
8/2/2019 20060518_jojo_20060517 MLP1
26/27
Results on Full CTS Task
2001 Hub-5 evaluation set
Word error rate (WER) and relative reduction of WER onNumbers using different combination approaches.
8/2/2019 20060518_jojo_20060517 MLP1
27/27
CONCLUSION
Word error rate was significantly reduced for the larger tasks as well
The combination methods, which gave equivalent performance forthe smaller task, were also comparable on the larger tasks.