20060518_jojo_20060517 MLP1

Embed Size (px)

Citation preview

  • 8/2/2019 20060518_jojo_20060517 MLP1

    1/27

    SCALING UP: LEARNING LARGE-SCALE RECOGNITIONMETHODS FROM SMALL-SCALE RECOGNITION TASKS

    Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke

    International Computer Science Institute, Berkeley, CA, USA

    Presenter: Chen Hung-Bin

    2004 Special Workshop in Maui (SWIM)

  • 8/2/2019 20060518_jojo_20060517 MLP1

    2/27

    Outline

    Introduction

    Conventional Features

    Multi-Layered Perceptrons (MLPs)

    three different temporal resolutions

    Experiments Conclusion

  • 8/2/2019 20060518_jojo_20060517 MLP1

    3/27

    Introduction

    In this pape, we describe a three-stage process of

    scaling to the larger conversational telephone speech(CTS) task.

    One goal was to improve conversational telephonespeech (CTS) recognition by modifying the acoustic front

    end. We found that approaches developed for the recognition of

    natural numbers scaled quite well to two different levels of CTScomplexity:

    recognition of utterances primarily consisting of the 500 most

    frequent words in Switchboard

    and large vocabulary recognition of Switchboardconversations

  • 8/2/2019 20060518_jojo_20060517 MLP1

    4/27

    Conventional Features

    Mel Frequency Cepstral Coefficients (MFCC)

    Perceptual Linear Prediction (PLP)

    Hidden Activation TRAPS (HATS)

    Modulation-filtered spectrogram (MSG)

    Relative Spectral Perceptual Linear Prediction(RASTA-PLP)

  • 8/2/2019 20060518_jojo_20060517 MLP1

    5/27

    Perceptual Linear Prediction

    Equal loudness preemphasis ()

    )661.9()56.1(

    )644.1()(

    222

    42

    efef

    feffE

    4kHz

  • 8/2/2019 20060518_jojo_20060517 MLP1

    6/27

    Perceptual Linear Prediction

    Intensity-loudness power law

    1

    3

    Perceived loudness, L(w), is approximately the cube root of the intensity, I(w)

    L(w) = I(w)

  • 8/2/2019 20060518_jojo_20060517 MLP1

    7/27

    Perceptual Linear Prediction

    HTK :

    Fill filterbank channels

    equal-loudness curve

    Do IDFT to get autocorrelation values

    transfer from lpc to cepstral coef

    // Mel to Hz conversion

    for( i=1; i

  • 8/2/2019 20060518_jojo_20060517 MLP1

    8/27

    RASTA-PLP

    modulation filtering

    1-

    2-2

    0.94z-1

    2z-z-z2z0.1H(z)

    filterRASTA

    Perceptually Inspired Signal-processing Strategies for

    Robust Speech Recognition in Reverberant Environments, 1998

  • 8/2/2019 20060518_jojo_20060517 MLP1

    9/27

    Multi-Layered Perceptrons (MLPs)

    A multilayer perceptrons is a feedforward neural network with one or

    more hidden layers The signals are propagated in a forward direction on a layer-by-layer

    basis

    The network consists of

    an input layer of source neurons

    at least one hidden layer of computational neurons

    an output layer of computational neurons

  • 8/2/2019 20060518_jojo_20060517 MLP1

    10/27

    TempoRAl Patterns (TRAPs)

    spectral-energy based vector at time t with variables

    Based on posterior probabilities of speech categories for longand short time functions of the time-frequency plane

    These feature may be represented as multiple streams ofprobabilistic information

    Working with narrow spectral subbands and longtemporal windows

    Naive One Stage Approach

    Two Stage Linear Approaches

    Two Stage Non-Linear ApproachesHidden Activation TRAPS (HATS)

  • 8/2/2019 20060518_jojo_20060517 MLP1

    11/27

    Naive One Stage Approach

    baseline approach

    51 frames of all 15 bands of log critical band energies (LCBEs)as inputs to an MLP.

    These inputs are built by stacking 25 frames before and after thecurrent frame to the current frame, and the target phonemecomes from the current frame.

  • 8/2/2019 20060518_jojo_20060517 MLP1

    12/27

    Two Stage Linear Approaches

    15 Bands x 51 Frames

    first, calculate principal component analysis (PCA) transforms

    second, combine what was learned at each critical band posteriors

  • 8/2/2019 20060518_jojo_20060517 MLP1

    13/27

    Two Stage Non-Linear Approaches

  • 8/2/2019 20060518_jojo_20060517 MLP1

    14/27

    Augmenting PLP Front End Features

    We used three different temporal resolutions.

    The original PLP features were derived from short term spectral analysis

    the PLP/MLP features used 9 frames of PLP features

    and the TRAPS features used 51 frames of log critical band energies

    dimension

    reduce

    dimension to 17

    39 dimension 56 dimension

    42 dimension

  • 8/2/2019 20060518_jojo_20060517 MLP1

    15/27

    Inverse entropy weighted combination (INVENT)

    The combined output posterior probability

    the MLP feature with lower entropy is more important than an MLPfeature with high entropy

    I

    i

    i

    n

    i

    ni

    n

    nin

    in

    n

    i

    ni

    n

    I

    i

    i

    n

    n

    K

    ki

    i

    nki

    i

    nk

    i

    n

    I

    i

    i

    i

    nk

    i

    nnk

    h

    hw

    hhh

    hh

    h

    I

    hh

    xqPxqPh

    xqPwXqP

    1

    1

    12

    1

    ~/1

    ~/1

    :

    :10000~

    ),|(log),|(

    ),|(),|(

    I

    I

    nnn

    k

    th

    th

    xxX

    q

    I

    n

    i

    ,,

    ,,

    nobservatio:

    3)of(casestreamofnumber:

    setparameter:

    numberframe:

    vectorfeaturestream:

    1

    1

    i

    K

    k

    K

    k

    i=1

    K

    k

    i=2

    K

    k

    i=3

  • 8/2/2019 20060518_jojo_20060517 MLP1

    16/27

    softmax

    Therefore we cannot use the entropy based weighting

    directly. We convert the spectrum into a probability mass function

    (PMF) using the equation

    N

    i

    ii

    th

    i

    N

    i

    iii

    xxH

    iXXXx

    1

    2

    1

    log

    spectrumofenergytheis,/iX

    1X

    N

    X

  • 8/2/2019 20060518_jojo_20060517 MLP1

    17/27

    Average of the posteriors combination (AVG)

    For the average combination

    )|()|()|( 221

    1 xqPwxqPwxqP kkk

    5.021 ww

  • 8/2/2019 20060518_jojo_20060517 MLP1

    18/27

    Experiments goal

    The PLP/MLP and the TRAPS features, developed for a very small

    task, were then applied to successively larger problems

    Our methods work on the small vocabulary continuous numbers taskeven when we did not train explicitly on continuous numbers

    There were several advantages to use

    First, since the recognition vocabulary consisted of common words, itwas likely that error rate reduction would apply to the larger task as well

    Second, there were many examples of these 500 words in the trainingdata, so less training data was required than would be needed for the

    full task

  • 8/2/2019 20060518_jojo_20060517 MLP1

    19/27

    THE 500WORD CTS TASK

    The 500 word test set was a subset of the 2001 Hub-5 evaluation

    data. Given the 500 most common words in Switchboard I, we choseutterances from the 2001 evaluation data in which 90% or more of thewords in all utterance

    training set

    consisted of 217 female and 205 male speakers

    contained one third of the total number of utterances The female speech consisted of

    0.92 hours from English CallHome,

    10.63 hours from Switchboard I with transcriptions,

    0.69 hours from the Switchboard Cellular Database.

    The male speech consisted of 0.19 hours from English CallHome,

    10.08 hours from Switchboard I,

    0.59 hours from Switchboard Cellular,

    0.06 hours from the Switchboard Credit Card Corpus.

  • 8/2/2019 20060518_jojo_20060517 MLP1

    20/27

    THE 500WORD CTS TASK

    We used the tuning set to tune system parameters like word

    transition weight and language model scaling And we determined word error rates on the test set

    tuning set

    0.97 hours

    8242 total word tokens

    test set

    1.42 hours

    11845 total word tokens

    language model Triphone gender-independent HMMs using the SRI speech recognition

    system and using a simple bigram language model

  • 8/2/2019 20060518_jojo_20060517 MLP1

    21/27

    Results on Top 500Words Task

    baseline PLP features

    we trained gender dependent triphone HMMs on the 23 hour RUSHtraining set

    and then tested this system on the 500 word test set achieving a 43.8%word error rate

    Word error rate (WER) and relative eduction of WER on

    the top 500 word test set of systems

  • 8/2/2019 20060518_jojo_20060517 MLP1

    22/27

    OGI NUMBERS TASK

    The training set for this stage was an 18.7-hour subset of the old

    short SRI Hub training set 48% of the training data was male and 52% female 4.4 hours of this training set comes from English CallHome

    2.7 hours from Hand Transcribed Switchboard

    2.0 hours from Switchboard Credit Card Corpus

    9.6 hours from Macrophone (read speech)

    tuning set ?

    testing set 1.3 hours of speech

    2519 utterances and 9699 word tokens

    language model

    Triphone gender-independent HMMs using the SRI speech recognitionsystem and using a simple bigram language model

  • 8/2/2019 20060518_jojo_20060517 MLP1

    23/27

    Results on Numbers Task

    The testing dictionary contained thirty words for numbers

    and two words for hesitation

    Word error rate (WER) and relative reduction of WER onNumbers using different combination approaches.

  • 8/2/2019 20060518_jojo_20060517 MLP1

    24/27

    FULL CTS VOCABULARY

    in the 500 word task like 500WORD CTS TASK

    This set contained a total of 68.95 hours of CTS female speaker

    2.75 hours of English CallHome

    31.30 hours fromMississippi State transcribed Switchboard I

    2.03 hours of Switchboard Cellular form the data

    male speaker

    0.56 hours of English CallHome

    30.28 hours from Switchboard I

    1.83 hours from Switchboard Cellular

    0.20 hours of Switchboard Credit Card Corpus

  • 8/2/2019 20060518_jojo_20060517 MLP1

    25/27

    FULL CTS VOCABULARY

    tuning set ?

    testing set 6.33 hours of speech

    62890 total word tokens

    language model

    Triphone gender-independent HMMs using the SRI speech recognitionsystem and using a simple bigram language model

  • 8/2/2019 20060518_jojo_20060517 MLP1

    26/27

    Results on Full CTS Task

    2001 Hub-5 evaluation set

    Word error rate (WER) and relative reduction of WER onNumbers using different combination approaches.

  • 8/2/2019 20060518_jojo_20060517 MLP1

    27/27

    CONCLUSION

    Word error rate was significantly reduced for the larger tasks as well

    The combination methods, which gave equivalent performance forthe smaller task, were also comparable on the larger tasks.