Graphic Models Lecture13

  • Upload
    jeysam

  • View
    225

  • Download
    0

Embed Size (px)

Citation preview

  • 8/11/2019 Graphic Models Lecture13

    1/37

    A Practical Introduction to Graphical Modelsand their use in ASR

    6.345

    Lecture # 13

    Session 2003

  • 8/11/2019 Graphic Models Lecture13

    2/37

    Graphical models for ASR 26.345 Automatic Speech Recognition

    Graphical models for ASR

    HMMs (and most other common ASR models) have somedrawbacks

    Strong independence assumptions

    Single state variable per time frame

    May want to model more complex structure

    Multiple processes (audio + video, speech + noise, multiple

    streams of acoustic features, articulatory features) Dependencies between these processes or between acoustic

    observations

    Graphical models provide:

    General algorithms for large class of models

    No need to write new code for each new model

    A language with which to talk about statistical models

  • 8/11/2019 Graphic Models Lecture13

    3/37

    Graphical models for ASR 36.345 Automatic Speech Recognition

    Outline

    First half intro to GMs

    Independence & conditional independence

    Bayesian networks (BNs)

    * Definition

    * Main problems

    Graphical models in general

    Second half dynamic Bayesian networks (DBNs) for speechrecognition

    Dynamic Bayesian networks -- HMMs and beyond

    Implementation of ASR decoding/training using DBNs

    More complex DBNs for recognition

    GMTK

  • 8/11/2019 Graphic Models Lecture13

    4/37Graphical models for ASR 46.345 Automatic Speech Recognition

    (Statistical) independence

    Y Definition: Given the random variables and ,

    )()|( xpyxp =Y

    c c

    )()(),( ypxpyxp = )()|( ypxyp =

  • 8/11/2019 Graphic Models Lecture13

    5/37Graphical models for ASR 56.345 Automatic Speech Recognition

    (Statistical) conditional independence

    Y Z Definition: Given the random variables , , and ,

    )|(),|( zxpzyxp =ZYX |

    c c

    )|()|()|,( zypzxpzyxp = )|(),|( zypzxyp =

  • 8/11/2019 Graphic Models Lecture13

    6/37

    Graphical models for ASR 66.345 Automatic Speech Recognition

    Is height independent of hair length?

    x

    xx

    x

    x

    x

    xxx xx

    x

    xx x

    x

    x x

    x

    xxx

    xx

    x

    x

    x

    x

    x

    x

    x

    xxx

    xx

    x

    x xxxxx

    x

    xx

    x x

    x

    x

    xxx

    xxxx

    x

    x

    x

    x

    x

    x

    xxx

    xx

    x

    x xxxxx

    x

    xx

    x x

    x

    x

    xxx

    xx

    H

    5 6 7

    L

    short

    mid

    longx = female

    x = male

  • 8/11/2019 Graphic Models Lecture13

    7/37

    Graphical models for ASR 76.345 Automatic Speech Recognition

    Is height independent of hair length?

    Generally, no

    If gender known, yes

    This is the common cause scenario

    gender

    hairlength

    height

    G

    HL

    )|(),|(

    )()|(

    ghpglhp

    hplhp

    =

    GLH |

    LH

  • 8/11/2019 Graphic Models Lecture13

    8/37

    Graphical models for ASR 86.345 Automatic Speech Recognition

    Is the future independent of the past(in a Markov process)?

    Generally, no

    If present state is known, then yes

    QiQi-1 Qi+1

    )|(),|(

    )()|(

    111

    111

    iiiii

    iii

    qqpqqqp

    qpqqp

    ++

    ++

    =

    iii QQQ |11 + 11 +

    ii QQ

  • 8/11/2019 Graphic Models Lecture13

    9/37

    Graphical models for ASR 96.345 Automatic Speech Recognition

    Are burglaries independent of earthquakes?

    Generally, yes

    If alarm state known, no

    Explaining-away effect: the earthquake explains away the

    burglary

    alarm

    earthquake burglary

    A

    BE

    )|(),|(

    )()|(

    abpaebp

    bpebp

    = BE

    ABE |

  • 8/11/2019 Graphic Models Lecture13

    10/37

    Graphical models for ASR 106.345 Automatic Speech Recognition

    Are alien abductions independent ofdaylight savings time?

    Generally, yes

    If Jim doesnt show up for lecture, no

    Again, explaining-away effect

    Jim

    absent

    AD

    J

    alien

    abductionDST

    )|(),|(

    )()|(

    japjdap

    apdap

    = D

    JAD |

  • 8/11/2019 Graphic Models Lecture13

    11/37

    Graphical models for ASR 116.345 Automatic Speech Recognition

    Is tongue height independent of lip rounding?

    Generally, yes If F1 is known, no

    Yet again, explaining-away effect...

    tongueheight

    lip

    roundingRH

    F1

    )|(),|(

    )()|(

    11 fhpfrhp

    hprhp

    = R

    1|FRH

  • 8/11/2019 Graphic Models Lecture13

    12/37

    Graphical models for ASR 126.345 Automatic Speech Recognition

    More explaining away...

    C4C2

    Lleak

    C3 C5C1

    pipes faucet caulking drain upstairs

    jiLCC ji ,| )|(),|(

    )()|(

    lcplccp

    cpccp

    iji

    iji

    = jiCC ji ,

  • 8/11/2019 Graphic Models Lecture13

    13/37

    Graphical models for ASR 136.345 Automatic Speech Recognition

    Bayesian networks

    The preceding slides are examples of simple Bayesian networks

    Definition:

    Directed acyclic graph (DAG) with a one-to-one correspondencebetween nodes (vertices) and variablesX1,X2, ... ,XN

    Each nodeXi with parentspa(Xi) is associated with the localprobability functionpXi|pa(Xi)

    The joint probability of all of the variables is given by the product

    of the local probabilities, i.e.p(xi, ... ,xN) = p(xi|pa(xi))

    B

    D

    A

    Cp(a)

    p(b|a)

    p(c|b)

    p(d|b,c)

    p(a,b,c,d) =p(a)p(b|a)p(c|b)p(d|b,c)

    A given BN represents a familyof probability distributions

  • 8/11/2019 Graphic Models Lecture13

    14/37

    Graphical models for ASR 146.345 Automatic Speech Recognition

    Bayesian networks, contd

    Missing edges in the graph correspond to independenceassumptions

    Joint probability can always be factored according to thechain rule:

    p(a,b,c,d) =p(a)p(b|a)p(c|a,b)p(d|a,b,c)

    But by making some independence assumptions, we get a

    sparse factorization, i.e. one with fewer parameters

    p(a,b,c,d) =p(a)p(b|a)p(c|b)p(d|b,c)

  • 8/11/2019 Graphic Models Lecture13

    15/37

    Graphical models for ASR 156.345 Automatic Speech Recognition

    Medical example

    lungcancer

    smoker genes

    parentsmokerprofession

    Things we may want to know: What independence assumptions does

    this model encode?

    What isp(lung cancer|profession) ?

    p(smoker|parent smoker, genes) ? Given some of the variables, what are

    the most likely values of others?

    How do we estimate the local

    probabilities from data?

  • 8/11/2019 Graphic Models Lecture13

    16/37

    Graphical models for ASR 166.345 Automatic Speech Recognition

    Determining independencies from a graph

    There are several ways... Bayes-ball algorithm (Bayes-Ball: The Rational Pastime,

    Schachter 1998)

    Ball bouncing around graph according to a set of rules

    Two nodes are independent given a set of observed nodes if aball cant get from one to the other

  • 8/11/2019 Graphic Models Lecture13

    17/37

    Graphical models for ASR 176.345 Automatic Speech Recognition

    Bayes-ball, contd

    Boundary conditions:

    B b ll i di l l

  • 8/11/2019 Graphic Models Lecture13

    18/37

    Graphical models for ASR 186.345 Automatic Speech Recognition

    Bayes-ball in medical example

    lungcancer

    smoker genes

    parentsmokerprofession

    According to this model:

    Are a persons genes independent of whether they have a parent

    who smokes? What about if we know the person has lungcancer?

    Is lung cancer independent of profession given that the person isa smoker?

    (Do the answers make sense?)

  • 8/11/2019 Graphic Models Lecture13

    19/37

    Graphical models for ASR 196.345 Automatic Speech Recognition

    Inference

    Definition: Computation of the probability of one subset of the variables

    given another subset

    Inference is a subroutine of:

    Viterbi decoding

    q* = argmaxqp(q|obs)

    Maximum-likelihood estimation of the parameters of the localprobabilities

    * = argmaxp(obs| )

    G hi l d l (GM )

  • 8/11/2019 Graphic Models Lecture13

    20/37

    Graphical models for ASR 206.345 Automatic Speech Recognition

    Graphical models (GMs)

    In general, GMs represent families of probability distributionsvia graphs

    directed, e.g. Bayesian networks

    undirected, e.g. Markov random fields

    combination, e.g. chain graphs

    To describe aparticulardistribution with a GM, we need tospecify:

    Semantics: Bayesian network, Markov random field, ...

    Structure: the graph itself

    Implementation: the form of the local functions (Gaussian, table, ...)

    Parameters of local functions (means, covariances, table entries...)

    Not all types of GMs can represent all sets of independenceproperties!

  • 8/11/2019 Graphic Models Lecture13

    21/37

    Graphical models for ASR 216.345 Automatic Speech Recognition

    Example of undirected graphical models:Markov random fields

    Definition:

    Undirected graph

    Local function (potential) defined on each maximal clique

    Joint probability given by normalized product of potentials

    Independence properties can be deduced via simple graphseparation

    B

    D

    AC ),,(),(),,,( ,,, dcbbadcbap DCBBA

  • 8/11/2019 Graphic Models Lecture13

    22/37

    Graphical models for ASR 226.345 Automatic Speech Recognition

    Dynamic Bayesian networks (DBNs)

    BNs consisting of a structure that r p ts an indefinite (ordynamic) number of times

    Useful for modeling time series (e.g. speech)

    D

    A

    C

    B

    D

    A

    C

    B

    D

    A

    C

    B

    frame i frame i+1frame i-1

  • 8/11/2019 Graphic Models Lecture13

    23/37

    Graphical models for ASR 236.345 Automatic Speech Recognition

    DBN representation of n-gram language models

    Bigram:

    Wi+1

    Wi

    Wi-1

    . . . . . .

    Trigram:

    Wi+1WiWi-1. . . . . .

    R ti HMM DBN

  • 8/11/2019 Graphic Models Lecture13

    24/37

    Graphical models for ASR 246.345 Automatic Speech Recognition

    Representing an HMM as a DBN

    1 2 3

    obs obs obs

    = state

    = allowed transition

    = variable

    = allowed dependency

    HMM DBN

    .3

    .7 .8

    .2

    1

    1 2 3qi

    qi-1

    1 .7 .3 0

    2 0 .8 .2

    3 0 0 1

    P(qi|qi-1)

    q=1

    q=2

    q=3

    P(obsi | qi)

    frame i

    Qi

    obsi

    . . .. . .

    Qi-1

    obsi-1

    Qi+1

    obsi+1

    frame i+1frame i-1

  • 8/11/2019 Graphic Models Lecture13

    25/37

    Graphical models for ASR 256.345 Automatic Speech Recognition

    Casting HMM-based ASR as a GM problem

    Qi

    obsi

    . . .. . .

    Qi-1

    obsi-1

    Qi+1

    obsi+1

    Viterbi decoding finding the most probable settings for allqi given the acoustic observations {obsi}

    Baum-Welch training finding the most likely settings for

    the parameters of P(qi|qi-1) and P(obsi| qi)

    Both are special cases of the standard GM algorithms forViterbi and EM training

  • 8/11/2019 Graphic Models Lecture13

    26/37

    Graphical models for ASR 266.345 Automatic Speech Recognition

    Variations

    Input-output HMMs

    Qi. . .. . . Q

    i-1Q

    i+1

    Xi-1 Xi Xi+1

    Xi-1 Xi Xi+1

    Factorial HMMs

    Qi . . .. . . Qi-1 Qi+1

    Xi-1 Xi Xi+1

    RiRi-1 Ri+1

    S it hi t

  • 8/11/2019 Graphic Models Lecture13

    27/37

    Graphical models for ASR 276.345 Automatic Speech Recognition

    Switching parents

    Definition: A variableX is a switching parent of variable Y if the value ofX

    determines the parents and/or implementation of Y

    Example:

    B

    D

    A

    C

    A=0 D has parent B with Gaussian distribution

    A=1 D has parent C with Gaussian distribution

    A=2 D has parent C with mixture Gaussian distribution

    HMM-based recognition with a DBN

  • 8/11/2019 Graphic Models Lecture13

    28/37

    Graphical models for ASR 286.345 Automatic Speech Recognition

    HMM based recognition with a DBN

    word

    word transition

    word position

    state transition

    phone state

    observation

    variable name

    end of utterance

    frame 0 frame i last frame

    What language model does this GM implement?

    T i i d t ti DBN

  • 8/11/2019 Graphic Models Lecture13

    29/37

    Graphical models for ASR 296.345 Automatic Speech Recognition

    Training and testing DBNs

    Why do we need different structures for training testing?Isnt training just the same as testing but with more of thevariables observed?

    Not always!

    Often, during training we have onlypartialinformation aboutsome of the variables, e.g. the word sequence but not whichframe goes with which word

    More comple GM models for recognition

  • 8/11/2019 Graphic Models Lecture13

    30/37

    Graphical models for ASR 306.345 Automatic Speech Recognition

    More complex GM models for recognition

    HMM + auxiliary variables (Zweig1998, Stephenson 2001)

    Noise clustering

    Speaker clustering

    Dependence on pitch, speaking rate,etc.

    Q

    obs

    Q

    obs

    aux

    Q

    obs

    a1 a2 aN. . .

    Articulatory/feature-based modeling

    Multi-rate modeling, audio-visual speech recognition (Nefian et al.2002)

    Modeling inter-observation dependencies:

  • 8/11/2019 Graphic Models Lecture13

    31/37

    Graphical models for ASR 316.345 Automatic Speech Recognition

    Modeling inter observation dependencies:Buried Markov models (Bilmes 1999)

    First note that observation variable is actually a vector ofacoustic observations (e.g. MFCCs)

    obs

    Qi+1QiQi-1

    Consider adding dependencies between observations

    Add only those that are discriminative with respect to classifying

    the current state/phone/word

    Feature-based modeling

  • 8/11/2019 Graphic Models Lecture13

    32/37

    Graphical models for ASR 326.345 Automatic Speech Recognition

    g

    Phone-based view: Brain:Give me a []!

    Lips, tongue, velum, glottis:Right on it, sir!

    Lips, tongue, velum, glottis:Right on it, sir!

    Lips, tongue, velum, glottis:Right on it, sir!

    Lips, tongue, velum, glottis:Right on it, sir!

    (Articulatory) feature-basedview:

    Tongue:Ummyeah, OK.

    Lips:

    Huh?

    Brain:Give me a []!

    Velum, glottis:Right on it, sir !Velum, glottis:Right on it, sir !

    A feature based DBN for ASR

  • 8/11/2019 Graphic Models Lecture13

    33/37

    Graphical models for ASR 336.345 Automatic Speech Recognition

    A feature-based DBN for ASR

    frame i+1frame i

    phone

    state

    A1 A2 AN. . .

    O

    phone

    state

    A1 A2 AN. . .

    O

    p(o|a1, ... ,aN)

    GMTK: Graphical Modeling Toolkit

  • 8/11/2019 Graphic Models Lecture13

    34/37

    Graphical models for ASR 346.345 Automatic Speech Recognition

    p g(J. Bilmes and G. Zweig, ICASSP 2002)

    Toolkit for specifying and computing with dynamicBayesian networks

    Models are specified via:

    Structure file: defines variables, dependencies, and form ofassociated conditional distributions

    Parameter files: specify parameters for each distribution in structurefile

    Variable distributions can be

    Mixture Gaussians + variants

    Multidimensional probability tables

    Sparse probability tables

    Deterministic (decision trees)

    Provides programs for EM training, Viterbi decoding, andvarious utilities

    Example portion of structure file

  • 8/11/2019 Graphic Models Lecture13

    35/37

    Graphical models for ASR 356.345 Automatic Speech Recognition

    Example portion of structure file

    variable : phone {

    type: discrete hidden cardinality NUM_PHONES;

    switchingparents: nil;

    conditionalparents: word(0), wordPosition(0) using

    DeterministicCPT("wordWordPos2Phone");

    }

    variable : obs {

    type: continuous observed OBSERVATION_RANGE;

    switchingparents: nil;

    conditionalparents: phone(0) using mixGaussian

    collection(global) mapping("phone2MixtureMapping");

    }

    Some issues

  • 8/11/2019 Graphic Models Lecture13

    36/37

    Graphical models for ASR 366.345 Automatic Speech Recognition

    Some issues...

    For some structures, exact inference may becomputationally infeasible approximate inference

    algorithms

    Structure is not always known structure learningalgorithms

    References

  • 8/11/2019 Graphic Models Lecture13

    37/37

    Graphical models for ASR 376.345 Automatic Speech Recognition

    References

    J. Bilmes, Graphical Models and Automatic SpeechRecognition, in Mathematical Foundations of Speech andLanguage Processing, Institute of Mathematical Analysis

    Volumes in Mathematics Series, Springer-Verlag, 2003.

    G. Zweig, Speech Recognition with Dynamic BayesianNetworks, Ph.D. dissertation, UC Berkeley, 1998.

    J. Bilmes, What HMMs Can Do, UWEETR-2002-0003, Feb.2002.