of 17 /17
Speech Recognition

Speech Recognition. What makes speech recognition hard?

  • View
    248

  • Download
    0

Embed Size (px)

Text of Speech Recognition. What makes speech recognition hard?

Page 1: Speech Recognition. What makes speech recognition hard?

Speech Recognition

Page 2: Speech Recognition. What makes speech recognition hard?

What makes speech recognition hard?

Page 3: Speech Recognition. What makes speech recognition hard?

Speech Recognition

• Task: Identify sequence of words uttered by speaker, given acoustic waveform.

• Uncertainty introduced by noise, speaker error, variation in pronunciation, homonyms, etc.

• Thus speech recognition is viewed as problem of probabilistic inference.

Page 4: Speech Recognition. What makes speech recognition hard?

Example: “I’m firsty, um, can I haf somefing to dwink?”

From Russell and Norvig, Artificial Intelligence

Page 5: Speech Recognition. What makes speech recognition hard?

Speech Recognition System Architecture (from Buchsbaum & Giancarlo paper)

Here, “lattice” means “Hidden Markov Model”

Acoustic feature extraction

Acoustic Features–>Phones model

Phones–>Word pronounciation model

Language model

Page 6: Speech Recognition. What makes speech recognition hard?

Acoustic feature extraction

From Russell and Norvig, Artificial Intelligence

Page 7: Speech Recognition. What makes speech recognition hard?

From Russell and Norvig, Artificial Intelligence

Page 8: Speech Recognition. What makes speech recognition hard?

Hidden Markov Models

• Markov model: Given state Xt, what is probability of transitioning to next state Xt+1 ?

• E.g., word bigram probabilities give P (wordt+1 | wordt )

• Hidden Markov model: There are observable states (e.g., signal S) and “hidden” states (e.g., Words). HMM represents probabilities of hidden states given observable states.

Page 9: Speech Recognition. What makes speech recognition hard?

Phone model

P( phone | frame features) = P(frame features| phone) P(phone)

P(frame features| phone) often represented by Gaussian mixture model

Page 10: Speech Recognition. What makes speech recognition hard?

From Russell and Norvig, Artificial Intelligence

Acoustic Features–>Phones model

Page 11: Speech Recognition. What makes speech recognition hard?

Word Pronunciation model

Now we want

P (words|phones1:t ) = P(phones1:t | words) P(words)

Represent P(phones1:t | words) as an HMM

Phones–>Word pronounciation model

Page 12: Speech Recognition. What makes speech recognition hard?

Example of Phones–>Word pronounciation model

From Russell and Norvig, Artificial Intelligence

Page 13: Speech Recognition. What makes speech recognition hard?

From Russell and Norvig, Artificial Intelligence

Language model

Page 14: Speech Recognition. What makes speech recognition hard?

To build a speech recognition system, need: • Lots of data

• Acoustic signal processing tools

• Methods for learning various probability models

• Methods for “maximum likelihood” calculation (i.e., search or “decoding”):

Suppose we have observations (features from acoustic signal) O= (o1 o2 o3 … on).

We want to find W* = (w1 w2 w3 … wn) such that

W* = argmaxW∈L

P(W |O)

= argmaxW∈L

P(O |W)P(W)

Page 15: Speech Recognition. What makes speech recognition hard?

To build a speech recognition system, need: • Lots of data

• Acoustic signal processing tools

• Methods for learning various probability models

• Methods for “maximum likelihood” calculation (i.e., search or “decoding”):

Suppose we have observations (features from acoustic signal) O= (o1 o2 o3 … on).

We want to find W* = (w1 w2 w3 … wn) such that

W* = argmaxW∈L

P(W |O)

= argmaxW∈L

P(O |W)P(W)

Language model

Combine phone models, segmentation models, word pronunciation models

Search or “decoding” method

Page 16: Speech Recognition. What makes speech recognition hard?

To build a speech recognition system, need: • Lots of data

• Acoustic signal processing tools

• Methods for learning various probability models

• Methods for “maximum likelihood” calculation (i.e., search or “decoding”):

Suppose we have observations (features from acoustic signal) O= (o1 o2 o3 … on).

We want to find W* = (w1 w2 w3 … wn) such that

W* = argmaxW∈L

P(W |O)

= argmaxW∈L

P(O |W)P(W)

Language model

Combine phone models, segmentation models, word pronunciation models

Search or “decoding” method

Page 17: Speech Recognition. What makes speech recognition hard?

Emotion recognition in speech(by OES high-school students!)

http://www.youtube.com/watch?v=NnbsGyViN3Y