24
Search and Decoding Steve Renals Automatic Speech Recognition— ASR Lecture 10 23 February 2009 Steve Renals Search and Decoding 1

Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Search and Decoding

Steve Renals

Automatic Speech Recognition— ASR Lecture 1023 February 2009

Steve Renals Search and Decoding 1

Page 2: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Overview

Today’s lecture

Search in (large vocabulary) speech recognition

Viterbi decoding

Approximate search

Steve Renals Search and Decoding 2

Page 3: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

HMM Speech Recognition

AcousticModel

Lexicon

LanguageModel

Recorded Speech

AcousticFeatures

SearchSpace

Decoded Text (Transcription)

TrainingData

Steve Renals Search and Decoding 3

Page 4: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

The Search Problem in ASR (1)

Find the most probable word sequence W = w1,w2, . . . ,wM

given the acoustic observations X = x1, x2, . . . , xn:

W = arg maxW

P(W |X)

= arg maxW

p(X |W )︸ ︷︷ ︸acoustic model

P(W )︸ ︷︷ ︸language model

Words are composed of state sequences so we may expressthis criterion by summing over all state sequencesQ = q1, q2, . . . , qn:

W = arg maxW

P(W )∑Q

P(Q |W )P(X | Q)

The acoustic observation sequence is conditionallyindependent of the word sequence given the HMM statesequence.

Steve Renals Search and Decoding 4

Page 5: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

The Search Problem in ASR (2)

Viterbi criterion: approximate the sum over all statesequences by using the most probable state sequence:

W = arg maxW

P(W ) maxQ∈QW

P(Q |W )P(X | Q)

QW is the set of all state sequences corresponding to wordsequence W

The task of the search (or decoding) algorithm is to determineW using the above equation given the acoustic, pronunciationand language models

In a large vocabulary task evaluating all possible wordsequences in infeasible (even using an efficient exactalgorithm)

Reduce the size of the search space through pruning unlikelyhypothesesEliminate repeated computations

Steve Renals Search and Decoding 5

Page 6: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Viterbi Decoding

Naive exhaustive search: with a vocabulary size V , and asequence of M words, there are V M different alternatives toconsider!Viterbi decoding (forward dynamic programming) is anefficient, recursive algorithm that performs an optimalexhaustive searchFor HMM-based speech recognition, the Viterbi algorithm isused to find the most probable path through a probabilisticallyscored time/state latticeExploits first-order Markov property—only need to keep themost probable path at each state:

t t+1

b c

y

Pab

Pxy

max(Pab fbc, Pxy fyc)

fbc

fyc

a

x

t-1

Steve Renals Search and Decoding 6

Page 7: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Time-state trellis

k

i

j

i

j

k

i

j

k

t-1 t t+1Set up the problem as a trellis ofstates and times

Use the Viterbi approximation

At each state-time point keep thesingle most probable path, discardthe rest

The most probable path is the oneat the end state at the final time

Typically use log probabilities

Steve Renals Search and Decoding 7

Page 8: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Compiling a Recognition Network

three

ticket

ticketstwo

one

w ah n

t uw

th r iy

Build a network of HMM states from a network of phones from anetwork of words

Steve Renals Search and Decoding 8

Page 9: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Connected Word Recognition

word4

word3

word2

word1

P(word4)

Steve Renals Search and Decoding 9

Page 10: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Time Alignment Path

Time

States

Word1

Word2

Word3

Word4

Steve Renals Search and Decoding 10

Page 11: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Backtrace to Obtain Word Sequence

Time

States

Word1

Word2

Word3

Word4

Backpointer array keeps track of word sequence for a path:backpointer[word][wordStartFrame] = (prevWord, prevWordStartFrame)

Backtrace through backpointer array to obtain the wordsequence for a path

Steve Renals Search and Decoding 11

Page 12: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Incorporating a bigram language model

ae

k

b

n d

uh t

tae

and

but

cat

Bigram Language Model

Word Models Word Ends

P(cat | cat)

P(but | cat)P(cat | and)

Trigram or longer span models require a word history.

Steve Renals Search and Decoding 12

Page 13: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Computational Issues

Viterbi decoding performs an exact search in an efficientmanner

Exact search is not possible for large vocabulary tasks. If thevocab size is V :

Word boundaries are not known: V words may potentiallystart at each frameCross-word triphones need to be handled carefully since theacoustic score of a word-final phone depends on the initialphone of the next wordLong-span language models (eg trigrams) greatly increase thesize of the search space

Solutions:Beam search (prune low probability hypotheses)Dynamic search structuresMultipass searchBest-first searchFinte State Transducer (FST) approaches

Steve Renals Search and Decoding 13

Page 14: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Sharing Computation: Prefix Pronunciation Tree

Need to build an HMM for each word in the vocabulary

Individual HMM for each word results in phone modelsduplicated in different words

Share computation by arranging the lexicon as a tree

D

B

IY

UW

K

OY

OW D

S

Z

AXR

DO

DECOY

DECODE

DECODES

DECODES

DECODER

Steve Renals Search and Decoding 14

Page 15: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Beam Search

Basic idea: Prune search paths which are unlikely to succeed

Remove nodes in the time-state trellis whose path probabilityis more than a factor δ less probable then the best path (onlyconsider paths in the beam)

Both language model and acoustic model can contribute topruning

Pronunciation tree can limit pruning since the language modelprobabilities are only known at word ends: each internal nodecan keep a list of words it contributes to

Search errors: errors arising due to the fact that the mostprobable hypothesis was incorrectly pruned

Need to balance search errors with speed

Steve Renals Search and Decoding 15

Page 16: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Multipass Search

Rather than compute the single best hypothesis the decodercan output alternative hypotheses

N-best list: list of the N most probable hypotheses

Word Graph/Word Lattice:

Nodes correspond to time (frame)Arcs correspond to word hypotheses (with associated acousticand language model probabilities)

Multipass search using progressively more detailed models

Eg: use bigram language model on first pass, trigram onsecond passTransmit information between passes as word graphsLater passes rescore word graphs produced by earlier passes

Steve Renals Search and Decoding 16

Page 17: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Word Search Tree

Wd1

WdM

Wd1

Wdj

WdM

View recognition search as searching a tree

Viterbi decoding is breadth-first search — time-synchronous

Pruning deactivates part of the search tree

Also possible to use best first search (stack decoding) — timeasynchronous

Steve Renals Search and Decoding 17

Page 18: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Static and dynamic networks

Previous approaches constructed the search spacedynamically: less probable paths are not explored.

Dynamic search is resource-efficient but results in

complex softwaretight interactions between pruning algorithms and datastructures

Static networks are efficient for smaller vocabularies, but notimmediately applicable to large vocabularies

Efficient static networks would enable

Application of network optimization algorithms in advanceDecoupling of search network construction and decoding

Steve Renals Search and Decoding 18

Page 19: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Weighted Finite State Transducers

Finite state automaton that transduces an input sequence toan output sequence

States connected by transitions. Each transition has

input labeloutput labelweight

0 1 2 3

4 5

6a:X/0.1 b:Y/0.2 c:Z/0.5 d:W/0.1

e:Y/0.7

f:V/0.3

g:U/0.1

Steve Renals Search and Decoding 19

Page 20: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

WFST Algorithms

Composition Used to combine transducers at different levels. Forexample if G is a finite state grammar and P is apronunciation dictionary then D transduces a phonestring to any word string, whereas P ◦G transduces aphone string to word strings allowed by the grammar

Determinisation removes non-determinancy from the network byensuring that each state has no more than a singleoutput transition for a given input label

Minimisation transforms a transducer to an equivalent transducerwith the fewest possible states and transitions

Several libraries for WFSTs eg:

Open FST: http://www.openfst.org/

MIT: http://people.csail.mit.edu/ilh/fst/

AT&T: http://www.research.att.com/~fsmtools/fsm/

Steve Renals Search and Decoding 20

Page 21: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

WFST-based decoding

Represent the following components as WFSTs

Context-dependent acoustic models (C )Pronunciation dictionary (D)n-gram language model (L)

The decoding network is defined by their composition:C ◦ D ◦ L

Successively determinize and combine the componenttransducers, then minimize the final network

Problem: although the final network may be of manageablesize, the construction process may be very memory intensive,particularly with 4-gram language models or vocabularies ofover 50,000 words

Used successfully in several systems

Steve Renals Search and Decoding 21

Page 22: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Evaluation

How accurate is a speech recognizer?Use dynamic programming to align the ASR output with areference transcriptionThree type of error: insertion, deletion, substitutionWord error rate (WER) sums the three types of error. If thereare N words in the reference transcript, and the ASR outputhas S substitutions, D deletions and I insertions, then:

WER = 100 · S + D + I

N% Accuracy = 100−WER%

Speech recognition evaluations: common training anddevelopment data, release of new test sets on which differentsystems may be evaluated using word error rate

NIST evaluations enabled an objective assessment of ASRresearch, leading to consistent improvements in accuracyMay have encouraged incremental approaches at the cost ofsubduing innovation (“Towards increasing speech recognitionerror rates”)

Steve Renals Search and Decoding 22

Page 23: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

Summary

Search in speech recognition

Viterbi decoding

Connected word recognition

Incorporating the language model

Pruning

Prefix pronunciation trees

Weighted finite state transducers

Evaluation

Steve Renals Search and Decoding 23

Page 24: Search and Decoding · Viterbi Decoding Naive exhaustive search: with a vocabulary size V, and a sequence of M words, there are VM di erent alternatives to consider! Viterbi decoding

References

Aubert (2002) - review of decoding techniques

Mohri et al (2002) - WFSTs applied to speech recognition

Moore et al (2006) - Juicer (example of a WFST-baseddecoder)

Steve Renals Search and Decoding 24