64
Automatic Speech Recognition: Trials, Tribulations and Triumphs Sadaoki Furui Professor Emeritus Tokyo Institute of Technology [email protected]

Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Automatic Speech Recognition:

Trials, Tribulations and Triumphs

Sadaoki Furui

Professor Emeritus

Tokyo Institute of Technology

[email protected]

Page 2: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Outline

• Progress of automatic speech recognition (ASR)

technology

– 4 generations

– Synchronization with computer technology

– Recent activities at Tokyo Tech

– Major ASR applications

• How to narrow the gap between machine and

human speech recognition

• Data-intensive approach for knowledge extraction

• Summary

Page 3: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Radio Rex – 1920’s ASR

A sound-activated toy dog named “Rex” (from Elmwood

Button Co.) could be called by name from his doghouse.

Page 4: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Generations of ASR technology

1950 1960 1970 1980 1990 2000 2010

1952 1970 1G

Heuristic approaches (analog filter bank + logic circuits)

2G Pattern matching (LPC, FFT, DTW)

1970 1980

3G

Statistical framework (HMM, n-gram, neural net)

1980 1990

3.5G Discriminative approaches, machine learning, robust training, adaptation, rich transcription

1990

4G

Extended knowledge processing

?

Prehistory ASR (1925)

Our research

NTT Labs (+Bell Labs), Tokyo Tech

Collaboration with other labs

Page 5: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

1st generation technology (1950’s-1960’s)

“Heuristic approaches”

• General

– The earliest attempts to devise

digit/syllable/vowel/phoneme recognition systems

– Spectral resonances extracted by an analogue filter

bank and logic circuits

– Statistical syntax at the phoneme level

• Early systems

– Bell labs, RCA Labs, MIT Lincoln Labs (USA)

– University College London (UK)

– Radio Research Lab, Kyoto Univ., NEC Labs (Japan)

Page 6: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

(David C. Moschella: “Waves of Power”)

10

100

1,000

3,000

Nu

mb

er

of

use

rs

(mil

lio

n)

Year:1970 1980 1990 2000 2010 2020 2030

System centered

PC centered

Network centered

Knowledge resource centered

IT technology progress

2G 3G 3.5G 4G ASR

Mic

ropro

cessor

PC

Lapto

p P

C

Sm

art

phone

Page 7: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Flanagan writes, “… the computer in the background is a Honeywell DDP516. This

was the first integrated circuits machine that we had in the laboratory. … with

memory of 8K words (one-half of which was occupied by the Fortran II compiler) …”

(November 1970)

Page 8: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

• DTW (dynamic time warping) – Vintsyuk in Russia (USSR) proposed the use of DP

– Sakoe & Chiba at NEC labs started to use DP

• Isolated word recognition – The area of isolated word or discrete utterance

recognition became a viable and usable technology based on fundamental studies in Russia and Japan (Velichko & Zagoruyko, Sakoe & Chiba, and Itakura)

• IBM Labs: large-vocabulary ASR

• Bell Labs: speaker-independent ASR

2nd generation technology (1970’s) (1)

“Pattern matching approaches”

Page 9: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

2nd generation technology (1970’s) (2)

• Continuous speech recognition

– Reddy at CMU conducted pioneering research based

on dynamic tracking of phonemes

• DARPA program

– Focus on speech understanding

– Goal: 1000-word ASR, a few speakers, continuous

speech, constrained grammar, less than 10%

semantic error

– Hearsay I & II systems at CMU

– Harpy system at CMU

– HWIM system at BBN

Page 10: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

(David C. Moschella: “Waves of Power”)

10

100

1,000

3,000

Nu

mb

er

of

use

rs

(mil

lio

n)

Year:1970 1980 1990 2000 2010 2020 2030

System centered

PC centered

Network centered

Knowledge resource centered

IT technology progress

2G 3G 3.5G 4G ASR

Mic

ropro

cessor

PC

Lapto

p P

C

Sm

art

phone

Page 11: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

3rd generation technology (1980’s) “Statistical framework”

• Connected word recognition – Two-level DP, one-pass method, level-building (LB)

approach, frame-synchronous LB approach.

• Statistical framework

– HMM

– N-gram

– cepstrum + Dcepstrum

• Neural net

• DARPA program (Resource management task) – SPHINX system at CMU

– BYBLOS system at BBN

– DECIPHER system at SRI

– Lincoln Labs, MIT, AT&T Bell Labs

Page 12: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Structure of speech production and recognition system based on information transmission theory

(Transmission theory)

(Speech recognition process)

Acoustic channel

W S X W ^

Speech recognition system

Information source Channel Decoder

Text generation

Acoustic processing

Speech production

Linguistic decoding

)(

)()(maxarg)(maxargˆ

XP

WPWXPXWPW

WW

Page 13: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

MFCC (Mel-frequency cepstral coefficients)-based front-end processor

FFT

FFT based spectrum

Speech

DCT Log

Mel scale triangular filters

Acoustic vector D

D 2

MFCC

Page 14: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Structure of phoneme HMMs

Page 15: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Statistical language modeling

Probability of the word sequence w1k w1w2...wk :

k k

P (w1k) P P (wi

|w1w2…wi-1) P P (wi |w1

i-1) i 1 i 1

P(wi |w1

i-1) N(w1i) / N(w1

i-1)

where N (w1i) is the number of occurrences of the string w1

i in the given training corpus.

Approximation by Markov processes:

Bigram model P(wi |w1

i-1) P(wi |wi-1)

Trigram model P(wi |w1

i-1) P(wi |wi-2wi-1)

Smoothing of trigram by unigram and bigram:

P(wi |wi-2wi-1) l1P(wi

|wi-2wi-1)+ l2P(wi |wi-1) + l3P(wi) ^

Page 16: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

(David C. Moschella: “Waves of Power”)

10

100

1,000

3,000

Nu

mb

er

of

use

rs

(mil

lio

n)

Year:1970 1980 1990 2000 2010 2020 2030

System centered

PC centered

Network centered

Knowledge resource centered

IT technology progress

2G 3G 3.5G 4G ASR

Mic

ropro

cessor

PC

Lapto

p P

C

Sm

art

phone

Page 17: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

3.5th generation technology (1990’s)

• Error minimization (discriminative) approach – MCE (Minimum Classification Error) approach

– MMI (Maximum Mutual Information) criterion

– MPE (Minimum Phone Error) criterion

• Robust ASR – Background noise, voice individuality, microphones,

transmission channel, room reverberation, etc.

– VTLN, MLLR, HLDA, fMPE, PMC, etc.

• DARPA program – ATIS task

– Broadcast news (BN) transcription integrated with information extraction and retrieval technology

– Switchboard task

• Applications – Automate and enhance operator services

Page 18: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Channel Speech

recognition system

Main causes of acoustic variation in speech

Noise • Other speakers • Background noise • Reverberations

Distortion Noise Echoes Dropouts

Microphone • Distortion • Electrical noise • Directional characteristics

Speaker Task/Context • Voice quality • Pitch • Gender • Dialect Speaking style Phonetic/Prosodic context • Stress/Emotion • Speaking rate • Lombard effect

• Man-machine dialogue • Dictation • Free conversation • Interview

Page 19: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

3.5th generation technology (2000’s)

• DARPA program – EARS (Effective Affordable Reusable Speech-to-

Text) program for rich transcription, GALE

– Detecting, extracting, summarizing, and translating important information

• Spontaneous speech recognition – CSJ lecture project in Japan

– Meeting projects in US and Europe

• Robust ASR – Utterance verification and confidence measures

– Combining systems or subsystems

• Machine learning – Graphical models (DBN)

– Deep neural network (DNN)

Page 20: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Word error rate (WER) as a function of the size of

acoustic model training data (8/8 = 510 hours)

25

25.5

26

26.5

27

27.5

1/8 2/8 3/8 4/8 5/8 6/8 7/8 8/8

Training corpus size

Wo

rd e

rro

r ra

te (

%)

Task: CSJ (Corpus of Spontaneous Japanese)

Page 21: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Out-of-vocabulary (OOV) rate, word error rate (WER)

and adjusted test-set perplexity (APP) as a function of

the size of language model training data (8/8 = 6.84M

words)

Task: CSJ (Corpus of Spontaneous Japanese)

Page 22: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

WFST (Weighted Finite State Transducer)-based

“T3 decoder”

Problems of WFST-based decoder:

– Large memory requirement

– Small flexibility

• Difficult to change partial models

H C L G speech

Context

dependent

phones

Context

independent

phones Word

Word

sequence

N-gram

Word sequence Speech

Composition

and

optimization

H: HMM

C: Triphone

L: Lexicon

G: N-gram

H∘C∘L∘G

On-the-fly composition

Parallel decoding

Page 23: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Audio

Config

Frontend filters

…..….….. Decoder

Active search GC

AM + Scoring GPU

Disk On-the-fly

WFST Control

Language model

Lexicon Triphone WFST tools

Static network

Word sequence/ Lattice

HMM acoustic model

Recognizer

Structure of the T3 decoder

Page 24: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

RTF vs. word accuracy for the various

decoders and acoustic models (WSJ 5k task)

(det(C odet(L oG)))

• Cascade construction:

• Decoder:

T3, Sphinx3, HDecode,

Juicer

• AM:

Sphinx, HDecode

Page 25: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

DBN representing a factor analyzed HMM

tq 1+tq

tO 1+tO

tx 1tx +

x

t x

t 1+

o

t o

t 1+Discrete

Continuous

xt : state vector, Ot : observation vector, qt : HMM state,

tx, t

o : mixture indicator

(A.-V. I. Rosti et al., 2002)

Latent variables

Page 26: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

1-tWtW 1+tW

1-tqtq 1+tq

1-tOtO 1+tO

1-tltl 1+tl Acoustic condition

Speaker

DBN for acoustic factorization

(M. J. F. Gales, 2001)

Page 27: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

(Dong Yu, 2012)

Deep neural network

Page 28: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Restricted Boltzmann machine

Hidden Layer No within layer connection

Visible Layer No within layer connection

(Dong Yu, 2012)

Page 29: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Why deep network is helpful

• Many simple non-linearities = One complicated non-linearity

• More efficient in representation: need fewer computational units for the same function

• More constrained space of transformations determined by the structure of the model – less likely to overfit

• Lower layer features are typically task independent (e.g., edges) and thus can be learned in an unsupervised way

• Higher layer features are task dependent (e.g., object parts or object) and are easier to learn given the low-level features

• Higher layers are easier to be classified using linear models

(Dong Yu, 2012)

Page 30: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

CD-DNN-HMM: 3 key components

Model senones (tied triphone states) directly

Many layers of nonlinear

feature transformation

Long window frames, incl. Δ, Δ2

(Dong Yu, 2012)

Page 31: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Empirical evidence: Summary (Dahl, Yu, Deng, Acero 2012, Seide, Li, Yu 2011 + new result)

• Voice Search SER (24 hours training)

AM Setup Hub5’00-SWB RT03S-FSH

GMM-HMM BMMI (9K 40-mixture) 23.6% 27.4%

DNN-HMM 7 x 2048 15.8% (-33%) 18.5% (-33%)

AM Setup Test

GMM-HMM MPE 36.2%

DNN-HMM 5 layers x 2048 30.1% (-17%)

Switch Board WER (309 hours training)

Switch Board WER (2000 hours training)

AM Setup Hub5’00-SWB RT03S-FSH

GMM-HMM (A) BMMI (18K 72-mixture) 21.7% 23.0%

GMM-HMM (B) BMMI + fMPE 19.6% 20.5%

DNN-HMM 7 x 3076 14.4% (A: -34% B: -27%)

15.6% (A: -32% B: -24%)

(Dong Yu, 2012)

Page 32: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Deeper models more powerful? (Seide, Li, Yu 2011, Seide, Li, Chen, Yu 2011)

DBN-Pretrain

BP LBP Discri- Pretrain

DBN-Pretrain

24.2 24.3 24.3 24.1 24.2 20.4 22.2 20.7 20.4 - - 18.4 20.0 18.9 18.6 - - 17.8 18.7 17.8 17.8 - - 17.2 18.2 17.4 17.1 22.5 17.1 17.4 17.4 16.8 22.6 17.0 16.9 16.9 - - - 17.9 - - - - - 17.0 - - - - -

22.1

Compare BP with DBN pre-training, pure backpropagation (BP), layer-wise BP-based model growing (LBP), and discriminative pretraining. Shown are word-error rates in %. ML alignment.

(Dong Yu, 2012)

Page 33: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Progress of speech recognition technology

since 1980

2 20 200 2000 20000 Unrestricted

Spontaneous speech

Read speech

Fluent speech

Connected speech

Isolated words

Vocabulary size (number of words)

Sp

eak

ing

sty

le

1980

1990

2000

natural conversation 2-way

dialogue transcription

network agent &

intelligent messaging

system driven dialogue

office dictation

name dialing

form fill by voice

directory assistance

word spotting

digit strings

voice commands

car navigation

Page 34: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Many applications now

Page 35: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Major speech recognition applications

• Spoken dialog systems for accessing information

services and voice search

(e.g. directory assistance, flight information systems,

stock price quotes, virtual agents, How May I Help You?,

GOOG-411, livesearch411, Vlingo, and Google VS)

• Systems for transcription, summarization and

information extraction from speech documents

(e.g. broadcast news, meetings, lectures, presentations,

congressional records, court records, and voicemails)

Page 36: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Voice search system architecture

Acoustic model Pronunciation

model Language model

Listing search

index

ASR

Search

Confidence

measures

Dialog

manager

Disambiguation

module

(Y.-Y. Wang et al., 2008)

Query

ASR m-best

results

Combined n-best

results

Listing

Page 37: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

While driving 42%

In Public (bars, restaurants, stores, public transportation) 8%

Context of Vlingo mobile interface usage

(Self-reported)

Page 38: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

email 5.7%

face book 1.3%

speak to text box 41.9%

tell a fiend 1.8%

twitter 0.4%

voice dial 10.4%

web search 18.7%

launch app 5.2%

note to self 1.2%

Vlingo Blackberry usage function

“Speak to text box” usage is the case where users speak into an existing application (hence, using the speech interface as a keyboard).

Page 39: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Average words per task

Average spoken words per Vlingo task

When speaking to Vlingo, users accomplish most tasks with only a small number of words.

Page 40: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Vlingo user behavior by expertise

% of utterances

Initial users tend to clear results when faced with recognition issues, whereas expert users are much more likely to correct either by typing or selecting from alternate results.

Page 41: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Google Voice Search accuracy evolution over time

model number

Webscore

(%)

Size of transcribed data: AM1: mismatched, AM2: 1K hours, AM3: 2K hours, AM4: 7K hours, AM5: more. Model structure and training methods are also improved.

(J. Schalkwyk at al., 2010)

Vocabulary

size: 46K

Page 42: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Perplexity and WER as a function of 3-gram LM

size Pe

rple

xity

WER

3-gram LM size 1 billion

Page 43: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Perplexity and WER as a function of 5-gram LM

size

WER

Perp

lexi

ty

5-gram LM size 1 billion

Page 44: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Human vs. machine recognition word error

rates across a variety of tasks (Lippmann, 1997)

Task Machine

performance %

Human performance %

Connected digits 0.72 0.009

Letters 5 1.6

Resource management 3.6 0.1

WSJ (Wall Street

Journal)

7.2 0.9

Switchboard 43 4

Page 45: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Pronunciation

ASR error analysis for a voice search application

(Y.-Y. Wang et al., 2008)

Page 46: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Major ASR problems

• Robustness against various speech variations

- Speakers, noise, language, topics, etc.

• Out-of-vocabulary (OOV) problem

- Don’t know when we know

• Few advances in basic understanding

• It takes a long time to build a system for a new

language; requires a large amount of

expensive speech databases

Page 47: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Knowledge sources for speech recognition

• Knowledge (Meta-data)

– Domain and topics – Context – Semantics – Speakers – Environment, etc.

• How to incorporate knowledge sources into

the statistical ASR framework is a key issue

• Unsupervised or lightly

supervised training is crucial

Speech (Data)

Recognition

Information extraction

Knowledge

Generalization (Meta-data)

Abstraction

Human speech recognition is a matching process whereby an audio signal is matched to existing knowledge (comprehension)

Page 48: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Next-generation ASR using comprehensive knowledge sources

Speech

Feature extraction

•Spectrum

•Waveform

•Pitch

•Energy

(Multiple time windows)

Decoder

Recognition

results

•Sentence

•Meaning

Multilingual model

Emotion

model

Prosody

model

Segmental

model

•Phone

•Syllable

Language

model

•N-gram

•Syntax

•Semantics

Speaker

model

Speaking style

model Topic

model

Noise

model

Adaptation

Page 49: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

(David C. Moschella: “Waves of Power”)

10

100

1,000

3,000

Nu

mb

er

of

use

rs

(mil

lio

n)

Year:1970 1980 1990 2000 2010 2020 2030

System centered

PC centered

Network centered

Knowledge resource centered

IT technology progress

2G 3G 3.5G 4G ASR

Mic

ropro

cessor

PC

Lapto

p P

C

Sm

art

phone

Page 50: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Data-intensive (“Big data”) ASR

Individuality

Huge speech database with various variations

Meaning (Message)

Emotion

Current DB

Others

Mapping

Mapping

Mapping

Mapping

“We have actually made fair progress on simulation tools, but not very much on

data analysis tools.” (Jim Gray)

“Quantity decides quality.” “Resources are better spent collecting more data.”

Page 51: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Technical issues • How to collect rich and orders of magnitude larger

speech DBs, covering all possible variations – Current typical speech DB (for a single task)

• 1000-10,000 h speech ~ 100 M - 1 Gbyte

– Current model complexity (for a single task) • HMM : several million parameters

• Number of Trigrams : 100 million - 1 billion

– Many sources of variations

• How to build and utilize huge DBs (Scalable approach) – Cheap, fast and good enough metadata/transcription – Well structured model (machine learning) – Efficient data selection for annotation (active learning) – Unsupervised, semi-supervised or lightly-supervised

training/adaptation – High-performance computing

Page 52: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

In the Mel-cepstral domain, mismatch function relating

clean speech x and noisy speech y is given by:

y = x + h + C log (1 + exp (C-1(n - x - h)))

where

n and h : additive and convolutional noise, respectively

C : DCT matrix

The model is first adapted to the target speaker by MLLR,

and then adapted to the target noise environment via

model-based vector Taylor series (VTS) compensation.

These transforms can be jointly estimated using ML.

Speaker and noise factorization

(Y.-Q. Wang and M. J. F. Gales, 2011)

Page 53: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Subspace mixture model

j : phonetic state index

m : substate index

cjm : mixture weight

i : Gaussian index

wjmi : mixture weight

s : speaker

Mi : mean projection matrix

vjm : state-substate-specific

vector

Niv(s) : speaker-specific offset

v(s) : speaker vector

)()(

)(

11

)(

vNvM

),;()(

s

ijmi

s

jmi

is

jmi

I

i

jmi

M

m

jm

s xwcjxpj

+

N|

(D. Povey et al., 2010)

Similar to Joint Factor Analysis, Eigenvoice, and Cluster AdaptiveTraining.

Page 54: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

(S. Novotney et al., 2010)

Cheap, fast and good enough: non-expert transcription through crowdsourcing

Page 55: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Efficient data selection for annotation (Active learning)

• Select most informative utterances, having

– Low certainty (word posterior probability, confidence score, relative likelihood, margin) (Hakkani-Tur et al., 2002)

– Large word lattice entropy (Varadarajan et al., 2009)

– Large classification disagreement among committees (query by committee) (Seung et al., 1992; Hamanaka et al., 2010)

• Manually transcribe selected utterances and use them for model training

• Problem: how to remove outliers

– Density-based approaches

• Combine semi-supervised training for remaining utterances (Riccardi et al., 2003)

Page 56: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Committee-based utterance selection

1. Randomly and equally divide

transcribed data T into

K data sets

2. Train K recognizers (committees)

using the K data sets

3. Recognize all the utterances in

the un-transcribed data U by

each recognizer

4. Select N hour utterances with

relatively high degree of

disagreement (vote entropy)

among K recognizers

5. Transcribe the selected

utterances and add them to T,

and go back to Step 1.

Page 57: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

• Random selection: randomly select utterances to be transcribed

• WPP (word posterior probability)-based method

• Committee-based method (8 and 1 committees for AM and LM, respectively)

69

70

71

72

73

74

75

76

25 50 75 100

Wo

rd a

ccu

racy

(%

)

Amount of transcribed training data (h)

Random

WPP

AM8-LM1

Comparison with other methods

The amount of data T to achieve a word accuracy of 74%: Random: 95 h, WPP-based: 72 h, Committee-based: 63 h

Accuracy when all

training data (191 h) was transcribed

(Database: Corpus of Spontaneous Japanese)

Page 58: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Create initial model from limited

amount of transcribed corpus

Select utterances considering both of

Informativeness and representativeness

Speech data

pool (Nu

untranscribed

utterances)

Combination of informativeness and representativeness

Decode all utterances (ui) (i=1,…,Nu) in

the data pool using the initial model

N-best phone n-gram sequence

Calculate N-best entropy (H(ui)) for each

utterance

(Informativeness)

Extract phone based

tf-idf features: t(ui)

Calculate dot products between t(ui) and

t(uj), j=1,…,Nu, j≠i

(Representativeness)

(N. Itoh et al., 2012)

Page 59: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Lightly-supervised AM training for broadcast news ASR

• BN has closed-captions which are close, but not exact transcription, and only coarsely time-aligned with the audio signal.

• A speech recognizer is used to automatically transcribe unannotated BN data. The closed-captions are aligned with transcriptions and only segments that agree are selected for training.

• Using the closed captions to provide supervision via the LM is sufficient, and there is only a small advantage in using them to filter the “incorrect” hypotheses/words.

(L. Lamel et al, 2000)

Page 60: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Semi-supervised batch-mode adaptation

M

M

M

T

D

Initial model

Copy

Untranscribed speech data

Recognition

hypotheses Recognition results

Model update (with confidence-measures)

Speech recognition

Iterate

Run a decoder to obtain recognition hypotheses

Use the hypotheses as reference for adaptation (eg. MLLR)

Page 61: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Problems of the batch-mode adaptation

• Errors are unavoidable in the recognition

• Model parameters are estimated using the

hypotheses including errors

• In the next decoding step, the adapted model

tends to make the same errors

• During the iterations, errors are reinforced

How to improve the adaptation performance by

reducing the influence of the recognition errors

Page 62: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

M(2)

D(1)

Initial model

Copy

Evaluation speech data

Recognition hypotheses

Recognition results

Model update

Speech recognition

Iteration

M(1) M(K)

M(2) M(1) M(K)

D(2) D(K)

M

T(1) T(2) T(K)

Reducing the influence of recognition errors by separating the data used for the decoding step and the model update step

Semi-supervised cross-validation (CV) adaptation

Page 63: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Number of iterations and WER

CV:

K=40

Ag:

K=10, K’=6, N=8 18.5

19

19.5

20

20.5

21

21.5

22

22.5

23

0 1 2 3 4 5 6 7 8

Baseline

CV(40)

Ag(10,6,8)

Wo

rld e

rro

r ra

te (

%)

Number of iterations

Relative WER reductions

Batch-adapt:12%

CV-adapt:17%

Ag-adapt:16%

Page 64: Automatic Speech Recognition: Trials, Tribulations and ...events.eventact.com/afeka/aclp2012/ASR Trials Tribulations and Triumphs.pdfTrials, Tribulations and Triumphs Sadaoki Furui

Summary

• Speech recognition technology has made very significant progress in the past 30 years with the help of computer technology.

• The majority of technological changes have been directed toward the purpose of increasing robustness of recognition.

• Major successful applications are spoken dialog and transcription systems.

• Much greater understanding of the human and physical speech processes is required before automatic speech recognition systems can approach human performance.

• Significant advances will come from data-intensive approach for knowledge extraction (“Big data” ASR).

• Active learning, and unsupervised, semi-supervised or lightly-supervised training/adaptation technologies are crucial (Machine learning).