23
IBM ASR Workshop Paris, France 18-20 Sept 2000 Towards Superhuman Speech Recognition Mukund Padmanabhan and Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms Group

Towards Superhuman Speech Recognition

  • Upload
    sheryl

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Towards Superhuman Speech Recognition. Mukund Padmanabhan and Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms Group. Common UI Folklore. - PowerPoint PPT Presentation

Citation preview

Page 1: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Towards Superhuman Speech Recognition

Mukund Padmanabhan and Michael PichenyHuman Language Technologies GroupIBM Thomas J. Watson Research Center

Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms

Group

Page 2: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Common UI Folklore“Except when interacting with video games, a user does not take very well to surprises”

Human-Computer InteractionDix, Finley, Aboud and Beale

“Golden Rule #3: Make the interface consistent”

Elements of user interface designMandel

“Computer users usually seek predictable responses and are discouraged if they must engage in clarification dialogs frequently”

Designing the User InterfaceShneiderman

Page 3: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Speech Recognition Progress

0.1

1

10

100

1985 1990 1995 2000 2005

Err

or R

ate

RMATISWSJSWBBNVoiceMailTIDigits

Page 4: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Human Performance(Lippmann, 1997)

Digits

00.10.20.30.40.50.60.70.8

Machine Human

Str

ing

Err

or R

ate

Letters

0

1

2

3

4

5

6

Machine Human

Wor

d E

rror

Rat

e

Wall Street Journal

02468

101214

10 16 22 Quiet

Speech-to-Noise Ratio (dB)

Wor

d E

rror

Rat

e

MachineHuman

Switchboard

0

10

20

30

40

50

Machine Human

Wor

d E

rror

Rat

e

Page 5: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Problem CategorizationDictation(WSJ)

Broadcast News

DARPACommunicator

SWB Voicemail Meetings

Well Formed

Varied, primarily Well formed

Spontaneous

Spontaneous Spontaneous

Spontaneous

Computer Audience Computer Person Person People

Full BW Mixed, primarily full BW

Telephone BW

Telephone BW

Telephone BW

Far-field

7% 12% 16% 20-30% 30% 55%

Page 6: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Domain Dependence

Training Data

Transaction

Switchboard

Voicemail

YP 4.39 6.44 8.55

Digits 1.34 1.86 2.36

Switchboard

-- 39 57

Voicemail -- 47 36

Page 7: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Observations- 1. spontaneous speech: largest effect on WER (Switchboard, Voicemail, Meetings, real-world speech)- 2. multi-environment speech sources (16K, 8K, far-field microphone, noisy ...)- 3. multi-domain speech sources (dictation, travel, call center, small vocab, broadcast news)- 4. domain-dependence of performance

Focus areas

Improve spontaneous speech models

1. Articulatory modeling2. Prosodic features3. Segmental graphical models4. Joint parameter estimation5. Speaker separation for multi-speaker speech6. Data collection for "meeting speech"

Multi-environment

1. non-linear feature space transformation2. Hidden observations

Multi-domain

1. Multistyle training2. Domain independent LM

Objective: Develop speech recognition system that mimics human performance (independent of environment, domain, works as well for spontaneous as for carefully enunciated speech)

Page 8: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Page 9: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Page 10: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Page 11: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Page 12: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

•30% Improvement

•No initial decoding

Page 13: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Page 14: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Page 15: Towards Superhuman Speech Recognition

ASR Workshop

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Page 16: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

A Language Model that Works Well on Many Domains

• Different (static) language models work best on different domains

• Use dynamic adaptation to make a generic LM act like a domain-specific LM– Generic LM – linear interpolation of collection of domain-specific

LMs (SWB, BN, digit/date grammar, etc.)– Adapt by dynamically adjusting interpolation weights

• Want to be able to adapt quickly– At the word/sentence level, not at the document level

Um, yeah. Well, anyway, I’ll be arriving at four twenty two p.m. on flight fifty six. Say hi to mom. Oh, and don’t forget to buy IBM at one forty-four.

Page 17: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Adapting Language Model Interpolation Weights

• Simply re-estimate weights to maximize likelihood of adaptation data (like dynamic deleted interpolation)– Can be quite slow because have to accumulate a lot

of evidence

• Add hidden variable to model that tracks which domain LM is currently being used (Bayesian adaptation)– Rate of adaptation can be fast, depend on context,

and can be trained on domain labelled data.

Page 18: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Page 19: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Page 20: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Other Factors Driving Progress

Vocabulary Independent SR (Hon, 1992)

6789

10111213141516

VCRM1RM2

86

286386

386386

486

486586

586586

586686

686

1975 1980 1985 1990 1995 20001

10

100

1000

Sp

ee

d,

MH

z

Speed Over Time

0

5

10

15

20

25

30

CU AT&T LIMSI SRI

Competition

Page 21: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

What Types of Data Do We Need?

Condition Targets Currently Available (U.S)

Total Amount

•5000 hours speech•10 GB LM data

•1000 hours speech•1 GB LM data

Styles •Imperatives•Queries•Fluent conversation•Declamatories

•C&C tasks•ATIS/DC•SWB/BN/Meetings•WSJ/Voicemail/BN

Environments

•High bandwidth/High SNR•Low bandwidth/High SNR•Low SNR

•WSJ/BN•SWB/Voicemail•Meetings

Domains •Low perplexity•Medium perplexity•High perplexity

•Digits, spelling•DC/ATIS•SWB/VM/WSJ/BN

Page 22: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Some Concrete SuggestionsTarget: 5000 Hours of transcribed spontaneous speech

2000 Hours/year

50000 hours/year (25)

5000 hours of speech

Cost ~ $1M

Test data: Mixture of current and new sources •Switchboard, Voicemail, BN, DC, OGI•SPEECON, Meetings

Sources of new data:Supergirl By David Odell Script - Revised Screenplay Word Document Superman: The Motion Picture By Mario Puzo Early Draft Script Superman: The Motion Picture By Mario Puzo Shooting Script Superman II Directed By Richard Donner Script - Early Version Superman II Directed By Richard Lester Script Later Version Superman II Shooting Script Superman IV: The Quest for Peace By Christopher Reeve, Script - Superman: The Man Of Steel By Alex Ford & J Ellison Script - Unproduced Superman Lives By Kevin Smith Script - Unproduced Superman Lives By Dan Gilroy Script synopsis Unproduced

Page 23: Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Conclusions

• Speech recognition performance not adequate• Human performance figures suggests that we still have enormous

room for improvement• Presented several new algorithms to attack problem aggressively• Suggested training and test methodology to drive research

• Communal participation critical to push ahead