Towards Superhuman Speech Recognition

IBM

ASR Workshop Paris, France 18-20 Sept 2000

Towards Superhuman Speech Recognition

Mukund Padmanabhan and Michael PichenyHuman Language Technologies GroupIBM Thomas J. Watson Research Center

Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms

Group

IBM


Common UI Folklore“Except when interacting with video games, a user does not take very well to surprises”

Human-Computer InteractionDix, Finley, Aboud and Beale

“Golden Rule #3: Make the interface consistent”

Elements of user interface designMandel

“Computer users usually seek predictable responses and are discouraged if they must engage in clarification dialogs frequently”

Designing the User InterfaceShneiderman

IBM


Speech Recognition Progress

0.1

1

10

100

1985 1990 1995 2000 2005

Err

or R

ate

RMATISWSJSWBBNVoiceMailTIDigits

IBM


Human Performance(Lippmann, 1997)

Digits

00.10.20.30.40.50.60.70.8

Machine Human

Str

ing

Err

or R

ate

Letters

0

1

2

3

4

5

6

Machine Human

Wor

d E

rror

Rat

e

Wall Street Journal

02468

101214

10 16 22 Quiet

Speech-to-Noise Ratio (dB)

Wor

d E

rror

Rat

e

MachineHuman

Switchboard

0

10

20

30

40

50

Machine Human

Wor

d E

rror

Rat

e

IBM


Problem CategorizationDictation(WSJ)

Broadcast News

DARPACommunicator

SWB Voicemail Meetings

Well Formed

Varied, primarily Well formed

Spontaneous

Spontaneous Spontaneous

Spontaneous

Computer Audience Computer Person Person People

Full BW Mixed, primarily full BW

Telephone BW

Telephone BW

Telephone BW

Far-field

7% 12% 16% 20-30% 30% 55%

IBM


Domain Dependence

Training Data

Transaction

Switchboard

Voicemail

YP 4.39 6.44 8.55

Digits 1.34 1.86 2.36

Switchboard

-- 39 57

Voicemail -- 47 36

IBM


Observations- 1. spontaneous speech: largest effect on WER (Switchboard, Voicemail, Meetings, real-world speech)- 2. multi-environment speech sources (16K, 8K, far-field microphone, noisy ...)- 3. multi-domain speech sources (dictation, travel, call center, small vocab, broadcast news)- 4. domain-dependence of performance

Focus areas

Improve spontaneous speech models

1. Articulatory modeling2. Prosodic features3. Segmental graphical models4. Joint parameter estimation5. Speaker separation for multi-speaker speech6. Data collection for "meeting speech"

Multi-environment

1. non-linear feature space transformation2. Hidden observations

Multi-domain

1. Multistyle training2. Domain independent LM

Objective: Develop speech recognition system that mimics human performance (independent of environment, domain, works as well for spontaneous as for carefully enunciated speech)

IBM


IBM


IBM


IBM


IBM


•30% Improvement

•No initial decoding

IBM


IBM


ASR Workshop

IBM


IBM


A Language Model that Works Well on Many Domains

• Different (static) language models work best on different domains

• Use dynamic adaptation to make a generic LM act like a domain-specific LM– Generic LM – linear interpolation of collection of domain-specific

LMs (SWB, BN, digit/date grammar, etc.)– Adapt by dynamically adjusting interpolation weights

• Want to be able to adapt quickly– At the word/sentence level, not at the document level

Um, yeah. Well, anyway, I’ll be arriving at four twenty two p.m. on flight fifty six. Say hi to mom. Oh, and don’t forget to buy IBM at one forty-four.

IBM


Adapting Language Model Interpolation Weights

• Simply re-estimate weights to maximize likelihood of adaptation data (like dynamic deleted interpolation)– Can be quite slow because have to accumulate a lot

of evidence

• Add hidden variable to model that tracks which domain LM is currently being used (Bayesian adaptation)– Rate of adaptation can be fast, depend on context,

and can be trained on domain labelled data.

IBM


IBM


IBM


Other Factors Driving Progress

Vocabulary Independent SR (Hon, 1992)

6789

10111213141516

VCRM1RM2

86

286386

386386

486

486586

586586

586686

686

1975 1980 1985 1990 1995 20001

10

100

1000

Sp

ee

d,

MH

z

Speed Over Time

0

5

10

15

20

25

30

CU AT&T LIMSI SRI

Competition

IBM


What Types of Data Do We Need?

Condition Targets Currently Available (U.S)

Total Amount

•5000 hours speech•10 GB LM data

•1000 hours speech•1 GB LM data

Styles •Imperatives•Queries•Fluent conversation•Declamatories

•C&C tasks•ATIS/DC•SWB/BN/Meetings•WSJ/Voicemail/BN

Environments

•High bandwidth/High SNR•Low bandwidth/High SNR•Low SNR

•WSJ/BN•SWB/Voicemail•Meetings

Domains •Low perplexity•Medium perplexity•High perplexity

•Digits, spelling•DC/ATIS•SWB/VM/WSJ/BN

IBM


Some Concrete SuggestionsTarget: 5000 Hours of transcribed spontaneous speech

2000 Hours/year

50000 hours/year (25)

5000 hours of speech

Cost ~ $1M

Test data: Mixture of current and new sources •Switchboard, Voicemail, BN, DC, OGI•SPEECON, Meetings

Sources of new data:Supergirl By David Odell Script - Revised Screenplay Word Document Superman: The Motion Picture By Mario Puzo Early Draft Script Superman: The Motion Picture By Mario Puzo Shooting Script Superman II Directed By Richard Donner Script - Early Version Superman II Directed By Richard Lester Script Later Version Superman II Shooting Script Superman IV: The Quest for Peace By Christopher Reeve, Script - Superman: The Man Of Steel By Alex Ford & J Ellison Script - Unproduced Superman Lives By Kevin Smith Script - Unproduced Superman Lives By Dan Gilroy Script synopsis Unproduced

IBM


Conclusions

• Speech recognition performance not adequate• Human performance figures suggests that we still have enormous

room for improvement• Presented several new algorithms to attack problem aggressively• Suggested training and test methodology to drive research

• Communal participation critical to push ahead

Documents

Towards Superhuman Speech Recognition