Author
sheryl
View
42
Download
0
Embed Size (px)
DESCRIPTION
Towards Superhuman Speech Recognition. Mukund Padmanabhan and Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms Group. Common UI Folklore. - PowerPoint PPT Presentation
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Towards Superhuman Speech Recognition
Mukund Padmanabhan and Michael PichenyHuman Language Technologies GroupIBM Thomas J. Watson Research Center
Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms
Group
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Common UI Folklore“Except when interacting with video games, a user does not take very well to surprises”
Human-Computer InteractionDix, Finley, Aboud and Beale
“Golden Rule #3: Make the interface consistent”
Elements of user interface designMandel
“Computer users usually seek predictable responses and are discouraged if they must engage in clarification dialogs frequently”
Designing the User InterfaceShneiderman
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Speech Recognition Progress
0.1
1
10
100
1985 1990 1995 2000 2005
Err
or R
ate
RMATISWSJSWBBNVoiceMailTIDigits
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Human Performance(Lippmann, 1997)
Digits
00.10.20.30.40.50.60.70.8
Machine Human
Str
ing
Err
or R
ate
Letters
0
1
2
3
4
5
6
Machine Human
Wor
d E
rror
Rat
e
Wall Street Journal
02468
101214
10 16 22 Quiet
Speech-to-Noise Ratio (dB)
Wor
d E
rror
Rat
e
MachineHuman
Switchboard
0
10
20
30
40
50
Machine Human
Wor
d E
rror
Rat
e
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Problem CategorizationDictation(WSJ)
Broadcast News
DARPACommunicator
SWB Voicemail Meetings
Well Formed
Varied, primarily Well formed
Spontaneous
Spontaneous Spontaneous
Spontaneous
Computer Audience Computer Person Person People
Full BW Mixed, primarily full BW
Telephone BW
Telephone BW
Telephone BW
Far-field
7% 12% 16% 20-30% 30% 55%
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Domain Dependence
Training Data
Transaction
Switchboard
Voicemail
YP 4.39 6.44 8.55
Digits 1.34 1.86 2.36
Switchboard
-- 39 57
Voicemail -- 47 36
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Observations- 1. spontaneous speech: largest effect on WER (Switchboard, Voicemail, Meetings, real-world speech)- 2. multi-environment speech sources (16K, 8K, far-field microphone, noisy ...)- 3. multi-domain speech sources (dictation, travel, call center, small vocab, broadcast news)- 4. domain-dependence of performance
Focus areas
Improve spontaneous speech models
1. Articulatory modeling2. Prosodic features3. Segmental graphical models4. Joint parameter estimation5. Speaker separation for multi-speaker speech6. Data collection for "meeting speech"
Multi-environment
1. non-linear feature space transformation2. Hidden observations
Multi-domain
1. Multistyle training2. Domain independent LM
Objective: Develop speech recognition system that mimics human performance (independent of environment, domain, works as well for spontaneous as for carefully enunciated speech)
IBM
ASR Workshop Paris, France 18-20 Sept 2000
IBM
ASR Workshop Paris, France 18-20 Sept 2000
IBM
ASR Workshop Paris, France 18-20 Sept 2000
IBM
ASR Workshop Paris, France 18-20 Sept 2000
IBM
ASR Workshop Paris, France 18-20 Sept 2000
•30% Improvement
•No initial decoding
IBM
ASR Workshop Paris, France 18-20 Sept 2000
IBM
ASR Workshop Paris, France 18-20 Sept 2000
ASR Workshop
IBM
ASR Workshop Paris, France 18-20 Sept 2000
IBM
ASR Workshop Paris, France 18-20 Sept 2000
A Language Model that Works Well on Many Domains
• Different (static) language models work best on different domains
• Use dynamic adaptation to make a generic LM act like a domain-specific LM– Generic LM – linear interpolation of collection of domain-specific
LMs (SWB, BN, digit/date grammar, etc.)– Adapt by dynamically adjusting interpolation weights
• Want to be able to adapt quickly– At the word/sentence level, not at the document level
Um, yeah. Well, anyway, I’ll be arriving at four twenty two p.m. on flight fifty six. Say hi to mom. Oh, and don’t forget to buy IBM at one forty-four.
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Adapting Language Model Interpolation Weights
• Simply re-estimate weights to maximize likelihood of adaptation data (like dynamic deleted interpolation)– Can be quite slow because have to accumulate a lot
of evidence
• Add hidden variable to model that tracks which domain LM is currently being used (Bayesian adaptation)– Rate of adaptation can be fast, depend on context,
and can be trained on domain labelled data.
IBM
ASR Workshop Paris, France 18-20 Sept 2000
IBM
ASR Workshop Paris, France 18-20 Sept 2000
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Other Factors Driving Progress
Vocabulary Independent SR (Hon, 1992)
6789
10111213141516
VCRM1RM2
86
286386
386386
486
486586
586586
586686
686
1975 1980 1985 1990 1995 20001
10
100
1000
Sp
ee
d,
MH
z
Speed Over Time
0
5
10
15
20
25
30
CU AT&T LIMSI SRI
Competition
IBM
ASR Workshop Paris, France 18-20 Sept 2000
What Types of Data Do We Need?
Condition Targets Currently Available (U.S)
Total Amount
•5000 hours speech•10 GB LM data
•1000 hours speech•1 GB LM data
Styles •Imperatives•Queries•Fluent conversation•Declamatories
•C&C tasks•ATIS/DC•SWB/BN/Meetings•WSJ/Voicemail/BN
Environments
•High bandwidth/High SNR•Low bandwidth/High SNR•Low SNR
•WSJ/BN•SWB/Voicemail•Meetings
Domains •Low perplexity•Medium perplexity•High perplexity
•Digits, spelling•DC/ATIS•SWB/VM/WSJ/BN
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Some Concrete SuggestionsTarget: 5000 Hours of transcribed spontaneous speech
2000 Hours/year
50000 hours/year (25)
5000 hours of speech
Cost ~ $1M
Test data: Mixture of current and new sources •Switchboard, Voicemail, BN, DC, OGI•SPEECON, Meetings
Sources of new data:Supergirl By David Odell Script - Revised Screenplay Word Document Superman: The Motion Picture By Mario Puzo Early Draft Script Superman: The Motion Picture By Mario Puzo Shooting Script Superman II Directed By Richard Donner Script - Early Version Superman II Directed By Richard Lester Script Later Version Superman II Shooting Script Superman IV: The Quest for Peace By Christopher Reeve, Script - Superman: The Man Of Steel By Alex Ford & J Ellison Script - Unproduced Superman Lives By Kevin Smith Script - Unproduced Superman Lives By Dan Gilroy Script synopsis Unproduced
IBM
ASR Workshop Paris, France 18-20 Sept 2000
Conclusions
• Speech recognition performance not adequate• Human performance figures suggests that we still have enormous
room for improvement• Presented several new algorithms to attack problem aggressively• Suggested training and test methodology to drive research
• Communal participation critical to push ahead