Upload
ofershochet
View
224
Download
0
Embed Size (px)
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 1/25
IBM Research
© 2006 IBM Corporation
Superhuman Speech Recognition:Technology Challenges & Market Adoption
David NahamooIBM FellowSpeech CTO, IBM Research
July 2, 2008
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 2/25
IBM Research
© 2006 IBM Corporation2
•This chart represents all revenue for speech related ecosystem activity.
•Revenue exceeded $1B for the 1st time in 2006
•Note also that hosted services will represent ½ of speech related revenue in 2011
WW Voice-Driven Conversation Access Technology Forecast
*Opus Research 02_2007
Overall Speech Market Opportunity
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 3/25
IBM Research
© 2006 IBM Corporation3
Market UsageNeed-based SegmentationSpeech Segments
Mobile Internet, Yellow Pages
SMS, IM, email
Embedded - Automotive, Mobile
Devices, Appliances, Entertainment
Contact Centers, Tourism, Global
Digital Communities, Media (XCast)
Media, Medical, Legal, Education,
Government, Unified Messaging
Contact Centers, Government
Contact Centers, Media,
Government
Contact Centers
Information Search & Retrieval
Command & Control
Multilingual Communication
Information Access and provision
Security
Intelligence
Transaction/Problem Solving
Speech Search & Messaging
Speech Control
Speech Translation
Speech Transcription
Speech Biometrics
Speech Analytics
Speech Self Service
Speech Market Segments
• Improved accuracy• Much larger vocabulary speech recognition system
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 4/25
IBM Research
© 2006 IBM Corporation4
New Opportunity Areas
Contact Centers Analytics
– Quality Assurance, Real Time Alerts, Compliance
Media Transcription
– Closed captioning
Accessibility
– Government, Lectures
Content Analytics
– Audio-indexing, cross-lingual information retrieval, multi-media mining
Dictation
– Medical, Legal, Insurance, Education
Unified Communication
– Voicemail, Conference calls, email and SMS on hand held
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 5/25
IBM Research
© 2006 IBM Corporation5
Human
Baseline for
conversations
Target zone
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 6/25
IBM Research
© 2006 IBM Corporation6
Performance Results (2004 DARPA EARS Evaluation)
14
15
16
17
18
19
20
0.1 1 10 100
xRT
W E
R
IBM
V3
V2
V1-V4
V4
IBM: Best Speed-Accuracy Tradeoff
(Last public evaluation of English Telephony Transcription)
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 7/25
IBM Research
© 2006 IBM Corporation7
MALACH
Multilingual Access to Large Spoken Ar CHives
• Funded by NSF, 5-year project (Started in Oct. 2001)
Project Participants
– IBM, Visual History Foundation, Johns Hopkins University, University of Maryland,
Charles University and University of West Bohemia
Objective
– Improve access to large multilingual collections of spontaneous speech by
advancing the state-of-the-art in technologies that work together to achieve
this objective: Automatic Speech Recognition, Computer-Assisted Translation
, Natural Language Processing and Information Retrieval
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 8/25
IBM Research
© 2006 IBM Corporation8
MALACH: A challenging speech corpus
Emotional speech
• young man they ripped his teeth and beard out they beat him
Disfluencies
• A- a- a- a- band with on- our- on- our- arm
Multimedia digital archive: 116,000 hours of interviews with over 52,000 survivors,liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32
languages.
Goal: improved access to large multilingual spoken archives
Challenges:
Frequent interruptions:
• CHURCH TWO DAYS these were the people who were to go to
march TO MARCH and your brother smuggled himself SMUGGLED
IN IN IN IN
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 9/25
IBM Research
© 2006 IBM Corporation9
Word Error Rates
20
30
40
50
60
70
80
90
Jan. '02 Oct. '02 Oct. '03 June '04 Nov. '04
Effects of Customization (MALACH Data)
State-of-the-art ASR system
trained on SWB data (8KHz)
MALACH Training data seen
by AM and LM
fMPE, MPE, Consensus
decoding
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 10/25
IBM Research
© 2006 IBM Corporation10
Improvement in Word Error Rate for IBM embedded ViaVoice
0
1
2
3
4
5
6
WER across 3 car speeds and 4 grammars
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 11/25
IBM Research
© 2006 IBM Corporation11
Progress in Word Error Rate – IBM WebSphere Voice Server
Grammar Tasks over Telephone
2001 - 2006
0
1
2
3
4
5
6
WVS
Word Error Rate %
45% relative improvement in WER the last 2.5 years
20% relative improvement in speed in the last 1.5 years
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 12/25
IBM Research
© 2006 IBM Corporation12
Multi-Talker Speech Separation Task
male and female speaker at 0dB
Lay white at X 8 soon
Bin Green with F 7 now
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 13/25
IBM Research
© 2006 IBM Corporation13
Two Talker Speech Separation Challenge Results
R e c
o g n i t i o n E r r o r
Examples:
Mixture
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 14/25
IBM Research
© 2006 IBM Corporation14
1992 1993 1994 1995 1996 1997 1998 1999 2000
1
10
100
Voicemail
SWITCHBOARD
BROADCAST NEWS
BROADCAST-HUMAN
SWITCHBOARD-HUMAN
Comparison of Human & Machine Speech Recognition
WSJ Broadcast Conv Tel Vmail SWB Call center Meeting0
10
20
30
40
50
60
70
Clean Speech Spontaneous Speech
Human-Machine Human-Human
W o r d
E r r o r
R a t e
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 15/25
IBM Research
© 2006 IBM Corporation15
IBM’s Superhuman Speech Recognition
Universal Recognizer
• Any accent
• Any topic
• Any noise conditions
• Broadcast, phone, in car, or live
• Multiple languages
• Conversational
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 16/25
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 17/25
IBM Research
© 2006 IBM Corporation17
Acoustic Modeling Today
Approach: Hidden Markov Models
– Observation densities (GMM) for P( feature | class )
• Mature mathematical framework, easy to combine with linguistic information
• However, does not directly model what we want i.e., P( words | acoustics )
Training: Use transcribed speech data
– Maximum Likelihood
– Various discriminative criteria
Handling Training/Test Mismatches:
– Avoid mismatches by collecting “custom” data
– Adaptation & adaptive training algorithms
Significantly worse than humans for tasks with little or no linguistic
information - e.g., digits/letters recognition Human performance extremely robust to acoustic variations
– due to speaker, speaking style, microphone, channel, noise, accent, & dialect variations
Steady progress over the years
Continued progress using current methodology very likely in thefuture
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 18/25
IBM Research
© 2006 IBM Corporation18
Towards a Non-Parametric Approach to Acoustics General Idea: Back to pattern recognition basics!
– Break test utterance into sequence of larger segments (phone, syllable, word,phrase)
– Match segments to closest ones in training corpus using some metric (possiblyusing long distance models)
– Helps to get it right if you’ve heard it before
Why prefer this approach over HMMs?
– HMMs compress training by x1000; too many modeling assumptions
• 1000hrs ~ 30Gb; State-of-the-art acoustic models ~ 30Mb• Relaxing assumptions have been key to all recent improvements in acoustic modeling
How can we accomplish this? – Store & index training data for rapid access of training segments close to test segments
– Develop a metric D( train_seq, test_seq): obvious candidate is DTW with appropriate metric and warping rules
Back to the Future?
– Reminiscent of DTW & Segmental models from late 80’s – ME was missing
– Limited by computational resources (storage/cpu/data) then & so HMMs won
Implications:
– Need 100x more data for handling larger units (hence 100x morecomputing resources)
– Better performance with more data – likely to have “heard it before”
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 19/25
IBM Research
© 2006 IBM Corporation19
Utilizing Linguistic Information in ASRUtilizing Linguistic Information in ASR
Today’s standard ASR does not explicitly use linguistic information
– But recent work at JHU, SRI and IBM all show promise – Semantic structured LM improves ASR significantly for limited domains
Reduces WER by 25% across many tasks (Air Travel, Medical)
A large amount of linguistic knowledge sources now available, but not usedfor ASR Inside IBM
WWW text: Raw text: 50 million pages ~25 billion words, ~10% useful after cleanup News text: 3-4 billion words, broadcast or newswires Name entity annotated text: 2 million words tagged Ontologies Linguistic knowledge used in rule-based MT system
External WordNet, FrameNet, Cyc ontologies PennTreeBank, Brown corpus (syntactic & semantic annotated) Online dictionaries and thesaurus Google
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 20/25
IBM Research
© 2006 IBM Corporation20
•Acoustic Confusability: LM should be optimized to distinguish between acoustic confusable
sets, rather than based on N-gram counts
•Automatic LM adaptation at different levels: discourse, semantic structure, and phrase
levels
Coherence:
semantic,syntactic,pragmaticWash your clothes withsoap/ soup.
David and his/her fatherwalked into the room.
I ate a one/ nine poundsteak.
SemanticParserDialogue
State
Documen
t Type
Speaker(turn,
gender, ID) Syntactic
Parser
WordClass
Embedded
Grammar
NamedEntity
WorldKnowledge
W
1
, ..., WN
Super Structured LM for LVCSRSuper Structured LM for LVCSR
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 21/25
IBM Research
© 2006 IBM Corporation21
Combination Decoders “ROVER” is used in all current systems
– NIST tool that combines multiple system outputs through voting Individual systems currently designed in an ad-hoc manner
Only 5 or so systems possible
“I feel shine today”
“I veal fine today”
“I feel fine toady”
“I feel fine today”
An army (“Million”) of simple decoders
• Each makes uncorrelated errors
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 22/25
IBM Research
© 2006 IBM Corporation22
• Feature definition is key challenge• Maximum entropy model used to compute word probabilities.• Information sources combined in unified theoretical framework.
• Long-span segmental analysis inherently robust to both
stationary and transient noise
Million Feature Paradigm: Acoustic information for ASR
Discard transient noise;
Global adaptation for
stationary noise
Segmental analysis
Broadband features
Narrowband features Onset features
Trajectory features
Information Sources Noise Sources
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 23/25
IBM R h
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 24/25
IBM Research
© 2006 IBM Corporation24
Generalization Dilemma
P e r f o r m a n c
e
In-Domain Out-of-Domain
Test Conditions
Complex model:
brute force learning
Simple
model
Want to get here:
Model combination:
Can we at least get best
of both worlds?
Correct
complex
model(simple model
on the right
manifold)
The Gutter of Data Addiction
8/14/2019 SuperHuman Speech Recognition Jul 2 2008
http://slidepdf.com/reader/full/superhuman-speech-recognition-jul-2-2008 25/25