Upload
dylan-bascom
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Development of CMU Sphinx From 2004 to 2006 Jul
An Observer’s Perspective
Arthur ChanEvandro Gouvea
David Huggins-DainesMosur Ravishankar
Alex RudnickyYitao Sun
[For the off-line viewer]
[This is Arthur Chan’s conclusion:
Joint consideration of
3 software components is crucial.]
Read on, you’ll see his argument.
Perspective
• Mainly Arthur Chan’s observation–Two Roles
• As a developer–“The Grand Janitor”
• As an observer of events–A historian.
What is CMU Sphinx?
• Definition 1 : – Large vocabulary speech recognizers with
high accuracy and speed performance.
• Definition 2 : – A collection of tools and resources that
enables developers/researchers to build successful speech recognition systems
Family of CMU Sphinx
• Decoders– Sphinx {II – IV} – PocketSphinx (by Dave since Oct 2005)
• Acoustic Model Trainer– SphinxTrain
• Language Model Trainer (new)• Documentation
– Hieroglyphs– Robust/SphinxTrain Tutorial
The Sphinx Developers
• Sphinx is maintained by – Volunteer programmers/researchers who like
speech recognition– All contribution go to the same codebase – Goal : Sustainable development of Sphinx
• Sphinx Developer Meetings are held – Regularly (as in an aperiodic function)– Secretly (in the sense that everyone knows)– to decide the way to go in Sphinx
Outline (~30 pages)
• Software of Speech Recognition– How should we develop? What should a
comprehensive software do?
• CMU Sphinx, Before/After
• Lessons Learned
• (Optional) Team and Structure.
What It Means to Software
• Philosophy behind the old black box
“When you don’t know, search.”
• The old Black Box:– Strongly focus on the decoder– Tend to ignore other important components
(e.g. models)
The Noisy Channel Point of View
)()|(maxargˆ
)(
)()|(maxargˆ
)|(maxargˆ
*
*
*
WPWAPW
AP
WPWAPW
AWPW
WW
WW
WW
DecoderA W
What it means to software
• We need to represent and estimate parameters of the acoustic model
• We need to represent and estimate parameters of the language model
• Given the models, we need to search through all possible word sequences. Or decoding
)|( WAP
)(WP
)()|(maxargˆ*
WPWAPWWW
A New Black Box
SpeechRecognizer
Acoustic Signal
Word Sequence
AcousticModel
LanguageModel
AM Trainer LM Trainer
The New Black Box
• Philosophy of the New Black Box
“When you don’t know, search with your knowledge.”
• Advantages of the New Black Box– Programmers tend to consider the problems
jointly– Reduce communication issues between
modules owner
The New Black Box vs The Old Black Box• The Old Black Box
– Narrow our ways to think of the problem– Motivates solely research on search
algorithms
• The New Black Box– Doesn’t ignore the fact that search is
important– But give correct emphasis on all the
necessary components
Sphinx and Friends (2004 Jan)
SphinxSiblings
Acoustic Signal
Word Sequence
AcousticModel
LanguageModel
SphinxTrainCMU-C LM Toolkit V2
Issues at the time
• Sleeping Decoders: (Sphinx Siblings)– Strength:
• Comprehensive product line
– Issues: Decoders came with many versions, code tends to duplicate
• Sphinx 2 -> fast but not accurate• Sphinx 3.0 -> very accurate but very slow• Sphinx 3.3 -> accurate, faster than sphinx 3.0 but
slower than 1xRT• Sphinx 4 not yet completed
Issues at the time (cont.)
• AM Trainer (SphinxTrain)– Strength: it works– Weakness: what we supported was simple
• Where is speaker adaptation?
• LM Trainer (CMU-Cambridge LM Toolkit V2)– Strength: it works– Weakness:
• software was sleeping -> development has stopped• Important functionalities weren’t in the package: e.g. LM
Interpolation
General Comments at the time:
• “Sphinx cannot do feature Y.”• “You have no ideas what you are up to.”• “No one is working on Sphinx any more.”• “Our job is not difficult but very
challenging” –Prof. Alex Rudnicky• “Sphinx is cursed.” (I made this one up.
)• “The riddle of Sphinx couldn’t be solved”
–Made up by Arthur Toth in SphinxLunch
Sphinx and Friends (now)
SphinxBrothers
Acoustic Signal
Word Sequence
AcousticModel
LanguageModel
SphinxTrainDebugged
CMU-C LM Toolkit V3
alpha
Sphinx Brothers now
• Sphinx 2– Could now use CDHMM– Could now use FST
• Sphinx 3.X (gimmicky name of Sphinx 3)– Could run faster if there are magic tuning
string– Merging of Sphinx 3.0 and Sphinx 3.3– Support speaker adaptation– Re-architected
Sphinx Brothers now (cont.)
• Sphinx 4– With great effort of Sun Developers and
mainly super speech advisors– Beta completed – Quite popular with users and new startups
• PocketSphinx (by Dave)– Newly added member of the family– First open source embedded LVCSR
Project LL
• Project L : Project Ladon:Goal: Extensions and Re-development of CMU-Cambridge LM Toolkit V2
• Final product:
CMU-Cambridge LM Toolkit Version 3 (alpha)
Story of V3: 3 “Young” Persons and their Inspiring Stories • Young StudentYoung Student - write the perl script
– Utterly frustrated by training LM, decide to write a set of new perl script
• Young FacultyYoung Faculty - convince us to license the code in BSD– Wanted to see LM toolkit to be BSD again but has no
time.
• Young StaffYoung Staff – add 32 bit LM support– Had nothing to do on the flight back to HK. – Want to do something he thought was useless.
Function of V3 alpha
• Support more than 65k words (32 bit LM)• Perl wrapper by Young Student
– One step LM training– Simplified process of LM interpolation and Class-
based LM training
• New functionalities– LM interpolation (lm_combine) (by Wen, Moss, Dave)– Random text generation in 3-gram (by Arthur Toth)– Modified Kneser-Ney smoothing (by Prof. Yannick
Estève from LIUM)
Blessing for this change
• Support by the license• Permissions from all copyright owners
– Prof. Rosenfeld (also make decision on licensing issue for CMU)
– Dr. Robinson (also make decision on licensing issue for Cambridge)
– Dr. Clarkson– Blessing mails sent to public mailing list
• V3 will be re-licensed under BSD
SphinxTrain now
• Now support speaker adaptation – MLLR, – MAP, – VTLN
• Fixed many bugs– Still have many to go
• Integrated to the tutorials. • NR code finally removed, we could
distribute it now.
Technology explored in last few years
• Search / GMM Computation
• Speaker Adaptation and Normalization
• Embedded Speech Recognition
• AM Training
• LM Training
Future Opportunities - Think the Three Modules Together• Technology
– N-gram (N>3) (LMtk + SX)– On-line adaptation (SX + ST)– On-line training (SX + ST)
• Software– Integrated package with comprehensive
support on SR (SX + ST + LMtk)– Dictation (SX + ST + LMtk)
Before/After, the difference
• Spent more time to secure training (both AM and LM)
• Architecture has been re-thought within module and across modules.
• Our food-chain is secured in the repository– AM, LM and Decoder’s code are under one code-
base (cmusphinx)
Some Good Signs
• Sourceforge’s Project of the Month (March 2006)
• Start to be decently competitive again
• Someone used our decoder(s) and they look happy– Users actually say “Thank You”.
• Some companies used our recognizer– (Some of them dare to make profits.)
Some Observations
• We still need to catch up in accuracy. – Mainly on better algorithmic support on
domain specific development• Some Observations
– Today’s 10xRT system becomes 5 years later 1xRT system– Today’s most accurate system becomes BL of next years most
accurate system
• Now seems to be just another starting point.
Lesson 1
• Anyone who tries to solve a legacy problem becomes a legacy problem– Corollary 1: Many legacy decision could
actually be clever
– Corollary 2: Not every change is good
Lesson 2 : on Research• Most of WER decline comes from
better acoustic model and language model– Corollary 1:
• Actually the trainers are the key piece of development.
– Corollary 2:• We should now focus on 1) acoustic segmenter,
2), speaker adaptation and 3) discriminative training.
Lesson 3: on Development
• Why some of our code never go into Sphinx?– Code without source controls is close to
useless– Corollary 1: If you want your code to survive,
check in.– Corollary 2: If you don’t know what is source
control, you probably need to learn it.
Lesson 4: My Favorite, the current Sphinx Moto• “Never Over/Under-estimate
yourself,you never know what kind of
mess you could make.”
–Dr. Evandro Gouvêa
Hiring: The Grand Janitor 2nd – Mixture of Several Jobs. • Release Manager - Kick other people to fix various
things• Speech Scientist – Tell users to give up when they
randomly read some useless papers. • System Architect - Rewrite the code in many different
ways but do the same thing• Mediator of Conflicts - Write pseudo-philosophical
comments• Core Developer - Write crappy code and occasionally
debug them • Advisor – Do what Dr. Phil does on your friends, your
users and most importantly, your boss(es) and ex-bosses
Acknowledement – CMU-Cambridge LM Toolkit
• Contributors:– David Huggins-Daines,
– Ananlada Chotimongkol,
– Arthur Toth,
– Xu Wen
–Prof. Yannick Esteve in LIUM.
How does it work? The Wrong Model• 1, A leader yell: “Sphinx Team Assemble!!”
• 2, The team then assemble and follow commands of the leader.
• 3, Things get done.
• 4, Once again Sphinx Team has saved the day!
How does it really work? 1-3/10 steps• 1, Someone in the team dream up with a new
feature. • 2, He communicate with the team:
– “What do you guys think?”
• 3, Developers start to give their “two cents” on the problem, e.g.– Arthur: “According to Harry G. Frankfurt, what you talk
about is B.S.”– Evandro: “Don’t underestimate yourself, you don’t
know what kind of mess you will make.”– Dave: “That doesn’t sound like the best idea……”
The guy doesn’t give up and others give OK (4-6/10 steps)• 4, He go on to implement the code. • 5, Check the code in. • 6, Peer review happens right after codes
check-in, example comments:– Arthur: “That is not the right balance according to Yin
and Yang.”– Evandro: “I wonder whether you know C
programming.”– Dave: “What is the rationale behind your change?”– Yitao: “*Sigh*, I need to recompile Speechalyzer and
Smartnote again.”
Automatic Tests (the final tests)
• 7, Run make check – Make sure there is no FAIL in testing – Require pasing 70 to 80 tests.
• 8, Standard regression tests (make perf-std)– Running tests on 3 corpora and make sure the results
are matched the past
• 9, Machines automated both 7 and 8– mails sent to everyone daily
• 10, The code could finally screw up people around the world!
“The Sphinx Developers”
• Members are all funded by CMU. – different purposes, but check-in to same code-base
• Common goal priority:
–Accuracy– Speed & Accuracy trade off– Memory– Interface– Features– User-Friendliness
Characteristic of our Development
• The role of manager/lead developer is significantly weakened
• Release could take some time –require good release management
• Good architecture is very important• Require skillful and knowledgeable
programmers• Highly practical: results worth more than
words and opinions
Missions of the team
• Take care of CMU’s daily need of quality SR
• Continue to improve the system
• Bridge the industry and academia.