Development of CMU Sphinx From 2004 to 2006 Jul An Observer’s Perspective Arthur Chan Evandro Gouvea David Huggins-Daines Mosur Ravishankar Alex Rudnicky

Development of CMU Sphinx From 2004 to 2006 Jul

An Observer’s Perspective

Arthur ChanEvandro Gouvea

David Huggins-DainesMosur Ravishankar

Alex RudnickyYitao Sun

What is the role of Software in

Speech Recognition?

The Main Theme for Today:

[For the off-line viewer]

[This is Arthur Chan’s conclusion:

Joint consideration of

3 software components is crucial.]

Read on, you’ll see his argument.

Perspective

• Mainly Arthur Chan’s observation–Two Roles

• As a developer–“The Grand Janitor”

• As an observer of events–A historian.

What is CMU Sphinx?

• Definition 1 : – Large vocabulary speech recognizers with

high accuracy and speed performance.

• Definition 2 : – A collection of tools and resources that

enables developers/researchers to build successful speech recognition systems

Family of CMU Sphinx

• Decoders– Sphinx {II – IV} – PocketSphinx (by Dave since Oct 2005)

• Acoustic Model Trainer– SphinxTrain

• Language Model Trainer (new)• Documentation

– Hieroglyphs– Robust/SphinxTrain Tutorial

The Sphinx Developers

• Sphinx is maintained by – Volunteer programmers/researchers who like

speech recognition– All contribution go to the same codebase – Goal : Sustainable development of Sphinx

• Sphinx Developer Meetings are held – Regularly (as in an aperiodic function)– Secretly (in the sense that everyone knows)– to decide the way to go in Sphinx

Outline (~30 pages)

• Software of Speech Recognition– How should we develop? What should a

comprehensive software do?

• CMU Sphinx, Before/After

• Lessons Learned

• (Optional) Team and Structure.

Software of Speech Recognition Systems

The Old Black Box

ASpeech

Recognizer

Acoustic Signal

Word Sequence

W

Legend: The Black Box

What It Means to Software

• Philosophy behind the old black box

“When you don’t know, search.”

• The old Black Box:– Strongly focus on the decoder– Tend to ignore other important components

(e.g. models)

The Noisy Channel Point of View

)()|(maxargˆ

)(

)()|(maxargˆ

)|(maxargˆ

*

*

*

WPWAPW

AP

WPWAPW

AWPW

WW

WW

WW

DecoderA W

What it means to software

• We need to represent and estimate parameters of the acoustic model

• We need to represent and estimate parameters of the language model

• Given the models, we need to search through all possible word sequences. Or decoding

)|( WAP

)(WP

)()|(maxargˆ*

WPWAPWWW

A New Black Box

SpeechRecognizer

Acoustic Signal

Word Sequence

AcousticModel

LanguageModel

AM Trainer LM Trainer

The New Black Box

• Philosophy of the New Black Box

“When you don’t know, search with your knowledge.”

• Advantages of the New Black Box– Programmers tend to consider the problems

jointly– Reduce communication issues between

modules owner

The New Black Box vs The Old Black Box• The Old Black Box

– Narrow our ways to think of the problem– Motivates solely research on search

algorithms

• The New Black Box– Doesn’t ignore the fact that search is

important– But give correct emphasis on all the

necessary components

Current CMU Sphinx thinks

The New Black Box

Before : CMU Sphinx (2004 Jan)

Sphinx and Friends (2004 Jan)

SphinxSiblings

Acoustic Signal

Word Sequence

AcousticModel

LanguageModel

SphinxTrainCMU-C LM Toolkit V2

Issues at the time

• Sleeping Decoders: (Sphinx Siblings)– Strength:

• Comprehensive product line

– Issues: Decoders came with many versions, code tends to duplicate

• Sphinx 2 -> fast but not accurate• Sphinx 3.0 -> very accurate but very slow• Sphinx 3.3 -> accurate, faster than sphinx 3.0 but

slower than 1xRT• Sphinx 4 not yet completed

Issues at the time (cont.)

• AM Trainer (SphinxTrain)– Strength: it works– Weakness: what we supported was simple

• Where is speaker adaptation?

• LM Trainer (CMU-Cambridge LM Toolkit V2)– Strength: it works– Weakness:

• software was sleeping -> development has stopped• Important functionalities weren’t in the package: e.g. LM

Interpolation

General Comments at the time:

• “Sphinx cannot do feature Y.”• “You have no ideas what you are up to.”• “No one is working on Sphinx any more.”• “Our job is not difficult but very

challenging” –Prof. Alex Rudnicky• “Sphinx is cursed.” (I made this one up.

)• “The riddle of Sphinx couldn’t be solved”

–Made up by Arthur Toth in SphinxLunch

After : CMU Sphinx (now)

Sphinx and Friends (now)

SphinxBrothers

Acoustic Signal

Word Sequence

AcousticModel

LanguageModel

SphinxTrainDebugged

CMU-C LM Toolkit V3

alpha

Sphinx Brothers now

• Sphinx 2– Could now use CDHMM– Could now use FST

• Sphinx 3.X (gimmicky name of Sphinx 3)– Could run faster if there are magic tuning

string– Merging of Sphinx 3.0 and Sphinx 3.3– Support speaker adaptation– Re-architected

Sphinx Brothers now (cont.)

• Sphinx 4– With great effort of Sun Developers and

mainly super speech advisors– Beta completed – Quite popular with users and new startups

• PocketSphinx (by Dave)– Newly added member of the family– First open source embedded LVCSR

Project LL

• Project L : Project Ladon:Goal: Extensions and Re-development of CMU-Cambridge LM Toolkit V2

• Final product:

CMU-Cambridge LM Toolkit Version 3 (alpha)

Story of V3: 3 “Young” Persons and their Inspiring Stories • Young StudentYoung Student - write the perl script

– Utterly frustrated by training LM, decide to write a set of new perl script

• Young FacultyYoung Faculty - convince us to license the code in BSD– Wanted to see LM toolkit to be BSD again but has no

time.

• Young StaffYoung Staff – add 32 bit LM support– Had nothing to do on the flight back to HK. – Want to do something he thought was useless.

Function of V3 alpha

• Support more than 65k words (32 bit LM)• Perl wrapper by Young Student

– One step LM training– Simplified process of LM interpolation and Class-

based LM training

• New functionalities– LM interpolation (lm_combine) (by Wen, Moss, Dave)– Random text generation in 3-gram (by Arthur Toth)– Modified Kneser-Ney smoothing (by Prof. Yannick

Estève from LIUM)

Blessing for this change

• Support by the license• Permissions from all copyright owners

– Prof. Rosenfeld (also make decision on licensing issue for CMU)

– Dr. Robinson (also make decision on licensing issue for Cambridge)

– Dr. Clarkson– Blessing mails sent to public mailing list

• V3 will be re-licensed under BSD

SphinxTrain now

• Now support speaker adaptation – MLLR, – MAP, – VTLN

• Fixed many bugs– Still have many to go

• Integrated to the tutorials. • NR code finally removed, we could

distribute it now.

Technology explored in last few years

• Search / GMM Computation

• Speaker Adaptation and Normalization

• Embedded Speech Recognition

• AM Training

• LM Training

Future Opportunities - Think the Three Modules Together• Technology

– N-gram (N>3) (LMtk + SX)– On-line adaptation (SX + ST)– On-line training (SX + ST)

• Software– Integrated package with comprehensive

support on SR (SX + ST + LMtk)– Dictation (SX + ST + LMtk)

Before/After, the difference

• Spent more time to secure training (both AM and LM)

• Architecture has been re-thought within module and across modules.

• Our food-chain is secured in the repository– AM, LM and Decoder’s code are under one code-

base (cmusphinx)

Some Good Signs

• Sourceforge’s Project of the Month (March 2006)

• Start to be decently competitive again

• Someone used our decoder(s) and they look happy– Users actually say “Thank You”.

• Some companies used our recognizer– (Some of them dare to make profits.)

Some Observations

• We still need to catch up in accuracy. – Mainly on better algorithmic support on

domain specific development• Some Observations

– Today’s 10xRT system becomes 5 years later 1xRT system– Today’s most accurate system becomes BL of next years most

accurate system

• Now seems to be just another starting point.

Conclusion on Our Technology

CMU Sphinx

=

Open Source SR in BSD

Lessons Learned

Lesson 1

• Anyone who tries to solve a legacy problem becomes a legacy problem– Corollary 1: Many legacy decision could

actually be clever

– Corollary 2: Not every change is good

Lesson 2 : on Research• Most of WER decline comes from

better acoustic model and language model– Corollary 1:

• Actually the trainers are the key piece of development.

– Corollary 2:• We should now focus on 1) acoustic segmenter,

2), speaker adaptation and 3) discriminative training.

Lesson 3: on Development

• Why some of our code never go into Sphinx?– Code without source controls is close to

useless– Corollary 1: If you want your code to survive,

check in.– Corollary 2: If you don’t know what is source

control, you probably need to learn it.

Lesson 4: My Favorite, the current Sphinx Moto• “Never Over/Under-estimate

yourself,you never know what kind of

mess you could make.”

–Dr. Evandro Gouvêa

Acknowledgement – Current Team

?Arthur David Evandro Yitao

Hiring: The Grand Janitor 2nd – Mixture of Several Jobs. • Release Manager - Kick other people to fix various

things• Speech Scientist – Tell users to give up when they

randomly read some useless papers. • System Architect - Rewrite the code in many different

ways but do the same thing• Mediator of Conflicts - Write pseudo-philosophical

comments• Core Developer - Write crappy code and occasionally

debug them • Advisor – Do what Dr. Phil does on your friends, your

users and most importantly, your boss(es) and ex-bosses

Acknowledgement – Advisors

?Alex RichAlan Ravi

Acknowledement – CMU-Cambridge LM Toolkit

• Contributors:– David Huggins-Daines,

– Ananlada Chotimongkol,

– Arthur Toth,

– Xu Wen

–Prof. Yannick Esteve in LIUM.

Discussion

Thanks

Backup

The Organization of the team

How does it work? The Wrong Model• 1, A leader yell: “Sphinx Team Assemble!!”

• 2, The team then assemble and follow commands of the leader.

• 3, Things get done.

• 4, Once again Sphinx Team has saved the day!

How does it really work? 1-3/10 steps• 1, Someone in the team dream up with a new

feature. • 2, He communicate with the team:

– “What do you guys think?”

• 3, Developers start to give their “two cents” on the problem, e.g.– Arthur: “According to Harry G. Frankfurt, what you talk

about is B.S.”– Evandro: “Don’t underestimate yourself, you don’t

know what kind of mess you will make.”– Dave: “That doesn’t sound like the best idea……”

The guy doesn’t give up and others give OK (4-6/10 steps)• 4, He go on to implement the code. • 5, Check the code in. • 6, Peer review happens right after codes

check-in, example comments:– Arthur: “That is not the right balance according to Yin

and Yang.”– Evandro: “I wonder whether you know C

programming.”– Dave: “What is the rationale behind your change?”– Yitao: “*Sigh*, I need to recompile Speechalyzer and

Smartnote again.”

Automatic Tests (the final tests)

• 7, Run make check – Make sure there is no FAIL in testing – Require pasing 70 to 80 tests.

• 8, Standard regression tests (make perf-std)– Running tests on 3 corpora and make sure the results

are matched the past

• 9, Machines automated both 7 and 8– mails sent to everyone daily

• 10, The code could finally screw up people around the world!

“The Sphinx Developers”

• Members are all funded by CMU. – different purposes, but check-in to same code-base

• Common goal priority:

–Accuracy– Speed & Accuracy trade off– Memory– Interface– Features– User-Friendliness

Characteristic of our Development

• The role of manager/lead developer is significantly weakened

• Release could take some time –require good release management

• Good architecture is very important• Require skillful and knowledgeable

programmers• Highly practical: results worth more than

words and opinions

Missions of the team

• Take care of CMU’s daily need of quality SR

• Continue to improve the system

• Bridge the industry and academia.

Conclusion on Team

• Current development is – Decentralized– Automated– Skill-demanding

• We probably want to keep in this way

Documents

Development of CMU Sphinx From 2004 to 2006 Jul An Observer’s Perspective Arthur Chan Evandro Gouvea David Huggins-Daines Mosur Ravishankar Alex Rudnicky