A Crash Course on Speech Recognition and Using CMU Sphinx to Build ASRs 379

7/26/2019 A Crash Course on Speech Recognition and Using CMU Sphinx to Build ASRs 379

1/15

A crash course on Automatic SpeechA crash course on Automatic SpeechRecognition and using CMU sphinx to buildRecognition and using CMU sphinx to build

small ASRssmall ASRsByBy

Shyam.kShyam.kSwathanthra Malayalam ComputingSwathanthra Malayalam Computing


2/15

2006-06-01 2

Introduction

Why Speech?Most eecti!e an" natural orm o human communication

Systems can #e more user-rien"ly an" more people will #ea#le to access the technology with ease

$pplications are enormous an" % lea!e that to your

imagination&language translators an" tutors' %()S'in"e*ing au"io recor"ing...etc+

$S) is still an unsol!e" pro#lem)esearches ha!e to #e ma"e to utili,e its ull potential

But mature enough to use un"er controlle" con"itions


3/15

2006-06-01

Introduction-2

Why is Speech )ecognition har"?remen"ous range o !aria#ility in speech'large !oca#ulary...

/isciplines in Speech echnologyhysiology o speech pro"uction an" hearing

signal processing

inear $lge#ra

ro#a#ility heory' statistical estimation an" mo"eling

%normation heory

inguistics


4/152006-06-01

Speech production and Acoustic Phonetic approach

$nalogy o speech

pro"uction with electricnetwork an" mo"ellingthe !ocal tract

But speech is not that

"eterministicStatistical approachesshowe" much #etterresults than acoustic

phonetic approach an" sothese metho"s lost theirimportance


5/152006-06-01 3

statistical pattern recognition approach

Speech )ecognition is a type o pattern recognition

pro#lem)aw sample streams o au"io are not well suite" ormatching

4eatures or mathematical e*pressions are ormulate"

which when applie" in an au"io stream representschanges in speech

So e*traction o such eatures is the !ery irst step

spectral analysis..#ut not enough

cepstral analysis...aha5#ut not yet there

Mel 4reuency Cepstral Coeicients&M4CC+..W7W5


6/15

2006-06-01 6

Template matching

$ simple i"ea in the way o statistical pattern

recognition is to pre-recor" a wor" to #erecogni,e"'compute the eature !ectors' an" comparethe !ectors to in" the more closely matche" input.

he same theory is e*pan"e" or the whole $S)

systemshe concept o matching the templates is enhance" to theamous /ynamic rogramming&/+ algorithm

Simple templates are replace" #y mo"els which are initially

trained/ierent types o mo"els are there as acoustic'pronounciationan" language mo"els.


7/15

2006-06-01 8

Hidden Maro! Models

Speech can #e consi"ere" in!aria#le o!er a !ery small

inter!al o time9MM mo"els speech #y consi"ering it as a set o suchsmall inter!als' with each such segments in!ariantwithin themsel!es

:ach such segments can #e mo"ele" #y an 9MM state:ach state has a mo"el which is #uilt o!er a pro#a#ility"istri#ution that "escri#es the eature !ector!ariation&aka speech !ariation+ o!er that segment

hus we ha!e the pro#a#ility or a gi!en speechsegment to #e a particular 9MM state #ase" on thepro#a#ility "istri#ution


8/15

2006-06-01 ;

HMM example

4igure shows $ Simple 9MM with three states


9/15

2006-06-01 $($

sphin*2 is a ast speech recognition system'semicontinuous 9MMs

pocket sphin* is the astest recognition system'thoughits not as accurate as sphin*2 or sphin*

sphin* uses continuous 9MMs


13/15

2006-06-01 1

CMUSPHI()-Training

Sphin* train is the training package

%t reuires the ollowing ilesphone list-which speciies the phones use" in our particularapplication with each phone in a seperate line

"ictionary-which speciies how each an" e!ery wor" in our

!oca#ulary is ma"e with the phones speciie" in the a#o!e listiller "ictionary-which speciies the special wor"s such assilent #reath cough etc..

transcripts-which speciies the content o each au"io ile inthe "ata#ase with the wor"s in the "ictionary

o#!iously the speech iles with the same ile names as per thetranscripts


14/15

2006-06-01 1

CMUSPHI()- Training* minimal usage

erl scripts

perl scripts are pro!i"e" or easy usage o sphin*train so thatonce we ha!e the iles rea"y at the appropriate locations weust nee" to run the perl scripts to get the acuoustical mo"elsrea"y.So lets go the easy way5

he iles such as phonelist'"ictionary'transcripts etc are kept in

proect@etc "irectoryspeech iles are kept in seperate "irectory or each users in theproect@wa! "irectory

once we ha!e these rea"y'we coul" setup the proect #y callingsphin*Atutorial.pl an" there ater makeAeats.pl &i necessary+

hen we train the mo"el #y either ust calling )un$ll.pl scriptin scriptsApl "ir or #y calling each component program otraining manually.


15/15

2006-06-01 13

CMUSPHI()-"ecoding

here is "ierent types o "eco"ers namely

sphin*2'sphin*'sphin* an" pocket sphin*we coul" select one rom these "eco"ers accor"ing tothe type o application we are ha!ing

7nce we ha!e the traine" mo"els' we ust want to call

the "eco"ers #y pro!i"ing those mo"el iles an" othernecessary "ata or "eco"ing

Sphin* ha!e a nice $% set which ena#les us "e!elopersto integrate sphin* to our own applications.

Documents

A Crash Course on Speech Recognition and Using CMU Sphinx to Build ASRs 379