ASR_Sphinx0.5

ASR building using Sphinx

CS6745: Building ASR and TTS Systems

Gopala Krishna A (gopalakrishna@students)

S P Kishore ([email protected])

2

The Components of ASR

Acoustic Model (AM)

Language Model (LM)

Phonetic Lexicon (Pronunciation dictionary)

3

Installing the Sphinx Trainer

Download the Sphinx III trainer from http://172.16.16.93/ASR/SphinxTrain-0.9.1-beta.tar.gz

(Source: http://www.speech.cs.cmu.edu/SphinxTrain/SphinxTrain-0.9.1-beta.tar.gz)

Untar and install Sphinx train (as root)

$tar –xvzf SphinxTrain-0.9.1-beta.tar.gz $cd SphinxTrain $./configure $make

4

Installing the Sphinx II decoder Download the Sphinx II decoder from

http://172.16.16.93/ASR/sphinx2-0.5.tar

(Source: http://www.sorcerer.mirrors.pair.com/sources/sphinx2/0.5/sphinx2-

0.5.tar.bz2) Untar and install the decoder (as root)

$tar -xvf sphinx2-0.5.tar $cd sphinx2-0.5 $./configure $make clean all $make test $make install

5

CMU-Statistical language modeling toolkit Download the CMU-SLM toolkit from

http://172.16.16.93/ASR/CMU-Cam_Toolkit_v2.tar.gz(Source http://mi.eng.cam.ac.uk/~prc14/CMU-Cam_Toolkit_v2.tar.gz)

Untar the tgz and install as root $tar –xvzf CMU-Cam_Toolkit_v2.tar.gz $cd CMU-Cam_Toolkit_v2/ $cd src Uncomment the #BYTESWAP_FLAG = -DSLM_SWAP_BYTES

in the Makefile $make install

6

Before getting started….

Download Speech data Available at http://172.16.16.93/ASR/TEL_Landline.tgz

Language Phoneset & Phonetizer Available at http://172.16.16.93/ASR/TELUGU.phone Available at http://172.16.16.93/ASR/IT3-Phonetizer

7

Before getting started…..contd NIST Scorer (for scoring the decoder performance)

Available at http://172.16.16.93/ASR/nist.tar.gz

Script for testing Available at http://172.16.16.93/ASR/sphinx2-test

Script for scoring and alignment Available at http://172.16.16.93/ASR/scorer.sh Available at http://172.16.16.93/ASR/sphinx2-align

8

Speech Databases…. format

Language //Tamil, Telugu or Marathi Data Cellphone

ID-**** (4-digit userid) FileRanking.txt // has info about recording quality recorded-ID.txt // has the transcription recordings //has the recordings in various formats

WAV/ // has 52 wav files of the recordings- recorded-***.wav(3-digit fileid)

….. Landline

…..

9

Directory structure

10

Training and Testing Datasets To train the models and to evaluate their

performance on unseen data, the speakers need to be classified into the Training and Testing sets.

The division is usually 70% of the speakers for the training and 30% for testing.

11

Wav file collection

This is done to collect Training and Testing data sets (good quality wav files without mistakes)

Copy and untar the file collection module from-

http://172.16.16.93/ASR/collect.tgz Extract the wav files as follows

$cd COLLECT $./runall.pl <Directory containing the speaker IDS>

This could take 5-10 minutes

12

Wav file collection …..contd

The following are created after running the runall.pl- Use the *.raw files in Training/ for training,

‘Training/transcript’ file as the corresponding transcription and train_fileids as the fileids file

Use the *.raw files in Testing/ for training, ‘Testing/transcript’ file as the corresponding transcription and test_fileids as fileids file

trainwords_uniq.txt is the unique word list of the training transcription (used for creating dictionary)

13

Acoustic Model Training

Create a new directory (training workspace) $mkdir TASK_NAME

Set the environment variables in the directory $export SPHINXTRAINDIR=“~/SphinxTrain”

Make sure you give the correct path of SphinxTrain Dir

Create the directory structure $SPHINXTRAINDIR/scripts_pl/setup_SphinxTrain

LangName

14

Directory wav/

Copy the Training/*.raw1 to this directory

1 : refer slide 11

15

Directory etc/ Contents to be put in the etc/ directory

etc/langname.transcription: Copy the Training/transcript1 file

etc/langname.filler: Should contain the silence specifiers

<s> SIL </s> SIL <sil> SIL

1 : refer slide 11

16

Directory etc/……contd

etc/langname.phone1 : Should contain the phoneset, each phone in a new line Append ‘SIL’ as a phone

etc/langname.fileids : Should have the filenames of all the files in the order they appear in the langname.transcription file (Use the train_fileids file)

1: http://172.16.16.93/ASR/TELUGU.phone

17

etc/langname.dic

etc/langname.dic : Should contain the phone breakage of each word entry. Proceed as follows -

Get all the unique words in the training transcription (excluding the <s>, </s> & filenames) each in a new line. You may use the trainwords_uniq.txt2

Use the IT3-Phonetizer1 to split the words into the constituent phones $./IT3-Phonetizer lang.phone lang.wordlist langname.dic

1 : refer slide 6 http://172.16.16.93/ASR/TELUGU.phone2 : refer slide 11

18

Some modifications

Modify etc/sphinx_train.cfg and changing the number of tied states to 1000 $CFG_N_TIED_STATES = 1000

To the command line in the file - scripts_pl/03.makeuntiedmdef/make_united_mdef.pl add the parameters –minocc 1 and –maxtriphones 20000

19

Some modifications

In the file bin/make_feats replace the final command line with the following bin/wave2feat -verbose -c $1 -raw -di wav -ei raw

-do feat -eo feat -srate 8000 -nfft 256 -lowerf 130 -upperf 3400 -nfilt 31 -ncep 13 –dither

Extract the features for the wav files executing $bin/make_feats etc/*.fileids

20

Training Checklist

Make sure the langname.fileids are in the same order as the filenames in langname.transcription

(check for the first few files) Ensure that the same transliteration is used in all the

three - langname.transcription, langname.dic and langname.phone

Remove duplicate entries, numerals and silence specifiers ( like <s>) in langname.dic

21

Steps involved in AM training STEP 0: Verify

./scripts_pl/00.verify/verify_all.pl STEP 1: Vector Quantization

./scripts_pl/01.vector_quantize/slave.VQ.pl STEP 2: Context Independent (CI) training

./scripts_pl/02.ci_schmm/slave_convg.pl STEP 3: State Tying

./scripts_pl/03.makeuntiedmdef/make_untied_mdef.pl

22

Steps in AM Training ….contd STEP 4: Context Dependent (CD) training

./scripts_pl/04.cd_schmm_untied/slave_convg.pl

STEP 5: Tree Building ./scripts_pl/05.buildtrees/make_questions.pl ./scripts_pl/05.buildtrees/slave.treebuilder.pl

STEP 6: Tree Pruning ./scripts_pl/06.prunetree/slave.state-tie-er.pl

23

Steps in AM Training ….contd STEP 7: CD training

./scripts_pl/07.cd-schmm/slave_convg.pl

STEP 8: Deleting Interpolation ./scripts_pl/08.deleted-interpolation/deleted_interpolation.pl

STEP 9: Converting to Sphinx 2 format ./scripts_pl/09.make_s2_models/make_s2_models.pl

24

Training the Language Model

Theoretically, though the LM should be trained on a large unbiased corpus, to approximate things for practical feasibility, we train it on a corpus derived from the testing and training transcriptions.

Statistical language modeling computes the smoothed trigram, bigram and the unigram probabilities from the corpus.

Concatenate the test and training transcriptions, each sentence in a new line Remove punctuations, filenames Prefix and suffix the sentences with <s> and </s>

25

Training the LM ….contd

Run the following commands on the corpus (eg. corpus.txt) in the directory /CMU-Cam_Toolkit_v2/bin `cat corpus.txt |./text2wfreq >corpus.wfreq`; `cat corpus.wfreq |./wfreq2vocab > corpus.vocab`; `cat corpus.txt |./text2idngram –vocab

corpus.vocab >corpus.idngram`; `./idngram2lm -idngram corpus.idngram -vocab

corpus.vocab -arpa corpus.lm`;

26

Pronunciation Dictionary

The decoder should be provided the phone split of all the unigrams of the LM.

Run the IT3-Phonetizer1 on the wordlist containing the unigrams and get the langname.dic for the entries $./IT3-Phonetizer lang.phone unigram.wordlist

langname.dic

1 : refer slide 3 http://172.16.16.93/ASR/TELUGU.phone

27

Running the decoder

Modify the script sphinx2-test1 with the appropriate values for the parameters TASK= Training directory path HMM= ${TASK}/model_parameters/langname.s2models CTLFILE= List of all the filenames of testing raw files

Arguments for the s2batch: -matchfn : output filename -datadir : Dir consisting testing files (in raw format) -lmfn : path of the language model -dictfn : path of the dictionary

1 - refer slide 7

28

Running the decoder….contd

Arguments you change for the command s2batch:

-matchfn : output filename -datadir : dir consisting the testing files (in raw format) -lmfn : path of the language model -dictfn : path of the dictionary -langwt : a value between 6 and 13 (larger the LM

size, lesser the value of the langwt) -logfn : logfile

Now run the script $./sphinx2-test

29

Evaluating the output

Use the original transcription (eg. test.txt) of the testing files to evaluate the output of the decoder this is a test sentence (file0001)

Modify the output of the decoder to the above format i.e. remove the scores at the end (eg. output.txt)

Modify and run the scorer.sh1 as follows NIST : path of the NIST directory REF : the testing transcription ( test.txt) HYP : the decoder output (output.txt) score.rpt : the performance report of the decoder

Run the script $./scorer.sh

1- refer slide 7

30

Interpreting the NIST report

The scorer aligns the decoder output with the reference transcript of the test utterances

It computes the mean word error rate (w.e.r) per utterance by penalizing the insertions, deletions and substitutions in alignment

The report also gives the w.e.r per speaker and indicates the good and the bad speakers in the test set

31

Forced Alignment

A technique to improve the Acoustic Model Download the sphinx2-align1 and modify the

parameter paths accordingly TASK : Training directory HMM : ${TASK}/model_parameters/TELUGU.s2models CTLFILE : The list of all the training files to be aligned TACTLFN : Transcript to be aligned. The format is -

*align_all* // This should be the first line this is sentence one // Remove <s>, </s> & filenames

DICT : ${TASK}/etc/langname.dic

1 : refer slide 7

32

Forced Alignment…..contd

Arguments for the $S2batch -osentfn : output file -datadir : directory containing the raw files -logfn : logfile for the alignment

Replace the etc/langname.transcription with aligned transcript (pointed by -osentfn)

Retrain the Acoustic models1, test and score the new models to see the improved performance

1 refer slide 20

33

Limited Domain Speech-to-Speech/ASR Target:

Exploiting the limited

domain

Integrating ASR with

MT and TTS systems

Schematic figure

shown alongside

34

The Language

Identify the kinds of templates and the various entities that recur in the domain Ex: Considering a Tourist domain

Template1: How can I go to the <Location>? Template2 : Can I catch a <Mode> to <Place>?

Values for Location : Market, Railway Station, Hospital Values for Mode: Train, Bus, Aeroplane Values for Place: Chennai, Delhi, Hyderabad

Implement a procedure to generate the legitimate utterances language of the domain. Use the correct transliteration as that of the Acoustic models

35

Components for the limited domain ASR AM : Existing AMs built for the languages

LM : LM trained on the set of legitimate sentences allowed by your application

Lexicon: Specified for the unigram terms of the LM

36

Biasing the decoder to LM

To exploit the limited domain, increase the langwt parameter of the sphinx2-test to increase the speed and accuracy of the decoder.

Documents

ASR_Sphinx0.5