Upload
siemenshearinghealth
View
118
Download
3
Tags:
Embed Size (px)
Citation preview
ASR building using Sphinx
CS6745: Building ASR and TTS Systems
Gopala Krishna A (gopalakrishna@students)
S P Kishore ([email protected])
2
The Components of ASR
Acoustic Model (AM)
Language Model (LM)
Phonetic Lexicon (Pronunciation dictionary)
3
Installing the Sphinx Trainer
Download the Sphinx III trainer from http://172.16.16.93/ASR/SphinxTrain-0.9.1-beta.tar.gz
(Source: http://www.speech.cs.cmu.edu/SphinxTrain/SphinxTrain-0.9.1-beta.tar.gz)
Untar and install Sphinx train (as root)
$tar –xvzf SphinxTrain-0.9.1-beta.tar.gz $cd SphinxTrain $./configure $make
4
Installing the Sphinx II decoder Download the Sphinx II decoder from
http://172.16.16.93/ASR/sphinx2-0.5.tar
(Source: http://www.sorcerer.mirrors.pair.com/sources/sphinx2/0.5/sphinx2-
0.5.tar.bz2) Untar and install the decoder (as root)
$tar -xvf sphinx2-0.5.tar $cd sphinx2-0.5 $./configure $make clean all $make test $make install
5
CMU-Statistical language modeling toolkit Download the CMU-SLM toolkit from
http://172.16.16.93/ASR/CMU-Cam_Toolkit_v2.tar.gz(Source http://mi.eng.cam.ac.uk/~prc14/CMU-Cam_Toolkit_v2.tar.gz)
Untar the tgz and install as root $tar –xvzf CMU-Cam_Toolkit_v2.tar.gz $cd CMU-Cam_Toolkit_v2/ $cd src Uncomment the #BYTESWAP_FLAG = -DSLM_SWAP_BYTES
in the Makefile $make install
6
Before getting started….
Download Speech data Available at http://172.16.16.93/ASR/TEL_Landline.tgz
Language Phoneset & Phonetizer Available at http://172.16.16.93/ASR/TELUGU.phone Available at http://172.16.16.93/ASR/IT3-Phonetizer
7
Before getting started…..contd NIST Scorer (for scoring the decoder performance)
Available at http://172.16.16.93/ASR/nist.tar.gz
Script for testing Available at http://172.16.16.93/ASR/sphinx2-test
Script for scoring and alignment Available at http://172.16.16.93/ASR/scorer.sh Available at http://172.16.16.93/ASR/sphinx2-align
8
Speech Databases…. format
Language //Tamil, Telugu or Marathi Data Cellphone
ID-**** (4-digit userid) FileRanking.txt // has info about recording quality recorded-ID.txt // has the transcription recordings //has the recordings in various formats
WAV/ // has 52 wav files of the recordings- recorded-***.wav(3-digit fileid)
….. Landline
…..
9
Directory structure
10
Training and Testing Datasets To train the models and to evaluate their
performance on unseen data, the speakers need to be classified into the Training and Testing sets.
The division is usually 70% of the speakers for the training and 30% for testing.
11
Wav file collection
This is done to collect Training and Testing data sets (good quality wav files without mistakes)
Copy and untar the file collection module from-
http://172.16.16.93/ASR/collect.tgz Extract the wav files as follows
$cd COLLECT $./runall.pl <Directory containing the speaker IDS>
This could take 5-10 minutes
12
Wav file collection …..contd
The following are created after running the runall.pl- Use the *.raw files in Training/ for training,
‘Training/transcript’ file as the corresponding transcription and train_fileids as the fileids file
Use the *.raw files in Testing/ for training, ‘Testing/transcript’ file as the corresponding transcription and test_fileids as fileids file
trainwords_uniq.txt is the unique word list of the training transcription (used for creating dictionary)
13
Acoustic Model Training
Create a new directory (training workspace) $mkdir TASK_NAME
Set the environment variables in the directory $export SPHINXTRAINDIR=“~/SphinxTrain”
Make sure you give the correct path of SphinxTrain Dir
Create the directory structure $SPHINXTRAINDIR/scripts_pl/setup_SphinxTrain
LangName
14
Directory wav/
Copy the Training/*.raw1 to this directory
1 : refer slide 11
15
Directory etc/ Contents to be put in the etc/ directory
etc/langname.transcription: Copy the Training/transcript1 file
etc/langname.filler: Should contain the silence specifiers
<s> SIL </s> SIL <sil> SIL
1 : refer slide 11
16
Directory etc/……contd
etc/langname.phone1 : Should contain the phoneset, each phone in a new line Append ‘SIL’ as a phone
etc/langname.fileids : Should have the filenames of all the files in the order they appear in the langname.transcription file (Use the train_fileids file)
1: http://172.16.16.93/ASR/TELUGU.phone
17
etc/langname.dic
etc/langname.dic : Should contain the phone breakage of each word entry. Proceed as follows -
Get all the unique words in the training transcription (excluding the <s>, </s> & filenames) each in a new line. You may use the trainwords_uniq.txt2
Use the IT3-Phonetizer1 to split the words into the constituent phones $./IT3-Phonetizer lang.phone lang.wordlist langname.dic
1 : refer slide 6 http://172.16.16.93/ASR/TELUGU.phone2 : refer slide 11
18
Some modifications
Modify etc/sphinx_train.cfg and changing the number of tied states to 1000 $CFG_N_TIED_STATES = 1000
To the command line in the file - scripts_pl/03.makeuntiedmdef/make_united_mdef.pl add the parameters –minocc 1 and –maxtriphones 20000
19
Some modifications
In the file bin/make_feats replace the final command line with the following bin/wave2feat -verbose -c $1 -raw -di wav -ei raw
-do feat -eo feat -srate 8000 -nfft 256 -lowerf 130 -upperf 3400 -nfilt 31 -ncep 13 –dither
Extract the features for the wav files executing $bin/make_feats etc/*.fileids
20
Training Checklist
Make sure the langname.fileids are in the same order as the filenames in langname.transcription
(check for the first few files) Ensure that the same transliteration is used in all the
three - langname.transcription, langname.dic and langname.phone
Remove duplicate entries, numerals and silence specifiers ( like <s>) in langname.dic
21
Steps involved in AM training STEP 0: Verify
./scripts_pl/00.verify/verify_all.pl STEP 1: Vector Quantization
./scripts_pl/01.vector_quantize/slave.VQ.pl STEP 2: Context Independent (CI) training
./scripts_pl/02.ci_schmm/slave_convg.pl STEP 3: State Tying
./scripts_pl/03.makeuntiedmdef/make_untied_mdef.pl
22
Steps in AM Training ….contd STEP 4: Context Dependent (CD) training
./scripts_pl/04.cd_schmm_untied/slave_convg.pl
STEP 5: Tree Building ./scripts_pl/05.buildtrees/make_questions.pl ./scripts_pl/05.buildtrees/slave.treebuilder.pl
STEP 6: Tree Pruning ./scripts_pl/06.prunetree/slave.state-tie-er.pl
23
Steps in AM Training ….contd STEP 7: CD training
./scripts_pl/07.cd-schmm/slave_convg.pl
STEP 8: Deleting Interpolation ./scripts_pl/08.deleted-interpolation/deleted_interpolation.pl
STEP 9: Converting to Sphinx 2 format ./scripts_pl/09.make_s2_models/make_s2_models.pl
24
Training the Language Model
Theoretically, though the LM should be trained on a large unbiased corpus, to approximate things for practical feasibility, we train it on a corpus derived from the testing and training transcriptions.
Statistical language modeling computes the smoothed trigram, bigram and the unigram probabilities from the corpus.
Concatenate the test and training transcriptions, each sentence in a new line Remove punctuations, filenames Prefix and suffix the sentences with <s> and </s>
25
Training the LM ….contd
Run the following commands on the corpus (eg. corpus.txt) in the directory /CMU-Cam_Toolkit_v2/bin `cat corpus.txt |./text2wfreq >corpus.wfreq`; `cat corpus.wfreq |./wfreq2vocab > corpus.vocab`; `cat corpus.txt |./text2idngram –vocab
corpus.vocab >corpus.idngram`; `./idngram2lm -idngram corpus.idngram -vocab
corpus.vocab -arpa corpus.lm`;
26
Pronunciation Dictionary
The decoder should be provided the phone split of all the unigrams of the LM.
Run the IT3-Phonetizer1 on the wordlist containing the unigrams and get the langname.dic for the entries $./IT3-Phonetizer lang.phone unigram.wordlist
langname.dic
1 : refer slide 3 http://172.16.16.93/ASR/TELUGU.phone
27
Running the decoder
Modify the script sphinx2-test1 with the appropriate values for the parameters TASK= Training directory path HMM= ${TASK}/model_parameters/langname.s2models CTLFILE= List of all the filenames of testing raw files
Arguments for the s2batch: -matchfn : output filename -datadir : Dir consisting testing files (in raw format) -lmfn : path of the language model -dictfn : path of the dictionary
1 - refer slide 7
28
Running the decoder….contd
Arguments you change for the command s2batch:
-matchfn : output filename -datadir : dir consisting the testing files (in raw format) -lmfn : path of the language model -dictfn : path of the dictionary -langwt : a value between 6 and 13 (larger the LM
size, lesser the value of the langwt) -logfn : logfile
Now run the script $./sphinx2-test
29
Evaluating the output
Use the original transcription (eg. test.txt) of the testing files to evaluate the output of the decoder this is a test sentence (file0001)
Modify the output of the decoder to the above format i.e. remove the scores at the end (eg. output.txt)
Modify and run the scorer.sh1 as follows NIST : path of the NIST directory REF : the testing transcription ( test.txt) HYP : the decoder output (output.txt) score.rpt : the performance report of the decoder
Run the script $./scorer.sh
1- refer slide 7
30
Interpreting the NIST report
The scorer aligns the decoder output with the reference transcript of the test utterances
It computes the mean word error rate (w.e.r) per utterance by penalizing the insertions, deletions and substitutions in alignment
The report also gives the w.e.r per speaker and indicates the good and the bad speakers in the test set
31
Forced Alignment
A technique to improve the Acoustic Model Download the sphinx2-align1 and modify the
parameter paths accordingly TASK : Training directory HMM : ${TASK}/model_parameters/TELUGU.s2models CTLFILE : The list of all the training files to be aligned TACTLFN : Transcript to be aligned. The format is -
*align_all* // This should be the first line this is sentence one // Remove <s>, </s> & filenames
DICT : ${TASK}/etc/langname.dic
1 : refer slide 7
32
Forced Alignment…..contd
Arguments for the $S2batch -osentfn : output file -datadir : directory containing the raw files -logfn : logfile for the alignment
Replace the etc/langname.transcription with aligned transcript (pointed by -osentfn)
Retrain the Acoustic models1, test and score the new models to see the improved performance
1 refer slide 20
33
Limited Domain Speech-to-Speech/ASR Target:
Exploiting the limited
domain
Integrating ASR with
MT and TTS systems
Schematic figure
shown alongside
34
The Language
Identify the kinds of templates and the various entities that recur in the domain Ex: Considering a Tourist domain
Template1: How can I go to the <Location>? Template2 : Can I catch a <Mode> to <Place>?
Values for Location : Market, Railway Station, Hospital Values for Mode: Train, Bus, Aeroplane Values for Place: Chennai, Delhi, Hyderabad
Implement a procedure to generate the legitimate utterances language of the domain. Use the correct transliteration as that of the Acoustic models
35
Components for the limited domain ASR AM : Existing AMs built for the languages
LM : LM trained on the set of legitimate sentences allowed by your application
Lexicon: Specified for the unigram terms of the LM
36
Biasing the decoder to LM
To exploit the limited domain, increase the langwt parameter of the sphinx2-test to increase the speed and accuracy of the decoder.