Upload
apurva-singh
View
53
Download
2
Tags:
Embed Size (px)
Citation preview
From:15th May – 10th July 2014
Intern Presentation
From:15th May – 10th July 2014
Institute: DA-IICT
Under: Professor Hemant Patil
Building a USS based TTS SystemThe following are the major steps.
1. Collection of text for the system.
2. Recording of the data (Studio quality preferred)
Labelling of the data.3. Labelling of the data.
4. System building.
Collection of text data� This step is as important as the other steps as the text is to be
collected keeping in mind that maximum of the syllables , phones or words are covered. So the text is collected from different news papers or magazines and other sources.
� The text is generally collected in form of sentences. This � The text is generally collected in form of sentences. This makes the labelling part easier.
� Once the collection of data is done the text is read and is corrected for any mistakes that may have crept in while the collection this is done manually and is an important step is the collection of data process.
Recording of the data� Once the text files are ready then we have to record the
associated speech files which will be used to give voice to our TTS system.
� The recording is another important aspect of building a TTS system.system.
� The quality of recorded speech determines the quality of the TTS system build so the speech files are recorded generally in studios so there is no noise in the speech files.
� The recording of the speech files is done by a speaker. So this brings us to another important aspect of speaker selection.
� The speaker is selected based on different factors some of which are clarity of voice, proper pronunciation, appealing factor of the voice i.e. is the voice appealing to the person factor of the voice i.e. is the voice appealing to the person who will be using such systems i.e. the blind or visually challenged people.
� Once the recording of the data is done then comes the next an the most important step that is labelling or segmentation of the data.
Labelling of Data� This is the most important step in the process of building a TTS
System.
� The labelling can be done using different tools like Wave Surfer for Windows Users and Audacity for Ubuntu users.
� Labelling can be done manually byu listening to the speech files Labelling can be done manually byu listening to the speech files and then labelling them or many Group Delay based automatic segmentation algorithms have been used for the labelling purpose.
� Labelling is the most laborious and the most important in the building of a USS based TTS system as the ultimate utterances quality of our system is based on the quality of the labelling.
� Labelling of data can be done at various levels i.e. we can label words, phones, syllables, diaphones or pentaphones etc.
� As Indian languages are syllable based so we do the labelling at the syllable level.
� This brings to the question as to what is a syllable ? � This brings to the question as to what is a syllable ?
What is a syllable ?The dictionary definition of syllable is as follows:
� a unit of pronunciation having one vowel sound, with or without surrounding consonants, forming the whole or a part of a word; for example, there are two syllables in water and three in inferno.in water and three in inferno.
� Syllables are often considered the phonological "building blocks" of words.
� The general structure of a syllable is C*VC*
The * represents that there can be 0 or any number of consonants in the syllable. A syllable has at least one vowel.
Labelling using wavesurfer
Wavesurfer Interface
� WaveSurfer is an audio editor widely used for studies of acoustic phonetics. It is a simple but fairly powerful program for interactive display of sound pressure waveforms, spectral sections, spectrograms, pitch tracks and transcriptions. It can read and write a number of transcription file formats used in industrial speech research includingTIMIT.
� WaveSurfer is free software, distributed under a permissive � WaveSurfer is free software, distributed under a permissive free software licence
� Wavesurfer provides basic audio editing operations, such as excision, copying, pasting, zero-crossing adjustment, and effects such as fading, normalization, echo, inversion, reversal, replacement with silence, and DC-removal
A wave file displayed in WaveSurfer1) Is the wave file displayed in terms of frequency VS time
2) Is the spectrogram of the wave file displayed.
Wave File displayed in WaveSurfer3)The label of the corresponding sound of syllable in the wave file
4)This bar shows the part of wave file which is currently being displayed in the wave surfer window
System Building using Festival
What is festival ?� The Festival Speech Synthesis System is a general multi-
lingual speech synthesis system originally developed by Alan W. Black at Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites. It is distributed under a free software license similar to the BSD License.
� It offers a full text to speech system with various APIs, as well as an environment for development and research of speech It offers a full text to speech system with various APIs, as well as an environment for development and research of speech synthesis techniques. It is written in C++ with a Scheme-like command interpreter for general customization and extension.
� Festival is designed to support multiple languages, and comes with support for English (British and American pronunciation), Welsh, and Spanish. Voice packages exist for several other languages, such as Castilian Spanish, Czech, Finnish, Hindi, Italian, Marathi, Polish, Russian and Telugu
� Before we go to system building the following things have to be prepared with you.
1) Speech files
2) .lab files i.e. the labelled files.
3) A file which has all the text which is used for system 3) A file which has all the text which is used for system building.
4) The .lab and .wav file should have same names which correspond to the same speech files.
1) In the home Directory of your Ubuntu create a new folder of any desirable name you want. General Format is "Institute Name_Language of The TTS System being build_Name of the Speaker" like for eg"LNMIIT_Hindi_Aditi" .You can build this folder using the following command in the terminalmkdir name_of_the_folder
2) Now copy the TTS_Building Script folder into the new folder you just created.This can be done manually or by typing the following command in the terminal.
sudo cp Source_folder_path Destination_folder_path
3) Go to the TTS_Building Script folder using the following commandcd TTS_Buildng Script
4)Run the following Command../Data_Prepration.sh
NOTE: If you get an error of permission denied then type in the following commandchmod -R 776 name_of_the _folder Now again run the following command./Data_Prepration.sh
Do as directed in the terminal
Here the text processing is taking place.
5)Copy the VoiceBuildingData (from Step 4) in voices folder and also make a wav folder in VoiceBuildingData folder and put all the wave files into it.
6)Enter into voice directory.
Copy following 3 things from given Automatic voices folder:
(1)RequiredFields folder
(2)build_voices.sh
(3)copy_voicedatatovoices.sh
7) Make a voice directory in voices folder using following commands:(NOTE -This Step is needed when running for first time )
mkdir name_of_the_folder
$FESTVOXDIR/src/unitsel/setup_clunits name of the folder(NOTE: The same name is used and without the undescores while giving this
command)
8) Enter to the voices folder in command propmt. (ie the folder created in the 8) Enter to the voices folder in command propmt. (ie the folder created in the very first step.)
9) Copy voice data to voices folder by running following command, but before that change directory name by entering in to following file:
./copy_VoicedataToVoices.sh(NOTE: The Directory name should be the name of the directory created in step
7. )
This is done by the following steps.
10) Now manually modify festvox/*phoneset.scm and
festival/clunits/all.desc for updated phonesets.
This is done by the following steps.
(i)EDITING OF *phoneset.scm.Go to the VoiceBuildingData folder and open the UniquePhone_Syll file.Copy all the data in it and then go to the directory created in step 7 and into the festvox folder and open the file like *phoneset.scm (where * is your directory name created in step 7 )
You have to Do the following editing in this file.There will be some lines like this in the file
(cplace l a p b d v g 0) ;; consonant voicing (cvox + - 0) ) ( So after this brackett paste all the data you had copied before form the UniquePhone_Syll_Phoneset file.
Also comment this line by adding a semi colon ; (pau - 0 - - - 0 0 -) ;; slience ...And lastly,(PhoneSet.silences '(SIL))
(replace the word SIL by the word you use to label the pauses between the words)Now(This step is system dependent means different people may have different notations.)( SIL + l 2 1 - l d + )( SSIL + l 2 1 - l d + )( LSIL + l 2 1 - l d + )
(NOTE: These three lines are added for long silence short silence and silence which may depend on user to user. )
(ii)EDITING OF all.desc for updated phonesets.
Go to the VoiceBuildingData folder and open the UniquePhone_Syll and do the following things
Open find and replace and in the FIND enter \n and in REPLACE WITH enter " " ie just a space.
And then copy these and close the file without saving the changes made.
And paste the selected data into the all.desc file in the following path name of dir created in step 7/festival/cluints.
11)Go to the dir created in step 7 and into the *lexicon.scm file Add the rules for your languge text processing after the Hand written to sound rules here comment.
12)Now as you are training the system we have to change the parser file name as*train.pl(format myfilepointer "perl %s %s %s" (path-append DAIICT_Gujarati_Dhiraj::dir "bin/il_parser_bme_train.pl")13)Run the command ./build_voices.sh14)
When the training of data is finished then change the parser file name as *test.pl and then go to the terminal and manually call the festival and then provide the text you want to say using the TTS system.
Made by – Apurva Singh
Y13UC043Y13UC043
B.Tech First year
LNMIIT, Jaipur.
Refrences� http://en.wikipedia.org/wiki/Festival_Speech_Synthesis_S
ystem
� http://en.wikipedia.org/wiki/WaveSurfer
� http://en.wikipedia.org/wiki/Syllable
� Discrete-Time Speech Signal Processing � Discrete-Time Speech Signal Processing
by Thomas F. Quatieri