Speech synthesis

Speech SynthesisSpeech Synthesis

1

Speech synthesisSpeech synthesisSpeech synthesis is the artificial production of

human speech.The computer or instrument used for this purpose is

called a speech synthesizer.A Text-To-Speech(TTS) synthesis is production of speech

from normal language text.

Input text phonetic synthesised levels speech simple text-to-speech synthesis

2

text and linguistic analysis

Prosody and speech generation

Stephen HawkingStephen HawkingSuffering from motor neuron

disease ALS.Lost his speech ability.First computer based speech

system was provided by Intel®.Main interface program is EZ KEYS

written by Word plus Inc.Cursor is controlled by cheek

moments and detected by IR sensor mounted on spectacles.

Formed words are sent to speech synthesiser ,hardware made by speech+.

3

Stephen HawkingStephen Hawking• Speech synthesiser voice output

can also be stored.• Current configuration

• Lenovo ThinkPad X220 tablet (2 copies).

• Intel® Core™ i7-2620M CPU @ 2.7GHz.

• Intel® 150Gb Solid-State Drive 520 Series.

• Windows 7.• Speech Synthesizers (3 copies):

Manufacturer: Speech+ CA.4

History of speech synthesizerHistory of speech synthesizer

• First device to be considered as speech synthesiser was VODER introduced by Homer Dudley in 1939 in New

York’s world fair.• The first format synthesizer PAT (Parametric Artificial

Talk) was introduced by Lawrence in 1953.

5

Architecture of TTS systemsArchitecture of TTS systems

6

Text-to-phoneme module

Text input

Grapheme-to-phoneme

conversion

Prosodic modelling

Acoustic synthesis

Abbreviation lexicon

Text in orthographic formExceptions

lexicon

Orthographic rules

Phoneme string

Normalization

Grammar rules

Phoneme string + prosodic annotation

Prosodic model

Synthetic speech output

Phoneme-to-speech module

Various methods

Challenges in speech Challenges in speech synthesissynthesis• TEXT-To-Phoneme Conversion

It is the conversion of input text into linguistic representation, also called as Grapheme-To-Phoneme conversion.

• Text Processing In this digits ,numerals, fractions, dates, abbreviations are

expanded into full words.

• Pronunciation• Next task is to find correct pronunciation.

• Homographic words should be pronounced correctly.

7

Challenges in speech Challenges in speech synthesissynthesis• Prosody

– Finding correct intonation, stress, and duration for written text.

8

Text normalizationText normalization

• Text ProcessingIn this digits, numerals, fractions, dates, abbreviations

are expanded into full words.Examples; 1750 would be expanded as seventeen-fifty

(if year) and one-thousand seven-hundred and fifty (if measure).

5/13 would be expanded as five-thirteenths (if fraction) and May thirteen.

Numbers are especially difficult 233 4488

9

Text normalizationText normalization

• Any text that has a special pronunciation is stored in a lexiconAbbreviations (Mr, Dr, St)Acronyms (UN as UNESCO)Special symbols (&, %)Particular conventions (£5, $5 million, 12°C)

10

Grapheme-to-phoneme conversionGrapheme-to-phoneme conversion

• It is the conversion of input text into linguistic representation.

• English spelling is complex but largely regular than other languages. • Gross exceptions must be in lexicon• Lexicon features

– look-up should be quick.– need rules anyway for unknown words too.

11

Grapheme-to-phoneme conversionGrapheme-to-phoneme conversion

• Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean)

• Much harder for others (English, French)• Especially if writing system is only partially alphabetic

(Arabic, Urdu)• Or not alphabetic at all (Chinese, Japanese)

12

Prosody modellingProsody modellingThe voice parameters affected by emotions are usually categorized in three main types:

Voice quality contains largely constant voice characteristics over the spoken utterance, such as loudness and breathiness.

Pitch contour and its dynamic changes carry important emotional information.

Time characteristics contain the general rhythm, speech rate, the lengthening and shortening of the stressed syllables, the length of content words, and the duration an placing of pauses.

13

Prosody modellingProsody modelling• The secondary emotional states are ;

Anger The voice is very breathy and has tense articulation with abrupt changes.

Happiness or joy The voice is breathy and light without tension.

Fear or anxiety Articulation is precise and the voice is irregular and energy at lower frequencies is

reduced. Sadness or sorrowness

The articulation precision and the speech rate are also decreased. Disgust or contempt

The average pitch level and the speech rate are also lower compared to normal speech and the number of pauses is high.

Whispering and shouting Whispering is produced by speaking with high breathiness without fundamental

frequency. Shouted speech causes an increased pitch range, intensity and greater variability in it.

14

Acoustic synthesisAcoustic synthesis

• Methods, Techniques and Algorithms:Articulatory synthesisFormant synthesisConcatenative synthesis

PSOLA MethodMicrophonemic MethodLinear prediction based MethodsSinusoidal Models

15

Articulatory synthesisArticulatory synthesis• Refers to the computational techniques for synthesizing

speech based on human vocal tract and articulation processes occurring there.

• Wolfgang von Kempelen and others used bellows, reeds and tubes to construct mechanical speaking machine.

• Modern versions simulate electronically the effect of articulator positions, vocal tract shape, etc.

16

Formant synthesisFormant synthesis• Formant means an acoustic resonance of human

vocal tract.• Probably the most widely used synthesis method

during last decades • Synthesised speech output is created by using

additive synthesis and an acoustic modelling.• SoftVoice synthesizers stimulates the human speech

production mechanism using digital oscillators, noise sources, and filters(formant resonators) just like an electronic music synthesizers.

17

Formant synthesis Demo: Formant synthesis Demo:

Microsoft windows• In control panel select

“Speech” icon• Type in your text and Preview

voice• You may have a choice of

voices

18

Concatenative synthesisConcatenative synthesis

• Concatenate segments of pre-recorded natural human speech.

• Requires database or lexicon of previously recorded human speech covering all the possible segments to be synthesised.

• Segment might be phoneme, syllable, word, phrase, or any combination.

• Diphone segments can be digitally manipulated for length, pitch and loudness.

• Segment boundaries need to be smoothed to avoid distortion.

19

Concatenative synthesis Concatenative synthesis methodsmethods• PSOLA (Pitch synchronous Overlap Add)

This algorithm is used to concatenate smoothly and provides good controlling for pitch and duration.

It is used for commercial synthesis systems.Time domain PSOLA is most commonly used due to its

computational efficiency.

• Micro-phoneme methodThe concatenation is made by Linear amplitude-based

Interpolation Method between the prototypes.

20

Concatenative synthesis Concatenative synthesis methodsmethods• Linear prediction based methods

This method is designed originally for speech coding system ,but also used for speech synthesis.

Co-variance and auto co-relation is used.

• Sinusoidal ModelsBased on assumption that voice signal can be

represented as sum of sine waves with time varying amplitude and frequencies.

Sinusoidal models are successfully used in singing voice synthesis using MIDI interface.

21

Speech synthesis demoSpeech synthesis demo

22

http://www.research.att.com/~ttsweb/tts/demo.php

Speech synthesis demoSpeech synthesis demo

23

http://cepstral.com/demos/

APPLICATIONSAPPLICATIONS

Application for the blindUsed for reading and communication aid for blindCurrent systems are mostly software based ,so with

scanner and OCR(optical character recognition) systemsApplication for deafened and vocally handicapped

Provides opportunity to communicate with people who do not understand sign language.

HAMLET helps users to express their feelings.HAMLET system is used with high quality TTS such as

DECTALK.

24

APPLICATIONSAPPLICATIONS

Educational applicationsProgrammed for special tasks like spelling and

pronunciation teaching for different languages. speech synthesizer is connected to word processor

which is helpful for proof reading.Applications for telecommunication and

multimediaSynthesized speech is used in all kind of telephone

enquiry systems.VoiceXML: Internet surfing using voice.

25

PRODUCTSPRODUCTS

• INFOVOX INFOVOX speech synthesizer is perhaps one of best known multilingual TTS

products. The latest full commercial version available is INFOVOX IVOX.

26

PRODUCTSPRODUCTS

• DECTalk Available for American English, Spanish and German

and available in nine different voice personalities, four female , four male and one child.

27

PRODUCTSPRODUCTS

• Bell Labs Text-to-Speech Available in English, French, Spanish, Italian,

German, Russian, Romanian, Chinese and Japanese.

28

http://www.research.att.com/~ttsweb/tts/demo.php

PRODUCTSPRODUCTS

• SoftVoiceSoftVoice is better known for SAM(Software Automatic

Mouth) synthesizer for Apple MacinTAlk, Amiga and Attari computers.

Fifth generation SoftVoice is also available for windows in 20 different languages.

• CNET PSOLAOne of the promising method for concatenation

synthesis developed by French Telecom CNET(Centre National d’Etudes Télécommunications ).

29

PRODUCTSPRODUCTS

• Apple Plain TalkApple developed three different speech synthesis

systems for Macintosh PCs.

30

PRODUCTSPRODUCTS

• Windows WhistlerMicrosoft Whistler (Whisper Highly Intelligent Stochastic

Talker) is a trainable speech synthesis system which is under development at Microsoft Research, Richmond, USA. The system is designed to produce synthetic speech that sounds natural and resembles the acoustic and prosodic characteristics of the original speaker .

31

THANK YOUTHANK YOU