32
Speech Synthesis Speech Synthesis 1

Speech synthesis

  • Upload
    rajan-r

  • View
    1.119

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Speech synthesis

Speech SynthesisSpeech Synthesis

1

Page 2: Speech synthesis

Speech synthesisSpeech synthesisSpeech synthesis is the artificial production of

human speech.The computer or instrument used for this purpose is

called a speech synthesizer.A Text-To-Speech(TTS) synthesis is production of speech

from normal language text.

Input text phonetic synthesised levels speech simple text-to-speech synthesis

2

text and linguistic analysis

Prosody and speech generation

Page 3: Speech synthesis

Stephen HawkingStephen HawkingSuffering from motor neuron

disease ALS.Lost his speech ability.First computer based speech

system was provided by Intel®.Main interface program is EZ KEYS

written by Word plus Inc.Cursor is controlled by cheek

moments and detected by IR sensor mounted on spectacles.

Formed words are sent to speech synthesiser ,hardware made by speech+.

3

Page 4: Speech synthesis

Stephen HawkingStephen Hawking• Speech synthesiser voice output

can also be stored.• Current configuration

• Lenovo ThinkPad X220 tablet (2 copies).

• Intel® Core™ i7-2620M CPU @ 2.7GHz.

• Intel® 150Gb Solid-State Drive 520 Series.

• Windows 7.• Speech Synthesizers (3 copies):

Manufacturer: Speech+ CA.4

Page 5: Speech synthesis

History of speech synthesizerHistory of speech synthesizer

• First device to be considered as speech synthesiser was VODER introduced by Homer Dudley in 1939 in New

York’s world fair.• The first format synthesizer PAT (Parametric Artificial

Talk) was introduced by Lawrence in 1953.

5

Page 6: Speech synthesis

Architecture of TTS systemsArchitecture of TTS systems

6

Text-to-phoneme module

Text input

Grapheme-to-phoneme

conversion

Prosodic modelling

Acoustic synthesis

Abbreviation lexicon

Text in orthographic formExceptions

lexicon

Orthographic rules

Phoneme string

Normalization

Grammar rules

Phoneme string + prosodic annotation

Prosodic model

Synthetic speech output

Phoneme-to-speech module

Various methods

Page 7: Speech synthesis

Challenges in speech Challenges in speech synthesissynthesis• TEXT-To-Phoneme Conversion

It is the conversion of input text into linguistic representation, also called as Grapheme-To-Phoneme conversion.

• Text Processing In this digits ,numerals, fractions, dates, abbreviations are

expanded into full words.

• Pronunciation• Next task is to find correct pronunciation.

• Homographic words should be pronounced correctly.

7

Page 8: Speech synthesis

Challenges in speech Challenges in speech synthesissynthesis• Prosody

– Finding correct intonation, stress, and duration for written text.

8

Page 9: Speech synthesis

Text normalizationText normalization

• Text ProcessingIn this digits, numerals, fractions, dates, abbreviations

are expanded into full words.Examples; 1750 would be expanded as seventeen-fifty

(if year) and one-thousand seven-hundred and fifty (if measure).

5/13 would be expanded as five-thirteenths (if fraction) and May thirteen.

Numbers are especially difficult 233 4488

9

Page 10: Speech synthesis

Text normalizationText normalization

• Any text that has a special pronunciation is stored in a lexiconAbbreviations (Mr, Dr, St)Acronyms (UN as UNESCO)Special symbols (&, %)Particular conventions (£5, $5 million, 12°C)

10

Page 11: Speech synthesis

Grapheme-to-phoneme conversionGrapheme-to-phoneme conversion

• It is the conversion of input text into linguistic representation.

• English spelling is complex but largely regular than other languages. • Gross exceptions must be in lexicon• Lexicon features

– look-up should be quick.– need rules anyway for unknown words too.

11

Page 12: Speech synthesis

Grapheme-to-phoneme conversionGrapheme-to-phoneme conversion

• Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean)

• Much harder for others (English, French)• Especially if writing system is only partially alphabetic

(Arabic, Urdu)• Or not alphabetic at all (Chinese, Japanese)

12

Page 13: Speech synthesis

Prosody modellingProsody modellingThe voice parameters affected by emotions are usually categorized in three main types:

Voice quality contains largely constant voice characteristics over the spoken utterance, such as loudness and breathiness.

Pitch contour and its dynamic changes carry important emotional information.

Time characteristics contain the general rhythm, speech rate, the lengthening and shortening of the stressed syllables, the length of content words, and the duration an placing of pauses.

13

Page 14: Speech synthesis

Prosody modellingProsody modelling• The secondary emotional states are ;

Anger The voice is very breathy and has tense articulation with abrupt changes.

Happiness or joy The voice is breathy and light without tension.

Fear or anxiety Articulation is precise and the voice is irregular and energy at lower frequencies is

reduced. Sadness or sorrowness

The articulation precision and the speech rate are also decreased. Disgust or contempt

The average pitch level and the speech rate are also lower compared to normal speech and the number of pauses is high.

Whispering and shouting Whispering is produced by speaking with high breathiness without fundamental

frequency. Shouted speech causes an increased pitch range, intensity and greater variability in it.

14

Page 15: Speech synthesis

Acoustic synthesisAcoustic synthesis

• Methods, Techniques and Algorithms:Articulatory synthesisFormant synthesisConcatenative synthesis

PSOLA MethodMicrophonemic MethodLinear prediction based MethodsSinusoidal Models

15

Page 16: Speech synthesis

Articulatory synthesisArticulatory synthesis• Refers to the computational techniques for synthesizing

speech based on human vocal tract and articulation processes occurring there.

• Wolfgang von Kempelen and others used bellows, reeds and tubes to construct mechanical speaking machine.

• Modern versions simulate electronically the effect of articulator positions, vocal tract shape, etc.

16

Page 17: Speech synthesis

Formant synthesisFormant synthesis• Formant means an acoustic resonance of human

vocal tract.• Probably the most widely used synthesis method

during last decades • Synthesised speech output is created by using

additive synthesis and an acoustic modelling.• SoftVoice synthesizers stimulates the human speech

production mechanism using digital oscillators, noise sources, and filters(formant resonators) just like an electronic music synthesizers.

17

Page 18: Speech synthesis

Formant synthesis Demo: Formant synthesis Demo:

Microsoft windows• In control panel select

“Speech” icon• Type in your text and Preview

voice• You may have a choice of

voices

18

Page 19: Speech synthesis

Concatenative synthesisConcatenative synthesis

• Concatenate segments of pre-recorded natural human speech.

• Requires database or lexicon of previously recorded human speech covering all the possible segments to be synthesised.

• Segment might be phoneme, syllable, word, phrase, or any combination.

• Diphone segments can be digitally manipulated for length, pitch and loudness.

• Segment boundaries need to be smoothed to avoid distortion.

19

Page 20: Speech synthesis

Concatenative synthesis Concatenative synthesis methodsmethods• PSOLA (Pitch synchronous Overlap Add)

This algorithm is used to concatenate smoothly and provides good controlling for pitch and duration.

It is used for commercial synthesis systems.Time domain PSOLA is most commonly used due to its

computational efficiency.

• Micro-phoneme methodThe concatenation is made by Linear amplitude-based

Interpolation Method between the prototypes.

20

Page 21: Speech synthesis

Concatenative synthesis Concatenative synthesis methodsmethods• Linear prediction based methods

This method is designed originally for speech coding system ,but also used for speech synthesis.

Co-variance and auto co-relation is used.

• Sinusoidal ModelsBased on assumption that voice signal can be

represented as sum of sine waves with time varying amplitude and frequencies.

Sinusoidal models are successfully used in singing voice synthesis using MIDI interface.

21

Page 22: Speech synthesis

Speech synthesis demoSpeech synthesis demo

22

Page 23: Speech synthesis

Speech synthesis demoSpeech synthesis demo

23

Page 24: Speech synthesis

APPLICATIONSAPPLICATIONS

Application for the blindUsed for reading and communication aid for blindCurrent systems are mostly software based ,so with

scanner and OCR(optical character recognition) systemsApplication for deafened and vocally handicapped

Provides opportunity to communicate with people who do not understand sign language.

HAMLET helps users to express their feelings.HAMLET system is used with high quality TTS such as

DECTALK.

24

Page 25: Speech synthesis

APPLICATIONSAPPLICATIONS

Educational applicationsProgrammed for special tasks like spelling and

pronunciation teaching for different languages. speech synthesizer is connected to word processor

which is helpful for proof reading.Applications for telecommunication and

multimediaSynthesized speech is used in all kind of telephone

enquiry systems.VoiceXML: Internet surfing using voice.

25

Page 26: Speech synthesis

PRODUCTSPRODUCTS

• INFOVOX INFOVOX speech synthesizer is perhaps one of best known multilingual TTS

products. The latest full commercial version available is INFOVOX IVOX.

26

Page 27: Speech synthesis

PRODUCTSPRODUCTS

• DECTalk Available for American English, Spanish and German

and available in nine different voice personalities, four female , four male and one child.

27

Page 28: Speech synthesis

PRODUCTSPRODUCTS

• Bell Labs Text-to-Speech Available in English, French, Spanish, Italian,

German, Russian, Romanian, Chinese and Japanese.

28

Page 29: Speech synthesis

PRODUCTSPRODUCTS

• SoftVoiceSoftVoice is better known for SAM(Software Automatic

Mouth) synthesizer for Apple MacinTAlk, Amiga and Attari computers.

Fifth generation SoftVoice is also available for windows in 20 different languages.

• CNET PSOLAOne of the promising method for concatenation

synthesis developed by French Telecom CNET(Centre National d’Etudes Télécommunications ).

29

Page 30: Speech synthesis

PRODUCTSPRODUCTS

• Apple Plain TalkApple developed three different speech synthesis

systems for Macintosh PCs.

30

Page 31: Speech synthesis

PRODUCTSPRODUCTS

• Windows WhistlerMicrosoft Whistler (Whisper Highly Intelligent Stochastic

Talker) is a trainable speech synthesis system which is under development at Microsoft Research, Richmond, USA. The system is designed to produce synthetic speech that sounds natural and resembles the acoustic and prosodic characteristics of the original speaker .

31

Page 32: Speech synthesis

THANK YOUTHANK YOU