19
5-Speech Synthesis 5-Speech Synthesis Speech Synthesis Concept Speech Synthesis Concept Phone Units Phone Units Phone Sequence To Speech Phone Sequence To Speech Speech Naturalness Speech Naturalness Concatenative Approaches Concatenative Approaches Rule-Base Approaches Rule-Base Approaches

5- Speech Synthesis

  • Upload
    amadis

  • View
    83

  • Download
    0

Embed Size (px)

DESCRIPTION

5- Speech Synthesis. Speech Synthesis Concept Phone Units Phone Sequence To Speech Speech Naturalness Concatenative Approaches Rule-Base Approaches. Speech Synthesis Concept. Text. Speech. Speech. Text to Phone Sequence. Phone Sequence to Speech. Text. Natural Language - PowerPoint PPT Presentation

Citation preview

Page 1: 5- Speech Synthesis

5-Speech Synthesis5-Speech Synthesis

Speech Synthesis Concept Speech Synthesis Concept

Phone UnitsPhone Units

Phone Sequence To SpeechPhone Sequence To Speech

Speech NaturalnessSpeech Naturalness– Concatenative ApproachesConcatenative Approaches– Rule-Base ApproachesRule-Base Approaches

Page 2: 5- Speech Synthesis

Speech Synthesis ConceptSpeech Synthesis Concept

Text toPhone Sequence

Phone Sequenceto Speech

Text Speech

Natural Language Processing (NLP)

Speech Processing

Text Speech

Page 3: 5- Speech Synthesis

Phone UnitsPhone Units

Paragraph ( )

Sentence ( )

Word (Depends on the language. Usually more than 100,000)

Syllable

Diphone & Triphone

Phoneme (Between 10 , 100)

Page 4: 5- Speech Synthesis

Phone Units (Cont’d)Phone Units (Cont’d)

Diphone : We model Transitions between Diphone : We model Transitions between two phonemestwo phonemes

p1 p2 p3 p4 p5 . . . . .

Diphone

Phoneme

Page 5: 5- Speech Synthesis

Phone Units (Cont’d)Phone Units (Cont’d)

In farsi we have 30 Phoneme. so we have In farsi we have 30 Phoneme. so we have 30*30 Diphone Theoretically.30*30 Diphone Theoretically.

Practically the only Diphone that we don’t Practically the only Diphone that we don’t have in farsi is have in farsi is /zho/ /zho/

we have 27000 Triphone Theoretically. we have 27000 Triphone Theoretically. But practically we have about 15000 But practically we have about 15000 Triphone in farsi.Triphone in farsi.

Page 6: 5- Speech Synthesis

Phone Units (Cont’d)Phone Units (Cont’d)

Syllable = Onset (Consonant) + RhymeSyllable = Onset (Consonant) + Rhyme

Syllable is a set of phonemes that exactly Syllable is a set of phonemes that exactly contains one vowelcontains one vowel

Syllables in Farsi : CV , CVC , CVCC Syllables in Farsi : CV , CVC , CVCC

We have about 4000 Syllables in farsiWe have about 4000 Syllables in farsi

Syllables in English :V, CV , CVC ,CCVC, Syllables in English :V, CV , CVC ,CCVC, CCVCC, CCCVC, CCCVCC, . . .CCVCC, CCCVC, CCCVCC, . . .

Number of Syllables in English is very muchNumber of Syllables in English is very much

Page 7: 5- Speech Synthesis

Phone Sequence To SpeechPhone Sequence To Speech

Concatenative Approaches : Trade-Off Concatenative Approaches : Trade-Off between Naturality And Memory usage between Naturality And Memory usage and function amountand function amount

Rule-Based Approaches : The most Rule-Based Approaches : The most important Rule-Based approach is Klatt important Rule-Based approach is Klatt methodmethod

Page 8: 5- Speech Synthesis

Phone Sequence To Speech Phone Sequence To Speech (Cont’d)(Cont’d)

Text to Phone

Sequence

Phone Sequence

to primitive utterance

Text Speechprimitive utteranceto Natural

Speech

NLP Speech Processing

Page 9: 5- Speech Synthesis

Speech NaturalnessSpeech Naturalness

Obviation of undesirable noise and Obviation of undesirable noise and distortion and dissociation from speechdistortion and dissociation from speech

Prosody generationProsody generation– Speech energySpeech energy– DurationDuration– IntonationIntonation– StressStress

Page 10: 5- Speech Synthesis

Speech Naturalness (Cont’d)Speech Naturalness (Cont’d)

Intonation and Stress are very effective in Intonation and Stress are very effective in speech naturalnessspeech naturalness

Intonation : Variation of Pitch frequency Intonation : Variation of Pitch frequency along speakingalong speaking

Stress : Increasing the pitch frequency in a Stress : Increasing the pitch frequency in a specific timespecific time

Page 11: 5- Speech Synthesis

Concatenative ApproachesConcatenative Approaches

In this approaches we store units of In this approaches we store units of natural speech for reconstruction of natural speech for reconstruction of desired speechdesired speech

We could select the appropriate phone We could select the appropriate phone unit for speech synthesisunit for speech synthesis

we can store compressed parameters we can store compressed parameters instead of main waveforminstead of main waveform

Page 12: 5- Speech Synthesis

Concatenative Approaches Concatenative Approaches (Cont’d)(Cont’d)

Benefits of storing compressed Benefits of storing compressed parameters instead of main waveformparameters instead of main waveform– Less memory useLess memory use– General state instead of a specific storedGeneral state instead of a specific stored

utteranceutterance– Generating prosody easilyGenerating prosody easily

Page 13: 5- Speech Synthesis

Concatenative Approaches Concatenative Approaches (Cont’d)(Cont’d)

Phone Unit Type of StoringParagraph

Sentence

Word

Syllable

Diphone

Phoneme

Main Waveform

Main Waveform

Main Waveform

Coded/Main Waveform

Coded Waveform

Coded Waveform

Page 14: 5- Speech Synthesis

Concatenative Approaches Concatenative Approaches (Cont’d)(Cont’d)

Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme transmit smoothing

Overlap-Add-Method is a standard DSP method

PSOLA is a base action for Voice Conversion.

In this method in analysis stage we select frames that are synchronous by pitch markers.

Page 15: 5- Speech Synthesis

Rule-Base Approach StagesRule-Base Approach Stages

Determine the speech model and model Determine the speech model and model parametersparameters

Determine type of phone unitsDetermine type of phone units

Determine some parameter amount for Determine some parameter amount for each phone uniteach phone unit

Substitute sequence of phone units by its Substitute sequence of phone units by its equivalent parameter sequenceequivalent parameter sequence

Put parameter sequence in speech modelPut parameter sequence in speech model

Page 16: 5- Speech Synthesis

KLATT 80 ModelKLATT 80 Model

Page 17: 5- Speech Synthesis

KLATT 88 ModelKLATT 88 Model

Page 18: 5- Speech Synthesis

KL GLOTT 88 KL GLOTT 88 model model

(default)(default)

SPECTRAL SPECTRAL TILT LOW-PAS TILT LOW-PAS RESONANTORRESONANTOR

MODIFIED LF

MODEL

ASPIRATION NOISE

GENERATOR

FIRST DIFFERENCE

PREEMPHASIS

NASAL NASAL FORMANT FORMANT

RESONATORRESONATOR

TRACHEAL FORMANT

RESONATOR

FOURTH FORMANT

RESONATOR

THIRTH FORMANT

RESONATOR

SECOND SECOND FORMANT FORMANT

RESONATORRESONATOR

FIRST FIRST FORMANT FORMANT

RESONATORRESONATOR

FRICATION FRICATION NOISE NOISE

GENERATORGENERATOR

SECOND FORMANT

RESONATOR

THIRD FORMANT

RESONATOR

FOURTH FORMANT

RESONATOR

FIFTH FIFTH FORMANT FORMANT

RESONATORRESONATOR

SIXTH FORMANT

RESONATOR

A2F

A3F

A4F

A5F

A6F

AB

ANV

A1V

A2V

A3V

A4V

ATV

+

-

+

-

+

-

+

+

-

+

-

-

+

+

FILTERED FILTERED IMPULSE IMPULSE

TRAINTRAIN

F0 AV OO FL DI

SO

SS

TL

AH

AF

GLOTTAL SOUND SOURCES

CP

BYPASS PATH

B2F

B3F

B4F

B5F

B6F F6

PARALLEL VOCAL TRACT MODEL LYRYNGEAL

SOUND SOURCES (NORMALLY NOT USED)

PARALLEL VOCAL TRACT MODEL FRICATION SOUND SOURCES

BNP BNZ BTP BTZ DF1 DB1 F2 B2 F3 B3 F4 B4 F5 B5

CASCADE VOCAL TRACT MODEL LARYNGEAL SOUND SOURCES

NASAL NASAL

POLE ZERO POLE ZERO PAIRPAIR

TRACHEAL TRACHEAL POLE ZERO POLE ZERO

PAIRPAIR

FIRST FIRST FORMANT FORMANT

RESONATORRESONATOR

SECOND SECOND FORMANT FORMANT

RESONATORRESONATOR

THIRTH THIRTH FORMANT FORMANT

RESONATORRESONATOR

FOURTH FOURTH FORMANT FORMANT

RESONATORRESONATOR

FIFTH FIFTH FORMANT FORMANT

RESONATORRESONATOR

THE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZERTHE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZERFNP FNZ FTP FTZ F1 B1

Page 19: 5- Speech Synthesis

Three Voicing Source Model In Three Voicing Source Model In KLATT 88KLATT 88

The old KLSYN impulsive sourceThe old KLSYN impulsive source

The KLGLOTT88 model The KLGLOTT88 model

The modified LF modelThe modified LF model