30
Progress on Bangla Text-To-Speech System Presented By: Dr. M. Shahidur Rahman Professor, Dept. of Computer Science & Engg. Shahjalal University of Science & Technology [email protected]

Progress on Bangla Text-To-Speech System by Dr. M. Shahidur Rahman

Embed Size (px)

Citation preview

Progress on Bangla Text-To-Speech System

Presented By:Dr. M. Shahidur Rahman

Professor, Dept. of Computer Science & Engg.Shahjalal University of Science & Technology

[email protected]

2

Outline

• Introduction to TTS• How TTS works• Present Bangla TTS systems• Problems of the present Bangla TTS• Directions to improve the performance of

Bangla TTS• Discussion…

3

What is a TTS?

• The goal of text-to-speech (TTS) synthesis is to convert an arbitrary input text into intelligible and natural sounding speech – TTS is not a “cut-and-paste” approach that strings together

isolated words – Instead, TTS employs linguistic analysis to infer correct

pronunciation and prosody (i.e., NLP) and acoustic representations of speech to generate waveforms (i.e., DSP)

4

TTS ApplicationsApplications: Services for the visually impaired community Services for the Illiterate people with difficulties in reading Enable use of Computers and IT services

Reading email aloud Using Word processor Using Internet

Commercial TTS Systems: Festival Bell Labs TTS

5

How TTS Works

6

Different TTS Systems

Phoneme-Based TTS System• Phonemes are:

– The minimal distinctive phonetic units– Relatively small in number (39 phonemes in English)

• Disadvantage– Phonemes ignore transitional sound !!!

7

Different TTS Systems (cont’d)

Diphone-Based TTS System: Diphones are:

– Made up of 2 phonemes– Incorporate transitional sound– Produce better sounding speech– Ex. কক = ক + কঅ + অক + ক

Disadvantage:• Over 1500 diphones in English language !!!

8

Text Pre-Processing

• Convert raw text, which may include numbers, abbreviations, etc., into the equivalent of written-out words

9

Word to Diphone Converter (Phonetization)

PurposeTranslate words to their diphone representations

(Ex. রা�জা� -> Diphones: { রা + রাআ + আজা + জাআ})mark the text into prosodic units such as phrases,

clauses and sentences

Resource– Dictionary of words and their diphones

10

Prosody

DiphoneRetrieval ConcatenationAcoustic

Manipulation

DiphoneDatabase

ProsodyParam.

11

Properties of Speech

PeriodicNon-Periodic

Non-Periodic

eg. cat.wav

12

Altering Pitch/Duration/Amplitude

• For smooth concatenation, altering pitch, duration and amplitude at the concatenation point is very important.

13

Altering Pitch

Hanningwindow

Original diphone Extractedpitch period

Hannedpitch period

X =

14

PSOLA – Pitch Synchronous Overlap and Add

=

50% Overlap + Add

Pitch Up > 50%Pitch Down < 50%

15

Altering Duration

• Increase number of PSOLA iterations (overlaps) to increase duration

• Decrease number of PSOLA iterations (overlaps) to decrease duration

16

Altering Amplitude

Multiplying the signal by a constantIf constant > 1, amplitude increaseIf constant < 1, amplitude decrease

17

Concatenation

Diphones Word• Using PSOLA at the joining ends• Ensures smooth transition

Words Sentence• Straight joining at the end points due to

presence of pauses

18

Putting All Together

TTS System

TextPre-processing Prosody Concatenation

words

19

Types of Concatenative speech synthesis

• Concatenative synthesis with a fixed inventory– contain one sample for each unit, and perform

prosodic modification to match the required prosody

• Unit-selection-based synthesis– store several instances of each unit, thus

improving the chances of finding a well-matched unit

20

Progress of Bangla TTS

• KATHA Developed in BRAC university Unit based system using Festival framework 4355 Diphones Takes 2 sec to generate a 10 sec utterance

• BANGLA VAANI syllable based synthesis system Developed in Kolkata

• SUBACHAN Developed by SUST people Diphone based synthesis system 527 Diphones Takes 45ms to generate a 10 sec utterance

Speech Signal From Kotha and Subachan

• (Voice of kotha) তি�তি প্রধা�� কতি� হলে�ও বে�শ তিকছু� প্র�ন্ধ- তি�ন্ধ রাচ� ও প্রক�শ কলেরালেছু

• (Voice of Subachan) তি�তি প্রধা�� কতি� হলে�ও বে�শ তিকছু� প্র�ন্ধ- তি�ন্ধ রাচ� ও প্রক�শ কলেরালেছু

• (Voice of kotha) জা���ন্দ দা�শ তি��শ শ��ব্দী�রা অ��ম প্রধা� আধা�তিক ����� কতি�

• (Voice of Subachan) জা���ন্দ দা�শ তি��শ শ��ব্দী�রা অ��ম প্রধা� আধা�তিক ����� কতি�

21

22

Problems: Homograph Ambiguity

• Homographs are words that share the same spelling but differ in meaning and pronunciation

23

Solution: Homograph Disambiguation

Collect all possible homograph words Determine POS tag of the homograph

words Ex. বেছুলে�রা� ম�লে �� (bol) বে!�লেছু।

�# তিম যা�লে� তিক � �� (bolo)।• Bayes Theorem can also be applied to determine the

likelihood of a word.

24

Problems: Improper Concatenation

Not concatenated properly

Signal from the the utterance of রা�শে�দ

25

Solution: Improper Concatenation

• PSOLA• Reducing number of concatenation point

– Ex 1. Sentence-> ।ক�ম�� ভা�� বেছুলে� Diphones-> ক� + আম� + আ� ভা�+ আলে�� বেছু+এলে�Instead of ক + কআ + আম + মআ + আ� + …�– Ex 2. ফ��( পৃ*তি+�� -> পৃ* + ইতি+ + ই��

• Vowel sound is periodic, thus suitable for appropriate concatenation

• Use 1000 most frequently spoken word

26

Duration Modeling

27

Duration Modeling

28

Thank you all!

Suggestions??

29

Sound Synthesized by Katha

• Katha

30

Sound Synthesized by Subachan

• Subachan