Speech Processing 15-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/multilingual_all.pdf · Writing Systems Romanized writing systems Latin-1 (iso-8599-1) Covers many Western

Speech Processing 15-492/18-492

Multilinguality

Dealing with *all* LanguagesDealing with *all* Languages

Over 6000 LanguagesOver 6000 Languages Maybe not all commercially interesting … nowMaybe not all commercially interesting … now

Major languages (economic)Major languages (economic) Cell phone manufacturers list 46 languagesCell phone manufacturers list 46 languages But even those not all coveredBut even those not all covered

What you needWhat you need

ASRASR Acoustic model (lots of speakers)Acoustic model (lots of speakers) Pronunciation LexiconPronunciation Lexicon Language modelLanguage model

TTSTTS Acoustic model (one speaker)Acoustic model (one speaker) Pronunciation LexiconPronunciation Lexicon Text analysisText analysis

Writing SystemsWriting Systems

Romanized writing systemsRomanized writing systems Latin-1 (iso-8599-1)Latin-1 (iso-8599-1) Covers many Western Europeans languagesCovers many Western Europeans languages

Cyrillic Cyrillic Covers many Eastern European LanguagesCovers many Eastern European Languages

Arabic ScriptsArabic Scripts Arabic(s), Farsi, Urdu, etcArabic(s), Farsi, Urdu, etc

DevenagariDevenagari Covers many Northern India LanguagesCovers many Northern India Languages

Chinese HanziChinese Hanzi Covers some Chinese dialects but different versionsCovers some Chinese dialects but different versions

Many other scripts some non-standardMany other scripts some non-standard

Writing SystemsWriting Systems

Letter based Letter based Latin, CyrillicLatin, Cyrillic

Consonant basedConsonant based Arabic, HebrewArabic, Hebrew

Mora basedMora based Half syllable or syllableHalf syllable or syllable Indian scripts, Japanese native scriptsIndian scripts, Japanese native scripts

Syllable based Syllable based Hangul, ChineseHangul, Chinese

StandardsStandards

Writing standardsWriting standards Taught at schools, newspapers, computer Taught at schools, newspapers, computer

supportsupport Typically standardized spellingTypically standardized spelling

May be mostly spokenMay be mostly spoken Occasionally writtenOccasionally written

Language Specific IssuesLanguage Specific Issues

No explicit markingsNo explicit markings Stress, accent, tonesStress, accent, tones

No word boundariesNo word boundaries Chinese, ThaiChinese, Thai

No (short) vowelsNo (short) vowels Arabic, HebrewArabic, Hebrew

Rich morphologyRich morphology Many different words in the languagesMany different words in the languages Finnish, Turkish, GreenlandicFinnish, Turkish, Greenlandic

Genre Specific IssuesGenre Specific Issues

No capitals, punctuationsNo capitals, punctuations UnpunctuatedUnpunctuated Plain vs polite formPlain vs polite form Speech vs text formSpeech vs text form Many foreign phrasesMany foreign phrases

(technology directed genre’s)(technology directed genre’s) Many new abbreviationsMany new abbreviations

E.g. SMS messagesE.g. SMS messages

Character EncodingCharacter Encoding

Unicode vs utf8 vs latinUnicode vs utf8 vs latin Documents mix themDocuments mix them

Sometime accent omittedSometime accent omitted For ease of typingFor ease of typing

Lots of standardsLots of standards Unicode, EUC, BIG5, TIS42, …Unicode, EUC, BIG5, TIS42, … Everyone has their own standardEveryone has their own standard

Some create their own standardsSome create their own standards Mixed character setsMixed character sets

Phoneme SetsPhoneme Sets

Hard to find consensus for new languagesHard to find consensus for new languages Typically lots of different dialectsTypically lots of different dialects

What level of distinction?What level of distinction? Some good for speech but not really phoneticSome good for speech but not really phonetic /t/ vs /dx/ in “water”/t/ vs /dx/ in “water”

Often doesn’t include foreign phonesOften doesn’t include foreign phones /w/ in German is common for younger people/w/ in German is common for younger people

WordsWords

May be hard to defineMay be hard to define No word boundariesNo word boundaries

Rich morphologyRich morphology Words have many variations of compoundsWords have many variations of compounds Yomenakatta -> could not readYomenakatta -> could not read Yomemasendeshita -> could not read (polite)Yomemasendeshita -> could not read (polite)

Gender specific speechGender specific speech Boku vs atashiBoku vs atashi

Language mixturesLanguage mixtures

Pronunciation lexiconsPronunciation lexicons

““proper” speech vs “actual” speechproper” speech vs “actual” speech Hard to generalizeHard to generalize

ChineseChinese Cross lingual pronunciationsCross lingual pronunciations

““Human” (English/German)Human” (English/German)

““Industry” wayIndustry” way

Collect at least 300 hours of spoken speechCollect at least 300 hours of spoken speech At least 20 different speakersAt least 20 different speakers Mixture of gender, age, etcMixture of gender, age, etc Through desired channel (phone/desktop)Through desired channel (phone/desktop)

Collect at least 5 hours from one speakerCollect at least 5 hours from one speaker High quality recording studioHigh quality recording studio

Data should be targeted to applicationData should be targeted to application Build pronunciation lexiconBuild pronunciation lexicon

Expert phonologistExpert phonologist

Industry wayIndustry way

Probably 3-6 monthsProbably 3-6 months Lead developerLead developer Local language expertLocal language expert Lots of human transcribersLots of human transcribers

Costs?Costs? Many hundreds of thousandsMany hundreds of thousands

Or cheaper (?) …Or cheaper (?) …

Find existing dataFind existing data Linguistic Data Consortium (UPenn)Linguistic Data Consortium (UPenn) ELRA (European equivalent)ELRA (European equivalent) Appen, AustraliaAppen, Australia Find local people who have collected dataFind local people who have collected data

Found data might be in wrong formatFound data might be in wrong format Data cleaning is often the most expensiveData cleaning is often the most expensive

Standardized DatasetsStandardized Datasets

Global PhoneGlobal Phone– 20+ languages, for ASR/TTS20+ languages, for ASR/TTS

LDC/DARPA/IARPA setsLDC/DARPA/IARPA sets– Mostly English, Arabic and ChineseMostly English, Arabic and Chinese

BABEL datasetBABEL dataset– 35 low resource languages (telephone conversations)35 low resource languages (telephone conversations)

LibrivoxLibrivox– Audio booksAudio books

VoxforgeVoxforge– Open source collected languagesOpen source collected languages

MozillaMozilla– Open source multilingual setsOpen source multilingual sets

CMU Wilderness DatasetCMU Wilderness Dataset 500+ Languages500+ Languages

– 20 hours aligned for each language 20 hours aligned for each language – Single speakerSingle speaker– Mined from read audio books (Bible)Mined from read audio books (Bible)– 20+ languages, for ASR/TTS20+ languages, for ASR/TTS

Actual wayActual way

Often mixtureOften mixture Found data for initial modelFound data for initial model Collect data with actual/initial applicationCollect data with actual/initial application

Multilingual SystemsMultilingual Systems

Support lots of different languagesSupport lots of different languages Press 1 for SpanishPress 1 for Spanish Press 2 for Gujarati …Press 2 for Gujarati …

Automatically detect languageAutomatically detect language Mixed languageMixed language

Multilingual (Menu)Multilingual (Menu)

Speak in your languageSpeak in your language Eki-mai no tsugi no bus no ha?Eki-mai no tsugi no bus no ha? When is the next bus to the stationWhen is the next bus to the station

Need multiple recognizersNeed multiple recognizers Run in parallel and take best resultRun in parallel and take best result

Or shared acoustic modelsOr shared acoustic models Recognizing both languages at once (mix)Recognizing both languages at once (mix)

Multilingual (in line)Multilingual (in line)

Code switchingCode switching European, India, Bilingual areasEuropean, India, Bilingual areas Hinglish, SpanglishHinglish, Spanglish

Borrowed words and phrasesBorrowed words and phrases Dad, time kyu hua haiDad, time kyu hua hai One lakhOne lakh Computer wallaComputer walla numbersnumbers

Can be inflectedCan be inflected Was updated -> up gedaten Was updated -> up gedaten

Speech Processing 11-492/18-492Speech Processing 11-492/18-492

MultilingualitySPICE: making it easier

Dealing with *all* LanguagesDealing with *all* Languages

Over 6000 LanguagesOver 6000 LanguagesMaybe not all commercially interesting … nowMaybe not all commercially interesting … now

Major languages (economic)Major languages (economic)Cell phone manufacturers list 46 languagesCell phone manufacturers list 46 languages

But even those not all coveredBut even those not all covered

ComputerizationComputerization: Speech is key technology: Speech is key technology Mobile Devices, Ubiquitous Information AccessMobile Devices, Ubiquitous Information Access

GlobalizationGlobalization: Multilinguality: Multilinguality More than 6000 Languages in the world More than 6000 Languages in the world

Multiple official languagesMultiple official languages

Europe has 20+ official languagesEurope has 20+ official languages

South Africa has 11 official languagesSouth Africa has 11 official languages

Speech Processing in multiple LanguagesSpeech Processing in multiple Languages Cross-cultural Human-Human InteractionCross-cultural Human-Human Interaction

Human-Machine Interface in mother tongueHuman-Machine Interface in mother tongue

MotivationMotivation

ChallengesChallenges Algorithms language independent but require dataAlgorithms language independent but require data

• Dozens of hours audio recordings and corresponding transcriptionsDozens of hours audio recordings and corresponding transcriptions

• Pronunciation dictionaries for large vocabularies (>100.000 words)Pronunciation dictionaries for large vocabularies (>100.000 words)

• Millions of words written text corpora in various domains in questionMillions of words written text corpora in various domains in question

• Bilingual aligned text corporaBilingual aligned text corpora

BUT: Such data only available in very few languagesBUT: Such data only available in very few languages• Audio data Audio data 40 languages, Transcriptions take up to 40x real time 40 languages, Transcriptions take up to 40x real time

• Large vocabulary pronunciation dictionaries Large vocabulary pronunciation dictionaries 20 languages 20 languages

• Small text corpora Small text corpora 100 languages, large corpora 100 languages, large corpora 30 languages 30 languages

• Bilingual corpora in very few language pairs, pivot mostly EnglishBilingual corpora in very few language pairs, pivot mostly English

Additional complications:Additional complications:• Combinatorical explosion (domain, speaking style, accent, dialect, ...)Combinatorical explosion (domain, speaking style, accent, dialect, ...)

• Few native speakers at hand for minority (endangered) languagesFew native speakers at hand for minority (endangered) languages

• Languages without writing systemsLanguages without writing systems

Solution: Learning SystemsSolution: Learning Systems Systems that learn a language from the userSystems that learn a language from the user

Efficient learning algorithms for speech processingEfficient learning algorithms for speech processing Learning:Learning:

• Interactive learning with user in the loopInteractive learning with user in the loop

• Statistical modeling approachesStatistical modeling approaches

Efficiency:Efficiency:

• Reduce amount of data (save time and costs): by a factor of 10Reduce amount of data (save time and costs): by a factor of 10

• Speed up development cycles: days rather than monthsSpeed up development cycles: days rather than months

Rapid Language Adaptation from universal modelsRapid Language Adaptation from universal models

Bridge the gap: language and technology expertsBridge the gap: language and technology experts• Technology experts do not speak all languages in questionTechnology experts do not speak all languages in question

• Native users are not in control of the technologyNative users are not in control of the technology

Sharing data between modulesSharing data between modules

Lexst LMt

Word s

Word t N-grams

AMtDictt

Word phone sequence

LMt

N-grams

AMs Dicts

Word phone sequence

Lexts

Word s

Word t

LMs

N-grams

AMs Dicts LMs

Word phone sequence

N-grams

AMtDictt

Word phone sequence

Input Ls

Input Lt

Output Ls

Speech-to-Speech Translation

Lsource Ltarget

Lsource Ltarget

SPICESPICESpeech Processing: Interactive Creation and Evaluation toolkit

• National Science Foundation, Grant 10/2004, 3 years

• Principle Investigators Tanja Schultz and Alan Black

• Bridge the gap between technology experts language experts

• Automatic Speech Recognition (ASR),

• Machine Translation (MT),

• Text-to-Speech (TTS)

• Develop web-based intelligent systems

• Interactive Learning with user in the loop

• Rapid Adaptation of universal models to unseen languages

• SPICE webpage http://cmuspice.org

Spice Project PageSpice Project Page

Input: Speech

Speech Processing SystemsSpeech Processing Systems

Pronunciation rules

hi /h//ai/you /j/u/we /w//i/

hi youyou areI am

AM Lex LMOutput: Speech & Text

Hello NLP /

MT

TTS

Text data

Phone set & Speech data

Input: Speech


hi youyou areI am


NLP /

MT

TTS


+

Hello

Rapid Portability: DataRapid Portability: Data

Finding “Nice” PromptsFinding “Nice” Prompts

From very large text databasesFrom very large text databasesFind “nice” sentences:Find “nice” sentences:Containing only high frequency wordsContaining only high frequency words5-15 words5-15 words

Find grapheme/phoneme balanced setFind grapheme/phoneme balanced setSelect sentences with best triphone/graphSelect sentences with best triphone/graph

500-1000 sentences500-1000 sentencesCollect for ASR and TTS acoustic modelingCollect for ASR and TTS acoustic modeling

Prompt Selection IssuesPrompt Selection Issues

Need good textNeed good textDe-htmlify, well-written, no misspellingDe-htmlify, well-written, no misspelling

Need word segmentationNeed word segmentationJapanese, Chinese ThaiJapanese, Chinese Thai

Natural text is often mixed languageNatural text is often mixed languageHindi Newspaper Text has lots of English wordsHindi Newspaper Text has lots of English words

Automatic selection has errorsAutomatic selection has errorsNeed Speaker to do further selectionNeed Speaker to do further selection

E.g. lots of telephone numbers, formating commandsE.g. lots of telephone numbers, formating commands

CMU Arctic used similar methodsCMU Arctic used similar methods

Recording PromptsRecording Prompts

GlobalPhoneGlobalPhoneMultilingual Database Widespread languages Native Speakers Uniform Data Broad Domain Large Text Resources

Internet, Newspaper

Corpus 19 Languages … counting 1800 native speakers 400 hrs Audio data Read Speech Filled pauses annotated

Arabic

Ch-Mandarin

Ch-Shanghai

German

French

Japanese

Korean

Croatian

Portuguese

Russian

Spanish

Swedish

Tamil

Czech

Turkish

+ Thai

+ Creole

+ Polish

+ Bulgarian

+ ... ???Now available from ELRA !!

Speech Recognition in 17 LanguagesSpeech Recognition in 17 Languages

0

10

20

30

40

1011.8

14 14 14.514.516.9 18 19 20 20

29

33.5

2021.7

23.4

29

Wo

rd E

rro

r R

ate

[%

]

Input: Speech


hi youyou areI am


NLP /

MT

TTS


+

Hello

Rapid Portability: Acoustic ModelsRapid Portability: Acoustic Models

Speech Production is independent from Language IPA1) IPA-based Universal Sound Inventory

2) Each sound class is trained by data sharing

Reduction from 485 to 162 sound classes m,n,s,l appear in all 12 languages p,b,t,d,k,g,f and i,u,e,a,o in almost all

Problem:

Context of sounds are language specific

Context dependent models for new languages?

Solution:

1) Multilingual Decision Context Trees

2) Specialize decision tree by Adaptation

Universal Sound InventoryUniversal Sound Inventory

-1=Plosiv?

N Jk (0)

klau k raut k leot k orin k ar

+2=Vokal?N J

k (1) k (2)lau k rain k ar

ut k leot k or

BlaukrautBrautkleidBrotkorbWeinkarte

Choosing PhonemesChoosing Phonemes

Rapid Portability: Acoustic ModelRapid Portability: Acoustic Model

69,1

57,149,9

40,632,8

28,9

19,6 19

0

20

40

60

80

100

Wor

d E

rror

rat

e [%

]

0 0:15 0:15 0:25 0:25 0:25 1:30 16:30

Ø Tree ML-Tree Po-Tree PDTS

+

Input: Speech

Rapid Portability: Pronunciation DictionaryRapid Portability: Pronunciation Dictionary

Pronunciation rules


hi youyou areI am


NLP /

MT

TTS

Textdaten„adios“ /a/ /d/ /i/ /o/ /s/„Hallo“ /h/ /a/ /l/ /o/ „Phydough“ ???

Hello

Phoneme Grapheme (FTT)Grapheme

English Spanish German Russian Thai

Phoneme- vs Grapheme based ASRPhoneme- vs Grapheme based ASR

Problem: • 1 Grapheme 1 Phoneme

Flexible Tree Tying (FTT): One decision tree• Improved parameter tying• Less over specification• Fewer inconsistencies

0=vowel?

0=obstruent? 0=begin-state?

-1=syllabic?0=mid-state?-1=obstruent?0=end-state?

AX-m

IX-m

AX-b

Dictionary: Interactive LearningDictionary: Interactive Learning

* Follow the work of Davel & Barnard

* Word list: extract from text

User

Word list W

i:= best select

Word wi

Generate pronunciation P(wi)

TTS

P(wi) okay?Yes

Delete wi

No

Update G-2-P

Improve P(wi)

G-2-P

Delete wi

* Update after each wi more effective training

* Kominek & Black

* G-2-P- explicit mapping rules - neural networks - decision trees - instance learning (grapheme context)

LexSkip

Spice: Lex LearnerSpice: Lex Learner

Spice: Lex LearnerSpice: Lex Learner

Issues and ChallengesIssues and Challenges How to make best use of the human?How to make best use of the human?

Definition of successful completion Definition of successful completion Which words to present in what order Which words to present in what order How to be robust against mistakes How to be robust against mistakes Feedback that keeps users motivated to continue Feedback that keeps users motivated to continue

How many words? How many words? G2P complexity language dependent G2P complexity language dependent 80% coverage80% coverage

hundred (SP) to thousands (EN)hundred (SP) to thousands (EN) G2P rule system perplexity G2P rule system perplexity

LanguageLanguage PerplexityPerplexity

EnglishEnglish 50.1150.11

DutchDutch 16.8016.80

GermanGerman 16.7016.70

AfrikaansAfrikaans 11.4811.48

ItalianItalian 3.523.52

SpanishSpanish 1.211.21

Input: Speech

Rapid Portability: LMRapid Portability: LM


hi youyou areI am


NLP /

MT

TTS

Text data

Internet / TV

Hello

Inquiry

AutomaticExtraction

LM

Bridge Languages

+

Resource rich languages Resource low languages:

Parametric TTSParametric TTS Text-to-speech for G2P Learning: Text-to-speech for G2P Learning:

Technique: phoneme-by-phoneme concatenation, Technique: phoneme-by-phoneme concatenation, speech not natural but understandable (Marelie Davel)speech not natural but understandable (Marelie Davel)

Units are based on IPA phoneme examplesUnits are based on IPA phoneme examples • PRO: covers languages through simple adaptationPRO: covers languages through simple adaptation

• CONS: not good enough for speech applications CONS: not good enough for speech applications

Text-to-speech for Applications: Text-to-speech for Applications: Statistical Parametric Systems: Statistical Parametric Systems: clustergenclustergen Clusters representing context-dependent allophones Clusters representing context-dependent allophones

• PRO: can work with little speech (10 minutes) PRO: can work with little speech (10 minutes)

• PRO: robust to erroneous data.PRO: robust to erroneous data.

• CONS: speech sounds buzzy, lacks natural prosodyCONS: speech sounds buzzy, lacks natural prosody

SPICE: Afrikaans - EnglishSPICE: Afrikaans - English Goal: Build Afrikaans – English S2S using SPICEGoal: Build Afrikaans – English S2S using SPICE

• Cooperation with UniversitCooperation with Universityy Stellenbosch and ARMSCOR Stellenbosch and ARMSCOR

• Bilingual PhD visited CMU fBilingual PhD visited CMU for 3 month (Herman Engelbrecht)or 3 month (Herman Engelbrecht)

• Afrikaans: Related to Dutch and EnglishAfrikaans: Related to Dutch and English, , g-2-p very close, regular grammar, simple morphologyg-2-p very close, regular grammar, simple morphology

SPICE, all components apply statistical modeling paradigmSPICE, all components apply statistical modeling paradigm• ASR: HMMs, N-gram LM (JRTk-ISL)ASR: HMMs, N-gram LM (JRTk-ISL)

• MT: Statistical MTMT: Statistical MT (SMT-ISL) (SMT-ISL)

• TTS: Unit-Selection (Festival)TTS: Unit-Selection (Festival)

• DictionaryDictionary: : G-2-P rules using CART decision treesG-2-P rules using CART decision trees

Text: 39 hansard; 680k words; Text: 39 hansard; 680k words; 43k bilingual aligned sentence pairs;43k bilingual aligned sentence pairs;

Audio: 6 hours read speech; 10k utterances, Audio: 6 hours read speech; 10k utterances,

SPICE: Time effortSPICE: Time effort Results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9)Results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9) Shared pronunciation dictionaries (fShared pronunciation dictionaries (for ASR+TTS) and LM or ASR+TTS) and LM (f(for ASR+MT)or ASR+MT) Most time consuming process: data preparation Most time consuming process: data preparation reduce amount of data! reduce amount of data! Still too much expert knowledge required (e.g. ASR parameter tuning!) Still too much expert knowledge required (e.g. ASR parameter tuning!)

Current TestsCurrent Tests

11 students is CMU class11 students is CMU classHindi (2), Vietnamese (2), French, German (2), Hindi (2), Vietnamese (2), French, German (2), Bulgarian, Telugu, Cantonese, Mandarin.Bulgarian, Telugu, Cantonese, Mandarin.Build complete S2S systemBuild complete S2S systemTeams of 2 for translation on small domainTeams of 2 for translation on small domainTranslation is simple phrase-based Translation is simple phrase-based

Purpose:Purpose:Have students get full experienceHave students get full experienceFind bugs/limitation in the systemFind bugs/limitation in the systemEvaluation resulting systems for development time and Evaluation resulting systems for development time and accuracyaccuracy