Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Speech Processing 15-492/18-492
Multilinguality
Dealing with *all* LanguagesDealing with *all* Languages
Over 6000 LanguagesOver 6000 Languages Maybe not all commercially interesting … nowMaybe not all commercially interesting … now
Major languages (economic)Major languages (economic) Cell phone manufacturers list 46 languagesCell phone manufacturers list 46 languages But even those not all coveredBut even those not all covered
What you needWhat you need
ASRASR Acoustic model (lots of speakers)Acoustic model (lots of speakers) Pronunciation LexiconPronunciation Lexicon Language modelLanguage model
TTSTTS Acoustic model (one speaker)Acoustic model (one speaker) Pronunciation LexiconPronunciation Lexicon Text analysisText analysis
Writing SystemsWriting Systems
Romanized writing systemsRomanized writing systems Latin-1 (iso-8599-1)Latin-1 (iso-8599-1) Covers many Western Europeans languagesCovers many Western Europeans languages
Cyrillic Cyrillic Covers many Eastern European LanguagesCovers many Eastern European Languages
Arabic ScriptsArabic Scripts Arabic(s), Farsi, Urdu, etcArabic(s), Farsi, Urdu, etc
DevenagariDevenagari Covers many Northern India LanguagesCovers many Northern India Languages
Chinese HanziChinese Hanzi Covers some Chinese dialects but different versionsCovers some Chinese dialects but different versions
Many other scripts some non-standardMany other scripts some non-standard
Writing SystemsWriting Systems
Letter based Letter based Latin, CyrillicLatin, Cyrillic
Consonant basedConsonant based Arabic, HebrewArabic, Hebrew
Mora basedMora based Half syllable or syllableHalf syllable or syllable Indian scripts, Japanese native scriptsIndian scripts, Japanese native scripts
Syllable based Syllable based Hangul, ChineseHangul, Chinese
StandardsStandards
Writing standardsWriting standards Taught at schools, newspapers, computer Taught at schools, newspapers, computer
supportsupport Typically standardized spellingTypically standardized spelling
May be mostly spokenMay be mostly spoken Occasionally writtenOccasionally written
Language Specific IssuesLanguage Specific Issues
No explicit markingsNo explicit markings Stress, accent, tonesStress, accent, tones
No word boundariesNo word boundaries Chinese, ThaiChinese, Thai
No (short) vowelsNo (short) vowels Arabic, HebrewArabic, Hebrew
Rich morphologyRich morphology Many different words in the languagesMany different words in the languages Finnish, Turkish, GreenlandicFinnish, Turkish, Greenlandic
Genre Specific IssuesGenre Specific Issues
No capitals, punctuationsNo capitals, punctuations UnpunctuatedUnpunctuated Plain vs polite formPlain vs polite form Speech vs text formSpeech vs text form Many foreign phrasesMany foreign phrases
(technology directed genre’s)(technology directed genre’s) Many new abbreviationsMany new abbreviations
E.g. SMS messagesE.g. SMS messages
Character EncodingCharacter Encoding
Unicode vs utf8 vs latinUnicode vs utf8 vs latin Documents mix themDocuments mix them
Sometime accent omittedSometime accent omitted For ease of typingFor ease of typing
Lots of standardsLots of standards Unicode, EUC, BIG5, TIS42, …Unicode, EUC, BIG5, TIS42, … Everyone has their own standardEveryone has their own standard
Some create their own standardsSome create their own standards Mixed character setsMixed character sets
Phoneme SetsPhoneme Sets
Hard to find consensus for new languagesHard to find consensus for new languages Typically lots of different dialectsTypically lots of different dialects
What level of distinction?What level of distinction? Some good for speech but not really phoneticSome good for speech but not really phonetic /t/ vs /dx/ in “water”/t/ vs /dx/ in “water”
Often doesn’t include foreign phonesOften doesn’t include foreign phones /w/ in German is common for younger people/w/ in German is common for younger people
WordsWords
May be hard to defineMay be hard to define No word boundariesNo word boundaries
Rich morphologyRich morphology Words have many variations of compoundsWords have many variations of compounds Yomenakatta -> could not readYomenakatta -> could not read Yomemasendeshita -> could not read (polite)Yomemasendeshita -> could not read (polite)
Gender specific speechGender specific speech Boku vs atashiBoku vs atashi
Language mixturesLanguage mixtures
Pronunciation lexiconsPronunciation lexicons
““proper” speech vs “actual” speechproper” speech vs “actual” speech Hard to generalizeHard to generalize
ChineseChinese Cross lingual pronunciationsCross lingual pronunciations
““Human” (English/German)Human” (English/German)
““Industry” wayIndustry” way
Collect at least 300 hours of spoken speechCollect at least 300 hours of spoken speech At least 20 different speakersAt least 20 different speakers Mixture of gender, age, etcMixture of gender, age, etc Through desired channel (phone/desktop)Through desired channel (phone/desktop)
Collect at least 5 hours from one speakerCollect at least 5 hours from one speaker High quality recording studioHigh quality recording studio
Data should be targeted to applicationData should be targeted to application Build pronunciation lexiconBuild pronunciation lexicon
Expert phonologistExpert phonologist
Industry wayIndustry way
Probably 3-6 monthsProbably 3-6 months Lead developerLead developer Local language expertLocal language expert Lots of human transcribersLots of human transcribers
Costs?Costs? Many hundreds of thousandsMany hundreds of thousands
Or cheaper (?) …Or cheaper (?) …
Find existing dataFind existing data Linguistic Data Consortium (UPenn)Linguistic Data Consortium (UPenn) ELRA (European equivalent)ELRA (European equivalent) Appen, AustraliaAppen, Australia Find local people who have collected dataFind local people who have collected data
Found data might be in wrong formatFound data might be in wrong format Data cleaning is often the most expensiveData cleaning is often the most expensive
Standardized DatasetsStandardized Datasets
Global PhoneGlobal Phone– 20+ languages, for ASR/TTS20+ languages, for ASR/TTS
LDC/DARPA/IARPA setsLDC/DARPA/IARPA sets– Mostly English, Arabic and ChineseMostly English, Arabic and Chinese
BABEL datasetBABEL dataset– 35 low resource languages (telephone conversations)35 low resource languages (telephone conversations)
LibrivoxLibrivox– Audio booksAudio books
VoxforgeVoxforge– Open source collected languagesOpen source collected languages
MozillaMozilla– Open source multilingual setsOpen source multilingual sets
CMU Wilderness DatasetCMU Wilderness Dataset 500+ Languages500+ Languages
– 20 hours aligned for each language 20 hours aligned for each language – Single speakerSingle speaker– Mined from read audio books (Bible)Mined from read audio books (Bible)– 20+ languages, for ASR/TTS20+ languages, for ASR/TTS
Actual wayActual way
Often mixtureOften mixture Found data for initial modelFound data for initial model Collect data with actual/initial applicationCollect data with actual/initial application
Multilingual SystemsMultilingual Systems
Support lots of different languagesSupport lots of different languages Press 1 for SpanishPress 1 for Spanish Press 2 for Gujarati …Press 2 for Gujarati …
Automatically detect languageAutomatically detect language Mixed languageMixed language
Multilingual (Menu)Multilingual (Menu)
Speak in your languageSpeak in your language Eki-mai no tsugi no bus no ha?Eki-mai no tsugi no bus no ha? When is the next bus to the stationWhen is the next bus to the station
Need multiple recognizersNeed multiple recognizers Run in parallel and take best resultRun in parallel and take best result
Or shared acoustic modelsOr shared acoustic models Recognizing both languages at once (mix)Recognizing both languages at once (mix)
Multilingual (in line)Multilingual (in line)
Code switchingCode switching European, India, Bilingual areasEuropean, India, Bilingual areas Hinglish, SpanglishHinglish, Spanglish
Borrowed words and phrasesBorrowed words and phrases Dad, time kyu hua haiDad, time kyu hua hai One lakhOne lakh Computer wallaComputer walla numbersnumbers
Can be inflectedCan be inflected Was updated -> up gedaten Was updated -> up gedaten
Speech Processing 11-492/18-492Speech Processing 11-492/18-492
MultilingualitySPICE: making it easier
Dealing with *all* LanguagesDealing with *all* Languages
Over 6000 LanguagesOver 6000 LanguagesMaybe not all commercially interesting … nowMaybe not all commercially interesting … now
Major languages (economic)Major languages (economic)Cell phone manufacturers list 46 languagesCell phone manufacturers list 46 languages
But even those not all coveredBut even those not all covered
ComputerizationComputerization: Speech is key technology: Speech is key technology Mobile Devices, Ubiquitous Information AccessMobile Devices, Ubiquitous Information Access
GlobalizationGlobalization: Multilinguality: Multilinguality More than 6000 Languages in the world More than 6000 Languages in the world
Multiple official languagesMultiple official languages
Europe has 20+ official languagesEurope has 20+ official languages
South Africa has 11 official languagesSouth Africa has 11 official languages
Speech Processing in multiple LanguagesSpeech Processing in multiple Languages Cross-cultural Human-Human InteractionCross-cultural Human-Human Interaction
Human-Machine Interface in mother tongueHuman-Machine Interface in mother tongue
MotivationMotivation
ChallengesChallenges Algorithms language independent but require dataAlgorithms language independent but require data
• Dozens of hours audio recordings and corresponding transcriptionsDozens of hours audio recordings and corresponding transcriptions
• Pronunciation dictionaries for large vocabularies (>100.000 words)Pronunciation dictionaries for large vocabularies (>100.000 words)
• Millions of words written text corpora in various domains in questionMillions of words written text corpora in various domains in question
• Bilingual aligned text corporaBilingual aligned text corpora
BUT: Such data only available in very few languagesBUT: Such data only available in very few languages• Audio data Audio data 40 languages, Transcriptions take up to 40x real time 40 languages, Transcriptions take up to 40x real time
• Large vocabulary pronunciation dictionaries Large vocabulary pronunciation dictionaries 20 languages 20 languages
• Small text corpora Small text corpora 100 languages, large corpora 100 languages, large corpora 30 languages 30 languages
• Bilingual corpora in very few language pairs, pivot mostly EnglishBilingual corpora in very few language pairs, pivot mostly English
Additional complications:Additional complications:• Combinatorical explosion (domain, speaking style, accent, dialect, ...)Combinatorical explosion (domain, speaking style, accent, dialect, ...)
• Few native speakers at hand for minority (endangered) languagesFew native speakers at hand for minority (endangered) languages
• Languages without writing systemsLanguages without writing systems
Solution: Learning SystemsSolution: Learning Systems Systems that learn a language from the userSystems that learn a language from the user
Efficient learning algorithms for speech processingEfficient learning algorithms for speech processing Learning:Learning:
• Interactive learning with user in the loopInteractive learning with user in the loop
• Statistical modeling approachesStatistical modeling approaches
Efficiency:Efficiency:
• Reduce amount of data (save time and costs): by a factor of 10Reduce amount of data (save time and costs): by a factor of 10
• Speed up development cycles: days rather than monthsSpeed up development cycles: days rather than months
Rapid Language Adaptation from universal modelsRapid Language Adaptation from universal models
Bridge the gap: language and technology expertsBridge the gap: language and technology experts• Technology experts do not speak all languages in questionTechnology experts do not speak all languages in question
• Native users are not in control of the technologyNative users are not in control of the technology
Sharing data between modulesSharing data between modules
Lexst LMt
Word s
Word t N-grams
AMtDictt
Word phone sequence
LMt
N-grams
AMs Dicts
Word phone sequence
Lexts
Word s
Word t
LMs
N-grams
AMs Dicts LMs
Word phone sequence
N-grams
AMtDictt
Word phone sequence
Input Ls
Input Lt
Output Ls
Speech-to-Speech Translation
Lsource Ltarget
Lsource Ltarget
SPICESPICESpeech Processing: Interactive Creation and Evaluation toolkit
• National Science Foundation, Grant 10/2004, 3 years
• Principle Investigators Tanja Schultz and Alan Black
• Bridge the gap between technology experts language experts
• Automatic Speech Recognition (ASR),
• Machine Translation (MT),
• Text-to-Speech (TTS)
• Develop web-based intelligent systems
• Interactive Learning with user in the loop
• Rapid Adaptation of universal models to unseen languages
• SPICE webpage http://cmuspice.org
Spice Project PageSpice Project Page
Input: Speech
Speech Processing SystemsSpeech Processing Systems
Pronunciation rules
hi /h//ai/you /j/u/we /w//i/
hi youyou areI am
AM Lex LMOutput: Speech & Text
Hello NLP /
MT
TTS
Text data
Phone set & Speech data
Input: Speech
hi /h//ai/you /j/u/we /w//i/
hi youyou areI am
AM Lex LMOutput: Speech & Text
NLP /
MT
TTS
Phone set & Speech data
+
Hello
Rapid Portability: DataRapid Portability: Data
Finding “Nice” PromptsFinding “Nice” Prompts
From very large text databasesFrom very large text databasesFind “nice” sentences:Find “nice” sentences:Containing only high frequency wordsContaining only high frequency words5-15 words5-15 words
Find grapheme/phoneme balanced setFind grapheme/phoneme balanced setSelect sentences with best triphone/graphSelect sentences with best triphone/graph
500-1000 sentences500-1000 sentencesCollect for ASR and TTS acoustic modelingCollect for ASR and TTS acoustic modeling
Prompt Selection IssuesPrompt Selection Issues
Need good textNeed good textDe-htmlify, well-written, no misspellingDe-htmlify, well-written, no misspelling
Need word segmentationNeed word segmentationJapanese, Chinese ThaiJapanese, Chinese Thai
Natural text is often mixed languageNatural text is often mixed languageHindi Newspaper Text has lots of English wordsHindi Newspaper Text has lots of English words
Automatic selection has errorsAutomatic selection has errorsNeed Speaker to do further selectionNeed Speaker to do further selection
E.g. lots of telephone numbers, formating commandsE.g. lots of telephone numbers, formating commands
CMU Arctic used similar methodsCMU Arctic used similar methods
Recording PromptsRecording Prompts
GlobalPhoneGlobalPhoneMultilingual Database Widespread languages Native Speakers Uniform Data Broad Domain Large Text Resources
Internet, Newspaper
Corpus 19 Languages … counting 1800 native speakers 400 hrs Audio data Read Speech Filled pauses annotated
Arabic
Ch-Mandarin
Ch-Shanghai
German
French
Japanese
Korean
Croatian
Portuguese
Russian
Spanish
Swedish
Tamil
Czech
Turkish
+ Thai
+ Creole
+ Polish
+ Bulgarian
+ ... ???Now available from ELRA !!
Speech Recognition in 17 LanguagesSpeech Recognition in 17 Languages
0
10
20
30
40
1011.8
14 14 14.514.516.9 18 19 20 20
29
33.5
2021.7
23.4
29
Wo
rd E
rro
r R
ate
[%
]
Input: Speech
hi /h//ai/you /j/u/we /w//i/
hi youyou areI am
AM Lex LMOutput: Speech & Text
NLP /
MT
TTS
Phone set & Speech data
+
Hello
Rapid Portability: Acoustic ModelsRapid Portability: Acoustic Models
Speech Production is independent from Language IPA1) IPA-based Universal Sound Inventory
2) Each sound class is trained by data sharing
Reduction from 485 to 162 sound classes m,n,s,l appear in all 12 languages p,b,t,d,k,g,f and i,u,e,a,o in almost all
Problem:
Context of sounds are language specific
Context dependent models for new languages?
Solution:
1) Multilingual Decision Context Trees
2) Specialize decision tree by Adaptation
Universal Sound InventoryUniversal Sound Inventory
-1=Plosiv?
N Jk (0)
klau k raut k leot k orin k ar
+2=Vokal?N J
k (1) k (2)lau k rain k ar
ut k leot k or
BlaukrautBrautkleidBrotkorbWeinkarte
Choosing PhonemesChoosing Phonemes
Rapid Portability: Acoustic ModelRapid Portability: Acoustic Model
69,1
57,149,9
40,632,8
28,9
19,6 19
0
20
40
60
80
100
Wor
d E
rror
rat
e [%
]
0 0:15 0:15 0:25 0:25 0:25 1:30 16:30
Ø Tree ML-Tree Po-Tree PDTS
+
Input: Speech
Rapid Portability: Pronunciation DictionaryRapid Portability: Pronunciation Dictionary
Pronunciation rules
hi /h//ai/you /j/u/we /w//i/
hi youyou areI am
AM Lex LMOutput: Speech & Text
NLP /
MT
TTS
Textdaten„adios“ /a/ /d/ /i/ /o/ /s/„Hallo“ /h/ /a/ /l/ /o/ „Phydough“ ???
Hello
Phoneme Grapheme (FTT)Grapheme
English Spanish German Russian Thai
Phoneme- vs Grapheme based ASRPhoneme- vs Grapheme based ASR
Problem: • 1 Grapheme 1 Phoneme
Flexible Tree Tying (FTT): One decision tree• Improved parameter tying• Less over specification• Fewer inconsistencies
0=vowel?
0=obstruent? 0=begin-state?
-1=syllabic?0=mid-state?-1=obstruent?0=end-state?
AX-m
IX-m
AX-b
Dictionary: Interactive LearningDictionary: Interactive Learning
* Follow the work of Davel & Barnard
* Word list: extract from text
User
Word list W
i:= best select
Word wi
Generate pronunciation P(wi)
TTS
P(wi) okay?Yes
Delete wi
No
Update G-2-P
Improve P(wi)
G-2-P
Delete wi
* Update after each wi more effective training
* Kominek & Black
* G-2-P- explicit mapping rules - neural networks - decision trees - instance learning (grapheme context)
LexSkip
Spice: Lex LearnerSpice: Lex Learner
Spice: Lex LearnerSpice: Lex Learner
Issues and ChallengesIssues and Challenges How to make best use of the human?How to make best use of the human?
Definition of successful completion Definition of successful completion Which words to present in what order Which words to present in what order How to be robust against mistakes How to be robust against mistakes Feedback that keeps users motivated to continue Feedback that keeps users motivated to continue
How many words? How many words? G2P complexity language dependent G2P complexity language dependent 80% coverage80% coverage
hundred (SP) to thousands (EN)hundred (SP) to thousands (EN) G2P rule system perplexity G2P rule system perplexity
LanguageLanguage PerplexityPerplexity
EnglishEnglish 50.1150.11
DutchDutch 16.8016.80
GermanGerman 16.7016.70
AfrikaansAfrikaans 11.4811.48
ItalianItalian 3.523.52
SpanishSpanish 1.211.21
Input: Speech
Rapid Portability: LMRapid Portability: LM
hi /h//ai/you /j/u/we /w//i/
hi youyou areI am
AM Lex LMOutput: Speech & Text
NLP /
MT
TTS
Text data
Internet / TV
Hello
Inquiry
AutomaticExtraction
LM
Bridge Languages
+
Resource rich languages Resource low languages:
Parametric TTSParametric TTS Text-to-speech for G2P Learning: Text-to-speech for G2P Learning:
Technique: phoneme-by-phoneme concatenation, Technique: phoneme-by-phoneme concatenation, speech not natural but understandable (Marelie Davel)speech not natural but understandable (Marelie Davel)
Units are based on IPA phoneme examplesUnits are based on IPA phoneme examples • PRO: covers languages through simple adaptationPRO: covers languages through simple adaptation
• CONS: not good enough for speech applications CONS: not good enough for speech applications
Text-to-speech for Applications: Text-to-speech for Applications: Statistical Parametric Systems: Statistical Parametric Systems: clustergenclustergen Clusters representing context-dependent allophones Clusters representing context-dependent allophones
• PRO: can work with little speech (10 minutes) PRO: can work with little speech (10 minutes)
• PRO: robust to erroneous data.PRO: robust to erroneous data.
• CONS: speech sounds buzzy, lacks natural prosodyCONS: speech sounds buzzy, lacks natural prosody
SPICE: Afrikaans - EnglishSPICE: Afrikaans - English Goal: Build Afrikaans – English S2S using SPICEGoal: Build Afrikaans – English S2S using SPICE
• Cooperation with UniversitCooperation with Universityy Stellenbosch and ARMSCOR Stellenbosch and ARMSCOR
• Bilingual PhD visited CMU fBilingual PhD visited CMU for 3 month (Herman Engelbrecht)or 3 month (Herman Engelbrecht)
• Afrikaans: Related to Dutch and EnglishAfrikaans: Related to Dutch and English, , g-2-p very close, regular grammar, simple morphologyg-2-p very close, regular grammar, simple morphology
SPICE, all components apply statistical modeling paradigmSPICE, all components apply statistical modeling paradigm• ASR: HMMs, N-gram LM (JRTk-ISL)ASR: HMMs, N-gram LM (JRTk-ISL)
• MT: Statistical MTMT: Statistical MT (SMT-ISL) (SMT-ISL)
• TTS: Unit-Selection (Festival)TTS: Unit-Selection (Festival)
• DictionaryDictionary: : G-2-P rules using CART decision treesG-2-P rules using CART decision trees
Text: 39 hansard; 680k words; Text: 39 hansard; 680k words; 43k bilingual aligned sentence pairs;43k bilingual aligned sentence pairs;
Audio: 6 hours read speech; 10k utterances, Audio: 6 hours read speech; 10k utterances,
SPICE: Time effortSPICE: Time effort Results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9)Results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9) Shared pronunciation dictionaries (fShared pronunciation dictionaries (for ASR+TTS) and LM or ASR+TTS) and LM (f(for ASR+MT)or ASR+MT) Most time consuming process: data preparation Most time consuming process: data preparation reduce amount of data! reduce amount of data! Still too much expert knowledge required (e.g. ASR parameter tuning!) Still too much expert knowledge required (e.g. ASR parameter tuning!)
Current TestsCurrent Tests
11 students is CMU class11 students is CMU classHindi (2), Vietnamese (2), French, German (2), Hindi (2), Vietnamese (2), French, German (2), Bulgarian, Telugu, Cantonese, Mandarin.Bulgarian, Telugu, Cantonese, Mandarin.Build complete S2S systemBuild complete S2S systemTeams of 2 for translation on small domainTeams of 2 for translation on small domainTranslation is simple phrase-based Translation is simple phrase-based
Purpose:Purpose:Have students get full experienceHave students get full experienceFind bugs/limitation in the systemFind bugs/limitation in the systemEvaluation resulting systems for development time and Evaluation resulting systems for development time and accuracyaccuracy