Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Speech Processing 11-492/18-492Speech Processing 11-492/18-492
Speech Translation
Speech TranslationSpeech Translation
Three part systemsThree part systems ASR -> Translation -> TTSASR -> Translation -> TTS
System configurationsSystem configurations One way – phrasalOne way – phrasal One way – broadcast/lectureOne way – broadcast/lecture 1.5 way – phrasal with limited answers1.5 way – phrasal with limited answers Two way – full two wayTwo way – full two way
Machine Translation TechnologiesMachine Translation Technologies
PhrasalPhrasal Phrase to phrase look upPhrase to phrase look up
Template:Template: Template fillers, fixed translationTemplate fillers, fixed translation
InterlinguaInterlingua Translation into meaning representationTranslation into meaning representation
Statistical Machine TranslationStatistical Machine Translation From large collect of parallel textFrom large collect of parallel text
Classification base translationClassification base translation Identify classes and deal directly with themIdentify classes and deal directly with them
Choices in TranslationChoices in Translation
Choose any two …Choose any two … High accuracyHigh accuracy Large vocabularyLarge vocabulary Fully automaticFully automatic
Speech vs TextSpeech vs Text Speech less clear than textSpeech less clear than text Less speech to train fromLess speech to train from Needs to be real-time (probably)Needs to be real-time (probably)
Simple TranslationSimple Translation
Phrase to PhrasePhrase to Phrase GreetingsGreetings Do you need medical attention?Do you need medical attention? Relatively easy to build, but limited useRelatively easy to build, but limited use
Template translationsTemplate translations The next train leaves at TIME from gate GATE The next train leaves at TIME from gate GATE
form PLACEform PLACE Limited but still usefulLimited but still useful
InterlinguaInterlingua
Translate sentences into standard formTranslate sentences into standard form Generate sentences from standard formGenerate sentences from standard form PROS:PROS:
Can do multiple languages easilyCan do multiple languages easily Can be very accurateCan be very accurate
CONSCONS Designing universal interlingua is very hardDesigning universal interlingua is very hard Doesn’t do well when out of domainDoesn’t do well when out of domain
Statistical Machine TranslationStatistical Machine Translation
Build probabilistic models from parallel textBuild probabilistic models from parallel text Parallel text often available fromParallel text often available from
Bilingual organizationsBilingual organizations Governments, UNGovernments, UN
Relatively easy to collect Relatively easy to collect Requires translators rather than MT expertsRequires translators rather than MT experts
Learning from Parallel TextLearning from Parallel Text
1. Ofi'at 'kowii'ã '츈hiyoh츈i '2. Kowii'at 'ofi'ã '츈hiyoh츈i '3. Ofi'at 'shoha4. Ihooat 'hattakã 'ho츈츈o '5. Lhiyoh츈i츈i6. Sa츈hiyoh츈i7. Hi츈ha
1. The 'dog 'chases 'the 'cat2. The 'cat 'chases 'the 'dog3. The 'dog 'stinks4. The 'woman '츈oves 'the 'man5. I 'chase 'her/him6. She/he 'chases 'me7. She/he 'dances
Learning from Parallel TextLearning from Parallel Text
1. Ofi'at 'kowii'ã '츈hiyoh츈i '2. Kowii'at 'ofi'ã '츈hiyoh츈i '3. Ofi'at 'shoha4. Ihooat 'hattakã 'ho츈츈o '5. Lhiyoh츈i츈i6. Sa츈hiyoh츈i7. Hi츈ha
1. The 'dog 'chases 'the 'cat2. The 'cat 'chases 'the 'dog3. The 'dog 'stinks4. The 'woman '츈oves 'the 'man5. I 'chase 'her/him6. She/he 'chases 'me7. She/he 'dances
Statistical Machine TranslationStatistical Machine Translation
PROSPROS Data collection doesn’t require MT expertsData collection doesn’t require MT experts Data drivenData driven Degrades gracefully when out of domainDegrades gracefully when out of domain
CONSCONS Needs all language pairsNeeds all language pairs Needs good/lots of dataNeeds good/lots of data Hard to fix specific errorsHard to fix specific errors
SPEECH TranslationSPEECH Translation
Speech isn’t textSpeech isn’t text Different style, hard to find lots of examplesDifferent style, hard to find lots of examples
Speech isn’t fluentSpeech isn’t fluent False starts, hesitations, ungrammaticalFalse starts, hesitations, ungrammatical
ASR never makes errors ASR never makes errors
One Way: Broadcast One Way: Broadcast
One speaker One speaker Lecturer: can modify language modelLecturer: can modify language model
Multiple speakersMultiple speakers May be repeat speakers (News Anchor)May be repeat speakers (News Anchor) May had other noises: music etcMay had other noises: music etc (TV programs)(TV programs)
Doesn’t need to be real time (maybe)Doesn’t need to be real time (maybe)
Two Way: DialogTwo Way: Dialog
Users can detect own errors and correctUsers can detect own errors and correct Needs to be real timeNeeds to be real time One user may be much more familiarOne user may be much more familiar How do you teach the other userHow do you teach the other user Typically domain directedTypically domain directed
Speech Technology IssuesSpeech Technology Issues
ASR:ASR: Disfluencies, dialects, speaking styleDisfluencies, dialects, speaking style Unfamiliarity with systemUnfamiliarity with system
TTS:TTS: MT output isn’t always fluentMT output isn’t always fluent TTS says it anywayTTS says it anyway Can be hard to understandCan be hard to understand
Speech Technology IssuesSpeech Technology Issues
Spoken not Written LanguagesSpoken not Written Languages Arabic vs Arabic DialectsArabic vs Arabic Dialects Mixture of languagesMixture of languages Politeness levelsPoliteness levels Gender in speechGender in speech
Phraselator: One Way TranslationPhraselator: One Way Translation
Commercial SystemCommercial System VoxTecVoxTec
Rapid deploymentRapid deployment Modules of 500ish uttsModules of 500ish utts
Transtac: Two S2S SystemTranstac: Two S2S System
DARPA developed forDARPA developed for Check points, medical and civil defenseCheck points, medical and civil defense
RequirementsRequirements Two wayTwo way Eyes-free (no screen)Eyes-free (no screen) PortablePortable Usable by real usersUsable by real users
Transtac SystemTranstac System
Laptop secured in Backpack
Optional speech controlPush-to-Talk Buttons
Close-talking Microphone
Small powerful Speakers
Transtac System DetailsTranstac System Details
Two way systemTwo way system 2 ASR systems: English and Iraqi2 ASR systems: English and Iraqi 2 way statistical translation2 way statistical translation 2 synthesizers2 synthesizers
Push-to-talk systemPush-to-talk system (Users don’t like “translate everything mode”)(Users don’t like “translate everything mode”)
Echo back ASR resultEcho back ASR result And then translationAnd then translation
Iraqi LanguageIraqi Language
Iraqi Arabic is a dialectIraqi Arabic is a dialect Most Iraqi’s write Modern Standard ArabicMost Iraqi’s write Modern Standard Arabic Most Iraqi’s do not write their own dialectMost Iraqi’s do not write their own dialect
No standardized spellingNo standardized spelling Transtac project invented oneTranstac project invented one But Iraqi’s may not be used to itBut Iraqi’s may not be used to it
Arabic (MSA and dialects)Arabic (MSA and dialects) Do not write short vowels in wordsDo not write short vowels in words
Data for TrainingData for Training
Collected human mediated dialogsCollected human mediated dialogs Human acts as a machineHuman acts as a machine Passed a microphone back an forwardPassed a microphone back an forward Try to get people not to talk at same timeTry to get people not to talk at same time
Large number of collections (over 4 years)Large number of collections (over 4 years) 650 thousand sentences pairs650 thousand sentences pairs Many different speakersMany different speakers Hand transcribed by experts (in Iraqi spelling)Hand transcribed by experts (in Iraqi spelling) Hand translate (Source sentences and Interpreter’s)Hand translate (Source sentences and Interpreter’s)
Iraqi ASRIraqi ASR
Acoustic model from Iraqi dataAcoustic model from Iraqi data Based on MSA phonesetBased on MSA phoneset Needs to be small fast modelsNeeds to be small fast models Discriminative TrainingDiscriminative Training Speaker specific adaptationSpeaker specific adaptation
LexiconLexicon Based on LDC provided lexiconBased on LDC provided lexicon Multiple pronunciations/typos still a problemMultiple pronunciations/typos still a problem Statistically trained LTS rulesStatistically trained LTS rules
Language ModelLanguage Model Trained on Iraqi input (and translated output)Trained on Iraqi input (and translated output)
English ASREnglish ASR
Acoustic modelAcoustic model Originally using other modelsOriginally using other models Then trained from collected dataThen trained from collected data (Mostly military personnel)(Mostly military personnel)
LexiconLexicon Existing lexicon but needed to add Military speak: Existing lexicon but needed to add Military speak:
MRAP, IEDMRAP, IED Language modelLanguage model
Trained from data providedTrained from data provided Trained from “similar” data found on the webTrained from “similar” data found on the web Training from hand created “typical” examplesTraining from hand created “typical” examples
TTSTTS
Standard English TTSStandard English TTS Appropriate “command” voiceAppropriate “command” voice Unit selectionUnit selection Added lots of military vocabularyAdded lots of military vocabulary
Iraqi TTSIraqi TTS Recorded from Iraqi radio announcerRecorded from Iraqi radio announcer Based on example sentences in the domainBased on example sentences in the domain LDC lexicon and LTS rules (same as ASR)LDC lexicon and LTS rules (same as ASR) Hand tunedHand tuned
S2S Interface IssuesS2S Interface Issues
How do you teach people to use the systemHow do you teach people to use the system ““Transtac say instructions”Transtac say instructions” Not really sufficientNot really sufficient
How can you tell it translated correctlyHow can you tell it translated correctly Give (speech) feedback.Give (speech) feedback.
BacktranslationBacktranslation ASR echo backASR echo back
S2S Interface IssuesS2S Interface Issues
How do you translate namesHow do you translate names A correct translation/transliteration is hard to A correct translation/transliteration is hard to
understandunderstand Mark names in translationsMark names in translations
““My name is … Abdullah”My name is … Abdullah” ““He lives on … al-Aqar … street”He lives on … al-Aqar … street”
S2S Evaluation (Transtac)S2S Evaluation (Transtac)
Offline testsOffline tests ASR->Text and Text->TextASR->Text and Text->Text Compare to translation referencesCompare to translation references WER and “BLEU” scoreWER and “BLEU” score
Online testsOnline tests Concept transfer (through defined scenarios)Concept transfer (through defined scenarios) Speed (number of concepts per minute)Speed (number of concepts per minute) (English speech masking)(English speech masking)
Utility testsUtility tests Does it really workDoes it really work
Transtac ParticipantsTranstac Participants
Developer groupsDeveloper groups IBMIBM SRISRI BBNBBN CMUCMU USCUSC
EvaluationsEvaluations Twice a year in Iraqi (somewhere in DC)Twice a year in Iraqi (somewhere in DC) One surprise language One surprise language
Farsi, Bahasa Malay, Dari, PashtoFarsi, Bahasa Malay, Dari, Pashto Other evaluations with military groupsOther evaluations with military groups
Does it work??Does it work??
Yes, mostlyYes, mostly 27 concepts out of 30-ish turns27 concepts out of 30-ish turns
Systems are mostly similarSystems are mostly similar But some better than othersBut some better than others
Other techniquesOther techniques Belt/holster based PC with handheld speakerBelt/holster based PC with handheld speaker Small PC in pouchSmall PC in pouch Chest mounted array microphoneChest mounted array microphone
S2S ASR Advanced issuesS2S ASR Advanced issues
Tight couplingTight coupling ASR should output N-bestASR should output N-best Translated all (lattice)Translated all (lattice) Choose best translationChoose best translation (MT as a LM for ASR)(MT as a LM for ASR)
Remove disfluencies/hestitationsRemove disfluencies/hestitations Add more relevant dataAdd more relevant data
Automatically convert past tense/third person data to Automatically convert past tense/third person data to present tense/first+second person …present tense/first+second person …
S2S TTS Advance IssuesS2S TTS Advance Issues
MT output isn’t grammaticalMT output isn’t grammatical TTS doesn’t care and just says itTTS doesn’t care and just says it TTS should try to say MT output with more TTS should try to say MT output with more
breaks.breaks. TTS (unit selection)TTS (unit selection)
As a LM on MT output As a LM on MT output Choose the best translation on what is said Choose the best translation on what is said
bestbest
S2S MT Advanced issuesS2S MT Advanced issues
Train on ASR outputTrain on ASR output Do ASR on training dataDo ASR on training data Build SMT model ASR-TEXT to TEXTBuild SMT model ASR-TEXT to TEXT
Session adaptationSession adaptation Improve coverage from daily usageImprove coverage from daily usage
S2S In-line TranslationS2S In-line Translation
CMU-INESC (Portugal) projectCMU-INESC (Portugal) project Translation of TED videosTranslation of TED videos Align audio to give “dubbing” not “voiceover”Align audio to give “dubbing” not “voiceover” Align: timing, breaks, focus across languageAlign: timing, breaks, focus across language
Speech Processing 11-492/18-492Speech TranslationMachine Translation TechnologiesChoices in TranslationSimple TranslationInterlinguaStatistical Machine TranslationLearning from Parallel TextSlide 9Slide 10SPEECH TranslationOne Way: BroadcastTwo Way: DialogSpeech Technology IssuesSlide 15Phraselator: One Way TranslationTranstac: Two S2S SystemTranstac SystemTranstac System DetailsIraqi LanguageData for TrainingIraqi ASREnglish ASRTTSS2S Interface IssuesSlide 26S2S Evaluation (Transtac)Transtac ParticipantsDoes it work??S2S ASR Advanced issuesS2S TTS Advance IssuesS2S MT Advanced issuesS2S In-line TranslationSlide 34