Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/s2s_all.pdfSpeech vs Text Speech less clear than text Less speech to train from Needs to be real-time (probably

Speech Processing 11-492/18-492Speech Processing 11-492/18-492

Speech Translation

Speech TranslationSpeech Translation

Three part systemsThree part systems ASR -> Translation -> TTSASR -> Translation -> TTS

System configurationsSystem configurations One way – phrasalOne way – phrasal One way – broadcast/lectureOne way – broadcast/lecture 1.5 way – phrasal with limited answers1.5 way – phrasal with limited answers Two way – full two wayTwo way – full two way

Machine Translation TechnologiesMachine Translation Technologies

PhrasalPhrasal Phrase to phrase look upPhrase to phrase look up

Template:Template: Template fillers, fixed translationTemplate fillers, fixed translation

InterlinguaInterlingua Translation into meaning representationTranslation into meaning representation

Statistical Machine TranslationStatistical Machine Translation From large collect of parallel textFrom large collect of parallel text

Classification base translationClassification base translation Identify classes and deal directly with themIdentify classes and deal directly with them

Choices in TranslationChoices in Translation

Choose any two …Choose any two … High accuracyHigh accuracy Large vocabularyLarge vocabulary Fully automaticFully automatic

Speech vs TextSpeech vs Text Speech less clear than textSpeech less clear than text Less speech to train fromLess speech to train from Needs to be real-time (probably)Needs to be real-time (probably)

Simple TranslationSimple Translation

Phrase to PhrasePhrase to Phrase GreetingsGreetings Do you need medical attention?Do you need medical attention? Relatively easy to build, but limited useRelatively easy to build, but limited use

Template translationsTemplate translations The next train leaves at TIME from gate GATE The next train leaves at TIME from gate GATE

form PLACEform PLACE Limited but still usefulLimited but still useful

InterlinguaInterlingua

Translate sentences into standard formTranslate sentences into standard form Generate sentences from standard formGenerate sentences from standard form PROS:PROS:

Can do multiple languages easilyCan do multiple languages easily Can be very accurateCan be very accurate

CONSCONS Designing universal interlingua is very hardDesigning universal interlingua is very hard Doesn’t do well when out of domainDoesn’t do well when out of domain

Statistical Machine TranslationStatistical Machine Translation

Build probabilistic models from parallel textBuild probabilistic models from parallel text Parallel text often available fromParallel text often available from

Bilingual organizationsBilingual organizations Governments, UNGovernments, UN

Relatively easy to collect Relatively easy to collect Requires translators rather than MT expertsRequires translators rather than MT experts

Learning from Parallel TextLearning from Parallel Text

1. Ofi'at 'kowii'ã '츈hiyoh츈i '2. Kowii'at 'ofi'ã '츈hiyoh츈i '3. Ofi'at 'shoha4. Ihooat 'hattakã 'ho츈츈o '5. Lhiyoh츈i츈i6. Sa츈hiyoh츈i7. Hi츈ha

1. The 'dog 'chases 'the 'cat2. The 'cat 'chases 'the 'dog3. The 'dog 'stinks4. The 'woman '츈oves 'the 'man5. I 'chase 'her/him6. She/he 'chases 'me7. She/he 'dances

Statistical Machine TranslationStatistical Machine Translation

PROSPROS Data collection doesn’t require MT expertsData collection doesn’t require MT experts Data drivenData driven Degrades gracefully when out of domainDegrades gracefully when out of domain

CONSCONS Needs all language pairsNeeds all language pairs Needs good/lots of dataNeeds good/lots of data Hard to fix specific errorsHard to fix specific errors

SPEECH TranslationSPEECH Translation

Speech isn’t textSpeech isn’t text Different style, hard to find lots of examplesDifferent style, hard to find lots of examples

Speech isn’t fluentSpeech isn’t fluent False starts, hesitations, ungrammaticalFalse starts, hesitations, ungrammatical

ASR never makes errors ASR never makes errors

One Way: Broadcast One Way: Broadcast

One speaker One speaker Lecturer: can modify language modelLecturer: can modify language model

Multiple speakersMultiple speakers May be repeat speakers (News Anchor)May be repeat speakers (News Anchor) May had other noises: music etcMay had other noises: music etc (TV programs)(TV programs)

Doesn’t need to be real time (maybe)Doesn’t need to be real time (maybe)

Two Way: DialogTwo Way: Dialog

Users can detect own errors and correctUsers can detect own errors and correct Needs to be real timeNeeds to be real time One user may be much more familiarOne user may be much more familiar How do you teach the other userHow do you teach the other user Typically domain directedTypically domain directed

Speech Technology IssuesSpeech Technology Issues

ASR:ASR: Disfluencies, dialects, speaking styleDisfluencies, dialects, speaking style Unfamiliarity with systemUnfamiliarity with system

TTS:TTS: MT output isn’t always fluentMT output isn’t always fluent TTS says it anywayTTS says it anyway Can be hard to understandCan be hard to understand

Speech Technology IssuesSpeech Technology Issues

Spoken not Written LanguagesSpoken not Written Languages Arabic vs Arabic DialectsArabic vs Arabic Dialects Mixture of languagesMixture of languages Politeness levelsPoliteness levels Gender in speechGender in speech

Phraselator: One Way TranslationPhraselator: One Way Translation

Commercial SystemCommercial System VoxTecVoxTec

Rapid deploymentRapid deployment Modules of 500ish uttsModules of 500ish utts

Transtac: Two S2S SystemTranstac: Two S2S System

DARPA developed forDARPA developed for Check points, medical and civil defenseCheck points, medical and civil defense

RequirementsRequirements Two wayTwo way Eyes-free (no screen)Eyes-free (no screen) PortablePortable Usable by real usersUsable by real users

Transtac SystemTranstac System

Laptop secured in Backpack

Optional speech controlPush-to-Talk Buttons

Close-talking Microphone

Small powerful Speakers

Transtac System DetailsTranstac System Details

Two way systemTwo way system 2 ASR systems: English and Iraqi2 ASR systems: English and Iraqi 2 way statistical translation2 way statistical translation 2 synthesizers2 synthesizers

Push-to-talk systemPush-to-talk system (Users don’t like “translate everything mode”)(Users don’t like “translate everything mode”)

Echo back ASR resultEcho back ASR result And then translationAnd then translation

Iraqi LanguageIraqi Language

Iraqi Arabic is a dialectIraqi Arabic is a dialect Most Iraqi’s write Modern Standard ArabicMost Iraqi’s write Modern Standard Arabic Most Iraqi’s do not write their own dialectMost Iraqi’s do not write their own dialect

No standardized spellingNo standardized spelling Transtac project invented oneTranstac project invented one But Iraqi’s may not be used to itBut Iraqi’s may not be used to it

Arabic (MSA and dialects)Arabic (MSA and dialects) Do not write short vowels in wordsDo not write short vowels in words

Data for TrainingData for Training

Collected human mediated dialogsCollected human mediated dialogs Human acts as a machineHuman acts as a machine Passed a microphone back an forwardPassed a microphone back an forward Try to get people not to talk at same timeTry to get people not to talk at same time

Large number of collections (over 4 years)Large number of collections (over 4 years) 650 thousand sentences pairs650 thousand sentences pairs Many different speakersMany different speakers Hand transcribed by experts (in Iraqi spelling)Hand transcribed by experts (in Iraqi spelling) Hand translate (Source sentences and Interpreter’s)Hand translate (Source sentences and Interpreter’s)

Iraqi ASRIraqi ASR

Acoustic model from Iraqi dataAcoustic model from Iraqi data Based on MSA phonesetBased on MSA phoneset Needs to be small fast modelsNeeds to be small fast models Discriminative TrainingDiscriminative Training Speaker specific adaptationSpeaker specific adaptation

LexiconLexicon Based on LDC provided lexiconBased on LDC provided lexicon Multiple pronunciations/typos still a problemMultiple pronunciations/typos still a problem Statistically trained LTS rulesStatistically trained LTS rules

Language ModelLanguage Model Trained on Iraqi input (and translated output)Trained on Iraqi input (and translated output)

English ASREnglish ASR

Acoustic modelAcoustic model Originally using other modelsOriginally using other models Then trained from collected dataThen trained from collected data (Mostly military personnel)(Mostly military personnel)

LexiconLexicon Existing lexicon but needed to add Military speak: Existing lexicon but needed to add Military speak:

MRAP, IEDMRAP, IED Language modelLanguage model

Trained from data providedTrained from data provided Trained from “similar” data found on the webTrained from “similar” data found on the web Training from hand created “typical” examplesTraining from hand created “typical” examples

TTSTTS

Standard English TTSStandard English TTS Appropriate “command” voiceAppropriate “command” voice Unit selectionUnit selection Added lots of military vocabularyAdded lots of military vocabulary

Iraqi TTSIraqi TTS Recorded from Iraqi radio announcerRecorded from Iraqi radio announcer Based on example sentences in the domainBased on example sentences in the domain LDC lexicon and LTS rules (same as ASR)LDC lexicon and LTS rules (same as ASR) Hand tunedHand tuned

S2S Interface IssuesS2S Interface Issues

How do you teach people to use the systemHow do you teach people to use the system ““Transtac say instructions”Transtac say instructions” Not really sufficientNot really sufficient

How can you tell it translated correctlyHow can you tell it translated correctly Give (speech) feedback.Give (speech) feedback.

BacktranslationBacktranslation ASR echo backASR echo back

S2S Interface IssuesS2S Interface Issues

How do you translate namesHow do you translate names A correct translation/transliteration is hard to A correct translation/transliteration is hard to

understandunderstand Mark names in translationsMark names in translations

““My name is … Abdullah”My name is … Abdullah” ““He lives on … al-Aqar … street”He lives on … al-Aqar … street”

S2S Evaluation (Transtac)S2S Evaluation (Transtac)

Offline testsOffline tests ASR->Text and Text->TextASR->Text and Text->Text Compare to translation referencesCompare to translation references WER and “BLEU” scoreWER and “BLEU” score

Online testsOnline tests Concept transfer (through defined scenarios)Concept transfer (through defined scenarios) Speed (number of concepts per minute)Speed (number of concepts per minute) (English speech masking)(English speech masking)

Utility testsUtility tests Does it really workDoes it really work

Transtac ParticipantsTranstac Participants

Developer groupsDeveloper groups IBMIBM SRISRI BBNBBN CMUCMU USCUSC

EvaluationsEvaluations Twice a year in Iraqi (somewhere in DC)Twice a year in Iraqi (somewhere in DC) One surprise language One surprise language

Farsi, Bahasa Malay, Dari, PashtoFarsi, Bahasa Malay, Dari, Pashto Other evaluations with military groupsOther evaluations with military groups

Does it work??Does it work??

Yes, mostlyYes, mostly 27 concepts out of 30-ish turns27 concepts out of 30-ish turns

Systems are mostly similarSystems are mostly similar But some better than othersBut some better than others

Other techniquesOther techniques Belt/holster based PC with handheld speakerBelt/holster based PC with handheld speaker Small PC in pouchSmall PC in pouch Chest mounted array microphoneChest mounted array microphone

S2S ASR Advanced issuesS2S ASR Advanced issues

Tight couplingTight coupling ASR should output N-bestASR should output N-best Translated all (lattice)Translated all (lattice) Choose best translationChoose best translation (MT as a LM for ASR)(MT as a LM for ASR)

Remove disfluencies/hestitationsRemove disfluencies/hestitations Add more relevant dataAdd more relevant data

Automatically convert past tense/third person data to Automatically convert past tense/third person data to present tense/first+second person …present tense/first+second person …

S2S TTS Advance IssuesS2S TTS Advance Issues

MT output isn’t grammaticalMT output isn’t grammatical TTS doesn’t care and just says itTTS doesn’t care and just says it TTS should try to say MT output with more TTS should try to say MT output with more

breaks.breaks. TTS (unit selection)TTS (unit selection)

As a LM on MT output As a LM on MT output Choose the best translation on what is said Choose the best translation on what is said

bestbest

S2S MT Advanced issuesS2S MT Advanced issues

Train on ASR outputTrain on ASR output Do ASR on training dataDo ASR on training data Build SMT model ASR-TEXT to TEXTBuild SMT model ASR-TEXT to TEXT

Session adaptationSession adaptation Improve coverage from daily usageImprove coverage from daily usage

S2S In-line TranslationS2S In-line Translation

CMU-INESC (Portugal) projectCMU-INESC (Portugal) project Translation of TED videosTranslation of TED videos Align audio to give “dubbing” not “voiceover”Align audio to give “dubbing” not “voiceover” Align: timing, breaks, focus across languageAlign: timing, breaks, focus across language

Speech Processing 11-492/18-492Speech TranslationMachine Translation TechnologiesChoices in TranslationSimple TranslationInterlinguaStatistical Machine TranslationLearning from Parallel TextSlide 9Slide 10SPEECH TranslationOne Way: BroadcastTwo Way: DialogSpeech Technology IssuesSlide 15Phraselator: One Way TranslationTranstac: Two S2S SystemTranstac SystemTranstac System DetailsIraqi LanguageData for TrainingIraqi ASREnglish ASRTTSS2S Interface IssuesSlide 26S2S Evaluation (Transtac)Transtac ParticipantsDoes it work??S2S ASR Advanced issuesS2S TTS Advance IssuesS2S MT Advanced issuesS2S In-line TranslationSlide 34

Documents

Speech Processing 11-492/18-492tts.speech.cs.cmu.edu/courses/11492/slides/s2s_all.pdfSpeech vs Text Speech less clear than text Less speech to train from Needs to be real-time (probably