Automatic Extraction of Presented by: Simon Mille Parallel ... · Automatic Extraction of Parallel...

Automatic Extraction of Parallel Speech Corpora from Dubbed MoviesAlp Oumlktem Mireia Farruacutes Leo WannerPresented by Simon Mille

alpoktem|mireiafarrus|leowanner|simonmilleupfedu

Contents

1 Parallel Speech Corpora2 Movies as a resource for parallel corpora3 Proposed Method4 Methodology5 Applying the methodology6 Discussions7 Conclusions8 Future work

Parallel Speech Corpora

Spoken parallel corpora are useful in building speech-to-speech applications

Costly Laborious with respect to translation and interpretation

Available corpora

Contain unexpressive speech (eg interpreted) Do not capture spontaneous spoken language traits

(eg read) Lack one-to-one alignment between wordssentences

(eg constrained conversations)

Dubbed movies as a resource

Popular movies documentaries TV shows are dubbed in many countries

A good resource for obtaining bilingual data

(1) Available parallel audio data in dubbed movies(2) Transcripts available with time information in subtitles

Expressive

Speech

Movie still from The Man Who Knew Too Much (1956) copy Universal Pictures

Example Dubbing in European Countries

Image source Wikipedia - Dubbing (filmmaking)

Proposed Method

Automatic extraction of segmented parallel sentences with prosodic parameters

Input Bilingual audio and subtitles pair Output Aligned bilingual sentences annotated with prosodic features

Key points

1 Supports any language pair 2 Contains expressive speech3 Aligned at sentence level

Methodology Overview

Stage 1 Sentence Segmentation

Use subtitle time-information to find script location in audio

SRT subtitle file

1VoxSigmareg Scribe

Stage 2 Prosodic Parameter Extraction

ProsodyPro1 library used for prosodic feature extraction

1Yi Xu 2013

Stage 3 Parallel Sentence Alignment

Goal Given sentence s1 in lang 1 find corresponding sentence s2 in lang2

1 Yandex Translate2 Meteor library (Denkowski and Lavie 2014)

Applying the Methodology

Three movies processed

The Man Who Knew Too Much (1956) Slow West (2015) The Perfect Guy (2015)

Films originally in English dubbed to Spanish

Audio extracted from DVD using Libav1

English and Spanish subtitles obtained from opensubtitles2

1httpslibavorg2httpswwwopensubtitlesorg

Currently obtained corpus

Shortcomings

Copyright restrictions for distributing the corpus

Main bottlenecks in capturing data

1 Audio-text alignment performance 15 sentences lost in original language 49 sentences lost in dubbed language

2 Translation difference in dubbed audio and subtitles

Hinders audio-text alignment

3 Background noise

Processing The Man Who Knew Too Much

Sub-dub differencesExtracted English (Sub + audio) Spanish Sub Spanish Dub

yesDaddy youre sure Ive never been to Africa before

Papaacute iquestestaacutes seguro de que nunca estuve antes en Aacutefrica

Papa estaacutes seguro que no habiacuteamos estado ya en Aacutefrica

no It looks familiar Me parece conocido Todo esto ya lo conozco

yesYou saw the same scenery last summer driving to Las Vegas

Viste el mismo panorama el verano pasado cuando manejamos a Las Vegas

Vimos un paisaje muy parecido cuando fuimos a Las Vegas

yesWhere Daddy lost all that money at the crap

Claro donde papaacute perdioacute todo ese dinero en la mesa

Ah claro donde papaacute perdioacute toda el dinero en la mesa de juego

no Hey look iexclMiren Hey mirad

no A Camel iexclUn camello Un camello

yes Of course this isnt really Africa honey Y esto no es realmente Aacutefrica Realmente esto no es Aacutefrica carintildeo

yes Its the French Morocco Es el Marruecos franceacutes Es el Marruecos Franceacutes

yes Well its northern Africa Es Aacutefrica del Norte Bueno es Aacutefrica del Norte

yes Still seems like Las Vegas Auacuten se parece a Las Vegas Pues sigue parecieacutendose a Las Vegas

Conclusions

Automatic building of multimodal bilingual corpora from dubbed media

Speech text prosody

Conversational speech rarr Useful for speech-to-speech translation applications

Works on any language pair (with trained acoustic model)

No further training needed

Code available at httpwwwgithubcomTalnUPFmovie2parallelDB

Future Work

1 Switch from proprietary audio-text aligner software to open source

Eg p2fa (based on CMU Sphinx ASR system)

2 XML based structure as corpus metadata

Instead of directory structure only

3 Speaker diarization

Identifying the speaker of each sentence

4 Extend and publish the corpus

Depending on agreement with Copyright holders

Questions Suggestions

Directly to author alpoktemupfedu

Appendix A State of the Art Corpora

Appendix B ProsodyPro Files

Some of the files generated by ProsodyPro

Contents

1 Parallel Speech Corpora2 Movies as a resource for parallel corpora3 Proposed Method4 Methodology5 Applying the methodology6 Discussions7 Conclusions8 Future work

Available corpora

Expressive

Speech

Proposed Method

Key points

SRT subtitle file

1VoxSigmareg Scribe

1Yi Xu 2013

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

Available corpora

Expressive

Speech

Proposed Method

Key points

SRT subtitle file

1VoxSigmareg Scribe

1Yi Xu 2013

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

Expressive

Speech

Proposed Method

Key points

SRT subtitle file

1VoxSigmareg Scribe

1Yi Xu 2013

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

Proposed Method

Key points

SRT subtitle file

1VoxSigmareg Scribe

1Yi Xu 2013

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

Proposed Method

Key points

SRT subtitle file

1VoxSigmareg Scribe

1Yi Xu 2013

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

SRT subtitle file

1VoxSigmareg Scribe

1Yi Xu 2013

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

SRT subtitle file

1VoxSigmareg Scribe

1Yi Xu 2013

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

1VoxSigmareg Scribe

1Yi Xu 2013

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

1Yi Xu 2013

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

Shortcomings

3 Background noise

Conclusions

Speech text prosody

Future Work

Conclusions

Speech text prosody

Future Work

Conclusions

Speech text prosody

Future Work

Automatic Extraction of Presented by: Simon Mille Parallel ... · Automatic Extraction of Parallel...

Documents

1 Automatic Foreground Extraction from Imperfect

Automatic Extraction of Physiographic Features and

Automatic extraction of translations from web-based bilingual …villasen/bib/Automatic extraction... · 2009. 6. 25. · Automatic extraction of translations 141 such translation

Lexical Knowledge Patterns for Semi-automatic Extraction …olst.ling.umontreal.ca/pdf/Marshman_thesis.pdf · · 2007-03-23Lexical Knowledge Patterns for Semi-automatic Extraction

Automatic Parallel Parking System Report

Automatic Video Object Extraction

LNCS 4035 - Automatic Foreground Extraction of Head ... · Automatic Foreground Extraction of Head ... Early research focused on fast and precise interactive segmentation ... Automatic

Extraction of Bilingual Information from Parallel Texts

Automatic Keyword Extraction

AUTOMATIC EXTRACTION SYSTEM OF MORPHOLOGICAL …

Automatic Extraction of Session-Based Workload

ACE (Automatic Content Extraction) Arabic Annotation

Automatic Keyword Extraction - L3S Forschungszentrum | Home

Extraction Based automatic summarization

Automatic Information Extraction from Large Websites · Automatic Information Extraction from Large Websites ... fully automatic—wrapper generation algorithms. ... Automatic Information

Automatic extraction of translations from web …ccc.inaoep.mx/~villasen/bib/Automatic extraction of...Automatic extraction of translations 141 such translation extraction system:

Automatic Extraction of Hierarchical Relations from Text

Automatic Code Features Extraction Using Bio-inspired ... · Automatic Code Features Extraction Using ... Microsoft Portable Executable format ... Automatic Code Features Extraction

Towards Automatic Extraction of Event and Place Semantics ...infolab.stanford.edu/~mor/research/sigir2007rattenburyTagSemantics.pdfTowards Automatic Extraction of Event and Place Semantics

Scalable Parallel Feature Extraction and Tracking for