53
Combining, Adapting and Reusing Bi- texts between Related Languages: Application to Statistical Machine Translation Preslav Nakov, Qatar Computing Research Institute (collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng) Yandex seminar August 13, 2014, Moscow, Russia

Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

  • Upload
    yandex

  • View
    396

  • Download
    1

Embed Size (px)

DESCRIPTION

Bilingual sentence-aligned parallel corpora, or bitexts, are a useful resource for solving many computational linguistics problems including part-of speech tagging, syntactic parsing, named entity recognition, word sense disambiguation, sentiment analysis, etc.; they are also a critical resource for some real-world applications such as statistical machine translation (SMT) and cross-language information retrieval. Unfortunately, building large bi-texts is hard, and thus most of the 6,500+ world languages remain resource-poor in bi-texts. However, many resource-poor languages are related to some resource-rich language, with whom they overlap in vocabulary and share cognates, which offers opportunities for using their bi-texts. We explore various options for bi-text reuse: (i) direct combination of bi-texts, (ii) combination of models trained on such bi-texts, and (iii) a sophisticated combination of (i) and (ii). We further explore the idea of generating bitexts for a resource-poor language by adapting a bi-text for a resource-rich language. We build a lattice of adaptation options for each word and phrase, and we then decode it using a language model for the resource-poor language. We compare word- and phrase-level adaptation, and we further make use of cross-language morphology. For the adaptation, we experiment with (a) a standard phrase-based SMT decoder, and (b) a specialized beam-search adaptation decoder. Finally, we observe that for closely-related languages, many of the differences are at the subword level. Thus, we explore the idea of reducing translation to character-level transliteration. We further demonstrate the potential of combining word- and character-level models.

Citation preview

Page 1: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

Combining, Adapting and Reusing Bi-texts between Related Languages:

Application to Statistical Machine Translation

Preslav Nakov, Qatar Computing Research Institute(collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng)

Yandex seminarAugust 13, 2014, Moscow, Russia

Page 2: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

2

Plan

•Part I- Introduction to Statistical Machine Translation

•Part II- Combining, Adapting and Reusing Bi-texts between Related

Languages: Application to Statistical Machine Translation

•Part III- Further Discussion on SMT

Page 3: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

3

Machine Translation:Hard or Easy?

Page 4: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

4

Why is Machine Translation Hard?

•Word order- En: I want beer.- Tr: Ben bira istiyorum.

•Lexical ambiguity- Ru: Штирлиц топил печку. Через час печка утонула.

•Pronouns, coreference- Ru: Если ребенок не любит холодное молоко, сварите его.- En: If the baby does not like cold milk, boil it/him.

•Idioms- Ru: Петр приказал долго жить.- En: Peter kicked the bucket.

Page 5: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

5

Why is Machine Translation Hard?

Page 6: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

6

Natural Language is Ambiguous

Page 7: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

7

Ambiguity in Russian: Idioms

Page 8: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

8

Ambiguity in Russian: Names

Page 9: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

9

Ambiguity in Russian: Stress

Page 10: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

10

Page 11: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

11

Russian Jokes about Stierlitz

•Штирлиц топил печку. Через час печка утонула.•Встретив гестаповцев, Штирлиц выхватил шашку и закричал: "Порублю!" Гестаповцы скинулись по рублю и убежали.•Штирлиц шёл по лесу и увидел голубые ели. Штирлиц присмотрелся и увидел, что голубые не только ели, но и пили.•Штирлиц подошёл к окну. Из окна дуло. Штирлиц закрыл окно. Дуло исчезло.

http://olgakagan.blog.com/2012/01/28/homonymy-in-russian-jokes-about-stierlitz/

Page 12: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

12

Russian Jokes about Stierlitz

http://olgakagan.blog.com/2012/01/28/homonymy-in-russian-jokes-about-stierlitz/

•Штирлиц выстрелил в Мюллера в упор. Мюллер не упал. “Броневой,”- подумал Штирлиц.•Лампа светила, но света не давала. Штирлиц погасил лампу и Света дала.•Штирлиц шёл по лесу и наткнулся на сук. “Шли бы вы домой, девушки. Война всё-таки!”•Штирлиц лёг на гальку. Галька вскрикнула и убежала.•Штирлиц сел в машину. "Всё, можно трогать!"- сказал он. "Ого-го!"- потрогала Кэт.

Page 13: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

13

When is Machine Translation Easy?

•Very closely related languages- similar word order, grammar

•Legal Texts- many repetitions

•Caterpillar English- simplified to make MT easy

Page 14: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

14

Translating a European Convention: English Bulgarian

English (orig.) Human Translation Computer Translation

European Convention on Mutual Аssistance in Criminal MattersPreambleThe governments signatory hereto, being members of the Council of Europe, considering that the aim of the Council of Europe is to achieve greater unity among its members;believing that the adoption of common rules in the field of mutual assistance in criminal matters will contribute to the attainment of this aim;considering that such mutual assistance is related to the question of extradition, which has already formed the subject of a convention signed on 13th december 1957,have agreed as follows:

Европейска конвенция за взаимопомощ по наказателно-правни въпросиПреамбюлПравителствата, подписали тази конвенция, в качеството си на членове на Съвета на Европа,считайки, че целта на Съвета на Европа е да се постигне по-голямо единство между неговите членове,убедени, че приемането на общи правила в областта на правната помощ по наказателни дела ще допринесе за постигането на тази цел,считайки, че правната помощ е свързана с въпроса за екстрадицията, която вече бе предмет на конвенцията, подписана на 13 декември 1957 година, се споразумяха за следното:

Европейска конвенция за взаимопомощ по наказателно-правни въпросиПреамбюлПравителствата, подписали този протокол, членове на Съвета на Европа,

считайки, че целта на Съвета на Европа е постигането на по-голямо единство между своите членове,убедени, че приемане на общи правила в областта на правна помощ по наказателни дела ще допринесе за постигането на тази цел,считайки, че тази взаимна помощ е свързана с въпроса за екстрадиция, който вече е образувано предмет на конвенция, подписана в 13th декември 1957 година,се споразумяха за следното:

Page 15: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

15

Adapting „Macedonian“ to Bulgarian„Macedonian” Human Translation Computer Translation

СКОПЈЕ, Македонија -- Според дипломатски извори, првата мировна мисија на ЕУ, која ќе биде распоредена во Македонија на 31-ви март, ќе го носи името Конкордија.Околу 27 земји навестија подготвеност да учествуваат во шестмесечната мисија.Се очекува македонскиот Парламент в среда (26-ти март) да го одобри нацрт договорот за статусот на силите на ЕУ, со кој на трупите на Унијата им се гарантира дипломатски статус и имунитет.Во вторникот, во меѓувреме, во Скопје започна меѓународна конференција за децентрализација, при што заменик-премиерот Муса Џафери го одржа воведниот говор.Во другите вести, претседателот Борис Трајковски за Утрински весник потврди дека експертски тим на министерствата за внатрешни работи и за одбрана ја завршил својата работа на првата национална стратегија за безбедност и одбрана.

СКОПИЕ, Македония -- Първата мироопазваща мисия на ЕС, която ще бъде разположена в Македония на 31 март, ще се нарича "Конкордия", съобщиха дипломатически източници.Тя ще продължи шест месеца. Около 27 страни са заявили желание да участват нея.Очаква се македонският парламент да одобри проекта на Споразумението за статута на силите в сряда (26 март), гарантирайки по този начин дипломатически статут и имунитет на съюзническия контингент.Междувременно, във вторник в Скопие започна международна конференция по въпросите на децентрализацията. Тя бе открита с реч на вицепремиера Муса Джафери.В други новини, президентът Борис Трайковски потвърди за Утрински весник, че екипът от експерти към министерствата на вътрешните работи и отбраната е завършил работата си по проекта за националната стратегия за сигурност и отбрана.

СКОПИЕ, Македония - Според дипломатически източници, първата мироопазваща мисия на ЕС, която ще бъде разположена в Македония на 31 март, ще носи името на Конкордия.

Около 27 страни намекнаха готовност да участват в шестмесечната мисия.

Очаква се македонският парламент в сряда (26 март) да одобри проект на споразумението за статута на силите на ЕС, с който на войниците на съюза да се гарантира дипломатически статут и имунитет.

Междувременно, във вторник в Скопие започна международна конференция, посветена на децентрализацията и вицепремиерът Муса Джафери изнесена изказването.

Други новини, президентът Борис Трайковски за Утрински вестник, че експертен екип на министерствата на вътрешните работи и отбраната е завършил работата си на първата национална стратегия за сигурност и отбрана.

Page 16: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

16

Summary: Machine Translation Today

•Usable technology- Translation memories- Web translation- “Caterpillar” English

•New profession- post-editor

•MT will never replace human translators- Computers cannot be held legally responsible

Page 17: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

17

Big CompaniesCare About SMT

Page 18: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

18

SDL – First To Invest In SMT

Page 19: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

19

Lionbridge Partners with IBM

Page 20: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

20

Facebook Buys SMT Company

Page 21: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

21

Old-Timer Systran Finds New Home

Page 22: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

22

eBay Considers SMT To Open New Markets

Page 23: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

23

Adobe Supports Open Source SMT

Page 24: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

24

Intel Investigates MT

Page 25: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

25

The Big Dream of NLP

Dave Bowman: “Open the pod bay doors, HAL”

HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”

Page 26: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

26

Future Directions

Page 27: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

27

Two Important Directions

•Semantics

•Machine Translation

Critical for the overalladvancement of the field

Practical, within the reachof current technology

Page 28: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

28

Two Important Directions

•Semantics

•Machine Translation

Page 29: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

29

Semantics: Revolution is Needed?

•If we want the dream come true, we should- not rely on superficial statistics alone- need to get to the meaning of text

•A revolution in semantics is needed- looking at words is not enough- we need better models for

o multi-word expressions (~70% of terminology)o semantic relations (meaning is in the links!)

•The revolution will be supported by- Web-scale corpora- linguistic knowledge

“Moving Lexical Semanticsfrom Alchemy to Science”

Discussion on [Corpora-List]

• This is what Chomsky has done with syntax.• Should we expect the same for lexical semantics?

Page 30: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

30

Two Important Directions

•Semantics

•Machine Translation

Page 31: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

31

Machine Translation

•The task that started the whole NLP field

•The hottest research topic today

•High practical and economic expectations

Page 32: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

32

Machine Translation: Evolution?

•Evolution?- Resource-poor language pairs- Morphologically rich languages

- Smarter Web-scale translation models

- Noisy inputo spoken languageo emails, chats, forums, Twittero poetry

Page 33: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

33

Machine Translation: Revolution?

•Revolution?- Two great revolutions so far

o1993: statistical word-based translationo2003: statistical phrase-based translation

Page 34: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

34

Machine Translation: Revolution?

•Revolution?- Two great revolutions so far

o1993: statistical word-based translationo2003: statistical phrase-based translation

- Overdue for the next revolution?o2013: ???

• Syntactic translation?• Semantic translation?

SOURCE TARGET

words words

syntax syntax

semantics semantics

interlingua

phrases phrases

Page 35: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

35

Machine Translation: Revolution?

•Revolution?- Two great revolutions so far

o1993: statistical word-based translationo2003: statistical phrase-based translation

- Overdue for the next revolution?o2013: ???

• Syntactic translation?• Semantic translation?

SOURCE TARGET

words words

syntax syntax

semantics semantics

interlingua

phrases phrases

Or maybe a return of deep neural networks?• Already started a little revolution in speech recognition• Very strong results for SMT, best paper award at ACL’2014

(Devlin&al., ACL 2014)• Very strong results for semantics too (word embedding)

Page 36: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

36

The Future?

Three words: Web, semantics, linguistics

and deep neural networks?

Page 37: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

37

QCRI

Page 38: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

38

Qatar

Page 39: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

39

Qatar

Page 40: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

40

Qatar

EMNLP 2014

Page 41: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

41

Qatar

Page 42: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

42

Qatar

Page 43: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

43

Qatar

Page 44: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

44

Vision and Mission

Become a world-leader in Arabic language technologies

Conduct innovative and strategic research and development

with local and global impact.

Page 45: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

45

The ALT Research Areas

• Build strong foundation: Arabic NLP• 2 flagship projects • 3 supplementary focus areas

motivated by local needs

Mul

ti-lin

gual

La

ngua

ge P

roce

ssin

g:Tr

ansc

riptio

n an

dTr

ansl

ation

Sear

ch a

ndIn

form

ation

Extr

actio

n

Inte

racti

veQ

uesti

on A

nsw

erin

gD

oha

22

Educ

ation

alAp

plic

ation

s

Arabic NLP StackTools and Resources

Arab

ic

Opti

cal

Char

acte

rRe

cogn

ition

Page 46: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

46

Dr. Preslav NakovSenior Scientist

Ahmed AliSenior Software Eng.

Abdulrahman GhanemSoftware Engineer

Dr. Kareem DarwishSenior Scientist

Dr. Stephan VogelPrincipal Scientist

Dr. Francisco GuzmanScientist

Dr. Walid MagdyScientist

Dr. Hassan SajjadScientist

Dr. Shafiq JotyScientist

Dr. AlessandroMoschittiSenior Scientist

Yifan ZhangSenior Software Eng

Dr Ahmed AbdelaliSenior Software Eng

Hamdy MubarakSenior Software Eng.

Dr. Lluis MarquezPrincipal Scientist

Dr. Ferda OfliScientist

The ALT Team

Fahad Al-ObaidliResearch Assistant

plus interns …

Page 47: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

47

Machine Translationat QCRI

Page 48: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

48

Speech Translation

News

Lectures

Meetings

Page 49: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

49

Application: News Translation

•Objective: high quality speech recognition and translation•Enable video search•Collaboration with Aljazeera

http://alt.qcri.org/QCRI_Demo/Speech_Recognition.html

Page 50: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

50

Application: Lecture Translation

•Objective: enable wider reach of educational material•Primarily English -> Arabic

Page 51: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

51

Application: Meeting Translation

•Objective: real-time, low-latency translation•Flexible architecture based on cutting-edge technology

Page 52: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

52

The Future: „Ubiquitous“ Translation

Google glass

Smart glass

صحح؟

Page 53: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

53

Acknowledgments

•Used some slides by George Doddington, John Hutchins, Kevin Knight, Jonas Kuhn, Dan Klein, Philipp Koehn, Daniel Marcu, Drago Radev, Arturo Trujillo, Stephan Vogel, C. Wayne, Kenji Yamada, etc.