30
Searching for the Best Translation Combination Matīss Rikters Darba vadītāja: Dr. Dat., prof. Inguna Skadiņa Doktorantūras seminārs Rīga, Latvija 12. oktobris 2016

Doktorantūras semināra 3. prezentācija

Embed Size (px)

Citation preview

Page 1: Doktorantūras semināra 3. prezentācija

Searching for the Best Translation

Combination

Matīss RiktersDarba vadītāja: Dr. Dat., prof. Inguna Skadiņa

Doktorantūras seminārsRīga, Latvija

12. oktobris 2016

Page 2: Doktorantūras semināra 3. prezentācija

ContentsHybrid Machine TranslationMulti-System Hybrid MTSimple combining of translations

– Combining full whole translations– Combining translations of sentence chunks

Combining translations of linguistically motivated chunksSearching for the best translation combinationOther workFuture plans

Page 3: Doktorantūras semināra 3. prezentācija

Hybrid Machine Translation

Statistical rule generation– Rules for RBMT systems are generated from training corpora

Multi-pass– Process data through RBMT first, and then through SMT

Multi-System hybrid MT– Multiple MT systems run in parallel

Page 4: Doktorantūras semināra 3. prezentācija

Multi-System Hybrid MT

Related work:SMT + RBMT (Ahsan and Kolachina, 2010)Confusion Networks (Barrault, 2010)

– + Neural Network Model (Freitag et al., 2015)

SMT + EBMT + TM + NE (Santanu et al., 2014)Recursive sentence decomposition (Mellebeek et al., 2006)

Page 5: Doktorantūras semināra 3. prezentācija

Combining Translations

Combining full whole translations– Translate the full input sentence with multiple MT systems– Choose the best translation as the output

Page 6: Doktorantūras semināra 3. prezentācija

Combining full whole translations– Translate the full input sentence with multiple MT systems– Choose the best translation as the output

Combining translations of sentence chunks– Split the sentence into smaller chunks

• The chunks are the top level subtrees of the syntax tree of the sentence– Translate each chunk with multiple MT systems– Choose the best translated chunks and combine them

Combining Translations

Page 7: Doktorantūras semināra 3. prezentācija

Teikumu dalīšana tekstvienībās

Tulkošana ar tiešsaistes MT API

Google Translate Bing Translator LetsMT

Labākā tulkojuma izvēle

Tulkojuma izvade

Sentence tokenization

Translation with online MT

Selection of the best translation

Output

Whole translations

Page 8: Doktorantūras semināra 3. prezentācija

Teikumu dalīšana tekstvienībās

Tulkošana ar tiešsaistes MT API

Google Translate

Bing Translator LetsMT

Labāko fragmentu izvēle

Tulkojumu izvade

Teikumu sadalīšana fragmentos

Sintaktiskā analīze

Teikumu apvienošana

Sentence tokenization

Translation with online MT

Selection of the best chunks

Output

Syntactic analysis

Sentence chunking

Sentence recomposition

Chunks

Page 9: Doktorantūras semināra 3. prezentācija

Choosing the best

Choosing the best translation:KenLM (Heafield, 2011) calculates probabilities based on the observed entry with longest matching history :

where the probability and backoff penalties are given by an already-estimated language model. Perplexity is then calculated using this probability: where given an unknown probability distribution p and a proposed probability model q, it is evaluated by determining how well it predicts a separate test sample x1, x2... xN drawn from p.

Page 10: Doktorantūras semināra 3. prezentācija

An advanced approach to chunking– Traverse the syntax tree bottom up, from right to left– Add a word to the current chunk if

• The current chunk is not too long (sentence word count / 4)• The word is non-alphabetic or only one symbol long• The word begins with a genitive phrase («of »)

– Otherwise, initialize a new chunk with the word– In case when chunking results in too many chunks, repeat the process,

allowing more (than sentence word count / 4) words in a chunk

Linguistically motivated chunks

CICLing 2016

Page 11: Doktorantūras semināra 3. prezentācija

Selection of the best translation:12-gram LM trained with

– KenLM– DGT-Translation Memory corpus (Steinberger, 2011) – 3.1 million

Latvian legal domain sentences– Sentences scored with the query program from KenLM

Linguistically motivated chunks

CICLing 2016

Page 12: Doktorantūras semināra 3. prezentācija

Selection of the best translation:12-gram LM trained with

– KenLM– DGT-Translation Memory corpus (Steinberger, 2011) – 3.1 million

Latvian legal domain sentences– Sentences scored with the query program from KenLM

Test data– 1581 random sentences from the JRC-Acquis corpus– ACCURAT balanced evaluation corpus

Linguistically motivated chunks

CICLing 2016

Page 13: Doktorantūras semināra 3. prezentācija

Linguistically motivated chunks

CICLing 2016

Page 14: Doktorantūras semināra 3. prezentācija

Simple chunks Linguistically motivated chunks

• Recently

• there

• has been an increased interest in the automated discovery of equivalent expressions in different languages

• .

• Recently there has been an increased interest

• in the automated discovery of equivalent expressions

• in different languages . 

Linguistically motivated chunks

CICLing 2016

Page 15: Doktorantūras semināra 3. prezentācija

Linguistically motivated chunks

Page 16: Doktorantūras semināra 3. prezentācija

Searching for the best

The main differences: • the manner of scoring chunks with the LM and selecting the best

translation• utilisation of multi-threaded computing that allows to run the

process on all available CPU cores in parallel• still very slow

Page 17: Doktorantūras semināra 3. prezentācija

Searching for the best

Legal domain General domain

Page 18: Doktorantūras semināra 3. prezentācija

Whole translations

System BLEUHybrid selection

Google Bing LetsMT Equal

Google Translate 16.92 100 % - - -

Bing Translator 17.16 - 100 % - -

LetsMT 28.27 - - 100 % -

Hybrid Google + Bing 17.28 50.09 % 45.03 % - 4.88 %

Hybrid Google + LetsMT 22.89 46.17 % - 48.39 % 5.44 %

Hybrid LetsMT + Bing 22.83 - 45.35 % 49.84 % 4.81 %

Hybrid Google + Bing + LetsMT 21.08 28.93 % 34.31 % 33.98 % 2.78 %

May 2015 results (Rikters 2015)

Page 19: Doktorantūras semināra 3. prezentācija

System

BLEU Hybrid selection

Whole translations Simple chunks Google Bing LetsMT

Google Translate 18.09 100% - -

Bing Translator 18.87 - 100% -

LetsMT 30.28 - - 100%

Hybrid Google + Bing 18.73 21.27 74% 26% -

Hybrid Google + LetsMT 24.50 26.24 25% - 75%

Hybrid LetsMT + Bing 24.66 26.63 - 24% 76%

Hybrid Google + Bing + LetsMT 22.69 24.72 17% 18% 65%

September 2015 (Rikters and Skadiņa 2016(1))

Simple chunks

Page 20: Doktorantūras semināra 3. prezentācija

System BLEU Equal Bing Google Hugo Yandex

BLEU - - 17.43 17.73 17.14 16.04

Whole translations – G+B 17.70 7.25% 43.85% 48.90% - -

Whole translations – G+B+H 17.63 3.55% 33.71% 30.76% 31.98% -

Simple Chunks – G+B 17.95 4.11% 19.46% 76.43% - -

Simple Chunks – G+B+H 17.30 3.88% 15.23% 19.48% 61.41% -

Linguistic Chunks – G+B 18.29 22.75% 39.10% 38.15% - -

Linguistic Chunks – G+B+H+Y 19.21 7.36% 30.01% 19.47% 32.25% 10.91%

Linguistically motivated chunks

January 2016 (Rikters and Skadiņa 2016(2))

Page 21: Doktorantūras semināra 3. prezentācija

Searching for the best

SystemBLEU

Legal General

Full-search 23.61 14.40

Linguistic chunks 20.00 17.27

Bing 16.99 17.43

Google 16.19 17.72

Hugo 20.27 17.13

Yandex 19.75 16.03

May 2016 (Rikters 2016 (2))

Page 22: Doktorantūras semināra 3. prezentācija

Start page

Translate with online systems

Input translations to combine

Input translated

chunks

Settings

Translation results

Input source sentence

Input source sentence

Interactive MS MT

Page 23: Doktorantūras semināra 3. prezentācija

• Matīss Rikters"Multi-system machine translation using online APIs for English-Latvian" ACL-IJCNLP 2015

• Matīss Rikters and Inguna Skadiņa"Syntax-based multi-system machine translation" LREC 2016

• Matīss Rikters and Inguna Skadiņa"Combining machine translated sentence chunks from multiple MT systems" CICLing 2016

• Matīss Rikters"K-translate – interactive multi-system machine translation"Baltic DB&IS 2016

• Matīss Rikters“Searching for the Best Translation Combination Across All Possible Variants”Baltic HLT 2016

Publications

CICLing 2016

Page 24: Doktorantūras semināra 3. prezentācija

• Matīss Rikters"Interactive Multi-system Machine Translation With Neural Language Models" IOS Press

• Matīss Rikters“Neural Network Language Models for Candidate Scoring in Hybrid Multi-System Machine Translation”CoLing 2016

Publications in progress

Page 25: Doktorantūras semināra 3. prezentācija

Neural Language Models

0.11

0.32

0.50

0.70

0.88

1.09

1.29

1.47

1.67

1.77

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

16.00

17.00

18.00

19.00

20.00

21.00

22.00

23.00

24.00

25.00

Perplexity BLEU-HYLinear (BLEU-HY)

Epoch

Per

plex

ity

BLE

U

0.11

0.32

0.50

0.70

0.88

1.09

1.29

1.47

1.67

1.77

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

13.30

13.80

14.30

14.80

15.30

15.80

16.30

Perplexity BLEULinear (BLEU)

Epoch

Per

plex

ity

BLE

U

Page 26: Doktorantūras semināra 3. prezentācija

Code on GitHubhttp://ej.uz/ChunkMT

http://ej.uz/SyMHyT

http://ej.uz/MSMT

http://ej.uz/chunker

http://ej.uz/NeuralLM

Code on GitHub

Page 27: Doktorantūras semināra 3. prezentācija

More enhancements for the chunking stepAdd special processing of multi-word expressions (MWEs)Try out other types of LMs

– POS tag + lemma– Recurrent Neural Network Language Model

(Mikolov et al., 2010)– Continuous Space Language Model

(Schwenk et al., 2006)– Character-Aware Neural Language Model

(Kim et al., 2015)Choose the best translation candidate with MT quality estimation

– QuEst++ (Specia et al., 2015)– SHEF-NN (Shah et al., 2015)

Handling MWEs in neural machine translation systems

Experiments on English – Estonian language pair

Future work

Page 28: Doktorantūras semināra 3. prezentācija

Citi darbi

• Pedagoģiskie darbi• Vadīti vairāki kursa un kvalifikācijas darbi• Vidējā atzīme 8.67• Studentu kurators

• Vasaras / ziemas skolas• Deep Learning For Machine Translation• ParseME 2nd Training School• Neural Machine Translation Marathon

Page 29: Doktorantūras semināra 3. prezentācija

References• Ahsan, A., and P. Kolachina. "Coupling Statistical Machine Translation with Rule-based Transfer and Generation, AMTA-The Ninth Conference of the Association for Machine Translation in the Americas." Denver, Colorado (2010).

• Barrault, Loïc. "MANY: Open source machine translation system combination." The Prague Bulletin of Mathematical Linguistics 93 (2010): 147-155.• Heafield, Kenneth. "KenLM: Faster and smaller language model queries." Proceedings of the Sixth Workshop on Statistical Machine Translation.

Association for Computational Linguistics, 2011.• Kim, Yoon, et al. "Character-aware neural language models." arXiv preprint arXiv:1508.06615 (2015).• Mellebeek, Bart, et al. "Multi-engine machine translation by recursive sentence decomposition." (2006).• Mikolov, Tomas, et al. "Recurrent neural network based language model." INTERSPEECH. Vol. 2. 2010.• Petrov, Slav, et al. "Learning accurate, compact, and interpretable tree annotation." Proceedings of the 21st International Conference on

Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.

• Raivis Skadiņš, Kārlis Goba, Valters Šics. 2010. Improving SMT for Baltic Languages with Factored Models. Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications, Vol. 2192. , 125-132.

• Rikters, M., Skadiņa, I.: Syntax-based multi-system machine translation. LREC 2016. (2016)• Rikters, M., Skadiņa, I.: Combining machine translated sentence chunks from multiple MT systems. CICLing 2016. (2016)• Santanu, Pal, et al. "USAAR-DCU Hybrid Machine Translation System for ICON 2014" The Eleventh International Conference on Natural Language

Processing. , 2014.• Schwenk, Holger, Daniel Dchelotte, and Jean-Luc Gauvain. "Continuous space language models for statistical machine translation." Proceedings of

the COLING/ACL on Main conference poster sessions. Association for Computational Linguistics, 2006.• Shah, Kashif, et al. "SHEF-NN: Translation Quality Estimation with Neural Networks." Proceedings of the Tenth Workshop on Statistical Machine

Translation. 2015.• Specia, Lucia, G. Paetzold, and Carolina Scarton. "Multi-level Translation Quality Prediction with QuEst++." 53rd Annual Meeting of the Association

for Computational Linguistics and Seventh International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: System Demonstrations. 2015.

• Steinberger, Ralf, et al. "Dgt-tm: A freely available translation memory in 22 languages." arXiv preprint arXiv:1309.5226 (2013).• Steinberger, Ralf, et al. "The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages." arXiv preprint cs/0609058 (2006).

References

Page 30: Doktorantūras semināra 3. prezentācija

Paldies!