24
Manuel Herranz Pangeanic PangeaMT System, Cleaning, Automation on Retraining, Data Management, Hybridization

Presentation at CEF-EU-Luxembourg

Embed Size (px)

Citation preview

PowerPoint Presentation

Manuel HerranzPangeanic PangeaMT System, Cleaning, Automation on Retraining, Data Management, Hybridization

What is PangeaMT?The first commercial application of Open Source Moses (AMTA 2010, http://euromatrixplus.net/moses). Trados compatibility. Automated (re)training modules by folder.A development overcoming Moses limitations reported to Association for MT in the Americas : PangeaMT putting open standards to work... well AMTA 2010 http://bit.ly/uM8x6V

2011 PangeaMT launches the DIY Solution to Machine Translate independently and flexibly like never before http://bit.ly/kSd3wC

2011 MT experiences Sony Europe http://slidesha.re/oxZmBS

2011 A harness that eases re-training and updating DIY SMT as presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

2PangeaMT

What is PangeaMT?

2012 Collaboration with Toshiba for Japanese hybridation: 2 articles at Asian Association for Machine Translation (2012-2013). Automated (re)training modules by domain tag.

2012: Compatibility with SDLXLIFF, MemoQ and all Tikal formats.

2013: API for hosted solutions.

2014: Compatibility with MemSource.

2015: Pangea v3

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

3PangeaMT

Partners in Research

ITI: 8 FT researchers 85 staffPRHLT: 15 researchers led byProf. F. Casacuberta

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

4PangeaMT

T/products/services/processes that can be offered to CEF.ATcopyright status/IPR. How can EC do business with you?

- The Pangea platform is the property of Pangeanic S.L, Valencia, Spain.- Built on Open Source (some GNL). Copyright/IPR: 80% Pangeanic, 20% ITI (Univ. underlying code) - Pangeanic is free to commercialize, hire or install a full PangeaMT platform, customize it to specific user needs, design and implement new features together with its technological partners ITI (Computer Science Institute) and PRHLT. - Full ownership of the platform includes engine creation by domain, data cleaning processes, engine retraining/update plus - modular pre- and post-processing per language pair (rules, re-ordering, etc) - BLEU scores and translation statistics.- API for CAT-tool integration: Trados ttx, Studio sdlxliff, MemoQ, Memsource, etc. - Training on language /field customization.- Data services (including cleaning)- Training, consultancy and support.- Pangeanic is free to provide full system services to any interested party, via RFP or direct purchasing request (PO).

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

5PangeaMT

Can the technology be licensed to be integrated in theCEF platform/services?Yes, to CEF and any other services requiring fast MT + retrainings.PangeaMT is not just a I sell MT engines business. It is a full machine translation environment with plug-ins and API calls. It is Moses-bases but it can also change paradigm to Apertium, Thot or other systems whilst keeping powerful language-dependant pre-processing and post-processing modules, which are key to hibridization.The technology is designed more large MT users than for small LSPs.What are the licensing conditions? Consultancy services1-3 year license (renewable) which provides unlimited use for translation purposes domain customization [practically] unlimited engine creation within the contracted language pairs (typically this is limited only to clients data availability) engine retrainings.The use of the platform requires a 1 week training. Consultancy services available.

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

6PangeaMT

Languages/language pairs and the achieved level of quality/evaluation results

BLUE results following typical academic standards: 2000 sentences taken out of main body.These results are for general purpose baseline engines after data cleaning and little customization. Typically based on EU, TAUS and our own data. Translators typical output without MT: up from 2,300 words.Productivity with MT around 4-5k/day/translator. with customized engines.

Peaked at 9k/day/person. Automotive client 8M/year 1,5 translators.This means 23,188 words/translator/day over 230 working days (including repetitions).

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

7PangeaMT

Languages/language pairs and the achieved level of quality/evaluation results

Evaluation results: Clients Use case Sybase 2011 2012 MosesCore:http://es.slideshare.net/TAUS/4-june-2012-taus-moses-open-source-mt-showcase-paris-kerstin-bier-sybasePE productivity >70%, cost savings 20%. 5M words, 49% BLEU EN-DEDeliveries 50% faster. BLUE not good metric, preferred METEOR.

Use case Sony: F/I/G/S, marketing and technical tests. Reported at TAUS / Localization World Barcelona -> +50% productivity increase & time to market. Language Project Managers updated and created engineshttp://www.slideshare.net/manuelherranz/mtexperiences-sony-europe-pangeamt-fprastarosonyeyustepangeamt

From EN and into EN : FR, IT, ES, ES-MX, DE, PT, PTBR, DA, SV, NO, NL, GK, PL, BG, RO, ZH, JP, KR, RU (19).From ES and into ES : FR, IT, PT. Under development: DE, SV, RU, ZH. 2016: ES AR

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

8PangeaMT

Resources needed to make each tool/service/process work and/or to adapt it to a specific domain (LR & of human manual/preparatory work)PangeaMT is a machine translation environment and as such, it can work with any language pair. A sufficiently big and representative training corpus is required for statistical modelling (FastAlign, KenLM, Moses) and model learning.Basic or baseline models can be trained by default by linguists with sufficient knowledge of TM cleaning. However, the platform allows big improvements on these models in several ways:

Language-specific and domain-specific modular pre-processing and post-processing: linguistic information, re-ordering, etc., help to normalize the text internally for better SMT model learning. The platform includes some general pre-processing modules and others that are language-dependant. New pre-processing modules can be easily added.

Table combination. This is a Moses functionality combining several segment tables. When we do domain adaptation, a general table is used when out of vocabulary or not good enough threshold is achieved.

Monolingual data for language model enhancement.

Hybridization. Language rules can be included in pre- or post-processing and users can opt for hierarchical models (slower results). WIP: Morphology-rich languages and Japanese with POS.

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

9PangeaMT

Resources needed to make each tool/service/process work and/or to adapt it to a specific domain (LR & of human manual/preparatory work)Minimum 5M words for meaningful domains (as reported Sybase), customization time: 1 person, 1-2 weeks including data cleaning.

General models can be trained with minimum human intervention. All users need to do is to upload the cleanest possible TMX bitexts. The system will automatically train the model applying general or language-specific pre-processing routines.

Pre-processing routines can be adapted with use (computational linguist/programmer)

Specialized models (or domain-specific) can be trained and set with specific language features: language / domain. These specific models can be combined with general models to obtain a wider coverage.

Retrainings are always automated: Sybase, Sony, Honda, Subaru, Hioki.

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

10PangeaMT

Data? best clean, thank you

A system for recovering the methane that is emitted from the manure so that it does not leak into the atmosphere.

Systme permettant de r prer le mthane qui se dgage de l'engrais naturel d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphre.

On 22nd May we decided not to join the group.

El Presidente de los Estados Unidos, el seor Obama y su esposa la seora Michelle, visitaron Costa Rica el pasado sbado.

It is a journalisticpoint of viewandstrengthsof theEnglish-language newspaperJapan Times. It includesa description of the excitingand rewardingworkof translationandinterpretation,as well as the introductionof consciousnessandhow to acquirethe required professionalskills.Theroadto becominga translator and interpreteralsodown tothe actualwork site,a comprehensive guide tointerpretingthereality oftoday'stranslation industry.

Data? best clean, thank you

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

12PangeaMT

Engine training with clean dataHaving approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data.

Data cleaning modulesRemove any suspects:Sentences that are too longMismatches (of many kinds!)Terminological inaccuraciesNon-useful segments, etc

Parallel text extraction / Translation input / Post-edited materialThis is often comes from CAT tools or document alignments, crawling

Data Cleaning (in-lines)Remove all non-translation data. TMX Human approvalSome of this material may actually be OK for training. It is then input in the training set.

Data? best clean, thank you

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

13PangeaMT

Programme Committe commentConcentrate on the processes and the automation of workflows and how well these have been validated and tested in real production of MT systems.

We are interested in the "factory" concept, automating training data domain selection, tuning and data cleaning.

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

14PangeaMT

System featuresCleaning

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

15PangeaMT

System features For EXPERTDomain

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

16PangeaMT

System features For EXPERTEngine Creation

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

17PangeaMT

System features For EXPERTEngine Training

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

18PangeaMT

- Moses reordering is insufficient when the syntactic distance between languages is very large (unrelated languages). Patterns are lost (or not found) monotone TR

- BLEU is not good indicator

- Improve PERCEIVED quality

Hybridation Experiences at PangeanicRationale

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

19PangeaMT

SYNTAX-BASED (TREE) FOR HYBRID SMT

Hybridation Experiences at PangeanicSyntax-based analysis & re-ordering rulesTree depth: 10Calc time +59% !!

Las siguientes diapositivas muestran distintos ejemplos de escalas de tiempo con elementos grficos SmartArt.Incluya una escala de tiempo del proyecto, donde se indiquen claramente los hitos y fechas importantes, y resalte dnde se encuentra el proyecto en este momento.

20PangeaMT

TOSHIBA vs MECAB LESSONS LEARNTMecab re-ordering produced higher BLEU than Toshibas

Steps Toward ENJP MT HybridationHybridation Experiences at PangeanicTWO OPTIONS

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

21PangeaMT

TOSHIBA vs MECAB LESSONS LEARNTMecab re-ordering produced higher BLEU than Toshibas

Paper published December 2011 AAMT Going Hybrid: Pangeanics and Toshibas First Steps Toward ENJP MT HybridationHybridation Experiences at Pangeanic

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

22PangeaMT

Installing PangeaMT

MySQL Servermysql-server 5.5Open JDKopenjdk-8-jdk 8Apache Tomcattomcat88MosesGiza++Fast_alignKenLMTikal (Okapi)IRSTLMmkclsmgizatercppmulti-bleu.perlmecabtimeoutpython

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

23PangeaMT

[email protected]#manuelhrrnz #pangeanic

pangeanic

* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.

24PangeaMT