15
PangeaMT putting open standards to work… well Manuel Herranz

Panacea presentation - Pangeanic - Budapest

Embed Size (px)

DESCRIPTION

presentation on history of MT and how language resources have helped to develop MT (particularly statistical MT) with an emphasis in Pangeanic's experience

Citation preview

Page 1: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

Manuel Herranz

Page 2: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

Chomsky: Imagine that ifin the futurelarge enoughamounts of data existed, they could be processed bycomputers withenoughcomputingpower

rule-based systems, IBM licenses, many linked to patent EN/RU & Intel

First statistical papers

1st Open source SMT

Translation industryappropriating Moseshttp://euromatrixplus.net/moses

DIY SMT

http://t.co/HDTboxQ

Page 3: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

PEAK of ColdWar and informationcontrol.Products & informationdirected toconsumers/ users / citizens

BEGINNING of data resources. Internet.Accessability toinformation

Content generated BY USERS / CONSUMERS / CITIZENS, multilingual, free informationexchange across theworld

Page 4: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

2007/08

.

2009/10

2011/12

• DIY SMT• Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users

• Compatibility withcommercial formats (ttx, sdlxliff, itd)

2007 and before

• RB tests with commercial software• Insufficiently good output• Only internal production

• EU Post-Editing Award

• V1: Small data sets (2-5M words), automotive & electronics

• (ES), then Fr/It/De in other fields

• Division born• 00's of engine trials and language combinations

• Open-Source to commercial

• TMX / XLIFF workflows

As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020

EstimatesUp 50% a year (Oracle)Doubles every 11 hours (IBM)

Page 5: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

OBJECTIVES = CHALLENGES

Turn academic development (Moses) into a commercialapplication.

To provide High Q MT for Post-Editing and save time and cost. No Google-type broad TR but domain-specific, user-centric.

Lower entry level for MT. Bring democracy and affordability to MT. Bring it to the user, take away from programmer.

How? By fostering open-standard geared translation automation strategies

To use only community-based Open standards –> Oasis / ISO: xliff / tmx, xml). NO proprietary formats (technology independence) so USERS are not “locked” in to buying and updating expensive software.

Page 6: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

LR… The plight

• TAUS TDA, millions of words

• Own data

• Sony (client donation)

• (Manual) data gathering & alignment

• Manual cleaning until some tools developed – limited SME resources

• Resources getting smaller soon – need to build more

• HELP!!!!

Page 7: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

7

The rush for data

Soon realised that there was a rush to gather data but that other resources around data were necessary

cleaning

More cleaning

Page 8: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

8

The rush for data (clean)

cleaning

More cleaning

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>A system for recovering the methane that is emitted from the manure so that

it does not leak into the atmosphere.</seg>

</tuv>

<tuv xml:lang="FR-FR">

<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel

d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>

</tuv>

<tu creationdate="20090817T114430Z" creationid="APIACCESS"

changedate="20110617T141159Z" changeid=“pat">

<tuv xml:lang="EN-US">

<seg>Overall heigtht –<bpt i="1">{\f43 </bpt> <ept i="1">}</ept>25&quot;; width –

<bpt i="2">{\f43 </bpt> <ept i="2">}</ept>20.1&quot;.</seg>

</tuv>

<tuv xml:lang="ES-EM">

<seg><bpt i="1">{\f2 </bpt>Altura total - 25&quot;; anchura <ept i="1">}</ept>–

<bpt i="2">{\f43 </bpt> <ept i="2">}</ept><bpt i="3">{\f2 </bpt>20,1&quot;.<ept

i="3">}</ept></seg>

</tuv>

</tu>

<tuv xml:lang=“EN-US">

<seg>On 22nd May we decided not to join the group.</seg>

<tuv xml:lang=“DE-DE">

<seg>Am 22. </seg>

Page 9: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

LR… The plight

• MORE DATA!!!

• Domain specific

• Monolingual (for LM) and parallel crawling

• Corpus normalization

• (Semi) automated PoS tagging

• Self-generation of similar texts for morphologically-rich languages

Page 10: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

- “on the fly” SMT training (minutes / hours, not manually) –April 2011 !!

- pick and match sets of data: “extreme customization” –April 2011 !!

- online, user-customizable glossaries, DNTs, expressions →

“predictive SMT” – May 2011 !!

- objetive stats for post-editors (calculate effort)

- confidence scores for users (→ translators or readers) withCAT integration (web-based / desktop)…

Page 11: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

- API integration + user domain building

- Audiovisual integration

- Release the code to users → create a community and flavours to each situation; hybridate and create rules

- Have more and more companies and institutions use PangeaMT as their platform and make it grown

Page 12: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

2015

2014

2013

2011

2010

2009

2012

2018

2017

2016

User e

mpo

werm

ent

YEAR2016

00

0's

of c

usto

miz

ed

MT

syste

ms

Predictions Tech. notthe realm of afew providers

Page 13: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

2010

2009

2018

Page 14: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

2015

2014

2013

2011

2010

2009

2012

2018

2017

2016

MT

acce

pta

nce

User e

mpo

werm

ent

• MT acceptance growth.

• Translator engagement challenge

• Need for data has been addressed – still more work to be done.

• Users and practitioners now can build their own systems.

Until 2011

YEAR2016

00

0's

of c

usto

miz

ed

MT

syste

msIn 5 years... after 2016

Predictions

• Combinations??

• Supra-engines??

• World-knowledge??

…...suggestions....???

Tech. notthe realm of afew providers

Page 15: Panacea presentation - Pangeanic - Budapest

PangeaMT – putting open standards to work… well

15

QUESTIONS ?

[email protected]