30
Bridging the Gap between Iberian Languages MT to the rescue Juan Alberto Alonso 04.05.2012

Bridging the Gap between Iberian Languages - MT to the rescue

Embed Size (px)

DESCRIPTION

A presentation given by Juan Alberto Alonso about the co-official languages in Spain and the special role Machine Translation (MT) plays in it.

Citation preview

Page 1: Bridging the Gap between Iberian Languages - MT to the rescue

Bridging the Gap between Iberian Languages MT to the rescueJuan Alberto Alonso04.05.2012

Page 2: Bridging the Gap between Iberian Languages - MT to the rescue

Agenda Basque and Portuguese:

two special cases

MT and the Iberian languagesThe Use of MT

© Lucy Software Ibérica SL / 2

A success case with Catalan

Page 3: Bridging the Gap between Iberian Languages - MT to the rescue

The Use of MT

When is MT useful?

© Lucy Software Ibérica SL / 3

Page 4: Bridging the Gap between Iberian Languages - MT to the rescue

When is MT Useful?

When it is adapted to the user’s specific needs:

Terminology

Document format

Linguistic peculiarities

© Lucy Software Ibérica SL / 4

Page 5: Bridging the Gap between Iberian Languages - MT to the rescue

When is MT useful?

When it is properly used according to:

The translation quality delivered by the language-pairin question

The type of documents to be translated

The user environment where it has to be integrated

© Lucy Software Ibérica SL / 5

Page 6: Bridging the Gap between Iberian Languages - MT to the rescue

When is MT Useful?

When it is well integrated into the user’s document flow:

CMS, proxies, etc.

Press agencies and newspapers

Translation agencies

© Lucy Software Ibérica SL / 6

Page 7: Bridging the Gap between Iberian Languages - MT to the rescue

The Uses of MT

Translation qualitydoes not need to be

very highFor languages linguis-

tically more distantUseful to break language barriers(from 0% to X%)

Assimilation(Information)

Dissemination(Production)

Very high MT qualityFor closely

related languagesCan be integrated

into very complex user environments

© Lucy Software Ibérica SL / 7

Page 8: Bridging the Gap between Iberian Languages - MT to the rescue

Languages: Multilingualism, projection and role in the world

Page 9: Bridging the Gap between Iberian Languages - MT to the rescue

Official policies: Linguistic politics

Spanish is the official language in Spain, next to four co-official languages: Basque Catalan/Valencian Galician

Portuguese: New linguistic normative toward the international unification of language.

© Lucy Software Ibérica SL / 9

Page 10: Bridging the Gap between Iberian Languages - MT to the rescue

MT with the Iberian Languages:A Unique Case

Very high MT qualityFor closely

related languagesCan be integrated

into very complex user environments

Translation qualitydoes not need to be

very highFor languages linguis-

tically more distantUseful to break language barriers(from 0% to X%)

Assimilation(Information)

Dissemination(Production)

© Lucy Software Ibérica SL / 10

Page 11: Bridging the Gap between Iberian Languages - MT to the rescue

Translation qualitydoes not need to be

very highFor languages linguis-

tically more distantUseful to break language barriers(from 0% to X%)

Very high MT qualityFor closely

related languagesCan be integrated

into very complex user environments

MT with the Iberian Languages:A Unique Case

Political FactorsAssimilationDissemination

The promotion ofminority languages is apolitical issue and is supported by local

Governments

Need for huge translation volumes

© Lucy Software Ibérica SL / 11

Page 12: Bridging the Gap between Iberian Languages - MT to the rescue

Castilian, Catalan and Galician:An Ideal Scenario for MT

The translation quality yielded by MT among Castilian, Catalan and Galician is very high (above 95%)

Through a ramp-up phase where the MT system is adapted to the user’s needs, this quality can become even better.

The daily “normal” use of Catalan and Galician is officially encouraged and supported by the corresponding local Governments

© Lucy Software Ibérica SL / 12

Page 13: Bridging the Gap between Iberian Languages - MT to the rescue

Castilian, Catalan and Galician:An Ideal scenario for MT

There is a real and constant need of translation for huge documentation volumes between Castilian and Catalan (less for Galician).

MT has been used for years in productive complex environments for Castilian-Catalan (newspapers, translation agencies, Public Administrations, etc.), with millions of words MT-translated and post-edited on a daily basis and therefore...

There exists a year-long culture for productive MT use, with users and post-editors trained to use these systems. This is probably a unique case in the World

© Lucy Software Ibérica SL / 13

Page 14: Bridging the Gap between Iberian Languages - MT to the rescue

Castilian, Catalan and Galician:An ideal scenario for MT

Very high MT qualityFor closely

related languagesCan be integrated

into very complex user environments

Political Factors

Dissemination

The promotion ofminority languages is apolitical issue and is supported by local

Governments

Need for huge translation volumes

© Lucy Software Ibérica SL / 14

Page 15: Bridging the Gap between Iberian Languages - MT to the rescue

A Success Case for Spanish-Catalan:La Vanguardia

La Vanguardia is the leading newspaper in Catalonia, and one of the main newspapers in the rest of Spain, with an average daily circulation of over 200.000 copies. It is widely recognized as a quality newspaper.

Starting May 3rd 2011, La Vanguardia now has two parallel editions, one in Spanish and another in Catalan.

© Lucy Software Ibérica SL / 15

Page 16: Bridging the Gap between Iberian Languages - MT to the rescue

La Vanguardia: The Challenge 3 Options

Given the task of making bilingual daily editions of a newspaper, three possible options could be considered:

The “MT-less” option:Using no MT at all

The “full-MT” option:“Only” using MT

The “sensible-MT” option:Using MT + customization + human post-editors

© Lucy Software Ibérica SL / 16

Page 17: Bridging the Gap between Iberian Languages - MT to the rescue

La Vanguardia:The Challenge The “MT-less” Option

Duplicate the whole editorial human team OR/AND hire a team of N human translators to translate the entire newspaper content on time in order to keep both editions synchronized for publishing.

Duplicate most of the IT infrastructure

Given all these factors, the question arises of whether it would be feasible to produce bilingual editions of a newspaper this way because of Dramatic increase of costs Very tight time constraints

© Lucy Software Ibérica SL / 17 CONFIDENTIAL

Page 18: Bridging the Gap between Iberian Languages - MT to the rescue

La Vanguardia:The Challenge The “full-MT” Option

Run all the contents of the base edition through an MT translation system.

Publish the raw MT-translation of the original contents in the other-language edition.

Obviously, this is not an option because, even for language-pairs for which the quality of MT is very high (as it is the case for Spanish-Catalan, > 95%), the output mistakes would be unacceptable for publishing (proper nouns being translated, homographs, etc.) and the resulting Catalan style would not always sound “natural” to Catalan speakers.

© Lucy Software Ibérica SL / 18

Page 19: Bridging the Gap between Iberian Languages - MT to the rescue

La Vanguardia:The Challenge The “sensible MT” Option

Customize the MT-system to the specific linguistic needs of the newspaper (style guide, corporate terminology, proper nouns, etc.)

Integrate the MT-flow within the newspaper editorial flow (document and character formats, connection to a post-edition environment, feedback processing, etc.)

Incorporate a post-edition environment to be used by a team of human post-editors into the editorial flow.

Here we have a compromise between the MT-use (time and effort saving) and the translation quality.

© Lucy Software Ibérica SL / 19

Page 20: Bridging the Gap between Iberian Languages - MT to the rescue

Requirements from La Vanguardia

One daily copy of La Vanguardia includes over 60.000 words, all of them to be translated, revised and post-edited.

The Catalan edition should comply with the linguistic requirements stated in the Style Guide of La Vanguardia.

Both editions should be ready for printing every day at 23:30 the latest.

Currently, most journalists at La Vanguardia write in Spanish, which is now the base edition, out of which the Catalan edition is created, but

At short/mid-term every journalist will be free to write in the language of his/her choice (Catalan or Spanish), so that, actually, there will be no base edition.

Both the MT-system and the post-edition environment should be completely integrated into their editorial flow (both IT-integration and human team integration).

© Lucy Software Ibérica SL / 20

Page 21: Bridging the Gap between Iberian Languages - MT to the rescue

How the MT-System was Customized forLa Vanguardia

Computational linguists, post-edition experts, and La Vanguardia editorial team worked together for six months in order to

Customize the MT-system to their linguistic requirements (as far as possible)

Over 20.000 lexical entries added/changed in the MT-system lexicons Around 440 rules adapted in the MT-system grammars.

Integrate the MT-system into their IT editorial environment. Integration with HERMES CMS. La Vanguardia specific character format and XML tag handling Inclusion of markups specifically designed for post-editors Translation performance to meet the translation load & peaks

requirements.

A team of around 15 persons has been trained on post-editing the MT-output before publishing.

© Lucy Software Ibérica SL / 21

Page 22: Bridging the Gap between Iberian Languages - MT to the rescue

La VanguardiaConclusions

Producing two parallel bilingual editions of a daily newspaperonly seems to be feasible if:

MT is used

MT is properly customized, adapted and integrated to the newspaper linguistic and IT requirements.

There is a team of trained specialized human post-editors who correct MT mistakes and “give the human flavor” to the output.

© Lucy Software Ibérica SL / 22

Page 23: Bridging the Gap between Iberian Languages - MT to the rescue

Portuguese: A Different Scenario

Portuguese is one of the Iberian languages with a high-level business potential (both in Portugal and Brazil/South America)

The translation quality given by MT-Systems between Portuguese and Spanish is very high (similar to the one among Castilian, Catalan and Galician.

However, in the case of Portuguese, the key factor is the Business needs and opportunities and not the political drive.

© Lucy Software Ibérica SL / 23

Page 24: Bridging the Gap between Iberian Languages - MT to the rescue

Very high MT qualityFor closely

related languagesCan be integrated

into very complex user environments

There is a wide Market asking for quality translation

between Portugueseand Spanish

Need for huge translation volumes

Portuguese: a Different Scenario

Market NeedsDissemination

© Lucy Software Ibérica SL / 24

Page 25: Bridging the Gap between Iberian Languages - MT to the rescue

Basque: yet Another Different Scenario

CA: El basc és un cas particular entre les llengües de la Península Ibèrica

ES: El vasco es un caso particular entre las lenguas de la Península Ibérica

GL: O vasco é un caso particular entre as linguas da Península Ibérica

EU: Iberiar Penintsulako hizkuntzen artean euskara kasu berezia da

PT: O basco é um caso particular entre as línguas da Península Ibérica

EN: Basque is a special case among the languages of the Iberian Peninsula

© Lucy Software Ibérica SL / 25

Page 26: Bridging the Gap between Iberian Languages - MT to the rescue

Basque: yet Another Different Scenario

The promotion ofminority languages is apolitical issue and is supported by local

Governments Need for huge

translation volumes

PoliticalFactors

AssimilationDissemination

Enough MT quality forrestricted domains

For closely related languages

Can be integratedinto very complex user

environments

Translation qualitydoes not need to be

very highFor languages linguis-

tically more distantUseful to break language barriers(from 0% to X%)

© Lucy Software Ibérica SL / 26

Page 27: Bridging the Gap between Iberian Languages - MT to the rescue

Basque: yet Another Different Scenario

Basque is a special case among the Iberian languages: It is not an Indo-European language. It is linguistically very

different from the rest of Iberian languages (and, incidentally, also from any other human language).

The MT translation quality between Basque and Castilian, Portuguese, Galician or Catalan will be lower than the one obtained among the latter four.

Adapted for restricted domains, the MT quality can be sufficient for productive use.

Its daily “normal” use is being encouraged and supported by the Basque Government.

The use of MT to translate from Basque into Castilian, Catalan, Portuguese, Galician or English is a good example of assimilation use (breaking language barriers).© Lucy Software Ibérica SL / 27

Page 28: Bridging the Gap between Iberian Languages - MT to the rescue

Lucy Basque MT Portal

The first MT-systems with Basque already exist and new oneswill be developed at short/mid-term

© Lucy Software Ibérica SL / 28

Page 29: Bridging the Gap between Iberian Languages - MT to the rescue

Questions?

© Lucy Software Ibérica SL / 29

Page 30: Bridging the Gap between Iberian Languages - MT to the rescue

Thank you for your attention!

Juan A. AlonsoLucy Software Ibérica

[email protected]

© Lucy Software Ibérica SL / 30