28
mputing Science, University of Aberdeen CS4025: Machine Translation Background, how languages differ MT Techniques Controlled languages For more info: J&M, chap 21 in 1 st ed, 25 in 2 nd . Also extra notes.

Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Embed Size (px)

Citation preview

Page 1: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 1

CS4025: Machine Translation

Background, how languages differ MT Techniques Controlled languages

For more info: J&M, chap 21 in 1st ed, 25 in 2nd .Also extra notes.

Page 2: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 2

Machine Translation

Automatically translate texts between languages (eg, English to Japanese)» Or assist human translators?

One of the oldest dreams of NLP, AI, and CS (first system in 1954).

Page 3: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 3

Varieties of Machine Translation

Translating from a source language to a target language.

(FA)MT – (full automatic) Machine Translation HAMT – Human Aided MT (aid before or after) MAHT – Machine Aided Human Translation

Page 4: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 4

Brief History of MT

Serious but naïve work in the 1950’s 1966 ALPAC report (speed, cost, accuracy)

terminated most research funding “Underground” MT systems developed into

products (e.g. SYSTRAN) in the 1970’s More MT products emerged in the 1980’s and

1990’s, though still relatively simple MT now in everyday widespread use (e.g. for

web pages), in spite of its problems

Page 5: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 5

Translation is Hard: Language differences

Lexical Meanings assigned to a word

» to know a person» to know a fact

Boundaries on a scale» friend vs acquaintance

Preferences» sibling vs brother vs elder brother

Gaps» Japanese has no word for privacy

Page 6: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 6

Overlaps between word senses (Eng/Fr)

Page 7: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 7

Syntactic differences

Morphology vs word-order» English: John saw Jane» Russian: John[+subject] saw Jane[+object]

Which word orders» English: a cheap car» French: a car cheap

Argument order (e.g. VSO/SVO/SOV languages)» English: John likes apples» Spanish: apples gustar John

Page 8: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 8

Pragmatic differences

Zero pronouns» Bake [] for 20 minutes

Extra distinctions» Relative-status markers in Japanese

Cultural knowledge» mu -> curtains of her bed, not just curtains

Page 9: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 9

Translating from Japanese to English…

dai yu zi zai chuang shang gan nian bao chai you ting jian chuang wai zhu shao xiang ye zhe shang, yu sheng xi li, qing han tou mu, bu jue you di xia lei lai.

Dai-yu alone on bed top think-of-with-gratitude Bao-chai again listen to window outside bamboo tip plantain leaf of on-top rain sound sigh drop clear cold penetrate curtain not feeling again fall down tears come

As she lay there alone, Dai-yu’s thoughts turned to Bao-chai… Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without noticing it she had begun to cry.

Page 10: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 10

Perfect Translation needs World Knowledge

Example: Translating “it” into a language which associates grammatical gender with nouns requires identifying the antecedent:» A hollow cylinder … rests on a surface … and an

object is suspended so that it …

English German Gender Pronoun

Surface Flaeche Feminine sie

Cylinder Zylinder Masculine er

Object Objekt Neuter es

Page 11: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 11

Approaches to MT

Page 12: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 12

Direct Translation

No intermediate representation. Possibly morphological analysis and simple reordering principles

Input: [Japanese text] After word-by-word translation

» I give PAST pen on desk John to After word-order, det rewrite rules

» I give PAST the pen on the desk to John After morphology

» I gave the pen on the desk to John

Page 13: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 13

Completely tied to a language pair» Complete new system for each pair

Problems dealing with ambiguity:Example (Russian-English)» My trebuem mira» We require world (direct translation)» We want peace (correct

translation) Don’t need complex NLP

» used in cheap translators Useful as a “default translation” if more

complex techniques fail

Direct Translation - Issues

Page 14: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 14

Structural Transfer

Three steps» parse input text (reusable)» rewrite parse tree into parse tree of new language

(specific to language pair)– English NP -> Det Adj N becomes– French NP -> Det N Adj

» generate output text (reusable) More in next lecture

Page 15: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 15

Structural Transfer - Issues

Most popular approach (?)» Used in Systran (Altavista translator)

n*(n-1) transfer components needed for translation between n languages

Good for syntax, less good for words, pragmatics» supplement with other techniques, such as statistical

translation of individual words?

Page 16: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 16

Interlingua Approach

Two steps» full analysis of input text, into a meaning

(interlingua)– eg, know into KnowFact or KnowPerson

» full generation of output text, from meaning Can’t be done except in a small domain Preserving ambiguity

» if target language uses same word for KnowFact or KnowPerson, no need to disambiguate know

Page 17: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 17

Interlingua Approach - Issues

Interlingua must contain all aspects of meaning needed for all the languages (e.g. gender for Spanish cats)

Interlingua must reflect all the different views on how the world is made up (e.g. Japanese “yasai” refers mostly to vegetables, but also mint but not carrots)

For this to work, the domain must be restricted and the languages similar

Translation between n languages only needs n analysis components and n generation components

Page 18: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 18

Statistical Approach

Noisy channel model for speech rec: look for Sentence that maximises P(Sig|Sent)*P(Sent)

MT: look for translation Sent that maximises P(Input|Sent)*P(Sent)» faithfulness*fluency??» P(Sent) - estimated using bigrams/trigrams» P(Input|Sent) - estimated by analysing a corpus of

human-translated texts– eg, how often is know translated as savoir (know fact)

and how often as connaitre (know person)– Also model reordering, insertions, deletions

Page 19: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 19

Statistical Approach - Issues

P(Input|Sent)» Very hard to model situations where

translation reorders material, even if this has a simple syntactic description

» How “faithful” is a proposed output sentence to the original input text?

» Less clear what this means once we go beyond translating individual words

» Combine with direct techniques?

Page 20: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 20

Translating 100 sentences is trivial, the problems are all in the scaling-up.» Good dictionaries are key.

Three uses» Fully automatic rough translation

– like Altavista/Systran Babelfish

» Draft translations which a human post-edits (humans can postedit quickly as long as less than 20% of words need to be changed)

» Tools for translators (MAHT)

MT Performance

Page 21: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 21

Another approach to HAMT:Controlled Languages

A controlled (simplified, basic) English is a subset of full English.» Limited vocabulary: repair but not fix» Limited syntax: I ate but not I have eaten

Mainly used for technical documents Originally intended to make manuals easier for

non-native speakers MT works much better if input is Controlled

English

Page 22: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 22

(Emerging) standard for commercial aerospace industry.

Designed by academic linguists as well as practitioners (technical authors).

AECMA Simplified English

Page 23: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 23

AECMA: vocabulary

Fixed vocabulary (2000 words?) with additions limited to specific areas (eg, company names).

Goal is “each word means only one thing”, and “each concept is expressed by only one word”. No ambiguity, no synonyms.

Page 24: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 24

Above: only use to indicate physical position» Legal: The wing is above the wheel» Illegal: The engine temperature is above normal» Legal: The engine temperature is more than normal

Test: use as noun only» Legal: the system test» Illegal: Test the circuit.» Legal: Do a test on the circuit.

Example words

Page 25: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 25

AECMA: Syntax

Rule: Forbid “unusual” English syntax Ex: only simple past, present, future tenses

» Illegal: Any other information is to be ignored» Legal: Ignore any other information

Ex: No gerunds» Illegal: Changing the light is dangerous.» Legal: It is dangerous to change the light.

Page 26: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 26

Only two noun-noun modifiers» Illegal: The aircraft door attachment bolt» Legal: The attachment bolt of the aircraft door

Verbs and det. must be included» Illegal: Rotary switch to INPUT» Legal: Set the rotary switch to INPUT

AECMA: Syntax Examples (2)

Page 27: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 27

AECMA: Stylistic Rules

Sentences should be 20 words or less Paragraphs should be 6 sentences or less. Start warnings with a command

» Illegal: The oil used in the engine contains toxic additives which may be absorbed through the skin.

» Legal: Do not get the oil on your skin. It is poisonous.

Page 28: Computing Science, University of Aberdeen1 CS4025: Machine Translation l Background, how languages differ l MT Techniques l Controlled languages For more

Computing Science, University of Aberdeen 28

Controlled-Language MT

Much easier» No problems disambiguating words» Hard syntax is forbidden» May also prohibit/restrict pronouns

Authors must write in CE» CE conformance checkers

Lot of commercial interest