Development of a German- English Translator Felix Zhang TJHSST Computer Systems Research Lab 2007-2008 Period 5

Development of a German-English Translator

Felix Zhang

TJHSST Computer Systems Research Lab 2007-2008

Period 5

Summary of previous quarters

• Developed a functional translator for simple German sentences to simple English sentences– All rule-based, no statistical methods, yet– Very specifically geared towards German –

language dependent

Scope for 4th quarter

• Statistical methods– Part-of-speech tagging– Morphological analysis– Information available in TIGER corpus

• Accuracy testing– Reliability of statistical methods– Vs. Rule-based

Finishing Rule-Based Translation

• Convert translation into user-readable format• Punctuation and capitalizationfzhang@ltsp1 ~/research $ python proj.pyPart of speech tags: [['den', 'art'], ['kurzen', 'adj'], ['Mann', 'nou'], ['machen', 'ver'], ['die', 'art'], ['kleinen', 'adj'], ['Kinder', 'nou']]Morphological analysis: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen', 'ver'], [['1', 'pl'], ['3',

'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl'], ['akk', 'pl']]]]Disambiguated after noun-verb agreement: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen',

'ver'], [['3', 'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl']]]]Lemmatized: [['kurzen', ['kurz']], ['Mann', ['Mann', 'Man']], ['machen', ['machen']], ['kleinen', ['klein']], ['Kinder', ['Kind']]]Root translated: [['den', 'the'], ['kurzen', 'short'], ['Mann', 'man'], ['machen', 'make'], ['die', 'the'], ['kleinen', 'small'], ['Kinder', 'child']]NP Chunked English: [[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]]], ['make', 'ver', [['3', 'pl'], 'pres']], [['the', 'art'], ['small',

'adj'], ['child', 'nou', [['nom', 'pl']]]]]Assigned an element type:[[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], [['the', 'art'], ['small', 'adj'],

['child', 'nou', [['nom', 'pl']]], 'sub']]Assigned priority:[['5', ['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['1', ['the', 'art'], ['small',

'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub']]Rearranged to English structure:[['1', ['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['5', ['the', 'art'], ['short', 'adj'], ['man',

'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj']]Inflected:The small childs make the short man.

Statistical Methods

• Trivial methods– Part of speech tagging, morphological analysis– Tags based on most frequently occurring tag

with the word in corpus– Theoretically, should still achieve reasonable

levels of accuracy

Testing

• Check all 746,660 words in TIGER corpus– Problem: Running time

• Match: Actual part of speech / linguistic properties of the word in the corpus == part of speech, etc. that program assigns based on maximum likelihood

• Predictions from research: 90% accuracy for part of speech tags

Accuracy – Part of speech

• Reaches about 90% percent accuracy, as predictedMatch NE NE Nato656882Match ADV ADV allein656883Match VAFIN VAFIN sein656884Match PIAT PIAT kein656885Match NN NN Zukunftskonzept656886Match APPR APPR für656887Match ART ART der656888Match NN NN Sicherheit656889Match APPR APPR in656890Match NE NE Europa656891Total matches: 656891Total words: 746660Accuracy: 87.9772587255 %

Accuracy – Morphological Analysis

• Lower accuracy – Properties are more context-dependent

Match -- -- allein550403Match 3.Sg.Pres.Ind 3.Sg.Pres.Ind ist550404Match Nom.Sg.Neut Nom.Sg.Neut Zukunftskonzept550405Match -- -- für550406Match Acc.Sg.Fem Acc.Sg.Fem die550407Match Acc.Sg.Fem Acc.Sg.Fem Sicherheit550408Match -- -- in550409Match Dat.Sg.Neut Dat.Sg.Neut Europa550410Total matches: 550410Total words: 746660Accuracy: 73.7162831811 %

Problems

• No good way to compare effectiveness of rule-based vs. statistical translations

• Accuracy not high enough– 10% error too large in a 700,000+ word corpus– 27% even worse

• Slow run times

Future Research

• Combine statistical and rule-based methods together in hybrid program

• Find way to compare accuracy of two techniques

• More advanced (and more accurate) statistical methods – Context-based

Documents

Development of a German- English Translator Felix Zhang TJHSST Computer Systems Research Lab 2007-2008 Period 5