Click here to load reader

A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint

  • View
    215

  • Download
    0

Embed Size (px)

Text of A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language...

  • A Trainable Transfer-based MT Approach for Languages with Limited ResourcesAlon LavieLanguage Technologies InstituteCarnegie Mellon University

    Joint Work with: Lori Levin, Jaime Carbonell, Katharina Probst, Erik Peterson, Stephan Vogel and Ariadna Font-Llitjos

    EAMT Meeting/ Malta

  • Why Machine Translation for Languages with Limited Resources?Commercial MT economically feasible for only a handful of major languages with large resources (corpora, human developers)Statistical MT looks promising but requires very large volumes of parallel texts Is there hope for MT for languages with limited electronic data resources?Benefits include:Better government information access to indigenous communitiesBetter indigenous communities participation in information-rich activities (health care, education, government) without giving up their languages.Civilian and military applications (disaster relief)Language preservation

    EAMT Meeting/ Malta

  • MT for Languages with Limited Resources: ChallengesMinimal amount of parallel textPossibly lack of standards for orthography/spellingOften relatively few trained linguistsAccess to native bilingual informants possibleNo real economic incentive, Limited financial resources for developing MTNeed to minimize development time and cost

    EAMT Meeting/ Malta

  • AVENUE Partners

    EAMT Meeting/ Malta

  • AVENUE: Two Technical ApproachesGeneralized EBMTParallel text 50K-2MB (uncontrolled corpus)Rapid implementationProven for major Ls with reduced dataTransfer-rule learning Elicitation (controlled) corpus to extract grammatical propertiesSeeded version-space learning

    EAMT Meeting/ Malta

  • Transfer with Strong DecodingSL inputTL output

    EAMT Meeting/ Malta

  • Learning Transfer-Rules for Languages with Limited ResourcesRationale:Large bilingual corpora not availableBilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation toolElicitation corpus designed to be typologically comprehensive and compositionalTransfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data

    EAMT Meeting/ Malta

  • English-Hindi Example

    EAMT Meeting/ Malta

  • Spanish-Mapudungun Example

    EAMT Meeting/ Malta

  • English-Arabic Example

    EAMT Meeting/ Malta

  • The Elicitation CorpusTranslated, aligned by bilingual informantRich information about the sentences elicited Corpus consists of linguistically diverse constructionsBased on elicitation and documentation work of field linguists (e.g. Comrie 1977, Bouquiaux 1992)Organized compositionally: elicit simple structures first, then use them as building blocksGoal: minimize size, maximize linguistic coverageTypological EC currently of about ~1000 sentencesWork in progress:Feature DetectionNavigation control through the corpus during elicitationExtensions to phenomena not currently coveredExperimenting with alternative types of elicited data

    EAMT Meeting/ Malta

  • Transfer Rule FormalismType informationPart-of-speech/constituent informationAlignments

    x-side constraints

    y-side constraints

    xy-constraints, e.g. ((Y1 AGR) = (X1 AGR));SL: the old man, TL: ha-ish ha-zaqen

    NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)

    ((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)

    ((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))

    EAMT Meeting/ Malta

  • Transfer Rule Formalism (II)

    Value constraints

    Agreement constraints;SL: the old man, TL: ha-ish ha-zaqen

    NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)

    ((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)

    ((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))

    EAMT Meeting/ Malta

  • The Transfer Engine

    AnalysisSource text is parsed into its grammatical structure. Determines transfer application ordering.Example: (he read book)

    SNP VPN V NP TransferA target language tree is created by reordering, insertion, and deletion.

    SNP VPN V NPhe read DET N a bookArticle a is inserted into object NP. Source words translated with transfer lexicon.GenerationTarget language constraints are checked and final translation produced.

    E.g. reads is chosen over read to agree with he.

    Final translation:He reads a book

    EAMT Meeting/ Malta

  • Rule Learning - OverviewGoal: Acquire Syntactic Transfer RulesUse available knowledge from the source side (grammatical structure)Three steps:Flat Seed Generation: first guesses at transfer rules; flat syntactic structure Compositionality: use previously learned rules to add hierarchical structureSeeded Version Space Learning: refine rules by learning appropriate feature constraints

    EAMT Meeting/ Malta

  • Flat Seed Rule Generation

    Learning Example: NP

    Eng: the big apple

    Heb: ha-tapuax ha-gadol

    Generated Seed Rule: NP::NP [ART ADJ N] [ART N ART ADJ]((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

    EAMT Meeting/ Malta

  • Compositionality

    Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8))

    NP::NP [ART ADJ N] [ART N ART ADJ]((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

    NP::NP [ART N] [ART N]((X1::Y1) (X2::Y2))

    Generated Compositional Rule: S::S [NP V NP] [NP V P NP]((X1::Y1) (X2::Y2) (X3::Y4))

    EAMT Meeting/ Malta

  • Compositionality - OverviewTraverse the c-structure of the English sentence, add compositional structure for translatable chunksAdjust constituent sequences, alignments in the transfer rule

    EAMT Meeting/ Malta

  • Seeded Version Space Learning

    Input: Rules and their Example Sets

    S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26}((X1::Y1) (X2::Y2) (X3::Y4))

    NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13}((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

    NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11}((X1::Y1) (X2::Y2))

    Output: Rules with Feature Constraints: S::S [NP V NP] [NP V P NP]((X1::Y1) (X2::Y2) (X3::Y4) (X1 NUM = X2 NUM) (Y1 NUM = Y2 NUM) (X1 NUM = Y1 NUM))

    EAMT Meeting/ Malta

  • Seeded Version Space Learning: OverviewGoal: add appropriate feature constraints to the acquired rulesMethodology:Preserve general structural transferLearn specific feature constraints from example set Seed rules are grouped into clusters of similar transfer structure (type, constituent sequences, alignments)Each cluster forms a version space: a partially ordered hypothesis space with a specific and a general boundaryThe seed rules in a group form the specific boundary of a version spaceThe general boundary is the (implicit) transfer rule with the same type, constituent sequences, and alignments, but no feature constraints

    EAMT Meeting/ Malta

  • Examples of Automatically Learned Rules (Hindi-to-English)

    {NP,14244};;Score:0.0429NP::NP [N] -> [DET N]((X1::Y2))

    {NP,14434};;Score:0.0040NP::NP [ADJ CONJ ADJ N] -> [ADJ CONJ ADJ N]((X1::Y1) (X2::Y2)(X3::Y3) (X4::Y4))

    {PP,4894} ;;Score:0.0470 PP::PP [NP POSTP] -> [PREP NP] ( (X2::Y1) (X1::Y2) )

    EAMT Meeting/ Malta

  • Manual Transfer Rules: Hindi Example;; PASSIVE OF SIMPLE PAST (NO AUX) WITH LIGHT VERB;; passive of 43 (7b){VP,28}VP::VP : [V V V] -> [Aux V]( (X1::Y2) ((x1 form) = root) ((x2 type) =c light) ((x2 form) = part) ((x2 aspect) = perf) ((x3 lexwx) = 'jAnA') ((x3 form) = part) ((x3 aspect) = perf) (x0 = x1) ((y1 lex) = be) ((y1 tense) = past) ((y1 agr num) = (x3 agr num)) ((y1 agr pers) = (x3 agr pers)) ((y2 form) = part))

    EAMT Meeting/ Malta

  • Manual Transfer Rules: Example; NP1 ke NP2 -> NP2 of NP1; Ex: jIvana ke eka aXyAya; life of (one) chapter ; ==> a chapter of life;{NP,12}NP::NP : [PP NP1] -> [NP1 PP]( (X1::Y2) (X2::Y1); ((x2 lexwx) = 'kA'))

    {NP,13}NP::NP : [NP1] -> [NP1]( (X1::Y1))

    {PP,12}PP::PP : [NP Postp] -> [Prep NP]( (X1::Y2) (X2::Y1)) NP PP NP1 NP P Adj N N1 ke eka aXyAya N jIvana NP NP1 PP Adj N P NP one chapter of N1 N life

    EAMT Meeting/ Malta

  • A Limited Data Scenario for Hindi-to-EnglishConducted during a DARPA Surprise Language Exercise (SLE) in June 2003Put together a scenario with miserly data resources:Elicited Data corpus: 17589 phrasesCleaned portion (top 12%) of LDC dictionary: ~2725 Hindi words (23612 translation pairs)Manually acquired resources during the SLE:500 manual bigram translations72 manually written phrase transfer rules105 manually written postposition rules48 manually written time expression rulesNo additional parallel text!!

    EAMT Meeting/ Malta

  • Manual Grammar DevelopmentCovers mostly NPs, PPs and VPs (verb complexes)~70 grammar rules, covering basic and recursive NPs and PPs, verb complexes of main tenses in Hindi (developed in two weeks)

    EAMT Meeting/ Malta

  • Adding a Strong DecoderXFER system produces a full lattice of translation fragments, ranging from single words to long phrases or sentencesEdges are scored using word-to-word translation probabilities, trained from the limited bilingual dataDecoder uses an English LM (70m words)Decoder can also reorder words or phrases (up to 4 positions ahead)For XFER(strong) , ONLY edges from basic XFER system are used!

    EAMT Meeting/ Malta

  • Testing ConditionsTested on section of JHU provided data: 258 sentences with four reference translationsSMT system (stand-alone)EBMT system (stand-alone)XFER system (nave decoding)XFER system with strong decoderNo grammar rules (baseline)Manually developed grammar rulesAutomatically learned grammar rulesXFER+SMT with strong decoder (MEMT)

    EAMT Meeting/ Malta

  • Automatic MT Evaluation MetricsIntend to replace or complement human assessment of translation quality of MT produced translationPrinciple idea: compare how similar is the MT produced translation with human reference translation(s) of the same inputMain metric in use today: IBMs BLEUCount n-gram (unigrams, bigrams, trigrams, etc) overlap between the MT output and several reference translationsCalculate a combined n-gram precision scoreNIST variant of BLEU used for official DARPA evaluations

    EAMT Meeting/ Malta

  • Results on JHU Test Set

    SystemBLEUM-BLEUNISTEBMT0.0580.1654.22SMT0.0930.1914.64XFER (nave) man grammar0.0550.1774.46XFER (strong) no grammar0.1090.2245.29XFER (strong) learned grammar0.1160.2315.37XFER (strong) man grammar0.1350.2435.59XFER+SMT0.1360.2435.65

    EAMT Meeting/ Malta

  • Effect of Reordering in the Decoder

    EAMT Meeting/ Malta

  • Observations and Lessons (I)XFER with strong decoder outperformed SMT even without any grammar rules in the miserly data scenarioSMT Trained on elicited phrases that are very shortSMT has insufficient data to train more discriminative translation probabilitiesXFER takes advantage of MorphologyToken coverage without morphology: 0.6989Token coverage with morphology: 0.7892Manual grammar currently somewhat better than automatically learned grammarLearned rules did not yet use version-space learningLarge room for improvement on learning rules Importance of effective well-founded scoring of learned rules

    EAMT Meeting/ Malta

  • Observations and Lessons (II)MEMT (XFER and SMT) based on strong decoder produced best results in the miserly scenario.Reordering within the decoder provided very significant score improvementsMuch room for more sophisticated grammar rulesStrong decoder can carry some of the reordering burden

    EAMT Meeting/ Malta

  • XFER MT for Hebrew-to-EnglishTwo month intensive effort to apply our XFER approach to the development of a Hebrew-to-English MT systemChallenges:No large parallel corpusLimited coverage translation lexiconRich Morphology: incomplete analyzer availablePlan:Collect available resources, establish methodology for processing Hebrew inputTranslate and align Elicitation CorpusLearn XFER rulesDevelop (small) manual XFER grammar as a point of comparisonSystem debugging and developmentEvaluate performance on unseen test data using automatic evaluation metrics

    EAMT Meeting/ Malta

  • Hebrew-to-English XFER SystemAccomplished:Baseline system in placeGood lexical coverage: 24634 translation pairsReasonable morphological coverageSmall manual grammar: 29 rules, mostly NPsTranslated and aligned elicitation corporaLearning of automatic grammarTesting and development on dev-test in progressResults on unseen data within a couple of weeksTranslation Example:in agreement with the interior ministry that copy fund will come to Haaretz agreed hotel homes to do all efforts to remove the african employees from Israel within days from the arrival of the new workers and to let people activities immigration police

    EAMT Meeting/ Malta

  • ConclusionsTransfer rules (both manual and learned) offer significant contributions that can outperform existing data-driven approachesAlso in medium and large data settings?Initial steps to development of a well-grounded transfer-based MT system with:Translation segments that are scored based on a well-founded probability model Strong and effective decoding that incorporates the most advanced techniques used in SMT decodingWorking from the opposite end of research on incorporating models of syntax into standard SMT systems [Knight et al]Our direction makes sense in the limited data scenario

    EAMT Meeting/ Malta

  • Future DirectionsContinued work on automatic rule learning (especially Seeded Version Space Learning)Use Hebrew and Hindi systems as test platforms for experimenting with advanced learning research Correcting and refining transfer rules by interaction with native bilingual speakersDeveloping a well-founded model for assigning scores (probabilities) to transfer rulesImproving the strong decoder to better fit the specific characteristics of the XFER modelFurther improved MEMT with:Combination of output from different translation engines with different scorings strong decoding capabilities

    EAMT Meeting/ Malta