CS460/626 : Natural Language Processing/Speech NLP and...

Preview:

Citation preview

CS460/626 : Natural Language Processing/Speech NLP and the WebProcessing/Speech, NLP and the Web

(Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses)on Giza++ and Moses)

Pushpak BhattacharyyaPushpak BhattacharyyaCSE Dept., IIT Bombay

15th F b 201115th Feb, 2011

Going forward from word alignmentalignment

Word alignmentWord alignment

Phrase Alignment Decoding(going to bigger units (best possibleOf correspondence) translation)

Abstract ProblemAbstract Problem

Given: e e e e e e (Entities)Given: eoe1e2e3….enen+1 (Entities)

Goal: l l1l2l3 l l 1 (Labels)Goal: lol1l2l3….lnln+1 (Labels)

The Goal is to find the best possible label sequence

))|((maxarg* ELPLL

=

Generative Model

)|().(maxarg)|(maxarg LEPLPELPL

=L

SimplificationSimplification

Using Markov Assumption the LanguageUsing Markov Assumption, the Language Model can be represented using bigrams

Simila l t anslation model can also be

)|()( 10

ii

n

iLLPLP +

=∏=

Similarly translation model can also be represented in the following way:

∏=

=n

iii lePLEP

0

)|()|(

Statistical Machine Translation

Finding the best possible English sentence given the foreign sentencesentence given the foreign sentence

)|().(maxarg)|(maxarg* EFPEPFEPeE

==

P(E)= Language ModelP(F|E) Translation ModelP(F|E) = Translation ModelE: English, F: Foreign Language

Problems in the frameworkProblems in the frameworkLabels are words of the target languageLabels are words of the target language

Very large in number Who do you want to_go with ? Preposition

With whom do you want to go ?आप िकस के_साथ जाना चाहते_हो (Aap kis ke sath jaana chahate ho)

Stranding

(Aap kis ke_sath jaana chahate_ho)who whodo do and so on

you youwant wantto_go to_gowith with

Column of words of target language on the

l dsource language words

^ Aap kis ke_sath jaana chahate_ho .who whodo do and so on you youy y

^ want want … .to_go to_gowith withwith with

Find the best possible path from ‘^’ to ‘.’ using transition andObservation probabilities.

Viterbi can be usedViterbi can be used

TUTORIAL ON Giza++ and Moses tools(delivered by Kushal Ladha)

Word-based alignmentWord based alignment

For each word in source language alignFor each word in source language, align words from target language that this word possibly producespossibly producesBased on IBM models 1-5M d l 1 i l tModel 1 – simplestAs we go from models 1 to 5, models get more complex but more realisticThis is all that Giza++ does

Ali tAlignment

A function from target position to source position:

The alignment sequence is: 2,3,4,5,6,6,6Ali f i A A(1) 2 A(2) 3 Alignment function A: A(1) = 2, A(2) = 3 ..A different alignment function will give the sequence:1,2,1,2,3,4,3,4 for A(1), A(2)..

10

To allow spurious insertion, allow alignment with word 0 (NULL)No. of possible alignments: (I+1)J

IBM Model 1: Generative ProcessProcess

11

Training Alignment ModelsTraining Alignment Models

Given a parallel corpora, for each (F,E) learn the best alignment A and thelearn the best alignment A and the component probabilities:

t(f| ) f M d l 1t(f|e) for Model 1lexicon probability P(f|e) and alignment probability P(a |a I)probability P(ai|ai-1,I)

How to compute these probabilities if all h i ll l

12

you have is a parallel corpora

Intuition : Interdependence of ProbabilitiesProbabilities

If you knew which words are probable translation of each other then you cantranslation of each other then you can guess which alignment is probable and which one is improbablepIf you were given alignments with probabilities then you can compute p y ptranslation probabilitiesLooks like a chicken and egg problem

13

gg pEM algorithm comes to the rescue

Limitation: Only 1->Many Alignments ll dallowed

14

Phrase-based alignmentPhrase based alignment

More natural

Many-to-one mappings allowed

Giza++ and Moses PackageGiza++ and Moses Package

http://cl naist jp/~eric-n/ubuntu-nlp/http://cl.naist.jp/~eric-n/ubuntu-nlp/Select your Ubuntu versionBrowse the nlp folderDownload debian package of giza++, p g g ,moses, mkcls, srilmResolve all the dependencies and they getResolve all the dependencies and they get installedFor alternate installation refer toFor alternate installation, refer to http://www.statmt.org/moses_steps.html

StepsSteps

Input - sentence aligned parallel corpusO t t t t id t d d tOutput- target side tagged data

TrainingTuningGenerate output on test corpusGenerate output on test corpus (decoding)

TrainingTraining Create a folder named corpus containing test, train and tuning fileGiza++ is used to generate alignmentg gPhrase table is generated after trainingBefore training language model needs toBefore training language model needs to be build on target sidemkdir lm ; /usr/bin/ngram-count -order 3 -interpolate -kndiscount -text d ; /us /b / g a cou t o de 3 te po ate d scou t te t$PWD/corpus/train_surface.hi -lm lm/train.lm;/usr/share/moses/scripts/training/train-factored-phrase-model.perl -scripts-root-dir /usr/share/moses/scripts -root-dir . -corpus train.clean -e hi -f en -l $ /l / llm 0:3:$PWD/lm/train.lm:0;

ExampleExample

train en train prtrain.enh e l l oh l l

train.prhh eh l owhh h l h e l l o

w o r l dc o m p o u n d w o r d

hh ah l oww er l dk d dc o m p o u n d w o r d

h y p h e n a t e do n e

k aa m p aw n d w er dhh ay f ah n ey t ih dow eh n iyo n e

b o o mk w e e z l e b o t t e r

ow eh n iyb uw mk w iy z l ah b aa t ah rk w e e z l e b o t t e r k w iy z l ah b aa t ah r

Sample from Phrase-tableSample from Phrase table

b ||| b ||| (0) (1) ||| (0) (1) ||| 1 0 666667 1 0 181818b o ||| b aa ||| (0) (1) ||| (0) (1) ||| 1 0.666667 1 0.181818 2.718

b ||| b ||| (0) ||| (0) ||| 1 1 1 1 2.718c o m p o ||| aa m p ||| (2) (0,1) (1) (0) (1) ||| (1,3) (1,2,4) (0)

||| 1 0.0486111 1 0.154959 2.718c ||| p ||| (0) ||| (0) ||| 1 1 1 1 2.718d w ||| d w ||| (0) (1) ||| (0) (1) ||| 1 0.75 1 1 2.718

l l o ||| l ow ||| (0) (0) (1) ||| (0,1) (2) ||| 0.5 1 1 0.227273 2.718l l ||| l ||| (0) (0) ||| (0,1) ||| 0.25 1 1 0.833333 2.718l o ||| l ow ||| (0) (1) ||| (0) (1) ||| 0.5 1 1 0.227273 2.718l ||| l ||| (0) ||| (0) ||| 0 75 1 1 0 833333 2 718d ||| d ||| (0) ||| (0) ||| 1 1 1 1 2.718

e b ||| ah b ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718e l l ||| ah l ||| (0) (1) (1) ||| (0) (1,2) ||| 1 1 0.5 0.5 2.718e l l ||| eh l ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.111111 0.5

0.111111 2.718e l ||| eh ||| (0) (0) ||| (0,1) ||| 1 0.111111 1 0.133333 2.718e ||| ah ||| (0) ||| (0) ||| 1 1 0 666667 0 6 2 718

l ||| l ||| (0) ||| (0) ||| 0.75 1 1 0.833333 2.718m ||| m ||| (0) ||| (0) ||| 1 0.5 1 1 2.718n d ||| n d ||| (0) (1) ||| (0) (1) ||| 1 1 1 1 2.718n e ||| eh n iy ||| (1) (2) ||| () (0) (1) ||| 1 1 0.5 0.3 2.718n e ||| n iy ||| (0) (1) ||| (0) (1) ||| 1 1 0.5 0.3 2.718n ||| eh n ||| (1) ||| () (0) ||| 1 1 0.25 1 2.718e ||| ah ||| (0) ||| (0) ||| 1 1 0.666667 0.6 2.718

h e ||| hh ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718h ||| hh ||| (0) ||| (0) ||| 1 1 1 1 2.718l e b ||| l ah b ||| (0) (1) (2) ||| (0) (1) (2) ||| 1 1 1 0.5 2.718l e ||| l ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.5 2.718

n ||| eh n ||| (1) ||| () (0) ||| 1 1 0.25 1 2.718o o m ||| uw m ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.5 1 0.181818 2.718o o ||| uw ||| (0) (0) ||| (0,1) ||| 1 1 1 0.181818 2.718o ||| aa ||| (0) ||| (0) ||| 1 0.666667 0.2 0.181818 2.718o ||| ow eh ||| (0) ||| (0) () ||| 1 1 0.2 0.272727 2.718o ||| ow ||| (0) ||| (0) ||| 1 1 0.6 0.272727 2.718w o r ||| w er ||| (0) (1) (1) ||| (0) (1,2) ||| 1 0.1875 1 0.424242 2.718w ||| w ||| (0) ||| (0) ||| 1 0.75 1 1 2.718

TuningTuning

Not a compulsory step but will improve the decoding by a small percentagethe decoding by a small percentagemkdir tuning; cp $WDIR/corpus/tun.en tuning/input; cp $WDIR/corpus/tun.hi tuning/reference; /usr/share/moses/scripts/training/mert moses pl $PWD/tuning/input/usr/share/moses/scripts/training/mert-moses.pl $PWD/tuning/input $PWD/tuning/reference /usr/bin/moses $PWD/model/moses.ini --working-dir $PWD/tuning --rootdir /usr/share/moses/scripts

It will take around 1 hour on a server with 32GBIt will take around 1 hour on a server with 32GB RAM

TestingTesting

mkdir evaluation; /usr/bin/moses -config $WDIR/tuning/moses.ini -input-file $WDIR/corpus/test.en >evaluation/test.output;

The output will be inThe output will be in evaluation/test.output fileSample OutputSample Output

h o t hh aa th |UNK hh h ip h o n e p|UNK hh ow eh n iy

b o o k b uw k

Recommended