28
A Lightweight and High Performance Monolingual Word Aligner Xuchen Yao, Benjamin Van Durme, (Johns Hopkins) Chris Callison-Burch and Peter Clark (UPenn) (Vulcan)

A Lightweight and High Performance Monolingual Word Aligner

  • Upload
    shyla

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

A Lightweight and High Performance Monolingual Word Aligner. Xuchen Yao, Benjamin Van Durme, (Johns Hopkins) Chris Callison-Burch and Peter Clark (UPenn) (Vulcan). monolingual word alignment. Aligning one sentence pair from RTE2 - PowerPoint PPT Presentation

Citation preview

Page 1: A Lightweight and High Performance Monolingual Word Aligner

A Lightweight and High Performance Monolingual Word Aligner

Xuchen Yao, Benjamin Van Durme,(Johns Hopkins)

Chris Callison-Burch and Peter Clark (UPenn) (Vulcan)

Page 2: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 2

monolingual word alignment

• Aligning one sentence pair from RTE2

• Premise: Linda Johnson, who lives with her husband, Charles, and two cats in ... , said Katrina has ...

• Hypothesis: Linda Johnson is married to Charles

• alignment contributed by Brockett (2007)

Page 3: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 3

monolingual vs. bilingual aligment

• less training data (labeled or unlabeled), but more lexical resources

• semantic relatedness: cued by distributional word similaries

• the same grammar shared by source/target sentences

Page 4: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 4

monolingual vs. bilingual aligment

• less training data (labeled or unlabeled), but more lexical resources

• semantic relatedness: cued by distributional word similaries

• the same grammar shared by source/target sentences

Page 5: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 5

monolingual vs. bilingual aligment

• less training data (labeled or unlabeled), but more lexical resources

• semantic relatedness: cued by distributional word similaries

• the same grammar shared by source/target sentences

Page 6: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 6

a discriminative model

• first proposed by Blunsom and Cohn (2006):

• s, t: source (observation), target sentence• a: target word indices (0 to target length), state 0

is NULL state for deletion.• f(): feature functions

Page 7: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 7

a discriminative model

• first proposed by Blunsom and Cohn (2006):

• s, t: source (observation), target sentence• a: target word indices (0 to target length), state 0

is NULL state for deletion.• f(): feature functions

Page 8: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 8

a discriminative model

• first proposed by Blunsom and Cohn (2006):

• s, t: source (observation), target sentence• a: target word indices (0 to target length), state 0

is NULL state for deletion.• f(): feature functions

Page 9: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 9

Page 10: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 10

desired Viterbi decoding path

Page 11: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 11

a discriminative model

• first proposed by Blunsom and Cohn (2006):

• s, t: source (observation), target sentence• a: target word indices (0 to target length), state 0

is NULL state for deletion.• f(): feature functions

Page 12: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 12

features

• string similarity– Jaro Winkler, Dice Sorensen, Hamming, Jaccard,

Levenshtein, NGram overlapping and common prefix matching

• POS tags matching• WordNet

– hypernym, hyponym, synonym, derived form, entailing, causing, members of, have member, substances of, have substances, parts of, have part

Page 13: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 13

features

• string similarity– Jaro Winkler, Dice Sorensen, Hamming, Jaccard,

Levenshtein, NGram overlapping and common prefix matching

• POS tags matching• WordNet

– hypernym, hyponym, synonym, derived form, entailing, causing, members of, have member, substances of, have substances, parts of, have part

Page 14: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 14

features

• string similarity– Jaro Winkler, Dice Sorensen, Hamming, Jaccard,

Levenshtein, NGram overlapping and common prefix matching

• POS tags matching• WordNet

– hypernym, hyponym, synonym, derived form, entailing, causing, members of, have member, substances of, have substances, parts of, have part

Page 15: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 15

features

• positional– offset difference between src/tgt word

• context– whether neighboring words are similar– helps to align functional words

• distortion (Markov feature)– how far apart are two aligned target words

Page 16: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 16

features

• positional– offset difference between src/tgt word

• context– whether neighboring words are similar– helps to align functional words

• distortion (Markov feature)– how far apart are two aligned target words

Page 17: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 17

features

• positional– offset difference between src/tgt word

• context– whether neighboring words are similar– helps to align functional words

• distortion (Markov feature)– how far apart are two aligned target words

Page 18: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 18

Implementation: jacana-alignsource code at http://code.google.com/p/jacana

• lightweight: only used a POS tagger and WordNet

• written in Scala, optimize with LBFGS

• platform independent, compiles to a .jar file, fully interoperable with Java

• high performance? -> evaluation

Page 19: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 19

Baselines

• GIZA++• Tree Edit Distance (with stem/wordnet matching)• MANLI

– MacCartney, B.; Galley, M. & Manning, C. D., A Phrase-Based Alignment Model for Natural Language

Inference, EMNLP 2008

• MANLI-constraint (decoding with ILP)– Thadani, K. & McKeown, K. Optimal and syntactically-informed decoding for

monolingual phrase-based alignment. ACL 2011

Page 20: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 20

Baselines

• GIZA++• Tree Edit Distance (with stem/wordnet matching)• MANLI

– MacCartney, B.; Galley, M. & Manning, C. D., A Phrase-Based Alignment Model for Natural Language

Inference, EMNLP 2008

• MANLI-constraint (decoding with ILP)– Thadani, K. & McKeown, K. Optimal and syntactically-informed decoding for

monolingual phrase-based alignment. ACL 2011

Page 21: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 21

Baselines

• GIZA++• Tree Edit Distance (with stem/wordnet matching)• MANLI

– MacCartney, B.; Galley, M. & Manning, C. D., A Phrase-Based Alignment Model for Natural Language

Inference, EMNLP 2008

• MANLI-constraint (decoding with ILP)– Thadani, K. & McKeown, K. Optimal and syntactically-informed decoding for

monolingual phrase-based alignment. ACL 2011

Page 22: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 22

Baselines

• GIZA++• Tree Edit Distance (with stem/wordnet matching)• MANLI

– MacCartney, B.; Galley, M. & Manning, C. D., A Phrase-Based Alignment Model for Natural Language

Inference, EMNLP 2008

• MANLI-constraint (decoding with ILP)– Thadani, K. & McKeown, K. Optimal and syntactically-informed decoding for

monolingual phrase-based alignment. ACL 2011

Page 23: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 23

performance in F1

10.3%

Page 24: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 24

performance in F1

0.8%

3.3%

Page 25: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 25

performance in speed(seconds per sentecne)

• when sentences are more balanced, jacana-align is about 20x faster

corpus sentence pair length

MANLI-approx. MANLI-exact jacana-align

RTE2 29/11 1.67s 0.08s 0.025s

FUSION 27/27 61.96s 2.45s 0.096s20x 20x

Page 26: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 26

performance in speed(seconds per sentecne)

• the speed of jacana-align is not as sensitive to sentence length increase

corpus sentence pair length

MANLI-approx. MANLI-exact jacana-align

RTE2 29/11 1.67s 0.08s 0.025s

FUSION 27/27 61.96s 2.45s 0.096s30x 30x 4x

Page 27: A Lightweight and High Performance Monolingual Word Aligner

2013-8-6 ACL 2013, Sofia 27

Conclusion

• state-of-the-art monolingual word aligner– in accuracy– in speed

• open source, use it and hack it!

Page 28: A Lightweight and High Performance Monolingual Word Aligner

thank youwith a demo