13
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment Model Ying Zhang Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

1

Competitive Grouping in Integrated Segmentation and Alignment Model

Ying Zhang Stephan Vogel

Language Technologies Institute

School of Computer Science

Carnegie Mellon University

Page 2: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

2

Integrated Segmentation and Alignment Model

• Phrase alignment models (Och et al., 1999; Marcu and Wong, 2002; Kohen et al., 2003)– Many of these models rely on the pre-calculated word alignment.– Use different heuristics to extract phrase pairs from the Viterbi word

alignment path.

• Integrated Segmentation and Alignment model (Zhang 2003)– No such word alignments needed– Segment source and target sentences into phrases and align them

simultaneously– Use chi-square(f, e) instead of the conditional probability P(f|e) for word

pair associations– Greedy search for phrase pairs– Key idea: competitive grouping algorithm– Inspired by the competitive linking algorithm (Melamed 1997) for word

alignment

Page 3: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

3

Competitive Linking Algorithm

• A greedy word alignment algorithm.

• The word pair has the highest likelihood L(f,e) “wins” the competition.

• One-to-one assumption: when pair{f, e} is “linked”, neither f nor e can be aligned with any other words.

• Example:

Page 4: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

4

Competitive Grouping Algorithm

• Discard the one-to-one assumption in competitive linking, make it less greedy.

• When a pair {e, f} wins the competition, inviting the neighboring pairs to join the “winner’s club”.

• Introducing the locality assumption: a source phrase of adjacent words can only be aligned to a target phrase of adjacent words.– Words inside the aligned phrase pairs can not be aligned to other words

Page 5: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

5

Expanding the Phrase Pair Aligned

• Two criteria have to be satisfied to expand the seeding word pair to phrase pairs1. If a new source word f is to be grouped, the best e that f is associated

should not be “blocked” by this expansion; similar for grouping a new target word.

2. The highest word pair likelihood value in the expanded area needs to be “similar” to the seed value

• According to the locality assumption, words in the aligned phrase pairs can not be aligned with other words again.

Page 6: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

6

Exploring All Possible Phrase Pairs

• Criterion 2 is used to control the granularity of the phrase pairs aligned– Two short phrase pairs

– Or one long phrase pairs

• Short phrases give better coverage for unseen testing data

• Long phrases encapsulate more context, e.g. local reordering, word sense, and etc.

• Hard to decided on the optimal granularity without knowing the testing data

• Solution: for each grouping, try all possible granularities

Page 7: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

7

Exploring All Possible Phrase Pairs

French: Je déclare reprise la session

English: I declare resumed the session

Page 8: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

8

The Likelihood of Word Associations

• Chi-square statistics is used to measure the likelihood of word associations for pair {e, f}

• For each word pair {e, f} null hypothesis: e and f are independent of each other.

• Calculating to measure how true is this hypothesis

• Construct the contingency table using the counts from the corpus given the current alignment, e.g. uniform alignment– O11: number of times when e and f are aligned

– O12: number of times when e aligned with other f

– O21: number of times when f aligned with other e

– O22: number of times when other f aligned with other e

f ~f

e O11 O12

~e O21 O22

Page 9: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

9

In WPT-05

• Submitted results for all four languages

• Training data as provided

• Language model as provided

• Decoder (Pharaoh) as provided

BLEU German Spanish Finnish French

Dev-test 18.63 26.20 12.88 26.20

Test 18.93 26.14 12.66 26.71

Page 10: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

10

Conclusion

• Competitive grouping algorithm at the core of the ISA model

• Simple and efficient model

• Comparable results as other phrase alignment models

Page 11: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

11

The Evolution of ISA

Page 12: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

12

Matrix of the Likelihood

Page 13: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

13

Expanding the Phrase Pairs