18
You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim Wermter and Udo Hahn Jena University ACL 2006 Regular Conference Paper

You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Embed Size (px)

Citation preview

Page 1: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative

Evaluation of Association Measures for Collocation and Term Extraction

Joachim Wermter and Udo Hahn

Jena University

ACL 2006 Regular Conference Paper

Page 2: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Objective

• Compare the performance of frequency, t-test, LSM and LPM methods on collocation extraction and domain-specific automatic term recognition

Page 3: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Collocation Extraction

• Extract idioms

• “kick the bucket”

Page 4: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Domain-Specific Term Extraction

• Extract domain-specific phrases

• “mitochondrial inheritance”

Page 5: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Corpus

Page 6: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

LSM

• A “linguistic knowledge-based” method for collocation extraction proposed by the same authors in another paper

• Assumes that idioms are less modifiable by supplements– e.g. “kick the beautiful bucket”

• probability of PNVtriple having Suppk :

• f(x) : frequency of x

Page 7: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

LSM

• Modifiability of a PNVtriple

• Probability of a PNVtriple

• Collocation Score

Page 8: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

LPM

• A “linguistic knowledge-based” method for automatic term recognition proposed by the same authors in another paper

• Assumes that words in a phrase are less interchangeable– e.g mitochondrion inheritance money inheritance

• Modifiability of a phrase:

• modk(n-gram) : replace k words• seli : particular replacement

Page 9: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

LPM

• Phrase Score:

Page 10: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Evaluation Criteria

• Compared to the baseline frequency ranking method, a good ranking function should have the four characteristics:

1. Keep the true positives in the upper portion of the list

2. Keep the true negatives in the lower portion of the list

3. Demote true negatives from the upper portion

4. Promote true positives from the lower portion

Page 11: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Collocation Extraction Results

Page 12: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Automatic Term Recognition Results

Page 13: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Observations

• CE Criterion 1– t-test and frequency methods have similar per

formance– LSM promotes some TPs to top 1/6

• ATR Criterion 1– t-test and frequency methods have similar per

formance– LPM promotes a few TPs to top 1/6

Page 14: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Observations

• CE Criterion 2– LSM promotes a lot more TNs to upper portio

n than t-test method (bad…)

• ATR Criterion 2– Same as above

Page 15: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Observations

• CE Criterion 3– LSM demotes a lot more TNs to the lower port

ion than t-test

• ATR Criterion 3– Same as above

Page 16: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Observations

• CE Criterion 4– LSM promotes more TPs to upper portion tha

n t-test

• ATR Criterion 4– Same as above

Page 17: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim
Page 18: You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim

Conclusion

• LSM and LPM methods are better than t-test and frequency methods

• Pure statistics methods are worse than knowledge-based methods