15
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 FF & FER Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing

Comparative Analysis of Automatic Term and Collocation Extraction

  • Upload
    lacy

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Comparative Analysis of Automatic Term and Collocation Extraction. Sanja Seljan , Bojana Dalbelo Bašić , Jan Šnajder , Davor Delač , Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of I nformation Sciences - PowerPoint PPT Presentation

Citation preview

Page 1: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

Comparative Analysis of Automatic Term and Collocation

Extraction

Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder,Davor Delač, Matija Šamec-Gjurin, Dina Crnec

Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing

Page 2: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FEROverview

I. Introduction– Reasons for extraction

II. Research– Resources & tools– Extracted lists

III. Evaluation– Precision, recall, F-measure

IV. Conclusion

Page 3: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERI. Introduction

• Monolingual and multilingual resources– Helpful– Integrated– Require human intervention

• EU pre-accession activities– Speed up + consistency

• Used in further research and practice

Page 4: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

• List:– Terms (Member State, European Union)

– Collocations (adopt a/the resolution, decided as follows)

– Multi-word units (depend on, well-being)

• Term extraction process:– Term extraction (term acquisition)- identification– Term recognition - verification

Page 5: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERII. Research

• Resources– 10 documents – legislation, Cro-Eng

• Tools– TermeX tool (FER) – list A– SDL Multi Term Extract + NooJ (FF) – list B

• Reference list– Evaluation – reference list

Page 6: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERReference list

• 470 terms and collocations• Exclude unigrams• Balance between lexical coverage, adequacy,

practicality– terms (NPs: 346/470)– collocations (VPs)

Page 7: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERReference list

• Contains:– Terms (acquiring company, applicant country)

– Collocations (adopt a/the resolution, decided as

follows, entry into force, having regard to) – Names and abbreviations (Economic and

Monetary Union EMU, European Union EU)

– Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures).

Page 8: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

• Language-independent statistically-based SDL Multi Term Extract tool – Frequency treshold set to 4– Filtered by the list of stop-words -> 369 cand.

• Language dependant NooJ tool– 36 local grammars -> 512 cand.

List B

Page 9: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERList A

• TermeX– Lexical association measures (AMs)– 14 AMs (PMI, Dice, Chi-square,…)– Lemmatization– POS filtering– Frequency treshold set to ?

Page 10: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERList A

• Extracted terms ranked by AM value – 1816 candidates

• AMs used:– 2-grams – PMI

– 3-grams, 4-grams – heuristic extensions

• Noun phrases only

Page 11: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERResults

• Evaluation– F1-measure (precision, recall)

– True positives calculated by taking into account inflection (suffix stripping)

List A List B

No. of terms 1816 508

Valid terms 202 234

Precision (%) 11.56 47.37

Recall (%) 42.98 49.79

F1 (%) 18.22 48.55

Page 12: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERResults

• List A unsatisfactory– Low recall – Verb phrases, terms consisting of

more than 4 words

– Low precision – ranked list, can be improved with cut-off (true positives are better ranked)

• List B modest– can be improved with lemmatization, definition of

upper/lower cases, more detailed local grammar

Page 13: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERConclusion

• Comparison of two hybrid approaches to term extraction

• Human created lists differ from extracted lists– human knowledge, experience and intuition

• Space for improvement – automatic extraction combined human intervention

Page 14: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

Thank you!

Page 15: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER