22
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas

Collocations and Terminology

  • Upload
    garren

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

Collocations and Terminology. Vasileios Hatzivassiloglou University of Texas at Dallas. Collocations. Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics , 1993 Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning - PowerPoint PPT Presentation

Citation preview

Page 1: Collocations and Terminology

Collocations and Terminology

Vasileios Hatzivassiloglou

University of Texas at Dallas

Page 2: Collocations and Terminology

Collocations

• Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993

• Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning

• Technical and non-technical

Page 3: Collocations and Terminology

Examples of collocations

• The Dow Jones average of industrials

• The Dow average

• The Dow industrials

• *The Jones industrials

• The Dow Jones industrial

• *The industrial Dow

• *The Dow industrial

Page 4: Collocations and Terminology

Collocation properties

• Arbitrary (dialect dependent)– ride a bike, set the table

• Domain dependent– dry suit, wet suit

• Recurrent

• Cohesive– Part of a collocation primes for the rest

Page 5: Collocations and Terminology

Applications

• Lexicography

• Grammatical restrictions (compare with/to but associate with)

• Generation

• Translation

Page 6: Collocations and Terminology

Types of collocations

• Predicative relations– make a decision, hostile takeover– flexible (syntactic variability, intervening

words)

• Rigid word groups– over the counter market

• Phrases with open slots– fluency in a domain

Page 7: Collocations and Terminology

Issues in finding collocations

• Possibly more than two words– Need measure that extends beyond the binary

case

• Possibly intervening words

• Possibly morphological and syntactic variation

• Semantic constraints (cf. doctors-dentists and doctors-hospitals)

Page 8: Collocations and Terminology

Xtract stage one

• For a given word, find all collocates at positions -5 to +5

• Three criteria:– strength (normalized frequency); 95% rejection

vs. expected 68% under normal distribution– position histogram must not be flat– select peak from histogram

Page 9: Collocations and Terminology

Xtract stage two

• Start from word pairs

• Look at each position in between, to the left, and to the right

• Keep words that appear very often

• If that fails, keep parts of speech that satisfy this criterion

Page 10: Collocations and Terminology

Xtract stage three

• Applied to pairs of words

• Requires (partial) parsing

• Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e.g., verb-object)

Page 11: Collocations and Terminology

Evaluation

• Ask lexicographer to evaluate output

• 40% precision after stages one and two

• 80% precision after stage three

• 94% conditional recall

Page 12: Collocations and Terminology

Terminology

• Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994

• Terms refer to concepts

• Terms key for populating a domain ontology

• Terms are typically nominal compounds of certain structure, e.g., NN, N of N

Page 13: Collocations and Terminology

Defining terms

• Unique reference

• Unique translation

• Term extension by– modification (e.g., addition of an adjective)– substitution– extension of structure– coordination

Page 14: Collocations and Terminology

Algorithm

• Apply syntactic constraints to match pairs of words in a candidate term

• Filter by application of an association measure

• Measures examined: pointwise mutual information, Φ2 (chi-square), log-likelihood ratio

Page 15: Collocations and Terminology

Observations

• Compare with reference list

• Frequency a strong predictor

• Log-likelihood ratio works best

• Additional criteria:– diversity of the distribution of each word– distance between the two words (determines

flexibility but not term status)

Page 16: Collocations and Terminology

Justeson and Katz

• Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.

Page 17: Collocations and Terminology

Analysis

• Examined association measures

• Well-known problems:– eliminating general-language constructs (e.g.,

collocations)– what to do with single word terms?

Page 18: Collocations and Terminology

Observations

• Frequency works well

• But a stronger predictor is P(k>1) compared to P(k≥1) in the same document

• Use syntactic patterns to propose terms, then check if they reappear in the same document

• Require this across multiple documents

Page 19: Collocations and Terminology

Term Expansion

• Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL 1997.

• Need to expand a given list of terms, especially for scientific domains

Page 20: Collocations and Terminology

Term variation

• Syntactic (same words, different structure)

• Morphosyntactic (derivational forms of words)

• Semantic (synonyms are used)

• In IR, normalization through stemming and removal of stop words

Page 21: Collocations and Terminology

Approach

• Process corpus matching new candidate terms to old ones via unification

• Matching based on– inflectional morphology (transducer)– derivational morphology (rule-based)– syntactic transformations– additions of words

Page 22: Collocations and Terminology

Results

• Manual inspection of several thousand proposed terms

• Precision of 89%

• Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)