Collocations and Terminology

Collocations and Terminology

Vasileios Hatzivassiloglou

University of Texas at Dallas

Collocations

• Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993

• Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning

• Technical and non-technical

Examples of collocations

• The Dow Jones average of industrials

• The Dow average

• The Dow industrials

• *The Jones industrials

• The Dow Jones industrial

• *The industrial Dow

• *The Dow industrial

Collocation properties

• Arbitrary (dialect dependent)– ride a bike, set the table

• Domain dependent– dry suit, wet suit

• Recurrent

• Cohesive– Part of a collocation primes for the rest

Applications

• Lexicography

• Grammatical restrictions (compare with/to but associate with)

• Generation

• Translation

Types of collocations

• Predicative relations– make a decision, hostile takeover– flexible (syntactic variability, intervening

words)

• Rigid word groups– over the counter market

• Phrases with open slots– fluency in a domain

Issues in finding collocations

• Possibly more than two words– Need measure that extends beyond the binary

case

• Possibly intervening words

• Possibly morphological and syntactic variation

• Semantic constraints (cf. doctors-dentists and doctors-hospitals)

Xtract stage one

• For a given word, find all collocates at positions -5 to +5

• Three criteria:– strength (normalized frequency); 95% rejection

vs. expected 68% under normal distribution– position histogram must not be flat– select peak from histogram

Xtract stage two

• Start from word pairs

• Look at each position in between, to the left, and to the right

• Keep words that appear very often

• If that fails, keep parts of speech that satisfy this criterion

Xtract stage three

• Applied to pairs of words

• Requires (partial) parsing

• Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e.g., verb-object)

Evaluation

• Ask lexicographer to evaluate output

• 40% precision after stages one and two

• 80% precision after stage three

• 94% conditional recall

Terminology

• Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994

• Terms refer to concepts

• Terms key for populating a domain ontology

• Terms are typically nominal compounds of certain structure, e.g., NN, N of N

Defining terms

• Unique reference

• Unique translation

• Term extension by– modification (e.g., addition of an adjective)– substitution– extension of structure– coordination

Algorithm

• Apply syntactic constraints to match pairs of words in a candidate term

• Filter by application of an association measure

• Measures examined: pointwise mutual information, Φ2 (chi-square), log-likelihood ratio

Observations

• Compare with reference list

• Frequency a strong predictor

• Log-likelihood ratio works best

• Additional criteria:– diversity of the distribution of each word– distance between the two words (determines

flexibility but not term status)

Justeson and Katz

• Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.

Analysis

• Examined association measures

• Well-known problems:– eliminating general-language constructs (e.g.,

collocations)– what to do with single word terms?

Observations

• Frequency works well

• But a stronger predictor is P(k>1) compared to P(k≥1) in the same document

• Use syntactic patterns to propose terms, then check if they reappear in the same document

• Require this across multiple documents

Term Expansion

• Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL 1997.

• Need to expand a given list of terms, especially for scientific domains

Term variation

• Syntactic (same words, different structure)

• Morphosyntactic (derivational forms of words)

• Semantic (synonyms are used)

• In IR, normalization through stemming and removal of stop words

Approach

• Process corpus matching new candidate terms to old ones via unification

• Matching based on– inflectional morphology (transducer)– derivational morphology (rule-based)– syntactic transformations– additions of words

Results

• Manual inspection of several thousand proposed terms

• Precision of 89%

• Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)

Documents

Collocations and Terminology