Upload
garren
View
63
Download
0
Embed Size (px)
DESCRIPTION
Collocations and Terminology. Vasileios Hatzivassiloglou University of Texas at Dallas. Collocations. Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics , 1993 Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning - PowerPoint PPT Presentation
Citation preview
Collocations and Terminology
Vasileios Hatzivassiloglou
University of Texas at Dallas
Collocations
• Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993
• Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning
• Technical and non-technical
Examples of collocations
• The Dow Jones average of industrials
• The Dow average
• The Dow industrials
• *The Jones industrials
• The Dow Jones industrial
• *The industrial Dow
• *The Dow industrial
Collocation properties
• Arbitrary (dialect dependent)– ride a bike, set the table
• Domain dependent– dry suit, wet suit
• Recurrent
• Cohesive– Part of a collocation primes for the rest
Applications
• Lexicography
• Grammatical restrictions (compare with/to but associate with)
• Generation
• Translation
Types of collocations
• Predicative relations– make a decision, hostile takeover– flexible (syntactic variability, intervening
words)
• Rigid word groups– over the counter market
• Phrases with open slots– fluency in a domain
Issues in finding collocations
• Possibly more than two words– Need measure that extends beyond the binary
case
• Possibly intervening words
• Possibly morphological and syntactic variation
• Semantic constraints (cf. doctors-dentists and doctors-hospitals)
Xtract stage one
• For a given word, find all collocates at positions -5 to +5
• Three criteria:– strength (normalized frequency); 95% rejection
vs. expected 68% under normal distribution– position histogram must not be flat– select peak from histogram
Xtract stage two
• Start from word pairs
• Look at each position in between, to the left, and to the right
• Keep words that appear very often
• If that fails, keep parts of speech that satisfy this criterion
Xtract stage three
• Applied to pairs of words
• Requires (partial) parsing
• Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e.g., verb-object)
Evaluation
• Ask lexicographer to evaluate output
• 40% precision after stages one and two
• 80% precision after stage three
• 94% conditional recall
Terminology
• Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994
• Terms refer to concepts
• Terms key for populating a domain ontology
• Terms are typically nominal compounds of certain structure, e.g., NN, N of N
Defining terms
• Unique reference
• Unique translation
• Term extension by– modification (e.g., addition of an adjective)– substitution– extension of structure– coordination
Algorithm
• Apply syntactic constraints to match pairs of words in a candidate term
• Filter by application of an association measure
• Measures examined: pointwise mutual information, Φ2 (chi-square), log-likelihood ratio
Observations
• Compare with reference list
• Frequency a strong predictor
• Log-likelihood ratio works best
• Additional criteria:– diversity of the distribution of each word– distance between the two words (determines
flexibility but not term status)
Justeson and Katz
• Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.
Analysis
• Examined association measures
• Well-known problems:– eliminating general-language constructs (e.g.,
collocations)– what to do with single word terms?
Observations
• Frequency works well
• But a stronger predictor is P(k>1) compared to P(k≥1) in the same document
• Use syntactic patterns to propose terms, then check if they reappear in the same document
• Require this across multiple documents
Term Expansion
• Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL 1997.
• Need to expand a given list of terms, especially for scientific domains
Term variation
• Syntactic (same words, different structure)
• Morphosyntactic (derivational forms of words)
• Semantic (synonyms are used)
• In IR, normalization through stemming and removal of stop words
Approach
• Process corpus matching new candidate terms to old ones via unification
• Matching based on– inflectional morphology (transducer)– derivational morphology (rule-based)– syntactic transformations– additions of words
Results
• Manual inspection of several thousand proposed terms
• Precision of 89%
• Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)