Upload
antonio-san-martin
View
168
Download
0
Tags:
Embed Size (px)
Citation preview
KWIC corpora as a source of specialized definitional
information:a pilot study
Antonio San MartínUniversity of Granada, Spain
Motivation: definition writing
http://ecolexicon.ugr.es
•Definitions in other resources•Corpus analysis
What should I include in my definitions?
Assumption
The lexical units that normally co-occur with another lexical
unit are potentially important to define them.
Hypothesis
Corpus of KWIC (Key Word In Context)
concordances of the concept to define
Term list: potentially definitional terms for the concept to define
2.1. Reference list
- Term list generated with TermoStat Web 3.0 (Drouin 2003): most frequent nouns, noun phrases and adjectives (+4 occurrences)
- Source: English corpus of 133 specialized definitions of MAGMA.
2.1. Reference list
- To minimize interference from terminological variation, terms in the reference list were categorized according to the conceptual proposition established with MAGMA.
- Any categorization has a certain degree of subjectivity. The configuration of our reference list is the result of certain choices.
2.1. Reference listConceptual proposition Instances from the list generated by TermoStat
magma is a rock rock (163), molten rock (79), rock material (17), molten rock material (10), liquid rock (4)
magma is a material material (37), rock material (17), molten rock material (10), molten material (8)
magma is (a) liquid / magma is a >luid liquid (13), >luid (6), liquid rock (4)
magma is a mixture / magma is made of a mixture mixture (6)
magma is molten molten (105), molten rock (79), molten rock material (10), molten material (8), molten state (4)
magma is hot hot (18), temperature (6)
magma is mobile mobile (6)
magma contains gas/bubbles gas (25), bubble (4)
magma contains crystals crystal (24)
magma contains silicate silicate (9)
magma contains volatiles volatile (4)
magma contains minerals mineral (4)
magma undergoes solidi>ication solidi>ication (6), solid (5)
magma undergoes (partial) melting melting (7), partial melting (6)
magma causes intrusion intrusion (7)
magma causes extrusion extrusion (6)
magma becomes igneous rock / magma is the raw material of igneous rocks igneous (40), igneous rock (37), raw material (4)
magma becomes lava lava (38)
magma is found under the Earth’s or a planet’s surface earth (98), surface (63), planet (5), deep (6), depth (4), underground (5)
magma is found deep in the Earth / at depth deep (6), depth (4)
magma is found in the (Earth’s) crust crust (33)
magma is found in the upper part of the (Earth’s) mantle. mantle (20), upper (5)
magma is erupted from a volcano volcano (7), volcanic (7)
2.2. Analysis lists
- An English corpus of environmental texts (PANACEA corpus + LexiCon corpus). 359 occurences of MAGMA.
- Wordsmith Tools (Scott 2008) to generate KWIC concordance lines:
100c MAGMA 100c250c MAGMA 250c500c MAGMA 500c750c MAGMA 750c
Sentences
2.2. Analysis list
-Each corpus was fed into TermoStat in order to obtain the most frequent nouns, noun phrases, and adjectives.-The 50 and 100 terms with the highest raw frequency were retained for comparison with the reference list.-Analysis lists:
50-term 100c50-term 250c50-term 500c50-term 750c50-term sentence
100-term 100c100-term 250c100-term 500c100-term 750c100-term sentence
2.3. Precision and recall
P = TP / (TP+FP)R = TP / (TP+FN)
-TP (true positive): a term in the analysis list that matches any of the categories in the reference list. The result is expressed as a percentage.
- FP (false positive): a term in the analysis list that matches no category in the reference list. The result is expressed as a percentage.
- FN (false negative): a category in the reference list that is not matched by any of the terms in the analysis list. The result is expressed as a percentage.
2.3. Precision and recall
F2-measurement (Chinchor, 1992, 25), which gives twice the importance to recall as to precision. The formula used was the following:
F2 = (5 · P ·R) / (5 · P + R)
3. Results
-The 100-term 250C list performed the best (F2-M: 69.08 %). Also, its recall ratio was the highest (78.28 %).-The highest precision ratio corresponded to the 50-term 100C list. But its recall ratio was 12 points below the 100-term 250C.-The SC list obtained a lower F2 score compared to any of the KWIC lists.-Once the threshold of the 250-character context was exceeded, longer contexts caused both precision and recall to decrease.
Conclusions and future work
‣Although the scope of this pilot study was limited, results indicate that a 250-character KWIC corpus coupled with a 100-term list generated from it could be a useful tool for definition writing.
‣The inevitable bias caused by the use of a reference list based on a manual classification does not invalidate the results.
Conclusions and future work
‣This initial pilot study will subsequently be expanded to include new variables:
‣other kind of definienda‣verbs and adverbs in the term lists‣corpora of different levels of specialization‣more KWIC corpora with different character
counts. comparison of the output of TermoStat with other term extractors as well as a simple keyword generator
Conclusions and future work
‣Our ultimate objective is to combine our approach with the application of knowledge-pattern-based techniques (Pearson, 1998; Meyer, 2001; Malaisé et al., 2005; Marshman and L’Homme 2006; Auger and Barrière, 2008, inter alia) to create a system of semi-automatic definitional information extraction.
Thank you
http://lexicon.ugr.es/sanmartin