Upload
grace
View
30
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Finding Domain Terms using Wikipedia. Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra [email protected]. Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya [email protected]. Outline. Introduction Related approaches - PowerPoint PPT Presentation
Citation preview
Finding Domain Terms using Wikipedia
Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra
Horacio Rodríguez Hontoria TALP Research Center
Universitat Politécnica de [email protected]
2
Outline
• Introduction• Related approaches• Methodology• Evaluation• Conclusions and future work
Introduction
• Problem: to automatically extract terminological units from specialized texts
• Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.
4
Related approaches
• Magnini et al., 2000 • Montoyo et al., 2001• Missikoff et al., 2002 • Vivaldi, Rodríguez, 2002 • Vivaldi, Rodríguez, 2004• Bernardini et al., 2006 • Cui et al., 2008
Graph structure of Wikipedia
WP categories WP pages
A B
C D E
F
G
P1
P2
P3
Redirectiontable
…
… …
…
… …
… …
Disamb. pagesInterwiki linksExternal links
InfoBox
Methodology: overview
domain
Pages
top categories
domain categories
domain pagesfinal domain
term setfiltering
filtering
Categories
bootstrapping
1) To find in WP the domain name as a category.2) Look for all the subcategories/pages related to the domain3) Extract all descendants from the domain name avoiding loops4) Remove proper names and service classes5) Filter categories and pages
Main steps:
WP
Methodology: filtering
• Category level
• Page level
Methodology: filtering
• Category levelTop Category of the Domain
CatSet1
C
Direct super-categories CatSet1Direct super-categories CatSet1 Direct neutral super-categories
Category Score
Methodology: filtering
• Page levelTop Category of the Domain
CatSet2
C
categories CatSet2
Pages C ... ...
neutral categories
Page Score
P
categories CatSet2
Methodology: category filtering
categories descendant filtered of set:CatSet2
} )21( if accept
11#2
11#1
of oriessupercateg direct of set# :1
{ 1
categories descendant of set:CatSet1
nnc
CatSetaCatSetn
CatSetaCatSetn
caCatSet
CatSetc
Methodology: page filtering
2 if 02 if 1
:)(
)(
a to assigned categories of set :
CatSetcCatSetc
cinCatPathToDoma
inCat(c)PathToDoma
inCat(c)PathToDomadtcWPDC
dtctermCats
termCatsc
termCatsc
Additional category filtering using pages scores:
catTerm: set of pages associated to a category
-MicroStrict: accept cat if # elements of catTerm with positive scoring is greater that # elements with negative scoring
-MicroLoose: Idem with greater or equal test.
-Macro: instead of counting the pages with positive/negative scoring we use the components of such scores.
Page filtering example: “semantics” (in Computing domain)
theoretical computer science Computing semantics
softwaresoftware engineering
formal methods
semantics {linguistics, philosophy of language, semiotics, theoretical computer science, philosophical Logic}
WPCD(semantics) = 0.25
Category filtering example using pages score: “chemistry”
# DTCMicroStrict
MicroLoose
MacroVote Result
ok ko ok ko ok ko
1 electroquímica(electrochemistry)
13 5 16 2 36 12 +3 Accept
2 quesos(cheeses)
0 8 6 2 8 12 -1 Reject
3 óxidos de carbono(carbon monoxide)
1 1 2 0 4 3 +2 Accept
Evaluation
• Partial evaluation: “chemistry” and “astronomy”:– Test against Magnini et al., 2000 (WordNet 1.6)– Low coverage: 25% for Chemistry and 15% for
Astronomy
• Full evaluation. “Medicine”– Test against SNOMED-CT Spanish Edition (2009)– Wide coverage of the clinical domain: 800K terms
Partial evaluationDomain Chemistry Astronomy Language EN ES EN ES Initial Categories 188374 2070 188816 44631 #Categories after pruning 1334 557 790 143 Categories 49 43 5 6
Precision 93,9 62,8 0 16,7 Loose 833 1038 284 119
Pages found Strict 580 700 284 81 Loose 61,3 52,6 34,8 31,9 Ite
ratio
n #
1
Prec. [%] Strict 62,7 56,6 37.2 27,2
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-loose
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-looseEN-strict
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-looseEN-strictES-loose
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-looseEN-strictES-looseES-strict
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-loose
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-looseEN-strict
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-looseEN-strictES-loose
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-looseEN-strictES-looseES-strict
Full evaluationEvaluation using WN SNOMED-CT Initial Categories 2431 Categories after pruning 839 Categories 174 394
Precision 27,6 54 Loose 2091 4182
Page Strict 1724 3492 Loose 21,0 58 It
era
tion
#1
Prec. [%] Strict 23,2 62
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WN
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WNES-loose-WN
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WNES-loose-WN
ES-loose-SNOMED
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WNES-loose-WN
ES-loose-SNOMEDES-strict-SNOMED
Validation issues
Accepts Reject
whisky
cigar
udder
fire
oral cancer
renal colic
phoniatrics
surgical instruments
17
Conclusions
• Good results when evaluated against a specialised resource
• Term list filtering must be improved (ex. Eliminate proper names)
18
Future work
• Apply this method to other languages/domains
• Improve filtering using in/out links of selected pages
• Improve filtering using also the page content
• Use this WP knowledge to improve a term extractor
19
Finding Domain Terms using Wikipedia
Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra
Horacio Rodríguez Hontoria TALP Research Center
Universitat Politécnica de [email protected]