19
Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra [email protected] Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Cataluny [email protected]

Finding Domain Terms using Wikipedia

  • Upload
    grace

  • View
    30

  • Download
    2

Embed Size (px)

DESCRIPTION

Finding Domain Terms using Wikipedia. Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra [email protected]. Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya [email protected]. Outline. Introduction Related approaches - PowerPoint PPT Presentation

Citation preview

Page 1: Finding Domain Terms using Wikipedia

Finding Domain Terms using Wikipedia

Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra

[email protected]

Horacio Rodríguez Hontoria TALP Research Center

Universitat Politécnica de [email protected]

Page 2: Finding Domain Terms using Wikipedia

2

Outline

• Introduction• Related approaches• Methodology• Evaluation• Conclusions and future work

Page 3: Finding Domain Terms using Wikipedia

Introduction

• Problem: to automatically extract terminological units from specialized texts

• Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.

Page 4: Finding Domain Terms using Wikipedia

4

Related approaches

• Magnini et al., 2000 • Montoyo et al., 2001• Missikoff et al., 2002 • Vivaldi, Rodríguez, 2002 • Vivaldi, Rodríguez, 2004• Bernardini et al., 2006 • Cui et al., 2008

Page 5: Finding Domain Terms using Wikipedia

Graph structure of Wikipedia

WP categories WP pages

A B

C D E

F

G

P1

P2

P3

Redirectiontable

… …

… …

… …

Disamb. pagesInterwiki linksExternal links

InfoBox

Page 6: Finding Domain Terms using Wikipedia

Methodology: overview

domain

Pages

top categories

domain categories

domain pagesfinal domain

term setfiltering

filtering

Categories

bootstrapping

1) To find in WP the domain name as a category.2) Look for all the subcategories/pages related to the domain3) Extract all descendants from the domain name avoiding loops4) Remove proper names and service classes5) Filter categories and pages

Main steps:

WP

Page 7: Finding Domain Terms using Wikipedia

Methodology: filtering

• Category level

• Page level

Page 8: Finding Domain Terms using Wikipedia

Methodology: filtering

• Category levelTop Category of the Domain

CatSet1

C

Direct super-categories CatSet1Direct super-categories CatSet1 Direct neutral super-categories

Category Score

Page 9: Finding Domain Terms using Wikipedia

Methodology: filtering

• Page levelTop Category of the Domain

CatSet2

C

categories CatSet2

Pages C ... ...

neutral categories

Page Score

P

categories CatSet2

Page 10: Finding Domain Terms using Wikipedia

Methodology: category filtering

categories descendant filtered of set:CatSet2

} )21( if accept

11#2

11#1

of oriessupercateg direct of set# :1

{ 1

categories descendant of set:CatSet1

nnc

CatSetaCatSetn

CatSetaCatSetn

caCatSet

CatSetc

Page 11: Finding Domain Terms using Wikipedia

Methodology: page filtering

2 if 02 if 1

:)(

)(

a to assigned categories of set :

CatSetcCatSetc

cinCatPathToDoma

inCat(c)PathToDoma

inCat(c)PathToDomadtcWPDC

dtctermCats

termCatsc

termCatsc

Additional category filtering using pages scores:

catTerm: set of pages associated to a category

-MicroStrict: accept cat if # elements of catTerm with positive scoring is greater that # elements with negative scoring

-MicroLoose: Idem with greater or equal test.

-Macro: instead of counting the pages with positive/negative scoring we use the components of such scores.

Page 12: Finding Domain Terms using Wikipedia

Page filtering example: “semantics” (in Computing domain)

theoretical computer science Computing semantics

softwaresoftware engineering

formal methods

semantics {linguistics, philosophy of language, semiotics, theoretical computer science, philosophical Logic}

WPCD(semantics) = 0.25

Page 13: Finding Domain Terms using Wikipedia

Category filtering example using pages score: “chemistry”

# DTCMicroStrict

MicroLoose

MacroVote Result

ok ko ok ko ok ko

1 electroquímica(electrochemistry)

13 5 16 2 36 12 +3 Accept

2 quesos(cheeses)

0 8 6 2 8 12 -1 Reject

3 óxidos de carbono(carbon monoxide)

1 1 2 0 4 3 +2 Accept

Page 14: Finding Domain Terms using Wikipedia

Evaluation

• Partial evaluation: “chemistry” and “astronomy”:– Test against Magnini et al., 2000 (WordNet 1.6)– Low coverage: 25% for Chemistry and 15% for

Astronomy

• Full evaluation. “Medicine”– Test against SNOMED-CT Spanish Edition (2009)– Wide coverage of the clinical domain: 800K terms

Page 15: Finding Domain Terms using Wikipedia

Partial evaluationDomain Chemistry Astronomy Language EN ES EN ES Initial Categories 188374 2070 188816 44631 #Categories after pruning 1334 557 790 143 Categories 49 43 5 6

Precision 93,9 62,8 0 16,7 Loose 833 1038 284 119

Pages found Strict 580 700 284 81 Loose 61,3 52,6 34,8 31,9 Ite

ratio

n #

1

Prec. [%] Strict 62,7 56,6 37.2 27,2

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-loose

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-looseEN-strict

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-looseEN-strictES-loose

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-looseEN-strictES-looseES-strict

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-loose

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-looseEN-strict

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-looseEN-strictES-loose

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-looseEN-strictES-looseES-strict

Page 16: Finding Domain Terms using Wikipedia

Full evaluationEvaluation using WN SNOMED-CT Initial Categories 2431 Categories after pruning 839 Categories 174 394

Precision 27,6 54 Loose 2091 4182

Page Strict 1724 3492 Loose 21,0 58 It

era

tion

#1

Prec. [%] Strict 23,2 62

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WN

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WNES-loose-WN

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WNES-loose-WN

ES-loose-SNOMED

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WNES-loose-WN

ES-loose-SNOMEDES-strict-SNOMED

Validation issues

Accepts Reject

whisky

cigar

udder

fire

oral cancer

renal colic

phoniatrics

surgical instruments

Page 17: Finding Domain Terms using Wikipedia

17

Conclusions

• Good results when evaluated against a specialised resource

• Term list filtering must be improved (ex. Eliminate proper names)

Page 18: Finding Domain Terms using Wikipedia

18

Future work

• Apply this method to other languages/domains

• Improve filtering using in/out links of selected pages

• Improve filtering using also the page content

• Use this WP knowledge to improve a term extractor

Page 19: Finding Domain Terms using Wikipedia

19

Finding Domain Terms using Wikipedia

Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra

[email protected]

Horacio Rodríguez Hontoria TALP Research Center

Universitat Politécnica de [email protected]