14
Darja Fišer, Senja Pollak, Špela Vintar University of Ljubljana, Dept. of Translation Studies {darja.fiser, spela.vintar}@guest.arnes.si, [email protected]

Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

  • Upload
    thanos

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Darja Fišer, Senja Pollak, Špela Vintar University of Ljubljana, Dept. of Translation Studies {darja.fiser, spela.vintar}@guest.arnes.si, [email protected]. Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources. Aim. - PowerPoint PPT Presentation

Citation preview

Page 1: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

Darja Fišer, Senja Pollak, Špela VintarUniversity of Ljubljana, Dept. of Translation Studies{darja.fiser, spela.vintar}@guest.arnes.si, [email protected]

Page 2: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

Extract definitions of specialised concepts from texts (journals, textbooks etc.). Use Wikipedia to learn rules that help

distinguish between proper definitions and non-definitions.

Extract candidate sentences from texts using 3 approaches:▪ patterns (A cell is the smallest living unit in an

organism)▪ automatic term recognition ▪ wordnet

Apply rules to select good definitions and discard non-definitions

LREC2010 Malta

Page 3: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

title

definition

non-definition

LREC2010 Malta

Page 4: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

Slovene Wikipedia (December 2009): 162,500 articles only well-formed pages retained morphosyntactic annotation and

lemmatization with ToTaLe (Erjavec et al. 2005)

structural parsing: 19,964 instances building a classification model in Weka

(Witten and Frank 2005) features: most frequent PoS and lemmata

LREC2010 Malta

Page 5: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

best: J48 decision tree classifierexperimenting with full and merged

PoS, absolute frequency (AF) and binary values

10-fold cross-validationSETS

Instances

Attributes

NaiveBayes

J48 JRIP PART

ORIG 19964 260 66.91%82.13%

80.91% 82.56%

ORIG_bin

19964 260 73.85% 82.2% 80.6% 81.88%

MERGED

19964 188 62.64%82.72%

81.68% 82.72%

MERGED_bin

19964 188 72.39%82.44%

80.5% 81.79%

LREC2010 Malta

Page 6: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

“unstructured texts”: subset of the FidaPlus corpus (http://www.fidaplus.net) knowledge-rich: textbooks, popular science

volumes (e.g. “All about mushrooms”) various domains: astronomy, physics,

geography, botany ... sloWNet – Slovene wordnet

(Fišer 2007, http://lojze.lugos.si/~darja/slownet.html) Automatic term recognition system for

Slovene (Vintar 2004, http://lojze.lugos.si/cgitest/extract.cgi)

LREC2010 Malta

Page 7: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

The sentence is a definition candidate if: the sentence starts with a sloWNet literal and

contains at least one more literal from the same hyperonymy chain (i.e. its hyponym or its hypernym)

<term id=ENG20-13313485-n>Diabetes</term> je <term id=ENG20-13268088-n>bolezen</term>, ki je posledica pomanjkanja inzulina, hormona, ki skrbi, da celice v telesu dobivajo glukozo (sladkor).

[Diabetes is a disease resulting from insulin deficiency, the hormone providing glucose (sugar) for body cells.]

LREC2010 Malta

Page 8: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

The sentence is a definition candidate if: the sentence contains at least two domain-

specific terms and the first term is in the nominative case

<term score=“80.45“>Ekvator</term> je najdaljši vzporednik,ki deli Zemljo na severno in <term score=”43.21”>južnopoloblo</term>.

[The Equator is the largest circle of latitude dividing the Earthinto the Northern and the Southern Hemispheres.]

LREC2010 Malta

Page 9: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

The sentence is a definition candidate if:

the sentence contains a defining morphosyntactic pattern (NP[nominative] is_a NP [nominative]).

NP is_a NPCelica je strukturna in funkcionalna enota vseh živih organizmov.

[A cell is a structural and functional unit of all living organisms.]

LREC2010 Malta

Page 10: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

Def. candidates

True definitions

Precision

sloWNet 104 41 0.39

ATR 629 118 0.19

Patterns 311 98 0.31

Total / Average

1044 257 0.29

• manual evaluation of all definition candidates

• sloWNet: best precision, ATR: best recall

• what is a definition??

LREC2010 Malta

Page 11: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

sloWNet ATR Patterns

MERGED + J48 61.76% 69.79% 69.45%

MERGED_bin + J48 66.67% 71.06% 63.9%

ORIG_bin + J48 63.72% 65.98% 62.7%

For definitions only:

sloWNet ATR Patterns

Precision 0.63 0.46 0.514

Recall 0.415 0.441 0.551

F-measure 0.5 0.452 0.532

LREC2010 Malta

Page 12: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

The Equator is an imaginary line on the Earth's surface equidistant from the North Pole and South Pole that divides the Earth into a Northern Hemisphere and a Southern Hemisphere.

An equator is the intersection of a sphere's surface with the plane perpendicular to the sphere's axis of rotation and containing the sphere's center of mass.

The longest of the five main circles of latitude on Earth (the others being the Arctic and Antarctic Circles and the Tropics of Cancer and Capricorn) is called the Equator.

LREC2010 Malta

Page 13: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

Head lice are parasites that live in the hair and scalp of humans.

HEAD LICE, also called Pediculus Humanus Capitis are small blood-sucking, wingless insects found on the human scalp. They are approximately the size of a sesame seed and cannot jump or fly.  They are six-legged creatures with claws, which help them cling to and crawl through human hair. 

Head lice are an emerging social problem, not only in economically poor countries but also in practically all other societies.

LREC2010 Malta

Page 14: Learning  to  Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources

Wikipedia can help us learn the properties of definitions,

Knowledge-rich texts are a good source of definitions,

A semantically-rich approach (using wordnet and ATR) yields many definitions and defining contexts.

Defining a definition is hard...

Encyclopaedic definitions differ from those found in running texts,

Future work: use other features in

learning, use active learning, redefine definitions

and possibly re-evaluate definition candidates

LREC2010 Malta