30
stlab.istc.cnr.it Type inference through the analysis of Wikipedia links Andrea Giovanni Nuzzolese [email protected] Aldo Gangemi [email protected] Valentina Presutti [email protected] Paolo Ciancarini [email protected] 16 April 2012 - Lyon, France - LDOW 2012

Type inference through the analysis of Wikipedia links

Embed Size (px)

Citation preview

Page 2: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it

• Motivations

• Materials

• Applied methods

• Results

• Conclusions

Outline

2

Page 3: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it3

Motivations

✦ Only a subset of the DBpedia resources is typed with the DBpedia ontology (DBPO)

✦ The typing procedure is top-down.

✦ Is the DBPO complete with respect to the DBpedia domain?

✦ How good and homogeneous is the granularity of DBPO types?

Resources used in wikilinksrelations:

15,944,381

Resources having a DBPO type:

1,518,697

Page 4: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it4

Dataset # of triples

wikilink triples 107,892,317

infobox mapping-based “data” triples 9,357,273

rdfs:label triples 7,972,225

rdf:type triples 6,173,940

infobox mapping-based “object” triples 4,251,239

Wikilinkstriples:

107,892,317

Wikilink triples with typed subject/object:

16,745,830

Materials

DBpedia ontology:272 classes

DBpedia 3.6

Page 5: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it5

What we did• Wikilinks of a DBpedia resource convey knowledge that

can be used for classifying it.

• Classification methods ✦ Inductive learning: k-Nearest Neighbor algorithm✦ Abductive classification based on EKPs [1] and homotypes used as

background knowledge

• The methods were performed on

Sample of untyped resources:

1,000

Resources used in wikilinksrelations:

15,944,381

Resources having a DBPO type:

1,518,697

[1] A. G. Nuzzolese, A. Gangemi, V. Presutti, and P. Ciancarini. Encyclopedic Knowledge Patterns from Wikipedia Links. In L. Aroyo, N. Noy, and C. Welty, editors, Proceedings of the 10th International Semantic Web Conference (ISWC2011), pages 520-536. Springer, 2011.

Page 6: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it65

Inductive classification

• We designed two inductive classification experiments based on the k-NN algorithm

✦ on 272 features, i.e., all the classes in the DBPO✦ on 27 features, i.e., the top-level classes in the DBPO

hierarchy

• For each experiment we built a labeled feature space model as training set by using a randomly sampled 20% of typed resources

✦ the algorithms were tested on the remaining 80% of typed resources

Page 7: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it7

dbpedia:Steve_Jobs

dbpedia:Apple_Inc. dbpedia:NeXT

dbpedia:Cupertino,_Californiadbpedia:Forbes

dbpo:wikiPageWikiLink

Mammal Scientist Company Drug City Magazine Class

dbpedia:Steve_Jobs

...

Building the training set for K-Nearest Neighbor algorithm

Page 8: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it7

dbpedia:Steve_Jobs

dbpedia:Apple_Inc. dbpedia:NeXT

dbpedia:Cupertino,_Californiadbpedia:Forbes

dbpo:Organisation

dbpo:Magazine dbpo:City

dbpo:wikiPageWikiLink

rdf:type

Mammal Scientist Company Drug City Magazine Class

dbpedia:Steve_Jobs

dbpo:Person

dbpo:Person

...

Building the training set for K-Nearest Neighbor algorithm

Page 9: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it7

dbpedia:Steve_Jobs

dbpo:Organisation

dbpo:Magazine dbpo:Citydbpo:wikiPageWikiLink

rdf:type

kp:linksTo

Mammal Scientist Company Drug City Magazine Class

dbpedia:Steve_Jobs dbpo:Person0 0 1 0 1 1

...

Building the training set for K-Nearest Neighbor algorithm

Page 10: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it7

dbpedia:Steve_Jobs

dbpo:Organisation

dbpo:Magazine dbpo:Citydbpo:wikiPageWikiLink

rdf:type

kp:linksTo

Mammal Scientist Company Drug City Magazine Class

dbpedia:Steve_Jobs dbpo:Person0 0 1 0 1 1

... ... ... ... ... ... ... ...

Building the training set for K-Nearest Neighbor algorithm

Page 11: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it7

dbpedia:Steve_Jobs

dbpo:Organisation

dbpo:Magazine dbpo:Citydbpo:wikiPageWikiLink

rdf:type

kp:linksTo

Building the training set for K-Nearest Neighbor algorithm

✦ Precision using all DBPO types as features: 31.65%✦ Precision using the top-level of DBPO as features: 40.27%

Page 12: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it8

Abductive classification with EKPs

• EKPs✦ A EKP of a certain entity

type is a small vocabulary that captures the core types used for describing such entity type as it emerges from the Wikipedia crowds

visit aemoo.org for an exploratory tool based on EKPs

Page 14: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it9

How can we infer the type of

“Galileo Galilei”?

We know its path types

http://www.aemoo.org

Page 15: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it9

We have 231 EKPs We compare the path types involving

“Galileo Galilei” as subject with EKPs in order to identify the most similar, which

is the "Scientist" EKP.

http://www.aemoo.org

Page 16: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it9

The inferred type for the resource

“Galileo Galiei” is the class “Scientist”

http://www.aemoo.org

Page 17: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it10

Distinctive weakness of some EKPs

✦ The distinctive weakness seems due to wide overlaps among some EKPs

✦ Systematic ambiguity of the 4 largest classes

✦ Precision and recall on all DBPO types both 44.4%✦ Precision and recall on the top-level of DBPO hierarchy: 36.5% and 79.5%

Page 18: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it11

Homotype-based abductive classification

• Homotypes are wikilinks that have the same type on both the subject and the object of the triple

• We have observed how the homotype is usually the most frequent (or in the top 3) wikilink type

• Given an untyped entity, we hypothesize that the most frequent type involved in its ingoing/outgoing wikilinks detects its homotype, hence it indicates its type

dbpedia:Immanuel_Kant dbpedia:Plato

dbpo:wikiPageWikiLink

dbpo:Philosopher dbpo:Philosopherrdf:type rdf:type

Page 19: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it12

Homotype-based abductive classification

s

Page 20: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it12

Homotype-based abductive classification

s

Page 21: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it13

Results on classifying already typed resources

Page 22: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it14

Results on untyped resources• Results on a sample of 1,000 untyped resources

are much less satisfactory

With EKPs

With Homotypes

Page 23: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it15

Why? [1]

• Typed entities: 2:3 typed wikilinks ratio

• Untyped entities: 1:3 typed wikilinks ratio

• Link structure for untyped entities is not rich enough

Page 24: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it16

Why? [2]

• DBPO does not provide a complete set of classes for correctly typing DBpedia resources

dbpedia:Counterattack Plandbpedia:Computer_Science ScientificDiscipline

dbpedia:List_of_FIFA_World_Cup_finals Collection

dbpedia:Eros(concept) Conceptdbpedia:Gentlemen’s_agreement Agreement

Page 25: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it17

Conclusions• We have investigated different approaches for

typing DBpedia resources based on the data set of wikilinks

• Results are acceptable in the test set, but extensive untypedness in output links, and poor DBPO coverage severely compromise automatic typing for untyped resources

• We have analyzed possible causes deriving from some bias in DBpedia

Page 26: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it18

Future work

• Yago could be helpful but✦ there is a lack of mapping between YAGO and DBPO✦ it has larger coverage and only an overlap with DBPO✦ the granularity of its categories is finer, and not easily

reusable, because the top level is very large

Page 27: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it19

Thank you

Andrea Nuzzolese-

STLab, ISTC-CNR&

Dipartimento di Scienze dell’InformazioneUniversity of Bologna

Italy

Page 28: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it

nRes(dbpo:MusicalArtist)

Chad_Smith

John_Lennon

Michael_Jackson

PREFIX dbpo : http://dbpedia.org/ontology/

Paul_McCartney

Jackie_Jackson rdf:type

Anthony_Kiedis

dbpo:MusicalArtist

dbpo:MusicalArtist

rdf:type

rdf:type

dbpo:MusicalArtist

rdf:type

rdf:type

rdf:type

Number of resources having type dbpo:MusicalArtist

dbpo:MusicalArtist

Path popularity

Page 29: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it

nSubjectRes(Pi,j)

Dave_Grohl

Foo Fighters

John_Lennon

Michael_Jackson

PREFIX dbpo : http://dbpedia.org/ontology/

Paul_McCartney

Jackson_5

Beatles

Jackie_Jackson

dbpo:MusicalArtist dbpo:Band dbpo:wikiPageWikiLink Si Oj

Nirvana

Number of distinct resources that participate in a path as subjects

Path popularity

Page 30: Type inference through the analysis of Wikipedia links

stlab.istc.cnr.it

pathPopularity(Pi,j,Si)

Dave_Grohl

Foo Fighters

John_Lennon

Michael_Jackson

Paul_McCartney

Jackson_5

Beatles

Jackie_Jackson

Nirvana

Madonna

Prince Charlie_Parker Keith_Jarrett

nSubjectRes(Pi,j)/nRes(Si)

Path popularity