36
What‘s all the data about – profiling and interlinking Web datasets Stefan Dietze L3S Research Center 27/03/14 1 Stefan Dietze

What's all the data about? - Linking and Profiling of Linked Datasets

Embed Size (px)

DESCRIPTION

Talk at LIRMM (http://www.lirmm.fr/) on 27 March 2014 about profiling of Linked Data.

Citation preview

Page 1: What's all the data about? - Linking and Profiling of Linked Datasets

What‘s all the data about –

profiling and interlinking Web datasets

Stefan Dietze

L3S Research Center

27/03/14 1 Stefan Dietze

Page 2: What's all the data about? - Linking and Profiling of Linked Datasets

Recent work on Linked Data exploration/discovery/search

Entity interlinking & dataset interlinking recommendation

Dataset profiling

Data consistency & conflicts

Research areas

Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking)

Application domains: education/TEL, Web archiving, …

Some projects

Introduction

http://www.l3s.de/

Stefan Dietze 27/03/14 2

See also: http://purl.org/dietze

Page 3: What's all the data about? - Linking and Profiling of Linked Datasets

…why are there so few datasets actually used?

Date reuse and in-links focused on trusted „reference graphs“ such as DBpedia, Freebase etc

Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples)

Explanations?

Linked Data is awesome, but...

27/03/14

„HTTP-accessibility“ (SPARQL, URI-dereferencing)

„Structure“ & „Semantics“ (=> shared/linked vocabularies)

„Interlinked“

„Persistent“

Hm,

really?

Stefan Dietze

Page 4: What's all the data about? - Linking and Profiling of Linked Datasets

Linked data is more diverse than we think SPARQL Web-Querying Infrastructure: Ready for Action?,

Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves

Vandenbussch, International Semantic Web Conference 2013,

(ISWC2013).

SPARQL endpoint availability over time [Buil-Aranda et al 2013]

Accessibility of datasets?

Less than 50% of all SPARQL endpoints actually responsive at given point of time

“THE” SPARQL protocol? No, but many variants & subsets

Shared vocabularies & schemas, but:

…still very heterogeneous [d’Aquin, WebSci13]

…data partially messy and not conformant (RDFS, schemas) [HoganJWS2012]

…even widely used reference datasets such as DBpedia noisy [Paulheim2013]

Co-occurence graph of data types in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties

Assessing the Educational Linked Data Landscape, D’Aquin, M.,

Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris,

France, May 2013.

Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic

Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218,

2013, pp 510-525

An empirical survey of Linked Data conformance. Hogan, A., Umbrich,

J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., In the Journal of Web

Semantics 14: pp. 14–44, 2012

Stefan Dietze

Page 5: What's all the data about? - Linking and Profiling of Linked Datasets

What about data consistency?

Inconsistency and Incompleteness of Linked Datasets – a

Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., Web

Science 2014, WebSci14, under review.

27/03/14

Page 6: What's all the data about? - Linking and Profiling of Linked Datasets

Too many/diverse datasets, too little information

Stefan Dietze 27/03/14

? ? ?

? ? ?

Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?

Types: which datasets describe statistics, videos, slides, publications etc?

Currentness, dynamics, accessability/reliability, data quantity & quality?

Page 7: What's all the data about? - Linking and Profiling of Linked Datasets

Data curation and dataset profiling

Dataset

Catalog/Registry

Stefan Dietze 27/03/14

Catalog of data: classification of datasets according to resource types, disciplines/topics, data quality, accessability, etc

Infrastructure for distributed/federated querying

describes

Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?

Types: which datasets describe statistics, videos, slides, publications etc?

Currentness, dynamics, accessability/reliability, data quantity & quality?

Page 8: What's all the data about? - Linking and Profiling of Linked Datasets

db:Astro. Objects

Dataset profiling: what’s all the data about

Dataset Metadata

Stefan Dietze 27/03/14

BIBO

AAISO

FOAF

contains

Entity disambiguation & linking [ESWC13]

Topic profile extraction [WWW13, ESCW14]

db:Astronomy

db:Astro. Objects

Dataset

Catalog/Registry

yov:Video

po:Programme

BBC Programme

<po:Programme …>

<po:Series>Wonders of the Solar System</.>

<po:Actor>Brian Cox</…>

</po:Programme…>

<yo:Video …>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video…>

Yovisto Video

bibo:Fil bibo:Fi

bibo:Film

Schema mappings [WebSci13]

Page 9: What's all the data about? - Linking and Profiling of Linked Datasets

Schemas/vocabularies on the Web: XKCD 927

Stefan Dietze 27/03/14

https://xkcd.com/927/

Page 10: What's all the data about? - Linking and Profiling of Linked Datasets

Schema assessment and mapping

Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)

Assessing the Educational Linked Data Landscape,

D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science

2013 (WebSci2013), Paris, France, May 2013.

<po:Programme …>

<po:title>Secret Universe –

The Life of the Cell</po:title>

</po:Programme…>

BBC Programme

<sioc:Item …>

<label>Viral diseases &

bacteria</title>

</sioc:Item ….>

SlideShare Set

po:Programme

sioc:Item

?

http://datahub.io/group/linked-education

Stefan Dietze 27/03/14

Page 11: What's all the data about? - Linking and Profiling of Linked Datasets

Schema assessment and mapping

Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)

Co-occurence after mapping into most frequent schemas

(201 frequent types mapped into 79

classes)

Assessing the Educational Linked Data Landscape,

D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science

2013 (WebSci2013), Paris, France, May 2013.

bibo:Slideshow

bibo:Film

bibo:Document

<po:Programme …>

<po:title>Secret Universe –

The Life of the Cell</po:title>

</po:Programme…>

BBC Programme

<sioc:Item …>

<label>Viral diseases &

bacteria</title>

</sioc:Item ….>

SlideShare Set

po:Programme

sioc:Item

Stefan Dietze 27/03/14

Page 12: What's all the data about? - Linking and Profiling of Linked Datasets

LinkedUp Data Catalog in a nutshell

http://datahub.io/group/linked-education

http://data.linkededucation.org/linkedup/catalog/

RDF (VoID) dataset catalog: browse & query distributed datasets

Live information about endpoint accessibility

Federated queries using type mappings

Stefan Dietze 27/03/14

http://datahub.io/group/linked-education

Page 13: What's all the data about? - Linking and Profiling of Linked Datasets

<yo:Video 8748720>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video 8748720>

Video

<sioc:Item 2139393292>

<title>Planetary motion

& gravity</title>

</sioc:Item 2139393292>

Slideset

Topics/categories addressed? Relatedness of resources/entities? (types, semantics)

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

Combining a co-occurrence-based and a semantic measure

for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.

Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended

Semantic Web Conference, (May 2013).

A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B., Dietze, S.,

Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended

Semantic Web Conference (ESWC2014), Crete, Greece, (2014).

Challenge: semantics of resources/datasets?

15 Stefan Dietze 27/03/14

Page 14: What's all the data about? - Linking and Profiling of Linked Datasets

<yo:Video 8748720>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video 8748720>

Video <po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

Data disambiguation (for linking & profiling)

Brian Cox?

Sun?

Pluto?

16 Stefan Dietze 27/03/14

Page 15: What's all the data about? - Linking and Profiling of Linked Datasets

db:Pluto

(Dwarf

Planet)

db:Astrono-

mical Objects

db:Sun

Data disambiguation using background knowledge „Semantic relatetedness“ of resources?

db:Astronomy

17

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

<sioc:Item 2139393292>

<title>Planetary motion

& gravity</title>

</sioc:Item 2139393292>

Slideset

<yo:Video 8748720>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video 8748720>

Video

Stefan Dietze 27/03/14

Page 16: What's all the data about? - Linking and Profiling of Linked Datasets

db:Pluto

(Dwarf

Planet)

db:Astrono-

mical Objects

<yov:Lecture8748720>

<title>Pluto & the Dwarf

Planets</title>

< yov:Lecture8748720>

Online Lecture

db:Astronomy

Computation of connectivity scores between resources/entities

Method: combination of a

(i) semantic (graph-based) connectivity score (SCS) with

(ii) a Web co-occurence-based measure (CBM) (similar to NGD)

For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties)

db:Sun

SCS = 0.32

CBM = 0.24

http://purl.org/vol/doc/

http://purl.org/vol/ns/

19/09/2013 19 Stefan Dietze

Combining a co-occurrence-based and a semantic

measure for entity linking, B. P. Nunes, S. Dietze, M.A.

Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013

- 10th Extended Semantic Web Conference, (May 2013).

Entity linking: semantic relatedness

<sioc:Item 2139393292>

<title>Planetary motion

& gravity</title>

</sioc:Item 2139393292>

Slideset

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

Page 17: What's all the data about? - Linking and Profiling of Linked Datasets

Entity linking: evaluation

27/03/14 20 Stefan Dietze

Evaluation based on USA Today News items (80.000 entity pairs)

Manually created gold standard (1000 entity pairs)

Baseline: Explicit Semantic Analysis (ESA)

=> CBM/SCS: „relatedness“; ESA: „similarity“

Precision/Recall/F1 for SCS, CBM, ESA.

Combining a co-occurrence-based and a semantic

measure for entity linking, B. P. Nunes, S. Dietze, M.A.

Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013

- 10th Extended Semantic Web Conference, (May 2013).

Page 18: What's all the data about? - Linking and Profiling of Linked Datasets

db:Astrono-

mical Objects

db:Astronomy

db:Sun

Extracting representative metadata („topic profile“) for each dataset

Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets

Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance

DBpedia category graph

Stefan Dietze 27/03/14

Dataset profiling: what‘s the data about? A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

,(ESWC2014), Crete, Greece, (2014).

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

Page 19: What's all the data about? - Linking and Profiling of Linked Datasets

Dataset profiling: approach

Stefan Dietze 27/03/14

1. Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling)

2. Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion)

3. Normalisation and ranking (using graphical-models such as PageRank with Priors, HITS with Priors and K-Step Markov)

=> Result: weighted dataset-topic profile graph

A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

(ESWC2014), Crete, Greece, (2014).

Page 20: What's all the data about? - Linking and Profiling of Linked Datasets

Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/

Stefan Dietze 27/03/14

Automatic extraction of dataset “topics” [ESWC2014]

Visualisation & exploration of dataset-topic graph (datasets, topics, relationships)

Includes all (responsive) datasets of LOD Cloud

Page 21: What's all the data about? - Linking and Profiling of Linked Datasets

Dataset profiling: results evaluation

Stefan Dietze 27/03/14

NDCG (averaged over all datasets) .

Datasets & Ground Truth

Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood

Crowd-sourced topic indicators from datasets (keywords, tags)

Manual mapping to entities & category extraction (ranking according to frequency)

Baselines

1) LDA, 2) tf/idf (applied to entire datasets)

Topic extraction according to our approach, weighting/ranking based on term weight

Measure

NDCG @ rank l

Performance (time/NDCG) for different sampling strategies/sizes etc

Page 22: What's all the data about? - Linking and Profiling of Linked Datasets

Stefan Dietze 27/03/14

dbp:Category:Royal_Medal_winners

dbp:Category:1955_births dbp:Category:People_from_London

dbp:Category:Buzzwords

dbp:Category:Web_Services

dbp:Category:HTTP

dbp:Category:Unitarian_Universalists

dbp:Category:World_Wide_Web

What have these categories in common?

Page 23: What's all the data about? - Linking and Profiling of Linked Datasets

Stefan Dietze 27/03/14

Diversity of category profile for a single paper

Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web". Scientific American Magazine.

person

document

dbp:Tim_Berners-Lee

dbp:Category:1955_births dbp:Category:People_from_London

dbp:Category:Buzzwords

dbp:Semantic_Web

dbp:Category:Semantic_Web

dbp:Category:Web_Services

dbp:Category:HTTP

dbp:Category:Unitarian_Universalists

first-level categories (dcterms:subject)

dbp:Category:World_Wide_Web

dbp:Category:Royal_Medal_winners

Page 24: What's all the data about? - Linking and Profiling of Linked Datasets

DBpedia category graph not an ideal “topic” vocabulary:

Broad and noisy

“Categories” vs “topics” (for capturing disciplines, thesauri like UMBEL or UNESCO Thesaurus seem better suited)

Hierarchy ?

Filtering of certain partitions of category graph (too generic categories etc)

Mixing categories across resource types (document, person) creates “perceived noise”

But: broadness is useful as general vocabulary for categorisation of all sorts of resource types

Stefan Dietze 27/03/14

Dataset profiling: some lessons learned

Page 25: What's all the data about? - Linking and Profiling of Linked Datasets

Stefan Dietze 27/03/14

http://data-observatory.org/led-explorer/

Type specific views on datasets/ categories

“Document” (foaf:document)

“Person “ (foaf:person)

“Course” (aaiso:course)

Currently applied to datasets in LinkedUp Catalog only (as schema mappings already available here)

Type-specific exploration of dataset categories

Page 26: What's all the data about? - Linking and Profiling of Linked Datasets

Stefan Dietze 27/03/14

Dataset interlinking recommendation Candidate datasets for interlinking?

34

t

Linkset1

Linkset2

Problem

Given dataset t, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)

Features:

Vocabulary overlap

Existing links (SNA)

Datasets more likely to contain linking candidates if they (a) share common schema elements, or (b) already link to t or datasets t links to (friend of a friend)

Conclusions

Roughly 60% MAP for both approaches

Future work: quantity of links, more remote links, extraction of dataset links rather than data from DataHub

Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A.,

Dietze, S., Recommending Tripleset Interlinking through a

Social Network Approach, The 14th International Conference

on Web Information System Engineering (WISE 2013),

Nanjing, China, 2013.

Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova,

M.A., Dietze, S., Identifying candidate datasets for data

interlinking, in Proceedings of the 13th International

Conference on Web Engineering, (2013).

Rank

1 DBLP

2 ACM

3 OAI

4 CiteSeer

5 IBM

6 Roma

7 IEEE

8 Ulm

9 Pisa

?

?

Page 27: What's all the data about? - Linking and Profiling of Linked Datasets

Stefan Dietze 27/03/14 37

Success models: data & applications

LinkedUp Challenge to identify innovative tools & applications

Evaluation methods and approaches

“LinkedUp” – Linking Web Data (for Education) L

Data linking & curation

Technology transfer & community-building

Collecting & exposing open data => LinkedUp Data Catalog

Profiling and linking of Web Data for education => educational data graph [ESWC2013], [ISWC2013],

Disseminating knowledge & building communities (educators, computer scientists, data engineers)

Gathering stakeholder feedback: use cases, and requirements

http://linkedup-challenge.org/#usecases

http://linkedup-project.eu/events

http://www.linkedup-challenge.org/

http://data.linkededucation.org European suport action to

advance take-up of open data & related technologies

http://www.linkedup-project.eu

Page 28: What's all the data about? - Linking and Profiling of Linked Datasets

Stefan Dietze 27/03/14

17/09/2013 38

Who we areL

LinkedUp Network

LinkedUp Consortium

LinkedUp Advisory Board

Page 29: What's all the data about? - Linking and Profiling of Linked Datasets

LinkedUp Challenge: using open data (for learning)

Open Data Competition to promote tools and applications that analyse / integrate (Linked) Web data

Organised by LinkedUp project over 2 years (“Veni”, “Vidi”, “Vici”) with 40.000 EUR awards

Veni Competition - 22 submissions, 8 shortlisted for presentation at Open Knowledge Conference (17 September, Geneva Switzerland)

http://linkedup-challenge.org

Stefan Dietze 27/03/14

Page 30: What's all the data about? - Linking and Profiling of Linked Datasets

Open & focused track(s)

Final events at ESWC2014 (May, Crete)

Open Track only

Final events at OKCon 2013 (September 2013, Geneva)

Open track & focused tracks

Submission details and calls to be released soon

Final events at ISWC2014 (October, Riva del Garda, Italy)

May –September 2013 October 2013 – May 2014 May 2014 – October 2014

?

Page 31: What's all the data about? - Linking and Profiling of Linked Datasets

The Veni shortlist & winners

DataConf.

KnowNodes

Mismuseos

ReCredible

YourHistory

27/03/14

http://www.globe-town.org/

WeShare - 3rd price / people‘s choice

GlobeTown - 2nd price

http://seek.cloud.gsic.tel.uva.es/weshare/

http://www.polimedia.nl/

PoliMedia – 1st price

Page 32: What's all the data about? - Linking and Profiling of Linked Datasets

data.l3s.de – a DataHub for the L3S

Page 33: What's all the data about? - Linking and Profiling of Linked Datasets

Learning Analytics & Knowledge Dataset & Challenge Facilitating Research on Learning Analytics and EDM a nutshell

Stefan Dietze 27/03/14

http://lak.linkededucation.org/

http://lak.linkededucation.org/

LAK Dataset (450 publications in RDF/R) ACM International Conference on Learning Analytics and

Knowledge (LAK) (2011-13)

International Conference on Educational Data Mining (2008-13)

Journal of Educational Data Mining (2008-12)

LAK Data Challenge

Analyse, explore correlate the LAK Dataset

At ACM LAK 2014 (April 2014, Indianapolis)

Page 34: What's all the data about? - Linking and Profiling of Linked Datasets

KEYSTONE COST ACTION

27/03/14 51 Stefan Dietze

http://www.keystone-cost.eu/

Research network focused on distributed search, dataset profiling, to Semantic Web, Databases, etc.

Running 2013-2017

WG1: Representation of structured data sources

WG2: Keyword search

WG3: User interaction and query interpretation

WG4: Research integration, showcases, benchmarks, and evaluations

Open to new members (even beyond Europe)

Joint workshops (eg PROFILES2014 @ ESWC2014)

Page 35: What's all the data about? - Linking and Profiling of Linked Datasets

Ongoing/future work … and some upcoming events

Linked Data evolution, preservation, consistency

In RDF graphs (eg LOD Cloud), „all“ nodes are connected

LD preservation: which datasets to preserve (direct links or even more distant neighbours)? => semantic relatedness as guidance for scalable preservation strategies /data enrichment

Link correctness in evolving LD

Investigating impact of changes on link correctness (weekly LOD crawls over 1 year time span)

Application: informed preservation strategies

Conflict detection and LD quality (link quality, impact of conflicts in distant nodes)

PROFILES workshop @ ESWC2014 (http://keystone-cost.eu/profiles2014)

26 May 2014, Crete, Greece

Linking User Data 2014 at UMAP2014 (http://liud.linkededucation.org)

Deadline: 1 April

Online Learning & LD Tutorial at WWW2014 (http://www2014.kr/)

07 April, Seoul

Page 36: What's all the data about? - Linking and Profiling of Linked Datasets

Thank you!

WWW See also (general)

http://linkedup-project.eu

http://linkededucation.org

http://data.l3s.de

http://purl.org/dietze

See also (data)

http://data.linkededucation.org

http://data.linkededucation.org/linkedup/catalog/

http://lak.linkededucation.org

27/03/14 54 Stefan Dietze

Besnik Fetahu (L3S)

Bernardo Pereira Nunes (PUC Rio)

Marco Casanova (PUC Rio)

Luiz Andre Paes Leme (PUC Rio)

Giseli Lopes (PUC Rio)

Davide Taibi (CNR, IT)

Mathieu d’Aquin (Open University, UK)

and many more…

Acknowledgements