Turning Data into Knowledge –
profiling and interlinking Web datasets
Stefan Dietze
L3S Research Center
- KESW2014 -
30/09/14 1 Stefan Dietze KESW2014
KESW2014
Recent work on Linked Data exploration/discovery/search
Entity interlinking & dataset interlinking recommendation
Dataset profiling
Data consistency & conflicts
Research areas
Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking)
Application domains: education/TEL, Web archiving, …
Some projects
Introduction
http://www.l3s.de/
30/09/14 2
See also: http://purl.org/dietze
Stefan Dietze
KESW2014
…why are there so few datasets actually used?
Date reuse and in-links focused on trusted „reference graphs“ such as DBpedia, Freebase etc
Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples)
Explanations?
Linked Data is awesome, but...
30/09/14
„HTTP-accessibility“ (SPARQL, URI-dereferencing)
„Structure“ & „Semantics“ (=> shared/linked vocabularies)
„Interlinked“
„Persistent“
Hm,
really?
Stefan Dietze 3
KESW2014
Linked data is more diverse (and messy) than we think
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of datasets?
Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013]
“THE” SPARQL protocol? No, but many variants & subsets
“Semantics”, links, quality?
…data accuracy (eg DBpedia)? [Paulheim2013]
…vocabulary reuse? [D’AquinWebSci13]
…schema compliance (RDFS, schemas) [HoganJWS2012]
Stefan Dietze
SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-Aranda,
Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic
Web Conference 2013, (ISWC2013).
Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC
2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,
A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012
30/09/14 4
KESW2014
What about data consistency?
Analyzing Relative Incompleteness of Movie Descriptions
in the Web of Data: A Case Study, Yuan, W., Demidova, E.,
Dietze, S., Zhu, X., International Semantic Web Conference
2014 (ISWC2014)
30/09/14 Stefan Dietze 5
KESW2014
Too many/diverse datasets, too little knowledge
Stefan Dietze 30/09/14
? ? ?
? ? ?
Topics? Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?
Types? Which datasets describe statistics, videos, slides, publications etc?
Quality? Currentness, dynamics, accessability/reliability, data quantity & quality?
6
KESW2014
db:Astro. Objects
Dataset Metadata
Stefan Dietze 30/09/14
BIBO
AAISO
FOAF
contains
Entity & dataset disambiguation & linking [ESWC13]
Topic profile extraction [WWW13, ESCW14]
db:Astronomy
db:Astro. Objects
Dataset
Catalog/Registry
yov:Video
po:Programme
BBC Programme
<po:Programme …>
<po:Series>Wonders of the Solar System</.>
<po:Actor>Brian Cox</…>
</po:Programme…>
<yo:Video …>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video…>
Yovisto Video
bibo:Fil bibo:Fi
bibo:Film
Schema mappings [WebSci13]
Data mapping, linking and profiling
7
KESW2014
Schemas/vocabularies on the Web: XKCD 927
Stefan Dietze 30/09/14
https://xkcd.com/927/ schemas & vocabularies
8
KESW2014
Schema assessment and mapping
Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
po:Programme sioc:Item
30/09/14
yov:Video
?
Stefan Dietze 9
KESW2014
typeX typeX
Schema assessment and mapping
Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)
Co-occurence after mapping into most frequent schemas
(201 frequent types mapped into 79
classes)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
bibo:Film
bibo:Document
po:Programme sioc:Item
30/09/14
foaf:Document
yov:Video
typeX
10
KESW2014
Application: LinkedUp Data Catalog in a nutshell
http://datahub.io/group/linked-education
RDF (VoID) dataset catalog: browse & query distributed datasets
Federated queries using type mappings
Live information about endpoint accessibility
Stefan Dietze 30/09/14
11
http://data.linkededucation.org/linkedup/catalog/
http://datahub.io/group/linked-education
DBpedia categories
KESW2014 Stefan Dietze 30/09/14
contains yov:Video
po:Programme
BBC Programme
<po:Programme …>
<po:Series>Wonders of the Solar System</.>
<po:Actor>Brian Cox</…>
</po:Programme…>
<yo:Video …>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video…>
Yovisto Video
Towards profiling: dataset disambiguation/linking
?
Relatedness of entities, meaningfulness of paths? [ESWC13]
Extraction of “topics” & relatedness of datasets [ESWC14]
?
?
?
14
db:Astro. Objects
db:CartoonCharacters
?
KESW2014 Stefan Dietze 30/09/14
contains yov:Video
po:Programme
BBC Programme
<po:Programme …>
<po:Series>Wonders of the Solar System</.>
<po:Actor>Brian Cox</…>
</po:Programme…>
<yo:Video …>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video…>
Yovisto Video
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
db:Sun
db:Astronomy
Computation of connectivity scores between entities
Combination of a (i) semantic (graph-based) connectivity score (SCS) with (ii) a Web co-occurence-based measure (CBM) (similar to NGD)
For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties)
SCS = 0.32
CBM = 0.24
15
Dataset disambiguation/linking
KESW2014
Entity linking: evaluation
30/09/14 16 Stefan Dietze
Evaluation based on USA Today News items (80.000 entity pairs)
Manually created gold standard (1000 entity pairs)
Baseline: Explicit Semantic Analysis (ESA)
=> CBM/SCS: „relatedness“; ESA: „similarity“
Precision/Recall/F1 for SCS, CBM, ESA.
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
KESW2014
„SCS Connector“ demo
http://lod2.inf.puc-rio.br/scs/SemConnectivities
SCS Connector – Quantifying and Visualising Semantic
Paths between Entity Pairs, Nunes, B. P., Herrera, J. E. T.,
Taibi, D., Lopes, G. R., Casanova, M. A., Dietze, S., Demo
Paper at 11th Extended Semantic Web Conference
(ESWC2014), Heraklion, Crete, Greece, (2014. –
*BEST ESWC2014 DEMO AWARD*
17 Stefan Dietze 30/09/14
KESW2014
Dataset Metadata
db:Astronomy
db:Astro. Objects
Dataset
Catalog/Registry
yov:Video
<yo:Video …>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video…>
Yovisto Video
Extracting representative (DBpedia) categories („topic profile“) & entities for arbitrary datasets
Sounds easy? But how to do that for 300+ datasets with < 50 bn triples?
Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance [ESWC2014] (applied to all responsive LOD datasets)
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
Dataset profiling: what‘s the data about?
18 Stefan Dietze 30/09/14
db:Pluto
(Dwarf
Planet)
KESW2014
Efficient dataset profiling: method
1. Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling)
2. Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion)
3. Normalisation and ranking (using graphical-models such as PageRank with Priors, HITS with Priors and K-Step Markov)
Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
19 Stefan Dietze 30/09/14
KESW2014
Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/
Automatic extraction of dataset “topics” [ESWC2014] => RDF/VoiD dataset profiles
Visualisation & exploration of dataset-topic graph (datasets, topics, relationships)
Includes all (responsive) datasets of LOD Cloud
20 Stefan Dietze 30/09/14
KESW2014
Dataset profiling: evaluation
NDCG (averaged over all datasets) .
Datasets & Ground Truth
Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood
Crowd-sourced topic indicators from datasets (keywords, tags)
Manual mapping to entities & category extraction (ranking according to frequency)
Baselines
1) LDA, 2) tf/idf (applied to entire datasets)
Topic extraction according to our approach, weighting/ranking based on term weight
Measure
NDCG @ rank l
Performance (time/NDCG) for different sampling strategies/sizes etc
21 Stefan Dietze 30/09/14
KESW2014
30/09/14
What (dataset) have these categories in common?
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Category:Semantic_Web
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
dbp:Category:World_Wide_Web
dbp:Category:Royal_Medal_winners
Stefan Dietze 22
?
?
KESW2014
30/09/14
Diversity of category profile for a single publication
Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web". Scientific American Magazine.
foaf:Person foaf:Document
dbp:Tim_Berners-Lee
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Semantic_Web
dbp:Category:Semantic_Web
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
first-level categories (dcterms:subject)
dbp:Category:World_Wide_Web
dbp:Category:Royal_Medal_winners
Stefan Dietze
DBLP
23
KESW2014
30/09/14
http://data-observatory.org/led-explorer/
Type specific views on datasets/ categories
“Document” (foaf:document)
“Person “ (foaf:person)
“Course” (aaiso:course)
Currently applied to datasets in LinkedUp Catalog only (as schema mappings already available here)
Type-specific exploration of dataset categories
Stefan Dietze
Exploring type-specific topic profiles of datasets:
a demo for educational linked data, Taibi, D.,
Dietze, S., Fetahu, B., Fulantelli, G., Demo at
International Semantic Web Conference 2014
(ISWC2014)
24
KESW2014
data.l3s.de – the L3S DataHub
KESW2014
KEYSTONE & PROFILES 2014
30/09/14 27 Stefan Dietze
http://www.keystone-cost.eu/
KEYSTONE: semantic keyword-based search on structured data sources (2013-2017)
Research network focused on distributed search, dataset profiling, to Semantic Web, Databases, etc.
Open to new members (beyond Europe)
http://www.keystone-cost.eu/profiles
http://www.ijswis.org/?q=node/51/
PROFILES2014 - Dataset PROFIling & fEderated Search for Linked Data
Workshop collocated with ESWC2014
IJSWIS Special Issue on … LD search & profiling
Deadline 8 December 2014
KESW2014
Summing up
Summary
Increasing amounts of data => require knowledge about nature and relationships of datasets
Profiling: scalable methods for extracting dataset metadata
Interlinking: connectivity of entities or datasets
What about LD evolution?
In RDF graphs (eg LOD Cloud), „all“ nodes are connected
Impact of evolution on preservation, linking and enrichment?
Which parts of datasets to preserve (entity „neighbourhood“)? => semantic relatedness /relevance/entity retrieval
Link correctness in evolving LD?
….
30/09/14 29 Stefan Dietze
KESW2014
Спасибо! Thank You!
WWW See also (general)
http://purl.org/dietze
http://linkedup-project.eu
http://duraark.eu
http://data.l3s.de
See also (data)
http://data.l3s.de
http://data.linkededucation.org
http://lak.linkededucation.org
30/09/14 30 Stefan Dietze
Besnik Fetahu (L3S)
Elena Demidova (L3S)
Bernardo Pereira Nunes (PUC Rio)
Marco Casanova (PUC Rio)
Luiz Andre Paes Leme (PUC Rio)
Giseli Lopes (PUC Rio)
Davide Taibi (CNR, IT)
Mathieu d’Aquin (Open University, UK)
and many more…
Acknowledgements