50
Slide 1 Ansgar Scherp – [email protected] How to Juggle with more than a Billion Triples? Ansgar Scherp Research Group on Data and Web Science Universität Mannheim October 2012 Image source: http://www.flickr.com/photos/pedromourapinheiro/2122754745/

Linked open data - how to juggle with more than a billion triples

Embed Size (px)

DESCRIPTION

Slides of my inauguration talk at the University of Mannheim in Germany in October 2012. Download this slide set to enjoy all animations.

Citation preview

Page 1: Linked open data - how to juggle with more than a billion triples

Slide 1Ansgar Scherp – [email protected]

How to Juggle with more

than a Billion Triples?

Ansgar ScherpResearch Group on Data and Web Science

Universität MannheimOctober 2012

Image source: http://www.flickr.com/photos/pedromourapinheiro/2122754745/

Page 2: Linked open data - how to juggle with more than a billion triples

Slide 2Ansgar Scherp – [email protected]

My thanks go to …

• Marianna• Simon Schenk• Carsten Saathoff• Thomas Franz• Thomas Gottron• Steffen Staab • Arne Peters• Bastian Krayer

• Daniel Eißing• Mathias Konrath• Daniel Schmeiß• Anton Baumesberger • Frederik Jochum • Alexander Kleinen

And many more …

Page 3: Linked open data - how to juggle with more than a billion triples

Slide 3Ansgar Scherp – [email protected]

Scenario

• Tim plans to travel– from London – to a customer in Cologne

Page 4: Linked open data - how to juggle with more than a billion triples

Slide 4Ansgar Scherp – [email protected]

Website of the German Railway

It works, why bother…?

Eurostar

DB

KVB

Page 5: Linked open data - how to juggle with more than a billion triples

Slide 5Ansgar Scherp – [email protected]

Let‘s Try Different Queries

Bottlenecks in public transportation? Compare the connections with flights? Visualize on a map? …

All these queries cannot be answered, because the data …

Page 6: Linked open data - how to juggle with more than a billion triples

Slide 6Ansgar Scherp – [email protected]

… locked in Silos!

– High Integration Effort– Lack in Reuse of Data

B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY

Page 7: Linked open data - how to juggle with more than a billion triples

Slide 7Ansgar Scherp – [email protected]

Linked Data

• Publishing and interlinking of data• Different quality and purpose• From different sources in the Web

World Wide Web Linked DataDocuments DataHyperlinks Typed LinksHTML RDFAddresses (URIs) Addresses (URIs)

Example: http://www.uni-mannheim.de/

Page 8: Linked open data - how to juggle with more than a billion triples

Slide 8Ansgar Scherp – [email protected]

Relevance of Linked Data?

Page 9: Linked open data - how to juggle with more than a billion triples

Slide 9Ansgar Scherp – [email protected]

Linked Data: May ‘07

< 31 Billion Triples

Media

Geographic

Publications

Web 2.0

eGovernment

Cross-Domain

LifeSciences

Sept. ‘11

Source: http://lod-cloud.net

Page 10: Linked open data - how to juggle with more than a billion triples

Slide 10Ansgar Scherp – [email protected]

Linked Data Principles

1. Identification2. Interlinkage3. Dereferencing4. Description

Page 11: Linked open data - how to juggle with more than a billion triples

Slide 11Ansgar Scherp – [email protected]

Example: Big Lynx

Big LynxCompany

Matt Briggs

Scott Miller

Source: http://lod-cloud.net< 31 Milliarde Triple

?

Page 12: Linked open data - how to juggle with more than a billion triples

Slide 12Ansgar Scherp – [email protected]

1. Use URIs for Identification

B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY

http://biglynx.co.uk/people/matt-briggs

http://biglynx.co.uk/people/scott-miller

Matt Briggs

Scott Miller

Page 13: Linked open data - how to juggle with more than a billion triples

Slide 13Ansgar Scherp – [email protected]

Example: Big Lynx

Big LynxCompany

Matt Briggs

Scott Miller

How to model relationships like knows?

Page 14: Linked open data - how to juggle with more than a billion triples

Slide 14Ansgar Scherp – [email protected]

• Description of Ressources with RDF triple

Matt Briggs is a Person

Resource Description Framework (RDF)

Subject Predicate Object

@prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> .

@prefix foaf:<http://xmlns.com/foaf/0.1/> .

<http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person .

Page 15: Linked open data - how to juggle with more than a billion triples

Slide 15Ansgar Scherp – [email protected]

1. Use URIs also for Relations

B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY

http://biglynx.co.uk/people/matt-briggs

http://biglynx.co.uk/people/scott-miller

foaf:knows

foaf:knows

Page 16: Linked open data - how to juggle with more than a billion triples

Slide 16Ansgar Scherp – [email protected]

Example: Big Lynx

Big Lynx Company

DBpedia

Matts privateWebseite

„sameperson“

Matt Briggs

„lives here“

Dave SmithLondon

Matt Briggs

Scott Miller

Page 17: Linked open data - how to juggle with more than a billion triples

Slide 17Ansgar Scherp – [email protected]

2. Establishing Interlinkage

• Relation links between ressources

<http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> .

Identity links between ressources

<http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> .

Page 18: Linked open data - how to juggle with more than a billion triples

Slide 18Ansgar Scherp – [email protected]

Example: Big Lynx

DBpedia

Big Lynx Company

Matts privateWebseite

foaf:based_near

owl:sameAs

Matt Briggs

Dave SmithLondon

Matt Briggs

„same Person“

„lives here“

Page 19: Linked open data - how to juggle with more than a billion triples

Slide 19Ansgar Scherp – [email protected]

• Looking up of web documents

• How can we “look up” things of the real world?

3. Dereferencing of URIs

http://biglynx.co.uk/people/matt-briggs

Page 20: Linked open data - how to juggle with more than a billion triples

Slide 20Ansgar Scherp – [email protected]

Two Approaches

1. Hash URIs– URI contains a part separated by #, e.g.,

http://biglynx.co.uk/vocab/sme#Team

2. Negotiation via „303 See Other“ request

http://biglynx.co.uk/people/matt-briggs

Response: „Look here:“http://biglynx.co.uk/people/matt-briggs.rdf

Page 21: Linked open data - how to juggle with more than a billion triples

Slide 21Ansgar Scherp – [email protected]

Example: Big Lynx

DBpedia

Big Lynx Company

Matts privateWebseite

foaf:based_near

owl:sameAs

Matt Briggs

Dave SmithLondon

Matt BriggsDescription of Matt?

Page 22: Linked open data - how to juggle with more than a billion triples

Slide 22Ansgar Scherp – [email protected]

4. Description of URIs

biglynx:matt-briggs

foaf:Person

biglynx:dave-smith

dp:Birmingham

rdf:type

foaf:knows

foaf:based_near

_:point

wgs84:lat

wgs84:long

dp:London

foaf:based_near

……

ex:loc

“-0.118”

“51.509”

Page 23: Linked open data - how to juggle with more than a billion triples

Slide 23Ansgar Scherp – [email protected]

Formalization of Description

),,( EPVG Given a RDF graph with

VPBRELBRV )( and

SimpleCBD(n) = I with

∩j = 0

I 0 = { (s, p, o) | (s, p, o) E s = n }

I = { (o, p‘, o‘) E | (s, p, o) I : o B

j

jj+1

k = 0k

j

(o, p‘, o‘) I }

Page 24: Linked open data - how to juggle with more than a billion triples

Slide 24Ansgar Scherp – [email protected]

W3C RDF / RDF Schema Vocabulary

• rdf:type • rdf:Property • rdf:XMLLiteral • rdf:List • rdf:first • rdf:rest • rdf:Seq • rdf:Bag • rdf:Alt • ... • rdf:value

• rdfs:domain • rdfs:range • rdfs:Resource • rdfs:Literal • rdfs:Datatype • rdfs:Class • rdfs:subClassOf • rdfs:subPropertyOf • rdfs:comment • …• rdfs:label

• Set of URIs defined in rdf:/rdfs: namespace

Page 25: Linked open data - how to juggle with more than a billion triples

Slide 25Ansgar Scherp – [email protected]

Semantic Web Layer Cake (Simplified)

Page 26: Linked open data - how to juggle with more than a billion triples

Slide 26Ansgar Scherp – [email protected]

Exploration of Linked Data

Source: http://lod-cloud.net< 31 Billion Triples

WordNet

Swoogle

GeoNames

Page 27: Linked open data - how to juggle with more than a billion triples

Slide 27Ansgar Scherp – [email protected]

Naive Approach

• Download all data• Store in really big

database• Programming of

queries• Design of

user interface

Swoogle

WordNet

GeoNames

Inflexible Monolithic

Notscaleable

RDFS

Rules Geo

Page 28: Linked open data - how to juggle with more than a billion triples

Slide 28Ansgar Scherp – [email protected]

SemaPlorer Approach

GeoQueries

GeoNames

Flexible

Scaleable

Extensible

RDFS Rules

WordNet Swoogle

Fulltext

12 Month in 2005/06 700 Mio. Triple

+ + +

> 1 Billion Triples

placeOfBirthbirthplace

birthplace

+

Page 29: Linked open data - how to juggle with more than a billion triples

Slide 29Ansgar Scherp – [email protected]

SemaPlorer – Semantic Social Media

Watch video online: http://vimeo.com/2057249

Page 30: Linked open data - how to juggle with more than a billion triples

Slide 30Ansgar Scherp – [email protected]

Billion Triple Challenge 2008

2008

[JWS 2009]

Page 31: Linked open data - how to juggle with more than a billion triples

Slide 31Ansgar Scherp – [email protected]

Searching for Linked Data Sources

Quelle: http://lod-cloud.net< 31 Milliarde Triples

Persons that are - Politicians and - Actors ? ?

Page 32: Linked open data - how to juggle with more than a billion triples

Slide 32Ansgar Scherp – [email protected]

Idea: Index of Data Sources

“Politician and Actor”

?Query

SELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician .}

Index

Page 33: Linked open data - how to juggle with more than a billion triples

Slide 33Ansgar Scherp – [email protected]

The Naive Approach

1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup

- Big machinery- Late in processing the data- High effort to scale with LOD cloudCan we do smarter?

Can we do smarter?

Page 34: Linked open data - how to juggle with more than a billion triples

Slide 34Ansgar Scherp – [email protected]

Idea Schema-level index

Define families of graph patternsAssign instances to graph patternsMap graph patterns to context (source URI)

ConstructionStream-based for scalabilityLittle loss of accuracy

Note Index defined over instancesBut stores the context

Page 35: Linked open data - how to juggle with more than a billion triples

Slide 35Ansgar Scherp – [email protected]

Input Data n-Quads

<subject> <predicate> <object> <context>

Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf>

w3p:#me

foaf:Person

http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf

Page 36: Linked open data - how to juggle with more than a billion triples

Slide 36Ansgar Scherp – [email protected]

SchemEX Approach• Stream-based schema extraction• While crawling the data

Nquad-Stream

Schema-LevelIndex

Schema-Extractor

Parser

Instance-Cache

LOD-CrawlerRDF-DumpTriple Store

NxParser

RDFRDBMS

FIFO

Page 37: Linked open data - how to juggle with more than a billion triples

Slide 37Ansgar Scherp – [email protected]

Building the Index from a Stream Stream of n-quads (coming from a LD crawler)

… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1

FiFo

4

3

2

1

1

6

23

4

5

C3

C2

C2

C1

• Linear runtime complexity wrt # of input triples

Page 38: Linked open data - how to juggle with more than a billion triples

Slide 38Ansgar Scherp – [email protected]

DS1 DS2 DS3 DS4 DS5 DSxData

sources

consistsOf

hasDataSource

Building the Schema and Index

EQC1 EQC2 EQCn Equivalenceclasses

hasEQClass p1 p2

TC1 TC2 TCm

Type clusters…

C2C1 C3 CkRDF

classes…

Page 39: Linked open data - how to juggle with more than a billion triples

Slide 39Ansgar Scherp – [email protected]

Layer 1: RDF Classes

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

http://dig.csail.mit.edu/2008/...

C1

DS 3DS 2DS 1

SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person .}

All instances of a particular type

Page 40: Linked open data - how to juggle with more than a billion triples

Slide 40Ansgar Scherp – [email protected]

Layer 2: Type Clusters

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

DS 3DS 2DS 1SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male .}

C1

C2

pim:Male

tc4711

pim:Male

All instances belonging to exactly the same set of types

TC1

Page 41: Linked open data - how to juggle with more than a billion triples

Slide 41Ansgar Scherp – [email protected]

Two instances are equivalent iff:They are in the same TCThey have the same

propertiesThe property targets are

in the same TC

Similar to 1-Bisimulation

Layer 3: Equivalence Classes

EQC1

DS 3DS 2DS 1

C1

C2

TC1

C3

TC2

p

Page 42: Linked open data - how to juggle with more than a billion triples

Slide 42Ansgar Scherp – [email protected]

Layer 3: Equivalence Classes

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

pim:Male

tc4711

pim:Male

foaf:PPD

timbl:card

eqc0815

foaf:PPD

tc1234

eqc0815-maker-tc1234

foaf:maker

SELECT ?x WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}

Page 43: Linked open data - how to juggle with more than a billion triples

Slide 43Ansgar Scherp – [email protected]

Computing SchemEX: TimBL Data Set

• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k triples / sec

• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+

• Commodity hardware (4GB RAM, single CPU)

Page 44: Linked open data - how to juggle with more than a billion triples

Slide 44Ansgar Scherp – [email protected]

Quality of Stream-based Index Construction

+ Runtime increases hardly with window size+ Memory consumption scales with window size

Page 45: Linked open data - how to juggle with more than a billion triples

Slide 45Ansgar Scherp – [email protected]

Computing SchemEX: Full BTC 2011 Data

Cache size: 50 k

Page 46: Linked open data - how to juggle with more than a billion triples

Slide 46Ansgar Scherp – [email protected]

Billion Triple Challenge 2011

[JWS 2012]

Page 47: Linked open data - how to juggle with more than a billion triples

Slide 47Ansgar Scherp – [email protected]

And 2012? Get the Google Feeling!

Page 48: Linked open data - how to juggle with more than a billion triples

Slide 48Ansgar Scherp – [email protected]

Semantic Data Management Chain• Research topics in a greater context

UseAggregatePublish Collect

OntoMDE

Core OntologiesKreuzverweis.com

SchemEX*

Mobile Facets

SemaPlorer*

* Winner of Billion Triple Challenge 2011/2008

See at: dws.informatik.uni-mannheim.de

Page 49: Linked open data - how to juggle with more than a billion triples

Slide 49Ansgar Scherp – [email protected]

Page 50: Linked open data - how to juggle with more than a billion triples

Slide 50Ansgar Scherp – [email protected]

Recommended Readings• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web:

Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x

• Simon Schenk, Carsten Saathoff, Steffen Staab, Ansgar Scherp: SemaPlorer - Interactive semantic exploration of data and media based on a federated cloud infrastructure. J. Web Sem. 7(4): 298-304 (2009)URL: http://dx.doi.org/10.1016/j.websem.2009.09.006

• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716

• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001