48
Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot University of Fribourg, June 19, 2015 Doctoral advisor Prof. Dr Philippe Cudré-Mauroux

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Embed Size (px)

Citation preview

Page 1: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware

Management of Linked Data

Marcin Wylot

University of Fribourg, June 19, 2015

Doctoral advisor Prof. Dr Philippe Cudré-Mauroux

Page 2: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Quiz

3

➢ Linked Data

➢ Big Data (3Vs)

➢ Semantic Web

➢ RDF

➢ Data Provenance

Page 3: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Life is about Links

4

Page 4: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Web of Documents

5

HTMLHTMLHTMLHTML

HTMLHTMLHTMLHTML

HTMLHTMLHTMLHTML

HTMLHTMLHTML

API

HTMLHTMLHTMLXML

➢ links between documents➢ no structure, no semantics➢ for human consumption➢ cannot be processed by computers

Page 5: Efficient, Scalable, and Provenance-Aware Management of Linked Data

6

silos gonna silo

Page 6: Efficient, Scalable, and Provenance-Aware Management of Linked Data

What’s the Problem?

7

Page 7: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Web of Data

8

Page 8: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Web of Linked Data

9

thing

thing

thing

thing

thing

thingthing

links define relationships between things

➢ a global database

➢ design for machines

➢ links between things

➢ explicit semantics

Page 9: Efficient, Scalable, and Provenance-Aware Management of Linked Data

How Does it Change our Life?

10

➢ a company is moving to another city ○ aggregated information on taxes, prices, salaries,

unemployment, climat

➢ I’m moving to another country○ mobile providers, bank accounts

➢ you’re buying a house○ crime statistics, weather, house prices, neighbourhood,

traffic information

➢ media annotation ○ video annotated with Linked Data retrieving always an

actual bio of the speaker

Page 10: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Linked Geo Data

11

Page 11: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Clean Air Status and Trends - Ozone

12

Page 12: Efficient, Scalable, and Provenance-Aware Management of Linked Data

LSM: Live Linked Data

13

Page 13: Efficient, Scalable, and Provenance-Aware Management of Linked Data

swissbib

14

Page 14: Efficient, Scalable, and Provenance-Aware Management of Linked Data

15

Page 15: Efficient, Scalable, and Provenance-Aware Management of Linked Data

16

Page 16: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Linked Open Data

17

9960 data collections from independent contributors

Page 17: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Linked Data

12-05-30 18

● Pushed by Sir Tim Berners-Lee● Provide interoperable, machine-processable

data on the Web○ Machines cannot meaningfully parse HTML pages○ Relational DBMSs were designed to be centralized

(implicit schema, local data, outside of the Web)○ XML is document-oriented, localized

● Now promoted by major vendors & actors○ Google, Yahoo!, Facebook, IBM○ Pharmaceuticals, Governments, News industry, etc.

➢ describes a method of publishing semi-structured data in the Web

➢ data can be interlinked

➢ builds upon standard Web technologies

➢ extends the standards to share information in a way that it can be read automatically by computers

Page 18: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

Big Data

19

Page 19: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Linked Data is Big Data

12-05-30 20

➢ Volume: data size growing exponentially

➢ Velocity: streams of data from the Internet of Things Cloud

➢ Variety

○ semi-structured data○ heterogenous linked collections of data

Page 20: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Data Integration

➢ Integrated and summarized data➢ Necessity to established transparency

21

Page 21: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Data Provenance

“Provenance is information about

entities, activities, and people involved

in producing a piece of data or thing, which can be used to form

assessments about its quality, reliability or trustworthiness.”

Which pieces of data and how they were

combined to produce the results?

22

Page 22: Efficient, Scalable, and Provenance-Aware Management of Linked Data

New Challenges

unstructuredness

AND

heterogeneity

23

Page 23: Efficient, Scalable, and Provenance-Aware Management of Linked Data

How to efficiently store and query vast amounts of Linked Data in the cloud?

24

➢ a new physiological data partitioning algorithm to efficiently and effectively partition the graph and co-locate related instances in partitions

➢ a new system architecture for handling fine-grained partitions at scale

➢ novel data placement techniques to co-locate semantically related pieces of data

➢ new data loading and query execution strategies taking advantage of our system’s data partitions and indices

Page 24: Efficient, Scalable, and Provenance-Aware Management of Linked Data

How to store and track provenance in Linked Data processing?

25

➢ a new way to express the provenance of query results at two different granularity levels by leveraging the concept of provenance polynomials

➢ two new storage models to represent provenance data in a data store compactly

➢ query execution strategies to derive the provenance polynomials while executing the queries

Page 25: Efficient, Scalable, and Provenance-Aware Management of Linked Data

How can we efficiently support queries tailored with provenance information?

26

➢ a characterization of provenance-enabled queries, that is, queries tailored with provenance data

➢ five provenance-oriented query execution strategies

➢ storage model and indexing techniques to handle provenance-aware query execution strategies

Page 26: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Contributions

➢ new advance data co-location and partitioning techniques for efficient and scalable query processing in the Cloud

➢ first efficient provenance-aware database for Linked Data

27

Page 27: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Linked Data Technology Stack

➢ URI

➢ RDF

➢ SPARQL

28

Page 28: Efficient, Scalable, and Provenance-Aware Management of Linked Data

URI: A Uniform Resource Identifier

29

“A Uniform Resource Identifier (URI) provides a simple and extensible means for identifying a resource.” -- RFC 3986

Some URIs for “real world” things: https://www.linkedin.com/in/mwylothttp://dbpedia.org/page/Fribourghttp://www.geonames.org/2657895

Page 29: Efficient, Scalable, and Provenance-Aware Management of Linked Data

RDF Data Model

30

➢ Standard model for data interchange on the Web

➢ Statements about resources/things (triples)○ Subject(URI) Predicate(URI) Object(URI) .

Page 30: Efficient, Scalable, and Provenance-Aware Management of Linked Data

SPARQL Query Language

31

➢ query and manipulate RDF graph content required and optional graph patterns along with their conjunctions and disjunctions

➢ aggregation, subqueries, negation, creating values by expressions, extensible value testing, and constraining queries by source graph

➢ results can be result sets or graphs

SELECT ?t WHERE {?a <type> <article> .?a <tag> <Obama> .?a <title> ?t . }

Page 31: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Outline

➢ Linked Data Management System

➢ Storing and Tracing Provenance

➢ Querying Provenance Information

32

Page 32: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Diplodocus

33

A new distributed Linked Data

management system implementing

a novel hybrid storage model

based on flexible RDF templates.

Page 33: Efficient, Scalable, and Provenance-Aware Management of Linked Data

System Architecture

Hardcoded Query Processor and Query Optimizer

Lexicographic Hash TreeURI -> KEYKEY - > URI

Literal Values Manager

Cluster List Manager

Cluster Manager

templates

string

sid

sid

string

oid

clusters

templates

34

Page 34: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Student787

Course18Student

is_atakes

John Doe

LastName

FirstName

4826 StudentID

Student67

Course5

Course6

Student

is_atakes

takes

John Doe

LastName

FirstName

37347StudentID

Hybrid Storage

Student28

Course3

Course8

Student

is_atakes

takes

John Doe

LastName

FirstName

28821StudentID

list oftypes

T01➢ Hybrid model

➢ Two perspective

(horizontal and

vertical)

➢ Horizontal -> graph

➢ Vertical -> analytic

➢ Co-location

35

Page 35: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Outline

➢ Linked Data Management System

➢ Storing and Tracing Provenance

➢ Querying Provenance Information

36

Page 36: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Physical Storage Models

Differences:➢ ease of implementation➢ memory consumption➢ query execution➢ interference with the original concept of molecule

1) SPOL 2) LSPO 3) SLPO 4) SPLO

37

S - Subject P - PredicateO - Object L - graph label, context value

Page 37: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Provenance Polynomials

➢ Ability to characterize ways each source contributed

➢ Pinpoint the exact source to each result

➢ Trace back the list of sources the way they were combined

to deliver a result

"Algebraic structures for capturing the provenance of sparql queries."Geerts, Floris, et al.Proceedings of the 16th International Conference on Database Theory. ACM, 2013.

38

Page 38: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Example Polynomial

select ?lat ?long where { ?a [] ``Eiffel Tower''.?a inCountry FR .?a lat ?lat .?a long ?long .

}

(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9) 39

Page 39: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Findings

➢ tracing provenance overhead is considerable

but acceptable, on average about 60-70%

➢ most suitable storage model depends upon

data and workloads characteristics

40

Page 40: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Outline

➢ Linked Data Management System

➢ Storing and Tracing Provenance

➢ Querying Provenance Information

41

Page 41: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Provenance-Enabled Query

A Workload Query is a query producing results a user is interested in. These results are referred to as workload query results.

A Provenance Query is a query that selects a set of data from which the workload query results should originate.

A Provenance-Enabled Query is a pair consisting of a Workload Query and a Provenance Query, producing results a user is interested in (as specified by the Workload Query) and originating only from data pre-selected by the Provenance Query.

42

Page 42: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Provenance-Enabled Query: Example

SELECT ?t WHERE {?a <type> <article> .?a <tag> <Obama> .?a <title> ?t . }

➢ ensure that the articles come from sources attributed to the governmentSELECT ?ctx WHERE { ?ctx prov:wasAttributedTo <government> . }

➢ ensure that the data used to produce the answer was associated a “SeniorEditor” and a “Manager”

SELECT ?ctx WHERE {?ctx prov:wasGeneratedBy <articleProd>.<articleProd> prov:wasAssociatedWith ?ed .?ed rdf:type <SeniorEdior> .<articleProd> prov:wasAssociatedWith ?m .?m rdf:type <Manager> . }

43

Page 43: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Executing Provenance-Enabled Queries

A workload and a provenance query are given as input to a triplestore, which produces results for both queries and then combine them to obtain the final results.

44

Page 44: Efficient, Scalable, and Provenance-Aware Management of Linked Data

TripleProv: Query Execution Pipeline

input: provenance-enable query

➢ execute the provenance query ➢ optionally pre-materializing or co-locating data➢ optionally rewrite the workload queries➢ execute the workload queries

output: the workload query results, restricted to those which were derived from data specified by the provenance query

45

Five query execution strategies for provenance-enabled queries.

Page 45: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Results

46

Queries tailored with provenance information can be executed faster due to the selectivity of provenance information.

Page 46: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Lessons Learnt

➢ there is room for further improvement in Linked Data management

➢ co-location of related entities is the right way

➢ provenance overhead does not have to be high

➢ we can leverage provenance information to improve performance

47

Page 47: Efficient, Scalable, and Provenance-Aware Management of Linked Data

The Future

Integrated and machine processable data providing transparent results.

48

Page 48: Efficient, Scalable, and Provenance-Aware Management of Linked Data

Summary

➢ new advance data co-location and partitioning techniques for efficient query processing

➢ cloud support for scalable query processing

➢ first efficient provenance-aware Linked Data management system

➢ source code available online49