On Provenance of Queries on Linked Web Data

1,2Yannis Theoharis, 2Irini Fundulaki, 3,2Grigoris Karvounarakis and 1,2Vassilis Christophides

1Institute of Computer Science, FORTH and

2Computer Science Department, University of Crete

3LogicBox, USA

What is “Linked Data”

W3C Linking Open Data

publish various open datasets as RDF on the Web

set RDF typed links between data items from different data sources.

Motivation: Linked Data Processing Data is:

fetched from

heterogeneous

sources

integrated

materialized in RDF

made available

via SPARQL

Range of computations

SPARQL queries

Complex programs

(logic or procedular)

Provenance Aware Applications

Trust assessment

trustworthiness

Access control

confidentiality level

Data cleaning

validity

Curated databases

source data origin

All these applications need to represent and store the relation of the input

with the output of data processes

gain efficiency

impossible without provenance

Data Provenance Models

X Y Annot.

Y Z Annot.

X Y Z Annot.

R1 R2R1 R2

Annotation Models: annotation computation coupled with a particular application and a particular assignment of source data annotations

Abstract Provenance Models: abstract provenance tokens and operators are substituted by appropriate concrete tokens for a particular application and assignment

X Y Annot.

a b c1

c d c2

Y Z Annot.

b e c3

X Y Z Annot.

a b e c1 * c3

R1 R2R1 R2

t Λ t

t Λ f

query recomputation!

t: trustedf: untrusted

This Talk

“Can previous work on abstract provenance models be leveraged for SPARQL” ?

NO: due to the OPTIONAL (similar to the SQL left outer join) operatorYES: for the positive (without OPTIONAL) fragment of SPARQL

We present our ongoing work on a SPARQL abstract provenance model.

Challenge: to capture the form of negation that OPTIONAL introduces

Outline

SPARQL algebra

Abstract Provenance Models for Positive SPARQL

Limitations of Previous Models

Towards a SPARQL Provenance Model

Outline

SPARQL algebra

SPARQL (1/2)

triple patterns(?x, ?y, e)

mappings{(?x,d),(?y,b)}

{(?x,f),(?y,g)}

ComposeFilter

mappings

{ … }

mappings

{ … }Select

Construct/ Describe

SPARQL: W3C Recommendation language to Query RDF data.

Triple Set

(?x, ?y, e)

constantvariables

SPARQL (2/2)

SPARQL algebra defines 5 operators on mapping bags

Unary ops: π (projection),

σ (selection, also called filtering)

Binary ops: U (union)

(join)

(optional)

π?x (Ω)

card(μ1) = 2card(μ2) = 1

Positive SPARQL (SPARQL+)

σ?x=a (Ω)

?x ?y ?z

Ω1 U Ω2

?z is unbound in μ1μ1μ2 μ1

?x ?y ?z

Ω1 Ω2

μ4 μ5 = μ1 U μ4

μ6 = μ3 U μ4

μ and μ’ are compatible (μ ~ μ’), if they agree

in their common variables μ1 ~ μ4

μ3 ~ μ4

μ2 ~ μ4

?x ?y ?z

Ω1 Ω2

μ3 μ4 = μ1 U μ3

μ2Ω1 \ Ω2Ω1 Ω2

Outline

SPARQL algebra

Abstract Provenance Models

Abstract provenance models encode the query

operators in different level of detail

Expressiveness vs efficiency

(annotation storage and computation time)

triple patterns(?x, ?y, e)

mappings{(?x,d),(?y,b)}

{(?x,f),(?y,g)}

ComposeFilter

mappings

{ … }

mappings

{ … }

Select

Provenance

Lineage

informative

Abstract Provenance Models for SPARQL+

Previous models are defined for positive relational algebra

Positive relational operators are monotonic

The addition (removal) of a tuple can only result in additional (removed) tuples in the output

This also holds for SPARQL+ (projection, union, join)

Previous models suffice for SPARQL+

Outline

SPARQL algebra

boolean trust semantics

set semantics on trusted mappings

Boolean trust assessment (SPARQL)

Ω1 \ Ω2

and \ are not monotonic: μ3 becomes untrusted

Ω1 Ω2

Ω1 \ Ω2

Ω1 Ω2

μ4 μ2μ1

μ5 becomes untrusted and

μ1 becomes trusted in Ω1 Ω2

Trusted: μ1, μ2, μ3, μ4

Trusted: μ1, μ2, μ4

?x ?y ?y2 ?z2

f g b c

f g e h

Ω1 \ Ω2

Intuitively, (f, g) is in Ω1 \ Ω2 because it is not compatible

with neither μ3 nor μ4

?z ?x1 ?y1 ?y2 ?z2

d b c d b b c

f g - f g b c

f g - f g e h

Ω1 Ω2

If μ3 becomes untrusted, Perm infers that (d, b, c) becomes untrusted, but cannot infer that (d, b, -) should become trusted

(d, b, c) is in Ω1 \ Ω2 due to the join

between μ1 and μ3

RDF Meta Knowledge & M-semirings

d b c1

f g c2

?x ?y RDF MK M-semirings

f g c2 Λ (c3Vc4) c2 0 = c2

b c c3

e h c4

Ω1 \ Ω2

Like Perm, RDF Meta Knowledge and M-semirings infer that μ5 is untrusted but can not infer that μ1: (d, b, -) is trusted.

?x ?y ?z RDF MK M-semirings

d b c c1 Λ c3 c1 * c3

f g - c2 Λ (c3Vc4)

Ω1 Ω2

Outline

SPARQL algebra

A Third Operation for Compatibility (1/2) Take care about compatible mappings

Only one between μ1, μ5 can appear in the result

Keep provenance information for both of them !

d b c1

f g c2

b c c3

e h c4

?x ?y ?z How SPARQL Prov.

d b c c1*c3 c1*c3

d b - No Info c1*A(μ1, μ3)

f g - c2 c2

Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2)

(t Λ t) = t(t Λ f) = f

A(μ1, μ3) =

f, if μ1 ~ μ3 and c3 = t

t, else

A Third Operation for Compatibility (2/2)

A is a binary operator on mappings

Determines whether the mapping exist in the result or not

If yes, its provenance equals the positive provenance part, e.g. c1 for c1*A(μ1, μ3)

In general,

?x ?y ?z How SPARQL Prov.

d b c c1*c3 c1*c3

d b - No Info c1*A(μ1, μ3)

f g - c2 c2

Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2)

A(μ1, μ3) =

0, if μ1 ~ μ3 and c3 ≠ 0

1, else

0: the neutral element for +

1: the neutral element for *

SPARQL Provenance Operators

Two types of operators

on provenance tokens, i.e. + and * (for SPARQL+)

on mappings, i.e. A (for and \)

Good news:

Every triple of the dataset is uniquely annotated.

Why not to use annotations as mapping identifiers in A?

Due to the projection operator…

Enrich Tokens with Schema Information

Use tokens (c1, c2…) as mapping ids in A expressions

But, μ1 ~ μ2 might hold, while π?y,?z (μ1) ~ π ?y,?z (μ2)

Tokens don’t suffice, keep pairs token-schema

A(c1, c2) =

0, if μ1 ~ μ2 and c2 ≠ 0

1, else

?x ?y ?z

?x ?y ?z Prov.

a b c (c1, {?x, ?y, ?z})

d b - (c2, {?x, ?y, ?z})

?y ?z Prov.

b c (c1, {?y, ?z})

b - (c2, {?y, ?z})

Ω π?y,?z (Ω)

A( (c1, S1), (c2, S2) ) =

0, if πS1 (μ1) ~ πS2 (μ2) and c2 ≠ 0

1, else

Define an algebra on token-schema pairs

3 operations

2 for SPARQL operators

1 for compatibility

What if there is no projection (or projection is not allowed to be pushed down) ?

annotations suffice (no need for schema information),

still in need of the compatibility operator

What if there is no Optional ?

previous models suffice, e.g. How

Future Work

SPARQL Provenance Model

Extent model expressiveness to capture other computations on

Linked Data

Logic explanations

Implementation

Questions ?

On Provenance of Queries on Linked Web Data

Documents

Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (ICWE 2012 Ed.)

From Provenance Standards and Tools to Queries and Actionable Provenance

Re-use in data-driven sciences: from provenance to (linked ... › data › pages › rda_France_Al… · from provenance to (linked) data summaries Alban Gaignard, PhD, CNRS Paris,

SPRING: Ranking the results of SPARQL queries on Linked Datacomad/2011/images/papers/research/3_bw... · SPRING: Ranking the results of SPARQL queries on Linked Data ... The PageRank

Paper talk @ Ipaw 2010: Janus: from Workflows to Semantic Provenance and Linked Open Data

Provenance Management over Linked Data Streams · dynamic Linked Data and compute provenance of these queries, and iii) an empirical evaluation of our approach using real-world datasets

An Introduction to SPARQL and Queries over Linked Data · Olaf Hartig - ICWE 2012 Tutorial "An Introduction to SPARQL and Queries over Linked Data" - Hands-on Exercises 2 Exercise

1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

Bio2RDF Release 2: Improved coverage, interoperability and provenance of Linked Data for the Life Sciences

Containment of Conjunctive Queries on Annotated Relations TJ Green University of Pennsylvania Symposium on Database Provenance University of Edinburgh

LODOP - Multi-Query Optimization for Linked Data Profiling Queries

Linked Justifications: Provenance Aware Data Integration on Linked Data

Efficient, Scalable, and Provenance-Aware Management of ... · Provenance-Aware Management of Linked Data THESIS presented to the Faculty of Science of the University of Fribourg

Robust and Scalable Linked Data Reasoning Incorporating Provenance and Trust …aidanhogan.com/docs/saor_ann_final.pdf · 2011-06-08 · provenance and trust, including formal discussion

Executing Provenance-Enabled Queries over Web Data · 2015-05-15 · provenance data (Section 6). 2. RELATED WORK There are two areas of related work that we build upon: provenance

TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, Updates, and Transactions

Strategies for Processing and Explaining Distributed Queries on Linked Data

YesWorkflow: More Provenance Mileage from Hybrid Provenance Models and Queries

Linked Justifications: Provenance Aware Data Integration on Linked Data Li Ding Tetherless World Constellation Rensselaer Polytechnic Institute Nov 2,

Efficient, Scalable, and Provenance-Aware Management of Linked Data