Upload
exascale-infolab
View
481
Download
2
Embed Size (px)
Citation preview
Efficient, Scalable, and Provenance-Aware
Management of Linked Data
Marcin Wylot
University of Fribourg, June 19, 2015
Doctoral advisor Prof. Dr Philippe Cudré-Mauroux
Quiz
3
➢ Linked Data
➢ Big Data (3Vs)
➢ Semantic Web
➢ RDF
➢ Data Provenance
Life is about Links
4
Web of Documents
5
HTMLHTMLHTMLHTML
HTMLHTMLHTMLHTML
HTMLHTMLHTMLHTML
HTMLHTMLHTML
API
HTMLHTMLHTMLXML
➢ links between documents➢ no structure, no semantics➢ for human consumption➢ cannot be processed by computers
6
silos gonna silo
What’s the Problem?
7
Web of Data
8
Web of Linked Data
9
thing
thing
thing
thing
thing
thingthing
links define relationships between things
➢ a global database
➢ design for machines
➢ links between things
➢ explicit semantics
How Does it Change our Life?
10
➢ a company is moving to another city ○ aggregated information on taxes, prices, salaries,
unemployment, climat
➢ I’m moving to another country○ mobile providers, bank accounts
➢ you’re buying a house○ crime statistics, weather, house prices, neighbourhood,
traffic information
➢ media annotation ○ video annotated with Linked Data retrieving always an
actual bio of the speaker
Linked Geo Data
11
Clean Air Status and Trends - Ozone
12
LSM: Live Linked Data
13
swissbib
14
15
16
Linked Open Data
17
9960 data collections from independent contributors
Linked Data
12-05-30 18
● Pushed by Sir Tim Berners-Lee● Provide interoperable, machine-processable
data on the Web○ Machines cannot meaningfully parse HTML pages○ Relational DBMSs were designed to be centralized
(implicit schema, local data, outside of the Web)○ XML is document-oriented, localized
● Now promoted by major vendors & actors○ Google, Yahoo!, Facebook, IBM○ Pharmaceuticals, Governments, News industry, etc.
➢ describes a method of publishing semi-structured data in the Web
➢ data can be interlinked
➢ builds upon standard Web technologies
➢ extends the standards to share information in a way that it can be read automatically by computers
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
Big Data
19
Linked Data is Big Data
12-05-30 20
➢ Volume: data size growing exponentially
➢ Velocity: streams of data from the Internet of Things Cloud
➢ Variety
○ semi-structured data○ heterogenous linked collections of data
Data Integration
➢ Integrated and summarized data➢ Necessity to established transparency
21
Data Provenance
“Provenance is information about
entities, activities, and people involved
in producing a piece of data or thing, which can be used to form
assessments about its quality, reliability or trustworthiness.”
Which pieces of data and how they were
combined to produce the results?
22
New Challenges
unstructuredness
AND
heterogeneity
23
How to efficiently store and query vast amounts of Linked Data in the cloud?
24
➢ a new physiological data partitioning algorithm to efficiently and effectively partition the graph and co-locate related instances in partitions
➢ a new system architecture for handling fine-grained partitions at scale
➢ novel data placement techniques to co-locate semantically related pieces of data
➢ new data loading and query execution strategies taking advantage of our system’s data partitions and indices
How to store and track provenance in Linked Data processing?
25
➢ a new way to express the provenance of query results at two different granularity levels by leveraging the concept of provenance polynomials
➢ two new storage models to represent provenance data in a data store compactly
➢ query execution strategies to derive the provenance polynomials while executing the queries
How can we efficiently support queries tailored with provenance information?
26
➢ a characterization of provenance-enabled queries, that is, queries tailored with provenance data
➢ five provenance-oriented query execution strategies
➢ storage model and indexing techniques to handle provenance-aware query execution strategies
Contributions
➢ new advance data co-location and partitioning techniques for efficient and scalable query processing in the Cloud
➢ first efficient provenance-aware database for Linked Data
27
Linked Data Technology Stack
➢ URI
➢ RDF
➢ SPARQL
28
URI: A Uniform Resource Identifier
29
“A Uniform Resource Identifier (URI) provides a simple and extensible means for identifying a resource.” -- RFC 3986
Some URIs for “real world” things: https://www.linkedin.com/in/mwylothttp://dbpedia.org/page/Fribourghttp://www.geonames.org/2657895
RDF Data Model
30
➢ Standard model for data interchange on the Web
➢ Statements about resources/things (triples)○ Subject(URI) Predicate(URI) Object(URI) .
SPARQL Query Language
31
➢ query and manipulate RDF graph content required and optional graph patterns along with their conjunctions and disjunctions
➢ aggregation, subqueries, negation, creating values by expressions, extensible value testing, and constraining queries by source graph
➢ results can be result sets or graphs
SELECT ?t WHERE {?a <type> <article> .?a <tag> <Obama> .?a <title> ?t . }
Outline
➢ Linked Data Management System
➢ Storing and Tracing Provenance
➢ Querying Provenance Information
32
Diplodocus
33
A new distributed Linked Data
management system implementing
a novel hybrid storage model
based on flexible RDF templates.
System Architecture
Hardcoded Query Processor and Query Optimizer
Lexicographic Hash TreeURI -> KEYKEY - > URI
Literal Values Manager
Cluster List Manager
Cluster Manager
templates
string
sid
sid
string
oid
clusters
templates
34
Student787
Course18Student
is_atakes
John Doe
LastName
FirstName
4826 StudentID
Student67
Course5
Course6
Student
is_atakes
takes
John Doe
LastName
FirstName
37347StudentID
Hybrid Storage
Student28
Course3
Course8
Student
is_atakes
takes
John Doe
LastName
FirstName
28821StudentID
list oftypes
T01➢ Hybrid model
➢ Two perspective
(horizontal and
vertical)
➢ Horizontal -> graph
➢ Vertical -> analytic
➢ Co-location
35
Outline
➢ Linked Data Management System
➢ Storing and Tracing Provenance
➢ Querying Provenance Information
36
Physical Storage Models
Differences:➢ ease of implementation➢ memory consumption➢ query execution➢ interference with the original concept of molecule
1) SPOL 2) LSPO 3) SLPO 4) SPLO
37
S - Subject P - PredicateO - Object L - graph label, context value
Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result
"Algebraic structures for capturing the provenance of sparql queries."Geerts, Floris, et al.Proceedings of the 16th International Conference on Database Theory. ACM, 2013.
38
Example Polynomial
select ?lat ?long where { ?a [] ``Eiffel Tower''.?a inCountry FR .?a lat ?lat .?a long ?long .
}
(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9) 39
Findings
➢ tracing provenance overhead is considerable
but acceptable, on average about 60-70%
➢ most suitable storage model depends upon
data and workloads characteristics
40
Outline
➢ Linked Data Management System
➢ Storing and Tracing Provenance
➢ Querying Provenance Information
41
Provenance-Enabled Query
A Workload Query is a query producing results a user is interested in. These results are referred to as workload query results.
A Provenance Query is a query that selects a set of data from which the workload query results should originate.
A Provenance-Enabled Query is a pair consisting of a Workload Query and a Provenance Query, producing results a user is interested in (as specified by the Workload Query) and originating only from data pre-selected by the Provenance Query.
42
Provenance-Enabled Query: Example
SELECT ?t WHERE {?a <type> <article> .?a <tag> <Obama> .?a <title> ?t . }
➢ ensure that the articles come from sources attributed to the governmentSELECT ?ctx WHERE { ?ctx prov:wasAttributedTo <government> . }
➢ ensure that the data used to produce the answer was associated a “SeniorEditor” and a “Manager”
SELECT ?ctx WHERE {?ctx prov:wasGeneratedBy <articleProd>.<articleProd> prov:wasAssociatedWith ?ed .?ed rdf:type <SeniorEdior> .<articleProd> prov:wasAssociatedWith ?m .?m rdf:type <Manager> . }
43
Executing Provenance-Enabled Queries
A workload and a provenance query are given as input to a triplestore, which produces results for both queries and then combine them to obtain the final results.
44
TripleProv: Query Execution Pipeline
input: provenance-enable query
➢ execute the provenance query ➢ optionally pre-materializing or co-locating data➢ optionally rewrite the workload queries➢ execute the workload queries
output: the workload query results, restricted to those which were derived from data specified by the provenance query
45
Five query execution strategies for provenance-enabled queries.
Results
46
Queries tailored with provenance information can be executed faster due to the selectivity of provenance information.
Lessons Learnt
➢ there is room for further improvement in Linked Data management
➢ co-location of related entities is the right way
➢ provenance overhead does not have to be high
➢ we can leverage provenance information to improve performance
47
The Future
Integrated and machine processable data providing transparent results.
48
Summary
➢ new advance data co-location and partitioning techniques for efficient query processing
➢ cloud support for scalable query processing
➢ first efficient provenance-aware Linked Data management system
➢ source code available online49