15
Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW USING A POWDER TRIPLE STORE FOR BOOSTING THE REAL-TIME PERFORMANCE OF GLOBAL AGRICULTURAL DATA INFRASTRUCTURES KREAM 2013 5 June 2013 Pythagoras Karampiperis National Centre for Scientific Research “Demokritos”

Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

Embed Size (px)

Citation preview

Page 1: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

Data

In

ten

siv

e T

ech

niq

ues t

o B

oost

the R

eal-

tim

e

Perf

orm

an

ce o

f G

lob

al A

gri

cu

ltu

ral D

ata

In

frastr

uctu

res

SEMAGROWUSING A POWDER TRIPLE STORE FOR BOOSTING THE REAL-TIME PERFORMANCE OF GLOBAL AGRICULTURAL DATA INFRASTRUCTURES

KREAM 20135 June 2013

Pythagoras KarampiperisNational Centre for Scientific Research

“Demokritos”

Page 2: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

Outline

5 June 2013

2/15

Introduction / Problem Statement

The SemaGrow Solution

The POWDER W3C Recommendation

SemaGrow Architecture

The SemaGrow Stack

SemaGrow Maintenance Components

Page 3: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

Moving Forward with “Old” Technologies3/15

KREAM 2013 5 June 2013

HARVESTER

OAI-PMH Service Provider #1

Schema #1

OAI-PMH Service Provider #n

Schema #n

INDEXER

AggregatedXML Repository

Web Portals

Open AGRIS (FAO)AgLR/GLN (ARIADNE)Organic.Edunet (UAH)

VOA3R (UAH)...

AGRIS AP Schema

IEEE LOM Schema

DC Schema

...

RDF Triple Store

Common Schema

SPARQL endpoint(Data Source #1)

SPARQL endpoint(Data Source #n)

INDEXER

Web Portals

SPARQL endpoint

NOW (2012) CASE OF AGRICULTURAL INFRASTRUCTURES 2015 (AgINFRA) CASE OF AGRICULTURAL INFRASTRUCTURES

How Many?

BigData Problem!

Is it feasible?

Page 4: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

What Semantic Web can bring into the picture

4/15

5 June 2013

Query

Federated endpoint Wrapper

SemaGrow SPARQL endpoint

Resource Discovery

Query results

query fragment,Source

(#1)

Instance StatisticsData Summaries

SPARQL endpoint

POWDER Inference Layer

P-Store

InstanceStatistics

query fragment,target Source

transformed query

Query Decomposition

querypatterns

Query Results Merger

query fragment,Source

(#n)

queryresults

Client

Reactivityparameters

Query Decomposer

Data Source(s) Selector

Ctrl

Candidate Source(s) List· Instance Statistics· Load Info· Semantic Proximity

Query Transformation Service

SchemaMappings

SPARQL endpoint(Data Source #n)

SPARQLquery

Ctrl

Ctrl

Load Info

Instance Statistics

Data Summaries

Set of query

patternsQuery Pattern Discovery

Service

equivalentpatterns

querypattern

SemanticProximity

Resource Selector

query results schema

transformed schema

queryrequest #1

queryrequest #n

queryresults

SPARQL endpoint(Data Source #1)

SPARQLquery

Query Manager

Going beyond existing Distributed Triple Store Implementations· Link Heterogeneous but

Semantically Connected Data· Index Extremely Large

Information Volumes (Peta Sizes)· Improve Information Retrieval

response

Data (+Metadata) physically stored in Data Provider· No need for

harvesting

Vocabularies / Thesauri / Ontologies of Data Provider choice· No need for

aligning according to common schemas

One Data Access Point for the entire Data Cloud· Enabling Service-Data level agreements with Data providers

Application-level Vocabularies / Thesauri / Ontologies· Enabling different application facets for different communities of users over

the SAME data pool

Page 5: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

The SemaGrow Solution

5 June 2013

5/15

Use POWDER to mass-annotate large-subspaces· Exploit naming convention regularities to

compress the indexes used by the system Partition triple patterns in the original

query Annotate each fragment with an ordered

list of data sources most likely to contain relevant data

Distribute and transform the query fragments

Collect and align the results

Page 6: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

The POWDER W3C Recommendation

5 June 2013

6/15

Exploits natural groupings of URIs to annotate all resources in a subset of the URI space

Regular expression based grouping Allows properties and their values to be

associated with an arbitrary number of subjects within a fully-defined semantic framework

POWDER Description Resources: http://www.w3.org/TR/powder-dr/ POWDER Formal Semantics: http://www.w3.org/TR/powder-formal/

Page 7: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

The SemaGrow Stack

5 June 2013

7/15

Integrates the components in order to offer a single SPARQL endpoint that federates a number of heterogeneous data sources

Targets the federation of independently provided data sources

Page 8: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

SemaGrow Architecture

5 June 2013

8/15

Query

Federated endpoint Wrapper

SemaGrow SPARQL endpoint

Resource Discovery

Query results

query fragment,Source

(#1)

Instance StatisticsData Summaries

SPARQL endpoint

POWDER Inference Layer

P-Store

InstanceStatistics

query fragment,target Source

transformed query

Query Decomposition

querypatterns

Query Results Merger

query fragment,Source

(#n)

queryresults

Client

Reactivityparameters

Query Decomposer

Data Source(s) Selector

Ctrl

Candidate Source(s) List· Instance Statistics· Load Info· Semantic Proximity

Query Transformation Service

SchemaMappings

SPARQL endpoint(Data Source #n)

SPARQLquery

Ctrl

Ctrl

Load Info

Instance Statistics

Data Summaries

Set of query

patternsQuery Pattern Discovery

Service

equivalentpatterns

querypattern

SemanticProximity

Resource Selector

query results schema

transformed schema

queryrequest #1

queryrequest #n

queryresults

SPARQL endpoint(Data Source #1)

SPARQLquery

Query Manager

Query Decompositio

nResource Discovery

Data Summaries Endpoint

Federated Endpoint Wrapper

Page 9: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

Query Decomposition

5 June 2013

9/15

Analyses SPARQL queries

Decides on the optimal way to create query fragments to be dispatched to sources’ endpoints

Components· Query Decomposition: Suggestions of possible

decompositions· Selector: Evaluates these suggestions based on

information and predictions from the Resource Discovery Component

Page 10: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

Resource Discovery

5 June 2013

10/15

Provides an annotated list of candidate data sources that (possibly) hold triples matching a query pattern

Sources are annotated with additional information· Schema-level metadata· Instance-level metadata· Predicted Response Volume· Run-time information about current source

load· Semantic proximity of source and query

schemas

Page 11: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

Data Summaries Endpoint

5 June 2013

11/15

Serves metadata about the schema and instances of the various federated data stores

Receives entity URIs

Returns the repositories where these entities are located (either at the schema or instance level)

Returns ontology alignment knowledge regarding entity equivalence between different sources

Page 12: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

Federated Endpoint Wrapper

5 June 2013

12/15

Manages the communication with external data sources federated by the SemaGrow Stack

Query Manager· Call Query Transformation Service when necessary · Forwarding query fragments to the Query Results Merger· Collecting and forwarding run-time statistics to the Resource Discovery

Component Query Results Merger

· Pay-as-you-go behaviour· Provides first approximations and iteratively refines them if more

computational resources are warranted by the reactivity parameters

Query Transformation Service· Accesses the Schema Mappings Repository· Rewrites query fragments from the original query schema to that of the data

source that will be used for the fragment· Rewrites query results from the source schema to the query schema

Page 13: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

Maintenance Components

5 June 2013

13/15

Authoring Tool· Visual tool for assisting data providers· Construction of POWDER statements· Provenance and cataloguing metadata

Ontology Alignment Tool· Semi-automatic (human intervention) alignment of

Semantic Vocabularies used by data providers and consumers

Content Classification and Ontology Evolution· Refine coarsely annotated data to a level of detail

where they can be more accurately aligned with other schemas within the federation

Page 14: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

KREAM 2013

Project info

5 June 2013

14/15

SemaGrow: Data intensive techniques to boost the real-time performance of global agricultural data infrastructures

FP7-ICT-2011.4.4 (Intelligent Information Management)

No.

Name Country

1 Universidad de Alcala

2 NCSR “Demokritos”

3 Universita Degli Studi di Roma Tor Vergata

4 Semantic Web Company

5 Institut Za Fiziku

6 Stichting Dienst Landbouwkundik Onderzoek

7 Food and Agriculture Organization of the UN

8 Agroknow Technologies

Page 15: Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING

Thank You!

5 June 2013KREAM 2013

15/15

Dr. Pythagoras P. Karampiperis

([email protected])

Institute of Informatics & Telecommunications (IIT),

NCSR “Demokritos” (NCSR)

www.semagrow.eu