52
www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical Director National Research Council, ISTI-CNR www.d4science.eu

Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

  • View
    217

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

www.d4science.org

D4Science technical features and opportunities of the grid infrastructure for large scale data management

Pasquale Pagano D4Science Technical Director National Research Council, ISTI-CNR

www.d4science.eu

Page 2: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

2

www.d4science.eu

A closer look on gCube technology

Core Services

Information Organisation Services

Information Retrieval Services

Presentation ServicesgCube architectural overview

Services, layers & specifications

The talk is not about this

Page 3: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

3

www.d4science.eu

Outline

• D4Science Mission

• D4Science World E-Infrastructure Technology Service

• D4Science Exploitation Data Management VREs

• Summing Up

D4Science technical features

Page 4: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

4

www.d4science.eu

D4Science mission

D4Science missionto provide a scientific e-Infrastructures that

removes all heterogeneity, sustainability, scalability, and other technical concerns from the minds of scientists,

hides all related complexities from their perception, and enables them to focus on their science and collaborate on

common research challenges

gCube isa framework to manage e-infrastructures where it is possible to define, host, and maintain dynamic Virtual Research Environments (VREs) capable to satisfy the collaboration needs of distributed Virtual Organizations (VOs)

D4Science technical features

Page 5: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

5

www.d4science.eu

From a testbed to a production ecosystem

Oct .’04 Nov.’07 Jan.’08 Dec.’09Oct .’09 Sept.’11

D4Science technical features

Page 6: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

6

www.d4science.eu

From a testbed to a production ecosystemfu

nctio

nalit

y

gLite

gCube

Oct .’04 Nov.’07 Jan.’08 Dec.’09Oct .’09 Sept.’11

D4Science technical features

Page 7: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

7

www.d4science.eu

D4Science: a Threefold World

D4Science technical features

Page 8: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

8

www.d4science.eu

Infrastructure vs. e-Infrastructure

• An infrastructure is the basic physical and organizational structures and facilities (roads, power supplies, ..) needed for the operation of a society or enterprise

• The D4Science e-Infrastructure provides support for effective consumption of shared resources:

hardware-bound resources (i.e. networks, storage, instruments, and computational resources),

system-level software resources (i.e. basic middleware services),

and application-level software resources (i.e. data sources and services).

D4Science technical features

Page 9: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

9

www.d4science.eu

Infrastructure vs. e-Infrastructure

• An infrastructure Connects remote places by

providing facilities to assist supported resources and consumers.

Has policies

• The D4Science e-infrastructure enables scientific communities to

cooperate within a coherent model, regardless of the location of their research facilities

Enforces policies

D4Science technical features

Page 10: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

10

www.d4science.eu

D4Science e-Infrastructure (1/2)

• Facilitate the life of scientists by hiding the complexity

D4Science e-InfrastructureData analysis

50 Gb

D4Science technical features

Page 11: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

11

www.d4science.eu

D4Science e-Infrastructure (2/2)

• Facilitate the life of scientists by supporting collaboration

D4Science e-Infrastructureshare

50 Gb

50 Gb

access

D4Science technical features

Page 12: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

12

www.d4science.eu

D4Science as e-Infrastructure: Key Features

• D4Science e-Infrastructure provides scientists with

Easy-to-use tools for infrastructural resources registration and management

Cost-effective tools for data resource registration, metadata generation, and curation

Seamless access to shared, distributed and heterogeneous resources organized in dynamically created Virtual Research Environments

D4Science technical features

Page 13: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

13

www.d4science.eu

e-Infrastructure Resources

The D4Science managed resources are:

D4Science technical features

Page 14: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

14

www.d4science.eu

e-Infrastructure Resources [cont.]

D4Science technical features

Page 15: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

15

www.d4science.eu

e-Infrastructure

Site A

Site B

Site C

D4Science technical features

Page 16: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

16

www.d4science.eu

Virtual Organization

• A Virtual Organization (VO) specifies how a set of users can access a set of resources

by defining what is shared, who is allowed to share, the conditions under which sharing occurs

and enforcing the authentication and

authorization policies.

VO

D4Science technical features

Page 17: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

17

www.d4science.eu

Virtual Research Environment (1/3)

• VRE scenarios• Data needs to be assessed before to make it publically

exploitable by the VO members.• Restricted set of users have to collaborate to refine

processes and implement show cases. • Products generated through elaboration of data or

simulation have to be validated by expert users.

Is the VO adequate to represent a growing aggregation of resources tailored to satisfy the evolving needs of the user community?

NO, it is not !

D4Science technical features

Page 18: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

18

www.d4science.eu

Virtual Research Environment (2/3)

VRE resources can be published in the VO at any time by the VRE data managers.

Virtual Research Environment (VRE) is a distributed and dynamically created

environment where subset of resources can be

assigned to a subset of users for a limited timeframe.

VRE 2

VRE 1

VO

D4Science technical features

Page 19: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

19

www.d4science.eu

Virtual Research Environment (3/3)

• A Virtual Research Environment (VRE) supports cooperative activities like

data analysis and processing; data generation, integration, enrichment, and curation; production of new knowledge using specialized tools

D4Science technical features

Page 20: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

20

www.d4science.eu

Infrastructure, Virtual Organisation and VRE

Infrastructure

VRE

VO

D4Science technical features

Page 21: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

21

www.d4science.eu

D4Science as Technology: Key Features

• gCube Core (gCore) simplifies and standardizes all systemic aspects of service

development; promotes the adoption of best practices in

multiprogramming and distributed programming

• gCube Enabling Services lift the Grid approach for batch job execution and resource

sharing to Web Services deployment and invocation in a SOA empowered e-Infrastructure

D4Science technical features

Page 22: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

22

www.d4science.eu

gCore: innovation in developing

• An initiative to reduce complexity in the design and implementation of gCube services

an application framework for the consolidation / development of existing/new services

the gCube Core Framework (gCF)

• An initiative to meet the needs of system administrators, infrastructure managers, and resource providers

an easy-to-install, self-contained sandbox to participate to the D4Science empowered e-Infrastructure

the gCube Core Distribution (gHN)

D4Science technical features

Page 23: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

23

www.d4science.eu

gCube Enabling Services – IS

gCube provides an Information and Monitoring System where rich set of resources including computing, storage, service, data, metadata, and applications can be independently of their type :

• registered, discovered, and accessed

• monitored, shared in a controlled way, accounted

Is a simple Registry sufficient to manage a growing set of heterogeneous resources?

NO, it is not !

D4Science technical features

Page 24: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

24

www.d4science.eu

gCube Enabling Services - IS [cont]

• gCube Information System: collects information about the capabilities and status of all resources:

Glue schema for computational and storage resources

profiles for gCube services and their running instances

profiles for content and metadata collections

• Currently it manages more than 100 M operations per year Serving more than 300 web services

D4Science technical features

Page 25: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

25

www.d4science.eu

gCube Information System

gHN embedded

Mandatory

D4Science technical features

Page 26: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

26

www.d4science.eu

gCube Enabling Services – dynamic application building

• gCube VRE Management System: manages services and applications reduces deployment costs reduces operational costs and

application porting timeframes grants execution only to certified

software

It reduces the costs related to e-Infrastructure ownership, maintenance, and upgrade without compromising the essence of secure sharing

VRE 2

VRE 1

VO

D4Science technical features

Page 27: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

27

www.d4science.eu

gCube VRE Management System

gHN embedded

Mandatory

D4Science technical features

Page 28: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

28

www.d4science.eu

D4Science as a Services Provider: Key Features

• gCube Service Frameworkstailored set of services to effectively manage all resources by providing seamlessly discover, access, and retrieval of data, metadata, and annotations through a variety of tools and protocols

• gCube Documentationtailored set of manuals to maximise the exploitation of the functionality by users, developers, and system administrators.

D4Science technical features

Page 29: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

29

www.d4science.eu

gCube Services – powerful information model

• gCube Data Management System Persistently stores compound objects Manages heterogeneous metadata Supports metadata cleaning, enrichment, and

transformation by exploiting mapping schema, controlled vocabulary, thesauri, and ontology

describe

similar to

aggregate

C 1 C 2 C 3

VRE 1 VRE 2

Supports programmatic/manual annotation of content, e.g. data provenance

Supports content linking Provides support for collections Supports collections sharing across

VREs

D4Science technical features

Page 30: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

30

www.d4science.eu

gCube Data Management System

D4Science technical features

Page 31: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

31

www.d4science.eu

gCube Services – flexible IR

• gCube Search Management provides an XML-based query language over full text, geospatial, and temporal information

• Maximizes the usefulness of resources available to VREusers by

promoting resource sharing avoiding suboptimal usage

• Combines information retrieval and data processing capabilities

D4Science technical features

Page 32: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

32

www.d4science.eu

gCube Search Management

• Search types Structured data (fielded search / xml search) Semi structured data (xml search) Geospatial / temporal data (R-Tree) Content based search

Full text search Image similarity search

• Access XML-based Query Language Web user interface (portal / search portlets) Command line UI

• Retrieval Incremental result delivery Automatic caching Result persistence

D4Science technical features

Page 33: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

33

www.d4science.eu

gCube Services – collaboration

Collaborative Environment: a workspace where users can

• share• Private data• Data process results• Annotation • Process definition• Derived data

• collaborate• to define new document

templates, new documents• to tune applications and

processes to compare execution results

… opens unique opportunities for virtual collaborations

Contain both objects owned by the workspace owner and objects the workspace owner has been allowed to see, e.g. group objects;

D4Science technical features

Page 34: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

34

www.d4science.eu

Exploiting D4Science:Data Management

D4Sciene technical features

Page 35: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

35

www.d4science.eu

Data Resources Staging

D4Science technical features

Page 36: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

36

www.d4science.eu

Example Protocol Metadata Data Restriction

EEA HTTP to be generated web pages scavenging

N/A

AATSR FTP to be generated download N/A

AquaMaps Database to be generated grid jobs N/A

NASO RSS Feed download download N/A

Landsat7 GridFTP (SE) to be generated grid jobs ESA Site

MERIS L3 Chlorophyll

GridFTP (SE) to be generated download N/A

Specific Reports File System to be generated download N/A

……

Data Resources Staging I

D4Science technical features

Page 37: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

37

www.d4science.eu

alte

rnat

ive

view

s (a

2D

map

and

13

Glo

bal 3

D v

iew

s)

AquaMaps IO

global

Asia

Indian Ocean

Data Resources Staging II

EEA Report IO

mul

ti m

edia

& m

ulti

part

report / part

talk

data

cover

URI

D4Science technical features

Page 38: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

38

www.d4science.eu

Data Resources Staging III

• AquaMaps IO Descriptive metadata in

proprietary format Data and metadata

generated by filtering and rendering Relational DB data

Standard classification (Phylum – Class – Order – Family – Species)

Data provenance injected

• EEA Report IO Descriptive metadata in DC Data and metadata

generated by web pages scavenging

Data provider classification (e.g. Agriculture, Land use)

Data provenance injected

D4Science technical features

Page 39: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

39

www.d4science.eu

Data Resources Staging IV

• based on a scripting language abstracting over the gCube powerful data model

3 object types: {collection, resource, relationship} Each object has a set of properties Each object has a unique “external identifier”

equipped with common data manipulation constructs, e.g., XSLT, Xpath

provided with predefined data and metadata importers hiding infrastructure complexities

• Very compact workflow specifications

D4Science technical features

Page 40: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

40

www.d4science.eu

Exploiting D4Science:The VREs

Page 41: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

41

www.d4science.eu

AquaMaps

Grid implementation of the current AquaMaps.org approach Takes benefit from the computing capabilities Adds advanced filtering Manages integration of different data sources Generates provenance data

• 5 seconds to generate an AquaMaps object

• Up to hundreds concurrent generation

• Bulk support

• Still to come a facility to compare maps

D4Science technical features

Page 42: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

42

www.d4science.eu

FCPPS VRE

Provides support for the generation of fisheries and aquaculture country report

Uses annotations as a means for the editors to communicate on specific topics and sections

Supports aggregation of evolving data Enriched with a rich set of metadata Generates provenance data

• HTML publishing with a variety of XSLT

• OpenXML export

• Text, Images, TimeSeries

D4Science technical features

Page 43: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

43

www.d4science.eu

ICIS VRE

Offers a set of tools to manage capture statistics Supports the complete TS lifecycle Supports validation, curation, and analysis Provides support for data reallocation Produces uniform data-set Generates provenance data • Multiple key families support

• Filtering, grouping, and aggregation

• Union

• Still to come facilities to perform complex reallocation rules

• Still to come facilities to compare large TSs

D4Science technical features

Page 44: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

44

www.d4science.eu

SUMMING UP

Page 45: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

45

www.d4science.eu

Exploitation Models

A new user community can exploit gCube / D4Science

• By creating a new infrastructure Different communities can run their own infrastructure The new community provides all resources

• By joining the D4Science infrastructure The production infrastructure currently serves two user

communities (Earth Monitoring and Fisheries Management)

The new community provides part of the resourcesD4Science technical features

Page 46: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

46

www.d4science.eu

VOs & VREs building

• A VRE brings together different types of resources through a well defined cost-effective process by offering a rich variety of functionality to access and exploit them.

• The creation of the community environment is simple and easy:

A new VO can join one infrastructure in less then 1 day A new VRE can be deployed in less then 1 hours Many automatic deployment & configuration operations

managed via the gCube Portal

D4Science technical features

Page 47: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

47

www.d4science.eu

D4Science & the Grid

• Grid is controlled sharing of computing and storage facilities

• D4Science provides controlled sharing of

Computing and storage facilities Services and applications Data, metadata and related resources

To offer control-oriented and cross-domain content-oriented applications to store, describe, curate, annotate, search, select, merge, and transform heteregeneous information

In the landscape of an on-demand created collaborative environment (VRE)

D4Science technical features

Page 48: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

48

www.d4science.eu

gCube Specifications, Standards & Technologies

WS-* WSRF X-* WS-BPEL JSR Glue Schema GSI-Security

Java Globus Toolkit gLite

More coming: OAI-PMH & OAI-ORE WS-DAI OpenSearch OpenGIS - related

https://quality.wiki.d4science.research-infrastructures.eu/quality/index.php/Standards

More Exploited: DC ISO19*

Page 49: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

49

www.d4science.eu

QUESTIONS?

The gCube Technology is open source.

Page 50: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

50

www.d4science.eu

gCube Main Links

• gCube software• http://software.d4science.research-infrastructures.eu/

• gCube Administrator Guide• https://wiki.gcore.research-infrastructures.eu/gCube/index.php/A

dministrator_Guide

• gCube User Guide• https://technical.wiki.d4science.research-infrastructures.eu/docu

mentation/index.php/User%27s_Guide

• gCube Developer Guide• https://technical.wiki.d4science.research-infrastructures.eu/documentat

ion/index.php/Developer%27s_Guide

Page 51: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

51

www.d4science.eu

gCube License I

Page 52: Www.d4science.org D4Science technical features and opportunities of the grid infrastructure for large scale data management Pasquale Pagano D4Science Technical

52

www.d4science.eu

gCube License II