25
e-Infrastructure Integration with gCube Andrea Manzi ( CERN ) Pasquale Pagano ( ISTI-CNR ) EGI User Forum 13 April 2011 Vilnius (Lithuania) www.d4science.eu

e-Infrastructure Integration-with gCube

  • Upload
    fao

  • View
    543

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: e-Infrastructure Integration-with gCube

e-Infrastructure Integration with gCube

Andrea Manzi ( CERN )Pasquale Pagano ( ISTI-CNR )

EGI User Forum

13 April 2011Vilnius (Lithuania)

www.d4science.eu

Page 2: e-Infrastructure Integration-with gCube

2

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Outline

• D4Science II Ecosystem• gCube architecture• Interoperability approaches

• Resource Discovery• Data Storage & Access• Data Discovery • Data Process• Security

• Applications• AquaMaps• Time Series

Page 3: e-Infrastructure Integration-with gCube

3

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

D4SCIENCE INFRASTRUCTURE

Hadoop EGEE/EGI

INSPIRE

DRIVER

GENESI-DR

AquaMaps

D4Science

II

Ecosyste

m

• Heterogeneous resources

• Heterogeneous computational platforms

• Rich set of legacy applications

• Multiple administrative domains

• Evolving communities

FAO Geonetwork

FAO FIGIS

Community A

Community B

Page 4: e-Infrastructure Integration-with gCube

4

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

gCube architecture

gCube run-time environment

gCube Definition and Management Services

gCube Application Services

Presentation Services

Portlets

Application Support Layer

Information Organization Services

Storage Management

Collection - Content - Metadata -

Annotation -… Management

Information Access Services

Search Framework

Ontology Management

Personalization Service

Index Management Framework

DIR Support Framework

Process ExecutionManagement

VREManagement

Information System Security

gCube Container gCore Framework

Use

r Se

rvices

Page 5: e-Infrastructure Integration-with gCube

5

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Virtual Organization

A Virtual Organization (VO) specifies how a set of users can access a set of resources

what is shared who is allowed to share the conditions under which

sharing can occur

The concept of VO Is not adequate to cover some common scenarios

• Data needs to be assessed before to make it publically exploitable by the VO members.

• Restricted set of users have to collaborate to refine processes and implement show cases.

• Products generated through elaboration of data or simulation have to be validated by expert users.

Page 6: e-Infrastructure Integration-with gCube

6

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Virtual Research Environment

Virtual Research Environment (VRE) is a distributed and dynamically created

environment where subset of resources can be

assigned to a subset of users via interfaces

for a limited timeframe at little or no cost for the providers of

the infrastructure Integrated with cloud systems

( OpenNebula )

VRE 2

VRE 1

VO

gCube is a first example of a VRE management system

Page 7: e-Infrastructure Integration-with gCube

7

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Interoperability: Assumptions

Very rich applications and data collections are currently maintained by a multitude of authoritative providers

Different problems require different execution paradigms: batch, map-reduce, synchronous call, message-queue, …

Key distributed computation technologies exist: grid (gLite and Globus), distributed resource management (Condor), clusters (Hadoop), …

Several standards are adopted in the same domain

Page 8: e-Infrastructure Integration-with gCube

8

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Interoperability: Landscape

Unstructured Data: blob (binary), and textual filesStructured Data: tabular, statistical, geospatial, temporal, and textual dataCompound Data: data composed by unstructured and structured data entities

secu

rity

Page 9: e-Infrastructure Integration-with gCube

9

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Interoperability: gCube Vision

gCube objectives:

hide heterogeneity, i.e. abstract over differences in location, protocol, and model;

embrace heterogeneity, i.e. allow for multiple locations, protocols, and models;

Technical goals:

no bottlenecks: scale no less than the interfaced resources no outages: keep failures partial and temporary autonomicity: system reacts and recovers

Page 10: e-Infrastructure Integration-with gCube

10

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Hiding Heterogeneity

• Heterogeneous resources are virtually accessible in a common ecosystem of resources

• despite their locations, technologies, and protocol

• Different communities have access to different views• according to the conditions under which the sharing can occur

• Each community can define its own VRE

• for a limited timeframe and at no cost for the providers of the resource

• Several VRE can coexist• without interfering each other

even by competing for the same resources

Page 11: e-Infrastructure Integration-with gCube

11

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Embracing Heterogeneity

Approaches and solutions to achieve interoperability :

Blackboard-based

asynchronous communication between components in a system one protocol to R/W and one language to specify messages

Wrapper/ Mediator-based

translates one interface for a component into a compatible interface

Adaptor-based

provides a unified interface to a set of other components interfaces and encapsulates how this set of objects interact

Page 12: e-Infrastructure Integration-with gCube

12

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Each resource is represented by a profile (metadata) characterising:

the interface the state the list of dependencies the run-time status the policies the configuration the pending tasks to execute

A Resource profile

is published by the resource owner is discovered by the resource consumers asynchronously through a

common resource-independent protocol

gCube offers a distributed and scalable Information System (blackboard) to store, discover, and access resource profiles

Interoperability Approaches: Resource Discovery

gCub

e in

tero

pera

bilit

y fr

amew

ork:

the

sol

utio

n

Page 13: e-Infrastructure Integration-with gCube

13

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Interoperability Approaches: Content Interoperability[1/2]

gCube Open Content Management Architecture (OCMA)

Assumption data stored in different storage back-ends diverse locations, models, access types few common primitives: documents, collections, repositories

gCube allows to reach content that lies outside system expose content (reachable from) inside system perform coarse-grained as well as fine-grained retrieval, update, and

addressing

Runtime scalability autonomic read-only state replication, maximize throughput, minimize response time: discovery-time load balancing

(through IS) reduce latencies

Software plugin-based architecture to reduce development costs (plugins over Storage

systems)

Page 14: e-Infrastructure Integration-with gCube

14

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

T1 T2

adapts adapts

factory

Interoperability Approaches: Content Interoperability[2/2]

gDoc

…gDocRead

gDocWrite

Content Manager Service ( OCMA Service)•Adapts gCube doc model ( gDoc ) to an unbounded number of back-end types

Page 15: e-Infrastructure Integration-with gCube

15

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Interoperability Approaches :Data Discovery

gCube offers

Several index types Forward indexing, which supports ultra fast lookups on tabular

typed metadata; XML indexing, that supports semistructured lookups on content

metadata; Textual field indexing, that supports full text and qualified lookups

on textual (mainly) metadata; Metadata full text indexing, that enables full text lookups on

metadata; Content full text indexing, that enables full text lookups on text

extracted by content; Geospatial/temporal indexing, that enables geospatial proximity and

coverage queries to be executed over geospatial/temporal metadata;

Feature indexing, that enables high-dimension vector indexing, for feature lookup (currently the feature is inactive);

Page 16: e-Infrastructure Integration-with gCube

16

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

gCube offers solutions to:

Decouple the business domain and infrastructure specific logic from the core “execution” functionality

Invocate a wide range of logic components: SOAP and REST WebServices, Shell Scripts, Executable Binaries, POJOs, …

Support most of the execution paradigms: batch, map-reduce, synchronous call

Bridges key distributed computation technologies: grid (gLite and Globus), Condor, Hadoop

Control and monitor the execution of a processing flow

Staging of data among different storage providers

Streaming data among computation elements

Interoperability Approaches : Process Execution [1/2]

Page 17: e-Infrastructure Integration-with gCube

17

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Interoperability Approaches : Process Execution [2/2]

By using adaptors that

operate on a specific third party language and translate them into native constructs,

allow for the creation of complex workflows that exploit several diverse technologies deployed on different infrastructures

Page 18: e-Infrastructure Integration-with gCube

18

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Interoperability Approaches : Security [1/2]

gCube offers solutions :

To secure access to gCube resources for interoperable external systems (incoming security) To ensure Interoperability of gCube security mechanisms with standards compliant security systems (reuse) To facilitate secure access to external resources for gCube services (outgoing)

Page 19: e-Infrastructure Integration-with gCube

19

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Authz: •XACML for authz request/response protocol and policy definition •SAML assertions to transport user/service authN information•Argus-based approach (EMI Authz framework) having pluggable design to integrate additional PIPs•SAML Profile for XACML 2.0 following the OASIS Authorization Interoperability Profile SpecificationAuthN:•Production level SSL/HTTPS support•Key- and Trust-Manager

Interoperability Approaches : Security [2/2]

Page 20: e-Infrastructure Integration-with gCube

20

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

AquaMaps is an application*

tailored to predict global distributions of marine species initially designed for marine mammals and subsequently generalised to marine species,

that generates color-coded species range maps using a half-degree latitude and longitude blocks

by interfacing several databases and repository providers

Species Distribution Maps Generation

* Algorithm by Kashner et al. 2006

Page 21: e-Infrastructure Integration-with gCube

21

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

AquaMaps execution is based on the gCube Ecological Niche Modelling Suite which allows the extrapolation of known species occurrences

Species Distribution Maps Generation

◦ to determine environmental envelopes (species tolerances)

◦ to predict future distributions by matching species tolerances against local environmental conditions (e.g. climate change and sea pollution)

Very large volume of input and output data: HSPEC native range 56,468,301 - HSPEC suitable range 114,989,360Very large number of computation: One multispecies map computed on 6,188 half degree cells (over 170k) and 2,540 species requires 125 millions computations (Eli E. Agbayani, FishBase Project/INCOFISH WP1, WorlFish Center)

Page 22: e-Infrastructure Integration-with gCube

22

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Time Series Management

Offers a set of tools to manage capture statistics

Supports the complete TS lifecycle Supports validation, curation, and analysis Provides support for data reallocation Produces uniform data-set

Page 23: e-Infrastructure Integration-with gCube

23

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Time Series and R statistical software integration

The main aims are to:

•provide a complete, fully working, environment for R language

•give user methods to automatically extract data from the time series he was working on

•give user the possibility to perform queries on the time series database

•provide a service distributed on the infrastructure. Multiple instances can be managed on the infrastructure VREs, the distribution being transparent to the users (SaaS model)

Page 24: e-Infrastructure Integration-with gCube

24

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Conclusions

gCube System:

Stable software being improved over the last 5 years ( end of DILIGENT -> D4Science -> D4ScienceII)

gCube offers a variety of patterns, tools, and solutions

to delivery interoperability solutions and interconnect Heterogeneous digital content Heterogeneous repository systems Heterogeneous computation platforms

to decrease the cost of adoption to deal with several standards

Page 25: e-Infrastructure Integration-with gCube

25

www.d4science.eue-Infrastructure Integration with gCubeVilnius, 13 April 2011

Questions Time