View
217
Download
3
Tags:
Embed Size (px)
Citation preview
www.d4science.org
D4Science technical features and opportunities of the grid infrastructure for large scale data management
Pasquale Pagano D4Science Technical Director National Research Council, ISTI-CNR
www.d4science.eu
2
www.d4science.eu
A closer look on gCube technology
Core Services
Information Organisation Services
Information Retrieval Services
Presentation ServicesgCube architectural overview
Services, layers & specifications
The talk is not about this
3
www.d4science.eu
Outline
• D4Science Mission
• D4Science World E-Infrastructure Technology Service
• D4Science Exploitation Data Management VREs
• Summing Up
D4Science technical features
4
www.d4science.eu
D4Science mission
D4Science missionto provide a scientific e-Infrastructures that
removes all heterogeneity, sustainability, scalability, and other technical concerns from the minds of scientists,
hides all related complexities from their perception, and enables them to focus on their science and collaborate on
common research challenges
gCube isa framework to manage e-infrastructures where it is possible to define, host, and maintain dynamic Virtual Research Environments (VREs) capable to satisfy the collaboration needs of distributed Virtual Organizations (VOs)
D4Science technical features
5
www.d4science.eu
From a testbed to a production ecosystem
Oct .’04 Nov.’07 Jan.’08 Dec.’09Oct .’09 Sept.’11
D4Science technical features
6
www.d4science.eu
From a testbed to a production ecosystemfu
nctio
nalit
y
gLite
gCube
Oct .’04 Nov.’07 Jan.’08 Dec.’09Oct .’09 Sept.’11
D4Science technical features
7
www.d4science.eu
D4Science: a Threefold World
D4Science technical features
8
www.d4science.eu
Infrastructure vs. e-Infrastructure
• An infrastructure is the basic physical and organizational structures and facilities (roads, power supplies, ..) needed for the operation of a society or enterprise
• The D4Science e-Infrastructure provides support for effective consumption of shared resources:
hardware-bound resources (i.e. networks, storage, instruments, and computational resources),
system-level software resources (i.e. basic middleware services),
and application-level software resources (i.e. data sources and services).
D4Science technical features
9
www.d4science.eu
Infrastructure vs. e-Infrastructure
• An infrastructure Connects remote places by
providing facilities to assist supported resources and consumers.
Has policies
• The D4Science e-infrastructure enables scientific communities to
cooperate within a coherent model, regardless of the location of their research facilities
Enforces policies
D4Science technical features
10
www.d4science.eu
D4Science e-Infrastructure (1/2)
• Facilitate the life of scientists by hiding the complexity
D4Science e-InfrastructureData analysis
50 Gb
D4Science technical features
11
www.d4science.eu
D4Science e-Infrastructure (2/2)
• Facilitate the life of scientists by supporting collaboration
D4Science e-Infrastructureshare
50 Gb
50 Gb
access
D4Science technical features
12
www.d4science.eu
D4Science as e-Infrastructure: Key Features
• D4Science e-Infrastructure provides scientists with
Easy-to-use tools for infrastructural resources registration and management
Cost-effective tools for data resource registration, metadata generation, and curation
Seamless access to shared, distributed and heterogeneous resources organized in dynamically created Virtual Research Environments
D4Science technical features
13
www.d4science.eu
e-Infrastructure Resources
The D4Science managed resources are:
D4Science technical features
14
www.d4science.eu
e-Infrastructure Resources [cont.]
D4Science technical features
15
www.d4science.eu
e-Infrastructure
Site A
Site B
Site C
D4Science technical features
16
www.d4science.eu
Virtual Organization
• A Virtual Organization (VO) specifies how a set of users can access a set of resources
by defining what is shared, who is allowed to share, the conditions under which sharing occurs
and enforcing the authentication and
authorization policies.
VO
D4Science technical features
17
www.d4science.eu
Virtual Research Environment (1/3)
• VRE scenarios• Data needs to be assessed before to make it publically
exploitable by the VO members.• Restricted set of users have to collaborate to refine
processes and implement show cases. • Products generated through elaboration of data or
simulation have to be validated by expert users.
Is the VO adequate to represent a growing aggregation of resources tailored to satisfy the evolving needs of the user community?
NO, it is not !
D4Science technical features
18
www.d4science.eu
Virtual Research Environment (2/3)
VRE resources can be published in the VO at any time by the VRE data managers.
Virtual Research Environment (VRE) is a distributed and dynamically created
environment where subset of resources can be
assigned to a subset of users for a limited timeframe.
VRE 2
VRE 1
VO
D4Science technical features
19
www.d4science.eu
Virtual Research Environment (3/3)
• A Virtual Research Environment (VRE) supports cooperative activities like
data analysis and processing; data generation, integration, enrichment, and curation; production of new knowledge using specialized tools
D4Science technical features
20
www.d4science.eu
Infrastructure, Virtual Organisation and VRE
Infrastructure
VRE
VO
D4Science technical features
21
www.d4science.eu
D4Science as Technology: Key Features
• gCube Core (gCore) simplifies and standardizes all systemic aspects of service
development; promotes the adoption of best practices in
multiprogramming and distributed programming
• gCube Enabling Services lift the Grid approach for batch job execution and resource
sharing to Web Services deployment and invocation in a SOA empowered e-Infrastructure
D4Science technical features
22
www.d4science.eu
gCore: innovation in developing
• An initiative to reduce complexity in the design and implementation of gCube services
an application framework for the consolidation / development of existing/new services
the gCube Core Framework (gCF)
• An initiative to meet the needs of system administrators, infrastructure managers, and resource providers
an easy-to-install, self-contained sandbox to participate to the D4Science empowered e-Infrastructure
the gCube Core Distribution (gHN)
D4Science technical features
23
www.d4science.eu
gCube Enabling Services – IS
gCube provides an Information and Monitoring System where rich set of resources including computing, storage, service, data, metadata, and applications can be independently of their type :
• registered, discovered, and accessed
• monitored, shared in a controlled way, accounted
Is a simple Registry sufficient to manage a growing set of heterogeneous resources?
NO, it is not !
D4Science technical features
24
www.d4science.eu
gCube Enabling Services - IS [cont]
• gCube Information System: collects information about the capabilities and status of all resources:
Glue schema for computational and storage resources
profiles for gCube services and their running instances
profiles for content and metadata collections
• Currently it manages more than 100 M operations per year Serving more than 300 web services
D4Science technical features
25
www.d4science.eu
gCube Information System
gHN embedded
Mandatory
D4Science technical features
26
www.d4science.eu
gCube Enabling Services – dynamic application building
• gCube VRE Management System: manages services and applications reduces deployment costs reduces operational costs and
application porting timeframes grants execution only to certified
software
It reduces the costs related to e-Infrastructure ownership, maintenance, and upgrade without compromising the essence of secure sharing
VRE 2
VRE 1
VO
D4Science technical features
27
www.d4science.eu
gCube VRE Management System
gHN embedded
Mandatory
D4Science technical features
28
www.d4science.eu
D4Science as a Services Provider: Key Features
• gCube Service Frameworkstailored set of services to effectively manage all resources by providing seamlessly discover, access, and retrieval of data, metadata, and annotations through a variety of tools and protocols
• gCube Documentationtailored set of manuals to maximise the exploitation of the functionality by users, developers, and system administrators.
D4Science technical features
29
www.d4science.eu
gCube Services – powerful information model
• gCube Data Management System Persistently stores compound objects Manages heterogeneous metadata Supports metadata cleaning, enrichment, and
transformation by exploiting mapping schema, controlled vocabulary, thesauri, and ontology
describe
similar to
aggregate
C 1 C 2 C 3
VRE 1 VRE 2
Supports programmatic/manual annotation of content, e.g. data provenance
Supports content linking Provides support for collections Supports collections sharing across
VREs
D4Science technical features
30
www.d4science.eu
gCube Data Management System
D4Science technical features
31
www.d4science.eu
gCube Services – flexible IR
• gCube Search Management provides an XML-based query language over full text, geospatial, and temporal information
• Maximizes the usefulness of resources available to VREusers by
promoting resource sharing avoiding suboptimal usage
• Combines information retrieval and data processing capabilities
D4Science technical features
32
www.d4science.eu
gCube Search Management
• Search types Structured data (fielded search / xml search) Semi structured data (xml search) Geospatial / temporal data (R-Tree) Content based search
Full text search Image similarity search
• Access XML-based Query Language Web user interface (portal / search portlets) Command line UI
• Retrieval Incremental result delivery Automatic caching Result persistence
D4Science technical features
33
www.d4science.eu
gCube Services – collaboration
Collaborative Environment: a workspace where users can
• share• Private data• Data process results• Annotation • Process definition• Derived data
• collaborate• to define new document
templates, new documents• to tune applications and
processes to compare execution results
… opens unique opportunities for virtual collaborations
Contain both objects owned by the workspace owner and objects the workspace owner has been allowed to see, e.g. group objects;
D4Science technical features
34
www.d4science.eu
Exploiting D4Science:Data Management
D4Sciene technical features
35
www.d4science.eu
Data Resources Staging
D4Science technical features
36
www.d4science.eu
Example Protocol Metadata Data Restriction
EEA HTTP to be generated web pages scavenging
N/A
AATSR FTP to be generated download N/A
AquaMaps Database to be generated grid jobs N/A
NASO RSS Feed download download N/A
Landsat7 GridFTP (SE) to be generated grid jobs ESA Site
MERIS L3 Chlorophyll
GridFTP (SE) to be generated download N/A
Specific Reports File System to be generated download N/A
……
Data Resources Staging I
D4Science technical features
37
www.d4science.eu
alte
rnat
ive
view
s (a
2D
map
and
13
Glo
bal 3
D v
iew
s)
…
AquaMaps IO
global
Asia
Indian Ocean
Data Resources Staging II
EEA Report IO
mul
ti m
edia
& m
ulti
part
…
report / part
talk
data
cover
URI
D4Science technical features
38
www.d4science.eu
Data Resources Staging III
• AquaMaps IO Descriptive metadata in
proprietary format Data and metadata
generated by filtering and rendering Relational DB data
Standard classification (Phylum – Class – Order – Family – Species)
Data provenance injected
• EEA Report IO Descriptive metadata in DC Data and metadata
generated by web pages scavenging
Data provider classification (e.g. Agriculture, Land use)
Data provenance injected
D4Science technical features
39
www.d4science.eu
Data Resources Staging IV
• based on a scripting language abstracting over the gCube powerful data model
3 object types: {collection, resource, relationship} Each object has a set of properties Each object has a unique “external identifier”
equipped with common data manipulation constructs, e.g., XSLT, Xpath
provided with predefined data and metadata importers hiding infrastructure complexities
• Very compact workflow specifications
D4Science technical features
40
www.d4science.eu
Exploiting D4Science:The VREs
41
www.d4science.eu
AquaMaps
Grid implementation of the current AquaMaps.org approach Takes benefit from the computing capabilities Adds advanced filtering Manages integration of different data sources Generates provenance data
• 5 seconds to generate an AquaMaps object
• Up to hundreds concurrent generation
• Bulk support
• Still to come a facility to compare maps
D4Science technical features
42
www.d4science.eu
FCPPS VRE
Provides support for the generation of fisheries and aquaculture country report
Uses annotations as a means for the editors to communicate on specific topics and sections
Supports aggregation of evolving data Enriched with a rich set of metadata Generates provenance data
• HTML publishing with a variety of XSLT
• OpenXML export
• Text, Images, TimeSeries
D4Science technical features
43
www.d4science.eu
ICIS VRE
Offers a set of tools to manage capture statistics Supports the complete TS lifecycle Supports validation, curation, and analysis Provides support for data reallocation Produces uniform data-set Generates provenance data • Multiple key families support
• Filtering, grouping, and aggregation
• Union
• Still to come facilities to perform complex reallocation rules
• Still to come facilities to compare large TSs
D4Science technical features
44
www.d4science.eu
SUMMING UP
45
www.d4science.eu
Exploitation Models
A new user community can exploit gCube / D4Science
• By creating a new infrastructure Different communities can run their own infrastructure The new community provides all resources
• By joining the D4Science infrastructure The production infrastructure currently serves two user
communities (Earth Monitoring and Fisheries Management)
The new community provides part of the resourcesD4Science technical features
46
www.d4science.eu
VOs & VREs building
• A VRE brings together different types of resources through a well defined cost-effective process by offering a rich variety of functionality to access and exploit them.
• The creation of the community environment is simple and easy:
A new VO can join one infrastructure in less then 1 day A new VRE can be deployed in less then 1 hours Many automatic deployment & configuration operations
managed via the gCube Portal
D4Science technical features
47
www.d4science.eu
D4Science & the Grid
• Grid is controlled sharing of computing and storage facilities
• D4Science provides controlled sharing of
Computing and storage facilities Services and applications Data, metadata and related resources
To offer control-oriented and cross-domain content-oriented applications to store, describe, curate, annotate, search, select, merge, and transform heteregeneous information
In the landscape of an on-demand created collaborative environment (VRE)
D4Science technical features
48
www.d4science.eu
gCube Specifications, Standards & Technologies
WS-* WSRF X-* WS-BPEL JSR Glue Schema GSI-Security
Java Globus Toolkit gLite
More coming: OAI-PMH & OAI-ORE WS-DAI OpenSearch OpenGIS - related
https://quality.wiki.d4science.research-infrastructures.eu/quality/index.php/Standards
More Exploited: DC ISO19*
49
www.d4science.eu
QUESTIONS?
The gCube Technology is open source.
50
www.d4science.eu
gCube Main Links
• gCube software• http://software.d4science.research-infrastructures.eu/
• gCube Administrator Guide• https://wiki.gcore.research-infrastructures.eu/gCube/index.php/A
dministrator_Guide
• gCube User Guide• https://technical.wiki.d4science.research-infrastructures.eu/docu
mentation/index.php/User%27s_Guide
• gCube Developer Guide• https://technical.wiki.d4science.research-infrastructures.eu/documentat
ion/index.php/Developer%27s_Guide
51
www.d4science.eu
gCube License I
52
www.d4science.eu
gCube License II