December 2009 Data Integration in Grid Environments Alex Poulovassilis, Birkbeck, U. of London

December 2009

Data Integration in Grid Environments Alex Poulovassilis, Birkbeck, U. of London

Lecture Overview

Part I: Grid Computing Grid Architectures Grid Standards

Part II: The ISPIDER project – a grid application in

Bioinformatics The AutoMed project OGSA-DAI and OGSA-DQP middleware DAI/DQP/AutoMed interoperability in ISPIDER

Part I

Grid Computing Grid Architectures Grid Standards

What is Grid Computing?

the term first arose in the mid 1990s and it is also known as utility computing

the world is full of computing resources connected by networks, but their distribution, heterogeneity and autonomy make it hard for such resources to be shared

the development of grid computing has been motivated by the need for flexible, secure and coordinated resource sharing to solve large-scale computing problems

this resource sharing is between dynamic collections of individuals, institutions and resources, which collectively form a Virtual Organisation (VO)

What is Grid Computing?

resource sharing includes shared usage of hardware, software, data resources, sensor networks, etc.

this sharing is necessary in order to solve in a collaborative fashion large scale computing problems arising in science, engineering and business

the sharing is controlled, with providers of resources defining what may be shared and under what conditions

How did grid computing come about?

arose in academia (e-science), with Ian Foster and Carl Kesselman leading the development of the Globus toolkit

was then picked up by industry e.g. SUN, IBM, HP, Oracle driving forces were:

(a) computationally-intensive scientific problems e.g. simulations(b) scientific problems involving huge quantities of data e.g. analysis of large data sets

leading to so-called Computational Grids and Data Grids grid computing may be viewed as an extension of the

WWW (information sharing) to sharing of general computing resources

The international grid community

the Global Grid Forum (GGF) was formed in 2000 as a community of researchers, users and vendors aiming to exchange ideas on grid development and deployment, and to develop specifications for grid standards

the Enterprise Grid Alliance (EGA) was formed in 2004 as a non-profit, vendor organisation formed to develop grid computing in industry

GGF and EGA merged in 2006 into the new Open Grid Forum

there has been much funding of grid research and development in the EU, regionally, and nationally over the past decade

How is grid different from distributed computing?

in grid computing, the focus is on large-scale resource sharing, requiring authentication, authorisation, resource discovery, resource scheduling and costing

in grid applications, presentation services access the functionality provided by service-oriented grid middleware, which virtualises the dynamic deployment of the actual computing resources

by contrast, in client-server or multi-tier applications, the presentation, application and back-end services and resources are separated, but are fixed and their deployment is known

What is a Grid Architecture?

was originally envisaged as being organised into layers: • Application• Collective: see next slide• Resource: protocols for initiation, monitoring and control of

the computing and data resources• Connectivity: communication and authentication protocols• Fabric: the physical computing and data resources

the components in each layer share common characteristics and build on the services provided by lower layers

the Resource and Connectivity services can be implemented over lower-level resources at the Fabric layer

and can in turn support a range of higher-level services at the Collective and Application layers

Collective Layer

comprises global protocols and services that capture interactions across collections of resources e.g.• directory services to search for resources by name or

attributes such as type, availability, load• allocation, scheduling and brokering services to

request allocation of one or more resources for a specific task, and scheduling of tasks on resources

• monitoring and diagnostic services to monitor the execution of tasks

• workload management systems - for specifying and executing workflows consisting of multiple tasks

• accounting and payment services gathering resource usage information

Subsequent developments in Grid Architectures

no longer a layered architecture but a service oriented architecture (SOA) e.g. the Open Grid Services Architecture

services are loosely coupled peers that can interact with each other to achieve a given capability e.g. • a service may extend the capabilities of another

service in order to provide its own functionality • a service may compose the capabilities of other

services to provide higher-level functionality

This leads to a 3-tier Architecture

Applications Service pool Resources

• the service pool is the grid middleware• below it are the actual physical resources• above it are the applications which access the service

pool• the location and nature of the actual resources is

transparent to the applications • thus the grid middleware allows resource virtualisation• applications use services as and when needed• and pay only for this usage, hence also the term utility

computing

Different types of grids

departmental grids: • built on clusters or groups of clusters owned by one

department of an enterprise enterprise grids:

• sharing of common resources by many departments of an enterprise

partner grids: • involving several partner institutions, known to each

other and with common goals open grids:

• anyone can join and become a resource provider and/or resource user

Status of grid computing today

numerous commercial departmental and enterprise grids are in operation today • e.g. for drug discovery, stock market trading,

integrated circuit design, enterprise resource planning

also many international partner grids • e.g. in high energy physics, life sciences, design and

engineering, computational chemistry, astrophysics, earth sciences

software to support open grids is also emerging

Grid Standards

Numerous Grid products are available from various organisations and vendors

these need to be able to interoperate, and this is made possible by the development of standards e.g. the Open Grid Services Architecture (OGSA)

OGSA is based on Web Services web standards of relevance to the grid include: HTTP (transport),

XML (data format), SOAP (message syntax), WSDL (web service definition), UDDI (web service registry), WS-Security, BPEL (workflow definition)

Grid Standards

a key area in which Grid requirements have motivated new WS standards is in the representation and manipulation of state

standard WSs are stateless from the point of view of the requester of the service

OGSA assumes that service interfaces and service behaviours are defined as in WSRF (Web Service Resource Framework)

WSRF defines how state should be modelled, accessed and managed; how services should be grouped; and how faults should be modelled

Also, WS-Notification defines notification mechanisms that support subscription to, and notification of, changes to services and to their state

OGSA vs other Web Service environments

apart from state, the other major difference of OGSA compared with other WS environments is that grid environments are not static:

in contrast to standard WSs, grid services can be created and deployed dynamically

the set of available resources and their load at any time may be highly variable, while still requiring application requirements and SLAs (service-level agreements) to be met

failures to meet SLAs or occurrences of faults may require dynamic restart of executions on other alternative resources

there is thus a need for monitoring of grid applications and for responding dynamically to their needs until completion

Implementing Grid services

associated with a Grid Service are a set of Service Data Elements (SDEs)

these are XML documents and represent information about grid service instances, allowing their discovery and management

each Grid Service “port type” has an associated set of SDEs

different types of Grid Service are realised by providing different sets of port types

Background Reading for Part I

Grid Cafe, http://www.gridcafe.org/ The EGEE project, www.eu-egee.org/ Worldwide LHC (Large Hadron Collider) Computing Grid,

http://lcg.web.cern.ch/LCG/public/ Open Grid Forum, www.ogf.org Globus toolkit, www.globus.org/alliance/publications/

http://www.gridcafe.org/

http://www.eu-egee.org/

http://lcg.web.cern.ch/LCG/public/

http://www.ogf.org/

http://www.globus.org/alliance/publications/

Part II

The ISPIDER project – a grid application in Bioinformatics

The AutoMed project OGSA-DAI and OGSA-DQP middleware DAI/DQP/AutoMed interoperability in ISPIDER

The ISPIDER Project

Partners: Birkbeck, EBI, Manchester, UCL Requirements:

• There are vast amounts of heterogeneous proteomics data being produced via a variety of new techniques

• Proteomics is the study of the protein complement of the genome

• It is targeted at the elucidation of biological function from genomic data

• There is a need for interoperability between autonomous proteomics data resources

• And for complex analyses over integrated virtual resources

Genome: DNA sequences of 4 bases (A,C,G,T)

RNA: copy of DNA

sequence

Protein: sequence of 20

amino acids

A gene

Biological data: Genes Proteins Biological Function

Permanent copy Temporary copy Product (eachtriple of RNA

bases encodes an amino acid)

FUNCTION

Job

BiologicalProcesses

This slide is adapted from Nigel Martin’s Lecture Notes on Bioinformatics

Aims of ISPIDER

Hence, the development of a Proteomics Grid Infrastructure, using existing proteomics resources and developing new ones; also developing new proteomics clients for querying, visualisation, workflow etc.

The development of such a system is beneficial for a number of reasons: • Access to more data sources yields more reliable

analyses• Integrating resources increases the breadth of

information available for the biologist• Enables new analyses to be undertaken which would

have been prohibitively difficult or impossible with just the individual resources

ISPIDER Architecture

Some ISPIDER data resources

gpmDB See http://gpmdb.thegpm.org• a publicly available database with more than 2 million

proteins and almost 470,000 unique peptide identifications• provides access to a wealth of peptide identifications from

a range of different laboratories and instruments PEDRo http://pedrodb.man.ac.uk:8080/pedrodb

• provides access to a collection of descriptions of experimental data sets in proteomics

PepSeeker http://nwsr.smith.man.ac.uk/pepseeker• developed as part of the ISPIDER project and targeted at

the identification stage of the proteomics pipeline• currently holds over 50,000 proteins and 50,000 unique

peptide identifications

http://gpmdb.thegpm.org/

http://nwsr.smith.man.ac.uk/pepseeker

myGrid / DQP / AutoMed Middleware

myGrid: provides a workflow environment over web/grid services, allowing high-level integration of data and applications for in-silico experiments in biology

OGSA-DQP: provides distributed query processing over Grid enabled data resources

AutoMed: provides heterogeneous data integration functionality over distributed data sources (the AutoMed project partners are Birkbeck and Imperial College)

ISPIDER research: integration of AutoMed and DAI/DQP (topic of this lecture); also integration of AutoMed and myGrid workflows

Motivation for AutoMed

Data Integration (DI) is the process of creating an integrated resource which• combines data from a variety of autonomous data sources• in order to support new queries and analyses

the data sources may be heterogeneous in terms of their: • data model, query interfaces, query processing

capabilities, database schema or data exchange format, data types used, nomenclature adopted

this poses several challenges, leading to several methodologies, architectures and systems being developed to support DI

these aim to abstract out data transformation and aggregation logic from application programs into generic data integration software

AutoMed

Supports a metamodel, the Hypergraph Data Model (HDM), in terms of which higher-level modelling languages can be defined – so extensible with new modelling languages

After a modelling language has been specified in terms of the HDM, a set of primitive schema transformations become available for schemas expressed in that language

Schemas can be incrementally transformed and integrated by applying to them a sequence of primitive transformations

Schemas may or may not have data associated with them: so virtual, materialised (data warehousing) or hybrid integration can be supported

Transformations are accompanied by queries, allowing data and query translation between source and target schemas

AutoMed Architecture

Global Query Processor

Global Query Optimiser

Schema Evolution Tool

Schema Transformationand Integration Tools

Model Definition Tool

Schema and Transformation

Repository

Model Definitions Repository

Wrapper

Distributed Data Sources

Global Query Processing in AutoMed

We handle query language heterogeneity by translation into/from a intermediate query language – IQL

A query Q expressed in a high-level query language such as SQL on a global schema S would first be translated into IQL

For example, the following IQL query on a global schema retrieves all identifications for the protein with accession number ENSP00000339074:

[id | {id,an} <- <<Protein,accession_number>>; an=`ENSP00000339074'] View definitions are then derived from the transformation

pathways between S and the data source schemas (in this case gpmDB, PEDRo and PepSeeker)

These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs

Global Query Processing in AutoMed (cont’d)

E.g. for Q as above the reformulated query is: [id | {id,an} <- [{id2lsid [`pepseeker.proteinhit:', toString d], x}| {d,x}<- distinct [{k,x}|{k,x}<-

<<proteinhit,ProteinID>>]] ++ [{id2lsid [`pedro.protein:', toString d], x}| {d,x}<- <<protein,accession_num>>] ++ [{id2lsid [`gpmdb.proseq:', toString d],x}| {d,x}<-<<proseq,label>>]; an=`ENSP00000339074']

Global Query Processing (cont’d)

Query optimisation then occurs One goal of this is to generate the largest possible sub-

queries that can be submitted to data source Wrappers for translation into the data source query languages and evaluation by the data sources

Query evaluation then follows, during which the AutoMed Evaluator submits to Wrappers sub-queries that they are able to translate into the data source query language (currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data resources)

The Wrappers submit sub-queries to data sources, and translate sub-query results back into the IQL type system

The Evaluator then undertakes any further necessary query evaluation to combine sub-query results

OGSA-DAI and OGSA-DQP

OGSA-DAI (Data Access and Integration)• delivers data access, transport and metadata

services for the grid• there are other OGSA services that focus on data

derivation, consistency and replication services

OGSA-DQP (Distributed Query Processing) • provides services for the compilation, optimisation

and distributed evaluation of queries over grid data resources accessed via OGSA-DAI

OGSA-DAI functionality

provides a consistent interface to data resources regardless of the underlying technology e.g. relational (Oracle, DB2, MySQL) or XML (Xindice; eXist)

OGSA-DAI extends standard Grid Services with several new port types, including • Grid Data Service (GDS)

OGSA-DAI functionality

Grid Data Service (GDS):• accepts requests, in the form of XML documents,

instructing the Grid Service instance to interact with a database in order to create, retrieve, update or delete data

• its primary operation is perform through which such requests are passed to the GS

• a request may consist of a collection of linked activities e.g. a data access, followed by a data translation, followed by a data delivery

• all of these can be bundled into one request in order to reduce the number of round trips required between the client and the service

OGSA-DQP functionality

This implements the GDS and GDT port types from OGSA-DAI and

also adds two new port types: GDQS and GQES. Grid Distributed Query Service (GDQS):

can interact with known registries to obtain the schemas of data resources and also information about computational resources

this set-up phase occurs once in the lifetime of a GDQS instance

clients can then submit a query to the GDQS via the GDS port-type, using a perform call

this is compiled, optimised and partitioned into a distributed query execution plan each of whose partitions will be scheduled for execution at different GQESs (see below)

the GDQS uses this information to create the necessary GQES instances on their designated execution nodes, and hands over to each GQES the partition assigned to it

OGSA-DQP functionality

Grid Query Evaluation Service (GQES): each GQES instance is an execution node in a

distributed query execution plan it is responsible for that part of the execution plan

allocated to it by the GDQS it implements a physical algebra over other Grid Data

Services encapsulated within these other GDSs are the data

resources whose schemas were imported during the GDQS set-up phase

DAI/DQP/AutoMed Interoperability

Data sources wrapped with OGSA-DAI

AutoMed-DAI wrappers extract data sources’ metadata

Semantic integration of data sources using transformation pathways

IQL queries submitted to an integrated schema are reformulated to IQL queries on the data sources, using the transformation pathways

Submitted to DQP for evaluation (not AutoMed)

AutoMed Wrappers

AutoMedRepository

OGSA-DAIActivity

OGSA-DAIActivity

OGSA-DAIActivity

DB

AutoMedwrapper

AutoMedwrapper

AutoMedwrapper

DistributedQuery Processor

IntegratedAutoMed Schema

AutoMedSchema

AutoMedSchema

AutoMedSchema

AutoMedQuery Processor

IQL query

OQL query

OGSA-DAIService

OGSA-DAIService

OGSA-DAIService

DBDB

AutoMed DQPwrapper

OQL result

IQL result

IQL query

IQL result

The AutoMed-DAI Wrapper

The AutoMed-DAI wrapper requests the schema of the data source using an OGSA-DAI service

The service replies with the source schema encoded an in XML response document

The AutoMed-DAI wrapper creates the corresponding schema in the AutoMed repository

AutoMedwrapper

AutoMedSchema

OGSA-DAIService

schema request

DB

XMLresponse

The AutoMed-DQP Wrapper

The AutoMed-DQP wrapper undertakes two tasks:• needs to inform AutoMed of the subset of IQL that it is

capable of translating into OQL• is responsible for making interactions with OGSA-DQP

transparent to the remainder of the AutoMed infrastructure

On receiving an IQL query, the AutoMed-DQP wrapper first translates it into the equivalent OQL query

The OQL query is then sent to OGSA-DQP for evaluation The reply from OGSA-DQP is in the form of an XML

response document containing the query results The AutoMed-DQP wrapper translates these results into the

IQL type system, and returns the result to AutoMed's evaluator for any further necessary evaluation

Background Reading for Part II

• OGSA Version 1.0 document, January 2005• “Service-Based Distributed Querying on the Grid” by

Alpdemir et al., Proc. of the 1st International Conference on Service Oriented Computing", 2003, pp 467-482

• “The design and implementation of grid database services in OGSA-DAI” by Antonioletti et al., Concurrency - Practice and Experience, Vol 17, No 2-4, 2005, pp 357-376

Documents

December 2009 Data Integration in Grid Environments Alex Poulovassilis, Birkbeck, U. of London