Transcript
Page 1: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

1

Data Management in a Grid Environment - theory and practical examples

Kerstin Kleese van Dam et. al.,

CCLRC e-Science Centre

[email protected]

http://www.e-science.clrc.ac.uk

Page 2: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

2

Council for the Central Laboratory of the Research Councils

One of Europe’s largest Research Support Organisations, providing large scale experimental, data and computing facilities primarily to the UK research community both in academia and industry. Annually supporting around 12000 scientists from all major scientific domains. 1800 members of staff over three sites:

•Rutherford Appleton Laboratory in Oxfordshire

•Daresbury Laboratory in Cheshire

•Chilbolton Observatory in Hampshire

Large quantities of data associated with the various facilities. Houses 1 World Data Centre, 3 National Data Centres and a range of community based data services.

http://www.cclrc.ac.uk

Page 3: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

3

CCLRC e-Science Centre

Early involvement in e-Science (from 1999 Data Grid / WOS onwards).

Centre established in 2000, since 2001 with direct governmental funding, additional funding through participation in other projects.

Currently housing UK Grid Support Centre (together with Manchester + Edinburgh) and BBSRC Grid Support Centre.

Involved in DataGrid, GridPP, AstroGrid and NERC DataGrid

Currently 40 permanent members of staff, 10 in the data management group.

http://www.escience.clrc.ac.uk

Page 4: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

4

Data Management Group

Page 5: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

5

Current e-Science Projects of the Data Management Group

Working on collaborations with partners inside CCLRC, the UK and internationally

CLRC DataPortal

Integration of ISIS and BADC operational Data Catalogues

Environment from the Molecular Level

NERC DataGrid

e-Science Technologies for the Simulation of Complex Materials

Extensions of the Storage Resource Broker (SRB) together with SDSC

Earth Science Portal Project

Database service for CCLRC and related e-Science projects

Page 6: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

6

Data Management

Page 7: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

7

Currently the scientist has to take care of his data, providing the binding link between different areas of work.

In the future we hope that e-Science technologies provide scientists with a more helpful environment …

Your personal e-Science Interface where ever you are.

Page 8: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

8

Issues

Data capture from instruments and computers

Data Storage

Annotating data

Data Discovery

Association of data with appropriate applications

Conversion of data from one application to the other

Merging of data from different sources

Page 9: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

9

Data capture from instruments and computers

In a Grid environment the Scientists will ultimately have little control where he will carry out his experiment or calculation and where therefore his data will be.

Capture Data

Capture Information about the environment

Direct where output goes

Page 10: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

10

Data Capture from Experimental Facilities (1)

Instruments produce varying amounts of data, ranging from small (e.g. temperature readings at a station) to large (e.g. LHC with several Tbytes per second).

Each instrument will produce data in its own format, often incompatible with anything else.

Most facilities provide their own short term storage, but will neither annotate nor manage the data.

The collection of environmental information is often limited, much of the information is still recorded in lab notice books.

Correction values or error margins related to the instrument are not linked to the collected data.

Page 11: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

11

Data Capture from Experimental Facilities (2) - Requirements

Generalised description of data format (possible standardisation for instruments of the same type).

Automatic capture of environment information including Instrument scientists if necessary.

Automatic linking of data about the environment and the raw data produced by the instrument.

Automatic insertion of both types of data into interim or final data repository.

Automatic linking of the donated data to existing related information e.g. proposal, other experiments of the same project.

Page 12: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

12

Data Capture from Experimental Facilities (3) - Examples

ICAT - CLRC ISIS Catalogue http://www.isis.rl.ac.uk/dataanalysis

See also:

Comb-e-Chem - http://www.combechem.org

Collection of Raw data from the Instrument, Detector specific Information for this experiment etc.

Integrate Raw Data with original Proposal Information and Log files of the Instrument Scientists

Finally Integrated with other Facility Data within and outside CCLRC via Instances of the CCLRC DataPortal software.

Page 13: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

13

Data Storage

The Grid environment provides access to a multitude of storage systems, often hiding the type of system behind services interfaces.

Where is the data

How can I manage it

On which media is my data (access time)

How can it be accessed

Where are replicas of my data

Page 14: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

14

Data Storage (2) - Requirements

Easy overview where your data is on the Grid

Support to manage your data (transfers/replicas)

Access and access control to your data where ever it is

Support to share your data

Two possible solutions:

Globus Data Management tools - example ESG http://www.earthsystemsgrid.org

Storage Resource Broker (SRB) from the San Diego Super Computing Centre

http://www.npaci.edu/DICE/SRB

Page 15: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

15

Typical Analysis Scenario and the use of Storage Resource Managers (SRM)

tape system

HRM

RequestExecuter

DRM

DiskCache

Metadatacatalog

Replicacatalog

NetworkWeatherService

logicalquery

pinning & filetransfer requests

network

DRM

DiskCache

clientclient ...

RequestInterpreter

requestplanning

logical files

site-specific files

Client’s site

...

DiskCache

site-specific files requests

Metadata Catalogue for Data Discovery within one Virtual Organisation

Replica Catalogue keeps track of all replica’s of specific datasets within one Virtual Organisation

The Network Weather Service helps to plan fastest Access routes to the data

Request goes out to Disk and Hierarchical Storage Resource Managers

Page 16: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

17

Storage Resource Broker (1)

Professional Data Storage Management System initially developed in the mid 90’s by the San Diego Super Computing Centre. http://www.npaci.edu/DICE/SRB/. Current version supports many platforms and authentication methods. Web services Interfaces.

Page 17: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

18

Storage Resource Broker

Integrated access to data on PC, UNIX, LINUX, DB and Tape Store http://www.npaci.edu/dice/srb/mySRB/mySRB.html

also used in the BIRN project http://www.nbirn.net/

SRB External Interface Modules: MySRB (web based), Command line Interface, C and Fortran API’s – Password and Certificate authorisation

Devise Interface Modules to wide range of platforms – easy to extend to new systems

MCAT provides links between logical to physical data location, replica and versioning. MCAT can be run on a variety of Relational Databases.

Page 18: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

19

Replica or Original Data

Version of DataType of Data

Physical Data Location and Type of Resource

Functions including ingestion, movement and replication of data. Providing access to data for others

Page 19: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

20

Page 20: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

21

Page 21: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

22

Biomedical Informatics Research Network

Page 22: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

23

Annotating Data

Data without further information is only of short and very limited use.

Information about the data itself

Information about the where, why, who and when

Information about the environment in which the data was captured

Related Information

Example: CLRC Scientific Metadata Schema http://www.e-science.clrc.ac.uk/Activity/ACTIVITY=DataPortal;SECTION=5;

Page 23: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

24

Diversity: Users & Searches

Discovery Excavation

Wider science

comm

unity

Data curator

Specialist userExperim

enter

General

comm

unity

Page 24: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

25

General Scientific Metadata

Science Metadata Model

ISIS SRS HEPSpace

ScienceSocial

ScienceEarth

Science

A generic metadata model for all scientific applications with Specialisation for each domain

Can answer questions across domains

Can answer questions about specific domains

Page 25: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

26

CLRC DataPortal - Scientific Metadata Model

Metadata Object

Topic

Study Description

Access Conditions

Data Location

Data Description

Related Material

Keywords providing a index on what the study is about.

Provenance about what the study is, who did it and when.

Conditions of use providing information on who and how the data can be accessed.

Detailed description of the organisation of the data into datasets and files.

Locations providing a navigational to where the data on the study can be found.References into the literature and community providing context about the study.

Page 26: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

27

Data Discovery

Most data is currently ‘discovered’ by word of mouth from friends and colleagues or sheer luck.

Discovery

Browsing

Selection

Comparison

Access

Example: CLRC DataPortal http://esc.dl.ac.uk:9000/index.html

Page 27: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

28

Different Levels of Metadata supporting Discovery and Selection

Metadata

XML

A

XML

B

A: Usage metadata generatedfrom (or about) the data. It could

be aggregated metadata: e.g.CDML from cdscan.

XML

C

XML

D

XML

QQ: Schema whichdefines supported

queries uponA,B,C,D

Relationships

B: Complete metadata from A+ user provided info to conform

with (at least) GEO profile.Application + template needed.

C: Metadata generated todescribe both documentations

and annotations (as opposed tobinary data).

D: Discovery metadata suitablefor harvesting to a portal.

Probably based on Dublin core& GEO. Subset of B and C.

Definitions

XML

D

XML

C

XML

BXML

AXML

D?

A -Metadata – can be derived from the data itself

D -Metadata – User provided information on what, who, what and when

C -Metadata – All related metadata, papers, pictures, related studies

B -Metadata – A summary of all other types of metadata

Page 28: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

29

CLRC DataPortal

The DataPortal currently allows access to selected metadata and data from four facilities. The first three housed by CLRC:

The Synchrotron Radiation Department (SRD)

The Neutron Spallation Source (ISIS)

The British Atmospheric Data Centre (BADC)

Max-Planck Institute for Meteorology (MPIM)

You will be able to assess the available data via the basic search.

If you are not one of our partners, but would like to try the system you can use one of our test accounts: Login , using 'dpuser' for your username and password.

http://esc.dl.ac.uk:9000/index.html

Page 29: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

30

DataPortal Architecture

The major functions of the DataPortal (DP) are grouped into modules, each module has a grid services interface to communicate with the other DP services and in some cases also with outside services like Visualisation or HPC Portal. The Soap protocol is used for communication and WSDL to describe the various services. We do not change any local metadata system, but use our own wrappers to translate our general query format into the local syntax. Replies from the resources will be XML files compliant with the CLRC Scientific Metadata Format:

(http://www-dienst.rl.ac.uk/library/2002/tr/dltr-2002001.pdf)

The UK e-Science Grid CA provides Globus x509 certificates for the UK e-Science community. The CA is located at RAL and is being run as part of the Grid Support Centre funded by the Research Councils' Core e-Science programme.

(http://www.grid-support.ac.uk/)

The implementation of the core modules as grid services allows the DataPortal to be a truly distributed application and allows several instances of the DataPortal to logically combined thus extending any user query.

Page 30: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

31

General CLRC DataPortal Architecture

CLRC DataPortal Server Other Instances of the CLRC DataPortal Server

Local data

Local metadata

XML wrapper

Facility 1

Local data

Local metadata

XML wrapper

Facility N

Local data

Local metadata

XML wrapper

Facility 1 ...

Page 31: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

32

DataPortal Architecture (2)

As well as interacting with the DataPortal via the Web Interface users can also run queries by directly calling the Query & Reply service assuming that they are properly authenticated. Other services are also externally visible, for example the Shopping Cart.

The Shopping Cart allows registered users to permanently store and annotate pointers to the external data files and data sets.

Facilities Facilities Access ControlAccess Control

CertificationCertificationAuthorityAuthority

DataPortal DataPortal Web InterfaceWeb Interface

AuthenticationAuthentication&&

AuthorisationAuthorisation

Session Session ManagementManagement

Facilities XML Facilities XML WrappersWrappers

QueryQuery&&

ReplyReply

FacilityFacilityAdministrationAdministration

DataPortal DataPortal Permanent Permanent RepositoryRepository

External External Data File Data File Store(s)Store(s)

Data TransferData Transfer

ServiceServiceLook UpLook Up

Shopping CartShopping Cart

Facility Administration allows external facilities to advertise their grid services to the DataPortal.

Accessing DataPortal either via Web Interface or Web Services Interfaces e.g. Query and Reply

Authenticate and Authorise user by checking certificate validity and check with associated facilities for general access rights

Query Generation, Selection of Suitable Facilities to Query. Farm out query to selected Facilities in parallel and collect and collate results

Put interesting Data in your personal, permanent Shopping Cart, which you can share with others as required.

Use the Data Transfer Service to send your data on to a chosen application or service

Page 32: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

33

Choose Facilities of Interest

Select Discipline and reduce Search Field

Page 33: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

34

Page 34: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

35

Page 35: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

36

Page 36: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

37

Annotate your Search Results

Forgotten where your data came from?

Specific Services associated with this data

Page 37: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

38

Association of data with appropriate applications

The scientists will need to be able to link to all his favourite applications for analysis, simulation and visualisation, but he also needs to be informed about suitable other program’s.

Suitable applications

Correct Format

Suitable for your environment

Availability

Page 38: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

39

HPCGrid Services Portal

This is a pilot project funded by the CLRC e-Science Centre to develop a Web portal to search for resources and submit HPC applications to a computational Grid in the UK. It will form the basis of application portals for the UK e-Science Grid and "thematic Grids" for e.g. NERC DataGrid and HPCI Consortia.

This project is a collaboration with the San Diego Supercomputer Centre who have developed the GridPortPortal and HotPage software for the NPACI HPC Grid, and with the University of Lecce, Italy who have developed the Grid Resource broker.

http://esc.dl.ac.uk/HPCPortal/

Page 39: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

40

HPC Grid Services Portal

Provides a portal for HPC resources which can be customised for domain-specific applications.

Original collaboration with San Diego Supercomputer Center, now University of Texas (Mary Thomas).

Similar functionality to HotPage and GridPort (SDSC):

Single sign-on using a digital certificate (GSI)

Resource monitoring and Discovery (Globus)

Application Discovery (search engine)

Personal "desktop" workspace

File transfer (Globus) and Job Submission (Globus)

Page 40: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

41

InfoPortal

HPCPortal

DataPortal

Searching for Applications on the UK Level 2 Grid

Page 41: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

42

Chose Application: DLPOLY

Resulting Findings for DLPOLY

Page 42: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

43

Summary Description

Web Service Address for DLPOLY code

Information about the systems the code is installed and available for use

Link to job submission

Page 43: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

44

Page 44: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

45

All machines on the UK level 2 Grid and their availability

Page 45: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

46

Conversion of data from one application to the other

The scientists will need to be able to pass data from one application to the next seamlessly and with minimum interference on their part.

Determining Data Formats

Data Schema

Interchange/Conversion

Example: e-Materials Project

Page 46: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

47

The CLRC DataPortal Related Projects

E-SCIENCE TECHNOLOGIES IN THE SIMULATION OF COMPLEX MATERIALS

A combination of novel computational and computer science methodologies and teams will be used to develop GRID e-Science technologies to deliver new simulation solutions to problems and fields relating to combinatorial materials science and polymorph prediction. The project will exploit the latest developments in scientific simulation methodologies (both electronic structure and force field based) and hardware ranging from desktop to HPC. It will establish a field tested integrated data and computing e-Science infrastructure customised for these key areas of current materials science. This infrastructure will, among others, enable the automatic submission of simulation, triggered by the identification of knowledge gaps in the database in response to user queries. Furthermore, the automatic integration of experimental and computational results for screening applications will be supported.

Page 47: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

48

The Science: Filtering

Purely SiO4

zeoliteMetal substitution with addition of

proton

Calculation of Vibrational Freqs

Add probe

Increase quality of calculation for

best candidates

Information of Interest Structure Total energy Binding Energy HOMO/LUMO Population Analysis Vibrational Freqs

Two point displacement method used to build up dynamical matrix.Single point energy calculation at each displacement +ve and –ve in x, y, and z.

Page 48: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

49

The Computation

2. Energy and gradients passed from GAMESS-UK to GULP and then final forces passed back to ChemShell (newopt module), which performs geometry optimisation.

ChemShell

ChemShellOptimiser

ChemShell

GAMESS-UK

GULP

GAMESS-UK

GULP

RMS=x

Maxg and maxs < 0.01

3. Optimisation is considered complete when both max gradient and max step are below set criteria.

1. Micro iterations to relax shells wrt forces from QM region. RMS criteria (x) tested for further movement of shells.

Page 49: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

50

CML – Chemical Markup Languages

CML is a new approach to managing molecular information. It has a large scope as it covers disciplines from macromolecular sequences to inorganic molecules and quantum chemistry. CML is new in bringing the power of XML to the management of chemical information. CML and associated tools allows for the conversion of current files without semantic loss into structured documents, including chemical publications, and provides for the precise location of information within files.

Developed by Peter Murray-Rust and Henry S. Rzepa.

http://www.xml-cml.org

As an addition they are also looking at:

CCML – a Computational Chemical Markup Language

Page 50: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

51

<document>- <!-- CML document - caffeine - karne - 7/8/00   --> - <!-- file converted from: MDL .mol   --> - <cml title="caffeine" id="cml_caffeine_karne" xmlns="x-schema:cml_schema_ie_02.xml">- <molecule title="caffeine" id="mol_caffeine_karne" convention="mol">  <formula>C8 H10 N4 O2</formula>   <string title="CAS">58-08-2</string>   <string title="ACX">I1001269</string>   <string title="DOT">UN 1544</string>   <string title="RTECS">EV6475000</string>   <float title="molecule weight">194.19</float>   <float title="melting point" units="degC">238</float>   <float title="specific gravity">1.23</float>   <string title="water solubility" units="g/100 mL" convention="g per 100 mL at 23 degC">1-5</string>   <string title="comments">White powder or white glistening needles usually melted together. LIGHT SENSITIVE</string> - <list title="alternate names">

Page 51: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

52

The CLRC DataPortal Related Projects

ENVIRONMENT FROM THE MOLECULAR LEVEL: AN E-SCIENCE PROPOSAL FOR MODELLING THE ATOMISTIC PROCESSES INVOLVED IN ENVIRONMENTAL ISSUES

Many environmental problems, such as transport of pollutants, development of remediation strategies, weathering, and containment of high-level radioactive waste, require an understanding of fundamental mechanisms and processes at a molecular level. Computer simulations at a molecular level can give considerable progress in our understanding of these processes. Developments in atomistic simulation tools must now be linked with GRID technologies in order to facilitate simulation studies that can be performed with realistic conditions, and which can scan across a wide range of physical and chemical parameters. This proposal brings together simulation scientists, applications developers and computer scientists to develop UK e-science/GRID capabilities for molecular simulations of environmental issues. A common set of simulation tools will be developed for a wide range of applications, and the GRID environment will be established which will result in a giant leap in the capabilities of these powerful scientific tools. See http://eminerals.org/

Page 52: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

53

The CLRC DataPortal Related Projects

THE NERC DATAGRID

Data discovery and delivery are inherent components of many aspects of science. They can be considered part of a processing chain that starts with raw data from a variety of sources, and ends with the graphical production of information that is directly used in scientific research. This proposal is to build a grid which makes data discovery, delivery and use much easier than it is now, facilitating better use of the existing investment in the curation and maintenance of quality data archives. Further we intend to make the connection between data held in managed archives and data held by individual research groups seamless in such a way that the same tools can be used to compare and manipulate data from both sources. What will be completely new will be the ability to compare and contrast data from an extensive range of (US, European, UK, NERC) datasets from within one specific context. The presence of the NERC DataGrid will allow grid based visualisation services to access a wide variety of data held at the British Atmospheric and Oceanographic Data Centres (BADC and BODC respectively) as well as on individual storage systems belonging to groups which register their data with the NERC DataGrid. The structures put in place will also allow NERC data to become part of the putative future semantic grid. See http://ndg.badc.rl.ac.uk/

Page 53: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

54

CLRC DataPortal Related Projects

EARTH SCIENCE PORTAL

The Earth Science Portal (ESP) is a collaboration designed to build the infrastructure needed to create web portals to provide access to observed and simulated data within the climate and weather communities. The infrastructure created within ESP will provide a flexible framework that will allow interoperability between the front-end and back-end software components.

The initial ESP community workshop was held on January 23rd and Friday, January 24th, 2003 at the National Center for Atmospheric Research, Boulder, Colorado. Based on the discussions of the workshop we created a draft document that describes the software framework within ESP. The development activities in ESP are intended to support this framework. The document will be updated based these activities and comments and suggestions from the community.

Partners are: BADC, CCLRC, CDC and GFDL NOAA, NASA, LLNL, NCAR and PMEL

http://nomads.gfdl.noaa.gov/~ck/esp/webpages

Page 54: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

55

The CLRC DataPortal Related Projects

EUROPEAN SPATIO-TEMPORAL DATA INFRASTRUCTURE FOR HIGH-PERFORMANCE COMPUTING

ESTEDI, an initiative of European software vendors and supercomputing centres, will establish a European standard for the storage and retrieval of multidimensional high-performance computing (HPC) data. It addresses a main technical obstacle, the delivery bottleneck of large HPC results to the users, by augmenting high-volume data generators with a flexible data management and extraction tool for spatio-temporal raster data. To this end, the multidimensional database system RasDaMan will be enhanced with intelligent mass storage handling and optimised towards HPC. See http://www.estedi.org/

Page 55: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

56

The CLRC DataPortal Related Projects

MSC PROJECT ON AUTOMATED DATA MANAGEMENT FOR CLIMATE SIMULATIONS

These days data is no longer only produced by experiments, measurements and observations. Many of the more complex phenomena are studied in computer simulations. These simulations can produce large quantities of data. However in contrast to much experimental or observational data these results are often not accessible to the wider research communities. Simulation data could be more widely exploited if better information was available concerning the simulation itself.This project aims to investigate the possibility of automatically capturing as much metadata concerning the simulation as possible and storing it in a suitable database. The database will be accessible via the CLRC DataPortal. It is expected that next to investigating the issue in general a prototype installation will be provided by the students.

Page 56: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

57

The CLRC DataPortal Related Projects

CLRC e-Science Database Service

We looking for the most flexible operating systems in terms of both software available and price/performance ultimately led to the choice of a Linux based system (enterprise editions). For running the widest choice of databases, the Redhat Advanced Server and SuSE Linux Enterprise Server are available. Oracle has been selected for the initial database service as it offers a clustering technology. Oracle Real Application Clusters are the multi-node extension to Oracle database server. A cluster is a group of independent servers (nodes) that cooperate as a single system. The primary cluster components are processor nodes, a cluster interconnect, and a shared storage subsystem. Oracle cluster database combines the memory in the individual nodes to provide a single view of the distributed cache memory for the entire database system. Oracle are the only vendor to offer this capability.

PostgreSQLPostgreSQL

We chose IBM x440 series nodes as the building blocks for the data clusters. The IBM Enterprise X-Architecture consists of Intel processor-based servers, such as support for up to 16-way SMP capability and remote I/O. The clusters connect to 1TB RAID 5 storage arrays via fibre channel switches.

Page 57: Data Management in a Grid Environment - theory and practical examples

N+N meeting Australia 2003e-Science CentreKerstin Kleese van Dam

58

For Information see:

Integrated e-Science Environment Portal

http://esc.dl.ac.uk/IeSE/

HPC Grid Services Portalhttp://esc.dl.ac.uk/HPCPortal/

DataPortalhttp://esc.dl.ac.uk:9000/index.html

CLRC e-Science Centrehttp://www.e-science.clrc.ac.uk