18
Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid Applications Developer CNB/CSIC

Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Embed Size (px)

Citation preview

Page 1: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Workflows over Grid-based Web servicesGeneral framework and a practical case in structural biology

gLite 3.0 Data Management

David García Aristegui

Grid Applications Developer

CNB/CSIC

Page 2: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Before all of you go crazy...

• Grid Acronym Soup (GAS)– http://www.gridpp.ac.uk/gas/

• EGEE Glosssary– http://public.eu-egee.org/faq/acronyms.html

• EGEE II Glosssary– http://egee-technical.web.cern.ch/egee-technical/documents/glossary.htm

Page 3: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Introduction

• EGEE middleware: called gLite, this middleware exploit experience and existing components from Condor, Globus, EDG, LCG, and others. gLite is a distribution that combines components from many different providers!

• gLite 3.0: convergence of LCG 2.7.0 and gLite 1.5.0 in spring

2006. Continuity on the production infrastructure ensured

usability by applications.

• Data Management System (DMS): provides file manipulation

for users and other Grid services. DMS enables the location,

access and transfer of data.

Page 4: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Data Management

• What does “Data Management” mean?– Users and applications produce and require data – Data may be stored in Grid files– Granularity is at the “file” level (no data “structures”) – Users and applications need to handle files on the Grid

• Files are stored in appropriate parmanent resources called “Storage Elements” (SE)– Present almost at every site together with computing resources– Described in details in next presentations– We will treat a storage element as a “black box” where we can

store data• Appropriate data management utilities/services hide internal

structure of SE• Appropriate data management utilities/services hide details on

transfer protocols

Page 5: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Data Management System - DMS

• EGEE DMS FUNCTIONALITY:– User does not need to know data location, just the logical name– Data is accessed through standard interfaces (POSIX)– Data can be replicated or transferred to serveral locations as

needed

– Data is shared within a VO

• KNOWN EGEE DMS LIMITATIONS: – Files cannot be changed unless removed or replaced– No intention of providing a global file management system– File replication sometimes doesn't affect performance– Local file system interfaces (ELFI) are still in beta stage

Page 6: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Data Issues and Grid Solutions - I

• Resource centers need to meet growing demand for storage– “Classic” Storage Elements– Storage Element capable to manage multiple disk pools

• Disk Pool Manager (DPM, disk)– Massive storage systems

• dCache (disk/tape) , CASTOR (tape)

• Data is stored on different storage systems technologies– Common interface required to hide underlying complexity

• Storage Resource Manager (SRM) – storage management protocol.

Page 7: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Data Issues and Grid Solutions - II

• Data is stored at different locations – File catalogue to provide uniform view of Grid data

• LCG File Catalog (LFC)

• Applications need to access data management services– Data management API

• Grid File Access Layer (API) - GFAL

• Biomedical Applications need data security– Encrypted Data Storage (EDS) and access control lists (ACLs)

• Hydra

Page 8: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Concepts

• What concepts/services do we need to know to understand the EGEE Data Management?– Storage Resource Manager - SRM– Storage Element – SE– File Transfer Services - FTS– File Catalogs

• What APIs and application level services are available for

developers? (not for this course)– Data Management APIs (GFAL)– Encryption (EDS, Hydra)– Metadata Catalog (AMGA)

Page 9: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Storage Resource Manager

• Storage Elements (SE) can use a wide variety of technologies

• Grid jobs need to see these SEs with a uniform interface– SRM is a protocol to manage storage resources (“classic”

storage elements, DPM, dCache, Castor...)– It is NOT a file access protocol

• Files are accessed using different file access protocols– gridFTP (GSI + FTP) for file transfers– rfio, dcap, GFAL... for file access for applications

Page 10: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Storage Element

• Storage Element – Provides storage space for grid files– SRM interface (not in the classic storage element)– Transfer protocol (gsiFTP) ~ GSI based FTP server– We have several implementations

• disk: classic (GridFTP server), Disk Pool Manager-DPM, dCache

• tape: Castor, dCache

– Security: ACLs now available in DPM, next Castor and dCache– POSIX file access: Grid File Access Layer (GFAL) library

• Exposes only LCG-needed features

• Single client implementation for all service-implementation

• Uses the Grid Information system to discover services

Page 11: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

File Transfer Service

• FTS is a low level data movement service

• Why is it needed? – Improves reliability for transfers

– Provides asynchronous file transfer

• schedule transfers when resources are available

– Provides control of transfer properties (channel concept)

• No catalogue interactions yet users have to handle

SURL

Page 12: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

File Catalog: LFC

• LFC Catalog => LFC - LCG File Catalogue– LCG = LHC Compute Grid – LHC = Large Hadron Collider (CERN)

• Provides– Mapping between LFN, GUID and SURL– Transactions, Sessions, Bulk queries– Hierarchical namespace, symbolic links– System metadata – Single string user metadata

• All members of a given VO have read-write permissionsin their directory

• Commands look like UNIX with “lfc-” in front (often)

Page 13: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

File and replicas name convention

• Globally Unique Identifier (GUID) – A non-human-readable unique identifier for a file, e.g.“guid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6”

• Site URL (SURL) (or Physical/Site File Name (PFN/SFN))– The location of the actual file on a storage system, e.g.

“sfn://lxshare0209.cern.ch/data/biomed/ntuples.dat”

• Logical File Name (LFN) – An alias created by a user to refer to some file, e.g.

“lfn:/grid/biomed/David20030203/run2/track1”

• Transport URL (TURL)– Temporary locator of a replica + access protocol: understood by a SE, e.g.“gsiftp://lxshare0209.cern.ch//data/biomed/ntuples.dat”

Page 14: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

LFC name space

Page 15: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Two sets of commands

• lfc commands– Use LFC commands to interact with the catalogue only

• To create catalogue directory

• List files

– Used by you and by lcg-utils

• lcg-utils– The LCG Data Management tools (usually called lcg-utils) allow users to

copy files between UI, CE, WN and a SE, to register entries in the File Catalogs and replicate files between Ses.

Page 16: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

LFC Catalog commands table

Add/replace a commentlfc-setcomment

Set file/directory access control listslfc-setacl

Remove a file/directorylfc-rm

Rename a file/directorylfc-rename

Create a directorylfc-mkdir

List file/directory entries in a directorylfc-ls

Make a symbolic link to a file/directorylfc-ln

Get file/directory access control listslfc-getacl

Delete the comment associated with the file/directorylfc-delcomment

Change owner and group of the LFC file-directorylfc-chown

Change access mode of the LFC file/directorylfc-chmod

Page 17: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

lcg-utils table

lcg-cp Copies a Grid file to a local destination

lcg-cr Copies a file to a SE and registers the file in the LRC

lcg-del Deletes one file (either one replica or all replicas)

lcg-rep Copies a file from SE to SE and registers it in the LRC

lcg-se set file status to “Done” in a specified request

lcg-aa Adds an alias in RMC for a given GUID

lcg-gt Gets the TURL for a given SURL and transfer protocol

lcg-la Lists the aliases for a given LFN, GUID or SURL

lcg-lg Gets the GUID for a given LFN or SURL

lcg-lr Lists the replicas for a given LFN, GUID or SURL

lcg-ra Removes an alias in RMC for a given GUID

lcg-rf Registers a SE file in the LRC (optionally in the RMC)

lcg-uf Unregisters a file residing on an SE from the LRC

Page 18: Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid

Bibliography

• “Data Services” - Simone Campana, LCG Experiment Integration and Support CERN-IT / INFN-CNAF

• “Data Management” - René Météry CS, Tutorial EGEE

Marseille, 3-4 Oct 2006

• “Data management in LCG and EGEE” - David Smith, CERN

& EGEE-JRA1/SA3 Data Management Team

• “EGEE middleware: gLite Data Management” - EGEE

Tutorial 23rd APAN Meeting, Manila, Jan 22, 2007