23
Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural Curation in Natural sciences sciences

Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Embed Size (px)

Citation preview

Page 1: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Z. Z. VilakaziiThemba LABS / UCT-CERN Research Centre

Curation in Natural Curation in Natural sciencessciences

Page 2: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

● Common effort of the ALICE and LGC Collaborations.

● Thanks to my colleagues of the ALICE-MUON

Collaboration.

Special thanks to Jean Cleymans, Bruce Becker, Artur

Szostak, Gareth de Vaux, Sukalyan Chattopadyay,

Corrado Cicalo, Timm Steinbeck, Volker Lindenstruth,

Heinz Tilsner, Florent Staley and others

Acknowledgments

Page 3: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Management of large data sets

Inter-operability

Standards and protocols

Security and certification

Topics for discussionTopics for discussion

Page 4: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Digital Curation

Maintainance of digital research data and other digital materials over their entire life-cycle and over time for current and future generations of users.

Processes of digital archiving and preservation

Also includes all the processes needed for good data creation and management, and the capacity to add value to data to generate new sources of information and knowledge.

", and services in this field."

Centre for Digital Curation

Page 5: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Digital Curation(2)

Curation and long-term preservation of digital resources will be of increasing importance for a wide range of activities within research and education.

Through sensors, experiments, digitisation and computer simulation, digital resources and data are growing in volume and complexity at a staggering rate.

The cost of producing these resources is very high: satellites, particle accelerators, genome sequencing, and large scale digitisation and electronic publishing collectively represent a cumulative investment of billions of pounds in digital research and learning.

Long-term curation and preservation of digital resources is seen as a challenge which is difficult if not impossible for individual institutions to resolve on their own due to the complexity and scale of the challenges involved.

Page 6: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Curation in Physical Sciences Data is being generated in large volumes.

In laboratories; old archival material (design specifications, codes etc) can serve as reference resources.

Remote information access through online publications.

Data management and real-time remote analysis Heavily dependent on bandwidth

New middleware is being developed for access ofdata across geographically disparate centres.

Data sharing in astro; nuclear and particle physics Usually characterised by large collaborations (in excess

of 100 people)

MetaData are essential for the selection of events

Can use the Grid file catalogue for one part of the MetaData

During the Data Challenge we used the file catalogue for storing part of the MetaData

Page 7: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

simulation

reconstruction

analysis

interactivephysicsanalysis

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventreprocessing

eventsimulation

eventsimulation

analysis objects(extracted by physics topic)

Data Handling and Computation for

Physics Analysisevent filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata

les.

rob

ert

son

@ce

rn.c

h

CERN

Page 8: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Experimental conditions in heavy-ion colliders

Experimental conditions in heavy-ion colliders

Beam : Pb-Pb, Ca-Ca, p-p, p-A Rates :

8000 events/s Minimum bias 50-100/s central events (2-5%

tot) acquisition rate 100 Hz

(central) 1000 Hz (dimuons) 1 month/year (106 s) =107

central events Multiplicity : dn/dy from 2000

to 8000 so a total of about 60000

Page 9: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Consequences

More than 60 GBytes produced per second in Alice:•High Level Trigger (HLT) + compression to reduce raw data to 1.2 GB/s : 2 to 3 PB/year in 1 month of data taking•Very fast acquisition and network

ALICE will be one of the largest data base in historyNeed a GRID to distribute and analyse data

Page 10: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

The Grid Vision

The GRID: networked data processing centres and ”middleware” software as the “glue” of resources.

Researchers perform their activities regardless geographical location, interact with colleagues, share and access data

Scientific instruments and experiments provide huge amount of data

Page 11: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Classification of Grids

Computational Grids (including CPU scavenging Grids) which focuses primarily on computationally-intensive operations

Data Grids or the controlled sharing and management of large amounts of distributed data

Equipment Grids which have a primary piece of equipment e.g. a telescope, and where the surrounding Grid is used to control the equipment remotely and to analyse the data produced.

Page 12: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Grid beyond high energy physics

Due to the computational power of the EGEE new communities are requiring services for different research fields

Normally these communities do not need the complex structure that required by the HEP communities

In many cases, their productions are shorter and well defined in the year

The amount of CPU required is much lower and also the Storage capabilities

20 applications from 7 domains

High Energy Physic, Biomedicine, Earth Sciences, Computational Chemistry

Astronomy, Geo-physics and financial simulation

36

Page 13: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

LCG services – built on two majorscience grid infrastructures

EGEE - Enabling Grids for E-ScienceOSG - US Open Science Grid

Page 14: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

LCG Service Hierarchy

Tier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Distribution of data Tier-1 centres

Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschunszentrum KarlsruheItaly – CNAF (Bologna)Netherlands Tier-1 (Amsterdam)Nordic countries – distributed Tier-1

Spain – PIC (Barcelona)Taiwan – Academia SInica (Taipei)UK – CLRC (Oxford)US – FermiLab (Illinois) – Brookhaven (NY)

Tier-1 – “online” to the data acquisition process high availability

Managed Mass Storage – grid-enabled data service

Data-heavy analysis National, regional support

Tier-2 – ~100 centres in 20 countries Simulation End-user analysis – batch and interactive

Page 15: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Tier0 / Tier1 / Tier2 Networks

Tier-2s and Tier-1s are inter-connected by the general

purpose research networks

Any Tier-2 mayaccess data at

any Tier-1

Tier-2 IN2P3

TRIUMF

ASCC

FNAL

BNL

Nordic

CNAF

SARAPIC

RAL

GridKa

Tier-2

Tier-2

Tier-2

Tier-2

Tier-2

Tier-2

Tier-2Tier-2

Tier-2

Cape Town ?

Page 16: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Summary of Tier0/1/2 Roles

Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times;

Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s;

Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction.

Very difficult to estimate Network requirements!

N.B. there are differences in roles by experimentEssential to test using complete production chain of each!

Page 17: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Tier2

Tier1

Tier2

Tier1

Production of RAW

Shipment of RAW to CERN

Reconstruction of RAW in all T1’s

Analysis

AliEn job control

Data transfer

Physics Data Challenge(s)F

. C

arm

inat

ti (C

ER

N)

Page 18: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

ALICE Network in the World

Yerevan

CERN

Saclay

Lyon

Dubna

Cape Town, ZA

Birmingham

Cagliari

NIKHEF

GSI

Catania

BolognaTorino

Padova

IRB

Kolkata, India

OSU/OSC

LBL/NERSC

Merida

Bari

http://www.to.infn.it/activities/experiments/alice-grid

37 people21 insitutions

Active sites

Page 19: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Undersea Cable Capacity

Page 20: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Asymmetric Inter-regional Bandwidth

Page 21: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Result: Sample Bandwidth Costs for African Universities

Source: IEEAF

Page 22: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Management of large data sets$$ and RDatabase management Skills

Digital divide : Cyber infr: network/HR/libraries/data sets/LAN etc

Inter-operability: e.g Astro-Grid, mammo Grid etc Standards and protocols

Preservation and qualityAccess (meaning of numbers)/terminology and use of unfamiliar dataConfiguration managementEx: Particle data book

Security and certification

Certification authoritiesDialogue between researchers & librarians

Role of libraries and curatorsGuidelinesAcademic training programme/ schools outreach

Schools: New curriculum development (lost data)Research students: access to previous theses

Resource management

Topics for discussionTopics for discussion

Page 23: Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

Challenges

Strategy for Natural sciences across different domains