33
PIC the Spanish LHC Tier-1 ready for data taking EGEE09, Barcelona 21-Sep-2008 Gonzalo Merino, [email protected] http://lhcatpic.blogspot.com

PIC the Spanish LHC Tier-1 ready for data taking

  • Upload
    lexine

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

PIC the Spanish LHC Tier-1 ready for data taking. EGEE09, Barcelona 21-Sep-2008 Gonzalo Merino, [email protected] http://lhcatpic.blogspot.com. The LHC. World’s largest scientific machine 27 km ring proton accelerators 100 m underground superconducting magnets -271 o C. Basic Science - PowerPoint PPT Presentation

Citation preview

Page 1: PIC the Spanish LHC Tier-1 ready for data taking

PICthe Spanish LHC Tier-1ready for data taking

EGEE09, Barcelona 21-Sep-2008

Gonzalo Merino, [email protected]

http://lhcatpic.blogspot.com

Page 2: PIC the Spanish LHC Tier-1 ready for data taking

The LHC

Basic Science

Try to answer fundamental questions:

What are things made of?

How did it look like 1 nanosecond after Big Bang?

World’s largest scientific machine

27 km ring proton accelerators

100 m underground

superconducting magnets -271 oC

Page 3: PIC the Spanish LHC Tier-1 ready for data taking

Four detectors located in the p-p collision points

Extremely complex devices

O(1000) people collaborations

>100 million sensors

Generate Petabytes/s data

Page 4: PIC the Spanish LHC Tier-1 ready for data taking

4

WLCG

The LHC Computing Grid project started in 2002– Phase 1 (2002 – 2005): Tests and dev, build service prototype– Phase 2 (2006 – 2008): Deploy initial LHC computing service

Purpose: “to provide the computing resources needed to process and analyse the data gathered by the LHC Experiments”

Enormous data volumes and computing capacity

– 15 PB/yr RAW >50 PB/year overall – Lifetime 10-15 years: Exabyte scale

Scalability challenge

Clear that a distributed infrastructure was needed

Page 5: PIC the Spanish LHC Tier-1 ready for data taking

5

Grid Infrastructure

Since early 2000s large international projects funded to deploy production Grid Infrastructures for scientific research

Luckily for WLCG, these took a big load building the infrastructure

WLCG built on big multi-science production Grid Infrastructures:

EGEE, OSG

Page 6: PIC the Spanish LHC Tier-1 ready for data taking

6

LHC a big EGEE user

Monthly CPU walltime usage per scientific discipline from EGEE Accounting Portal.

The LHC: the biggest user of EGEE.

There is many ways one can use EGEE: How is the LHC using the Grid?

Page 7: PIC the Spanish LHC Tier-1 ready for data taking

7

Tiered Structure

It comes from the early days (1998, MONARC). Then mainly motivated for limited network connectivity among sites.

Today, the network is not the issue but the Tiered model is still used to organise work and data flows.

Tier-0 at CERN: – DAQ and prompt reconstruction.– Long term data curation.

Tier-1 (11 centres): Online to DAQ 24x7– Massive data reconstruction.– Long term storage of RAW data copy.

Tier-2 (>150 centres): – End-user analysis and simulation.

Computing Models: not all 4 LHC experiments use the Tiered structure the same way

Page 8: PIC the Spanish LHC Tier-1 ready for data taking

8

Distribution of resources

Experiment computing requirements for the 2009-2010 run at the different WLCG Tiers

CPU required 2009+2010

21%

34%

45%

Disk required 2009+2010

41%

45%

14%CERN

Tier-1

Tier-2

More than 80% of the resources are outside CERN

The Grid MUST work

Page 9: PIC the Spanish LHC Tier-1 ready for data taking

The operations challenge

Scalability

Reliability

Performance

Page 10: PIC the Spanish LHC Tier-1 ready for data taking

10

Scalability

The computing and storage capacity needs for WLCG are enormous.

Once LHC starts, the growth rate will be impressive.

0

500

1000

1500

2000

2500

3000

2008 2009 2010 2011 2012 2013

Sites pledges

Experiment requirements

LHC CPU requirements (T1+T2)

0

500

1000

1500

2000

2500

3000

2008 2009 2010 2011 2012 2013

CP

U p

ow

er

(kH

S0

6)

LHC Disk requirements (T1+T2)

0

50

100

150

200

250

300

2008 2009 2010 2011 2012 2013

Dis

k (P

B)

~100.000 today cores

Page 11: PIC the Spanish LHC Tier-1 ready for data taking

11

Reliability

Setting up and deploying a robust Operational model is crucial for building reliable services on the Grid.

One of the key tools for WLCG comes from EGEE:

The Service Availability Monitor

Page 12: PIC the Spanish LHC Tier-1 ready for data taking

12

Improving reliability

Track site status with time…

Page 13: PIC the Spanish LHC Tier-1 ready for data taking

13

Improving reliability

... publish rankings

Page 14: PIC the Spanish LHC Tier-1 ready for data taking

14

Improving reliability

See how sites reliability goes up!

60%

65%

70%

75%

80%

85%

90%

95%

100%

105%

may-06sep-06

ene-07may-07

sep-07ene-08

may-08sep-08

ene-09may-09

Rel

iabi

lity

T0/T1 Average

Target

An increasing number of more realistic sensors, plus a powerful monitoring framework that ensures peer pressure, guarantees that reliability of WLCG service will keep improving.

Page 15: PIC the Spanish LHC Tier-1 ready for data taking

15

Performance: data volumes

CMS has been transferring 100 – 200 TB per day on the Grid since more than 2 years

20082009

150 TB/day

Last June ATLAS added 4 PB in 11 days to their

total of 12 PB on the Grid

Page 16: PIC the Spanish LHC Tier-1 ready for data taking

16

http://www.pic.es

Port d’Informació Científica (PIC) created in June 2003

4 partners collaboration agreement: DEiU, CIEMAT, UAB, IFAE

Data centre supporting scientific research involving analysis of massive sets of distributed data.

Page 17: PIC the Spanish LHC Tier-1 ready for data taking

17

http://www.pic.es

WLCG Tier-1 since Dec 2003 supporting ATLAS, CMS and LHCb– Targeting to provide 5% of the Tier-1s capacity

The Tier-1 represents >80% of current resources

Goal: Technology transfer from the LHC Tier-1 to other scientific areas facing similar data challenges.

Computing services for other applications besides LHC– Astroparticles (MAGIC datacenter)– Cosmology (DES, PAU)– Medical Imaging (Neuroradiology)

Page 18: PIC the Spanish LHC Tier-1 ready for data taking

is PIC deliveringas WLCG Tier-1?

Scalability

Reliability

Performance

Page 19: PIC the Spanish LHC Tier-1 ready for data taking

19

0200400600800

100012001400160018002000

en

e-0

6

ab

r-0

6

jul-0

6

oct

-06

en

e-0

7

ab

r-0

7

jul-0

7

oct

-07

en

e-0

8

ab

r-0

8

jul-0

8

oct

-08

en

e-0

9

ab

r-0

9

jul-0

9

oct

-09

Dis

k (

TB

)Scalability

In the last 3 years PIC has followed the capacity ramp-ups as pledged in the WLCG MoU

5 fold in CPU

>10 fold in Disk and Tape

0

2000

4000

6000

8000

10000

12000

en

e-0

6

ab

r-0

6

jul-0

6

oct

-06

en

e-0

7

ab

r-0

7

jul-0

7

oct

-07

en

e-0

8

ab

r-0

8

jul-0

8

oct

-08

en

e-0

9

ab

r-0

9

jul-0

9

oct

-09

CP

U (

HS

06

)

PIC Installed Capacity "MoU Pledge"

0

200

400

600

800

1000

1200

1400

ab

r-0

6

jul-0

6

oct

-06

en

e-0

7

ab

r-0

7

jul-0

7

oct

-07

en

e-0

8

ab

r-0

8

jul-0

8

oct

-08

en

e-0

9

ab

r-0

9

jul-0

9

Ta

pe

(T

B)

Planned

CPU cluster: Torque/Maui

Disk: dCache Tape

CastorEnstore migration

Sun 4500 servers (DAS)

DDN S2A (SAN)

Public tenders purchase model different technologies integration challenge

Page 20: PIC the Spanish LHC Tier-1 ready for data taking

20

PIC Tier-1 Reliability

60%

65%

70%

75%

80%

85%

90%

95%

100%

105%

may-06sep-06

ene-07may-07

sep-07ene-08

may-08sep-08

ene-09may-09

Rel

iabi

lity

T0/T1 Average

Target

60%

65%

70%

75%

80%

85%

90%

95%

100%

105%

may-06sep-06

ene-07may-07

sep-07ene-08

may-08sep-08

ene-09may-09

Rel

iabi

lity

PIC

T0/T1 Average

Target

Tier-1 reliability targets have been met for most of the months

Page 21: PIC the Spanish LHC Tier-1 ready for data taking

21

Performance: T0/T1 transfers

Target ATLAS+CMS+LHCb ~ 210 MB/s

CMS data imported from T1s CMS data exported to T1s

Target ATLAS+CMS+LHCb ~ 100 MB/s

ATLAS daily rate CERNPIC June 2009

Target: 76 MB/s

CMS daily rate CERNPIC June 2009

Target: 60 MB/s

Data import from CERN and transfers with other Tier-1s successfully tested above targets

Page 22: PIC the Spanish LHC Tier-1 ready for data taking

22

Challenges ahead

Tape is difficult (specially reading …)– Basic use cases have been tested in June, but still tricky

• Organisation of files in tape groups – big impact in performance

• Sharing of drives/libraries between experiments

Access to data from jobs– Support very different jobs: reco (1GB/2h) user (5GB/10min)– Optimisation for all of them is not possible – compromise

• Remote open of files vs. copy to the local disk

• Read_ahead buffer tunning, disk contention in the WNs

Real users (chasing real data)– Simulations of “user analysis” load have been done – Ok – The system has never been tested with 1000s of real users

simultaneously accessing data with realistic (random?) patterns

Page 23: PIC the Spanish LHC Tier-1 ready for data taking

23

Challenges ahead

The Cloud– A new buzzword? – Several demos of running scientific apps on commercial clouds. – Sure we can learn a lot from the associated technologies.

Encapsulating jobs as VMs – decouple hardware from applications

Multi-science support environments– Many of the WLCG sites also support other EGEE VOs– Requirements can be very different (access control to data, …)– Maintaining high QoS in these heterogeneous environments will

be a challenge… but that’s precisely the point.

Page 24: PIC the Spanish LHC Tier-1 ready for data taking

Thank you

Gonzalo Merino, [email protected]

Page 25: PIC the Spanish LHC Tier-1 ready for data taking

Backup Slides

Page 26: PIC the Spanish LHC Tier-1 ready for data taking

26

Data Analysis

The original vision: Application thin layer interacting with a powerful m/w layer (super-WMS to which the user throws input dataset queries plus algorithms and spits the result out)

Reality today: LHC experiments have build increasingly sophisticated s/w stacks to interact with the Grid. On top of basic services: CE, SE, FTS, LFC

Workload Management: Pilot jobs, late scheduling, VO-steered prioritisation (DIRAC, Alien, Panda …)

Data Management: Topology aware higher level tools, capable of managing complex data flows (Phedex, DDM …)

User Analysis: Single interface for the whole analysis cycle, hide the complexity of the Grid (Ganga, CRAB, DIRAC, Alien …)

To use the Grid at such large scale is not an easy business!

Page 27: PIC the Spanish LHC Tier-1 ready for data taking

27

Performance: CPU usage

100.000 ksi2k·month

~ 50.000 simult. busy cores

Page 28: PIC the Spanish LHC Tier-1 ready for data taking

PIC data centre

150 m2 machine room

200 KVAIT UPS (+diesel generator)

~ 1500 CPU cores batch farm

~ 1 Petabyte of disk

~ 2 Petabytes of tape

STK-5500 and IBM-3584 tape libraries

Page 29: PIC the Spanish LHC Tier-1 ready for data taking

29

End-to-end Throughput

Besides growing capacity, one of the challenges for sites is to stand high throughput data rates between components

1500 cores cluster 1 Petabyte disk1 Petabyte tape

Page 30: PIC the Spanish LHC Tier-1 ready for data taking

30

End-to-end Throughput

Besides growing capacity, one of the challenges for sites is to stand high throughput data rates between components

1500 cores cluster 1 Petabyte disk1 Petabyte tape

2,5 GBytes/s peak rates WNs-disk during June-09

10 Gbps

Page 31: PIC the Spanish LHC Tier-1 ready for data taking

31

End-to-end Throughput

Besides growing capacity, one of the challenges for sites is to stand high throughput data rates between components

1500 cores cluster 1 Petabyte disk1 Petabyte tape

2,5 GBytes/s peak rates WNs-disk during June-09

>250 MB/s r+w tape bandwidth demonstrated in June-09 250 MB/s

Page 32: PIC the Spanish LHC Tier-1 ready for data taking

32

Data Transfers to Tier-2s

• Reconstructed data sent to the T2s for analysis. Bursty nature.

• Experiment requirements very fuzzy for this dataflow (as fast as possible)– Links to all SP/PT Tier-2s certified with 20-100 MB/s sustained– CMS Computing Model: sustained transfers to > 43 T2s

worldwide

Page 33: PIC the Spanish LHC Tier-1 ready for data taking

33

Reliability: experiments view

• Seeing sites reliability improving, experiments were motivated to making their sensors to measure it more realistic.

• An increasing number of more realistic sensors, plus a powerful monitoring framework that ensures peer pressure, guarantees that reliability of WLCG service will keep improving.