Upload
lexine
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
PIC the Spanish LHC Tier-1 ready for data taking. EGEE09, Barcelona 21-Sep-2008 Gonzalo Merino, [email protected] http://lhcatpic.blogspot.com. The LHC. World’s largest scientific machine 27 km ring proton accelerators 100 m underground superconducting magnets -271 o C. Basic Science - PowerPoint PPT Presentation
Citation preview
PICthe Spanish LHC Tier-1ready for data taking
EGEE09, Barcelona 21-Sep-2008
Gonzalo Merino, [email protected]
http://lhcatpic.blogspot.com
The LHC
Basic Science
Try to answer fundamental questions:
What are things made of?
How did it look like 1 nanosecond after Big Bang?
World’s largest scientific machine
27 km ring proton accelerators
100 m underground
superconducting magnets -271 oC
Four detectors located in the p-p collision points
Extremely complex devices
O(1000) people collaborations
>100 million sensors
Generate Petabytes/s data
4
WLCG
The LHC Computing Grid project started in 2002– Phase 1 (2002 – 2005): Tests and dev, build service prototype– Phase 2 (2006 – 2008): Deploy initial LHC computing service
Purpose: “to provide the computing resources needed to process and analyse the data gathered by the LHC Experiments”
Enormous data volumes and computing capacity
– 15 PB/yr RAW >50 PB/year overall – Lifetime 10-15 years: Exabyte scale
Scalability challenge
Clear that a distributed infrastructure was needed
5
Grid Infrastructure
Since early 2000s large international projects funded to deploy production Grid Infrastructures for scientific research
Luckily for WLCG, these took a big load building the infrastructure
WLCG built on big multi-science production Grid Infrastructures:
EGEE, OSG
6
LHC a big EGEE user
Monthly CPU walltime usage per scientific discipline from EGEE Accounting Portal.
The LHC: the biggest user of EGEE.
There is many ways one can use EGEE: How is the LHC using the Grid?
7
Tiered Structure
It comes from the early days (1998, MONARC). Then mainly motivated for limited network connectivity among sites.
Today, the network is not the issue but the Tiered model is still used to organise work and data flows.
Tier-0 at CERN: – DAQ and prompt reconstruction.– Long term data curation.
Tier-1 (11 centres): Online to DAQ 24x7– Massive data reconstruction.– Long term storage of RAW data copy.
Tier-2 (>150 centres): – End-user analysis and simulation.
Computing Models: not all 4 LHC experiments use the Tiered structure the same way
8
Distribution of resources
Experiment computing requirements for the 2009-2010 run at the different WLCG Tiers
CPU required 2009+2010
21%
34%
45%
Disk required 2009+2010
41%
45%
14%CERN
Tier-1
Tier-2
More than 80% of the resources are outside CERN
The Grid MUST work
The operations challenge
Scalability
Reliability
Performance
10
Scalability
The computing and storage capacity needs for WLCG are enormous.
Once LHC starts, the growth rate will be impressive.
0
500
1000
1500
2000
2500
3000
2008 2009 2010 2011 2012 2013
Sites pledges
Experiment requirements
LHC CPU requirements (T1+T2)
0
500
1000
1500
2000
2500
3000
2008 2009 2010 2011 2012 2013
CP
U p
ow
er
(kH
S0
6)
LHC Disk requirements (T1+T2)
0
50
100
150
200
250
300
2008 2009 2010 2011 2012 2013
Dis
k (P
B)
~100.000 today cores
11
Reliability
Setting up and deploying a robust Operational model is crucial for building reliable services on the Grid.
One of the key tools for WLCG comes from EGEE:
The Service Availability Monitor
12
Improving reliability
Track site status with time…
13
Improving reliability
... publish rankings
14
Improving reliability
See how sites reliability goes up!
60%
65%
70%
75%
80%
85%
90%
95%
100%
105%
may-06sep-06
ene-07may-07
sep-07ene-08
may-08sep-08
ene-09may-09
Rel
iabi
lity
T0/T1 Average
Target
An increasing number of more realistic sensors, plus a powerful monitoring framework that ensures peer pressure, guarantees that reliability of WLCG service will keep improving.
15
Performance: data volumes
CMS has been transferring 100 – 200 TB per day on the Grid since more than 2 years
20082009
150 TB/day
Last June ATLAS added 4 PB in 11 days to their
total of 12 PB on the Grid
16
http://www.pic.es
Port d’Informació Científica (PIC) created in June 2003
4 partners collaboration agreement: DEiU, CIEMAT, UAB, IFAE
Data centre supporting scientific research involving analysis of massive sets of distributed data.
17
http://www.pic.es
WLCG Tier-1 since Dec 2003 supporting ATLAS, CMS and LHCb– Targeting to provide 5% of the Tier-1s capacity
The Tier-1 represents >80% of current resources
Goal: Technology transfer from the LHC Tier-1 to other scientific areas facing similar data challenges.
Computing services for other applications besides LHC– Astroparticles (MAGIC datacenter)– Cosmology (DES, PAU)– Medical Imaging (Neuroradiology)
is PIC deliveringas WLCG Tier-1?
Scalability
Reliability
Performance
19
0200400600800
100012001400160018002000
en
e-0
6
ab
r-0
6
jul-0
6
oct
-06
en
e-0
7
ab
r-0
7
jul-0
7
oct
-07
en
e-0
8
ab
r-0
8
jul-0
8
oct
-08
en
e-0
9
ab
r-0
9
jul-0
9
oct
-09
Dis
k (
TB
)Scalability
In the last 3 years PIC has followed the capacity ramp-ups as pledged in the WLCG MoU
5 fold in CPU
>10 fold in Disk and Tape
0
2000
4000
6000
8000
10000
12000
en
e-0
6
ab
r-0
6
jul-0
6
oct
-06
en
e-0
7
ab
r-0
7
jul-0
7
oct
-07
en
e-0
8
ab
r-0
8
jul-0
8
oct
-08
en
e-0
9
ab
r-0
9
jul-0
9
oct
-09
CP
U (
HS
06
)
PIC Installed Capacity "MoU Pledge"
0
200
400
600
800
1000
1200
1400
ab
r-0
6
jul-0
6
oct
-06
en
e-0
7
ab
r-0
7
jul-0
7
oct
-07
en
e-0
8
ab
r-0
8
jul-0
8
oct
-08
en
e-0
9
ab
r-0
9
jul-0
9
Ta
pe
(T
B)
Planned
CPU cluster: Torque/Maui
Disk: dCache Tape
CastorEnstore migration
Sun 4500 servers (DAS)
DDN S2A (SAN)
Public tenders purchase model different technologies integration challenge
20
PIC Tier-1 Reliability
60%
65%
70%
75%
80%
85%
90%
95%
100%
105%
may-06sep-06
ene-07may-07
sep-07ene-08
may-08sep-08
ene-09may-09
Rel
iabi
lity
T0/T1 Average
Target
60%
65%
70%
75%
80%
85%
90%
95%
100%
105%
may-06sep-06
ene-07may-07
sep-07ene-08
may-08sep-08
ene-09may-09
Rel
iabi
lity
PIC
T0/T1 Average
Target
Tier-1 reliability targets have been met for most of the months
21
Performance: T0/T1 transfers
Target ATLAS+CMS+LHCb ~ 210 MB/s
CMS data imported from T1s CMS data exported to T1s
Target ATLAS+CMS+LHCb ~ 100 MB/s
ATLAS daily rate CERNPIC June 2009
Target: 76 MB/s
CMS daily rate CERNPIC June 2009
Target: 60 MB/s
Data import from CERN and transfers with other Tier-1s successfully tested above targets
22
Challenges ahead
Tape is difficult (specially reading …)– Basic use cases have been tested in June, but still tricky
• Organisation of files in tape groups – big impact in performance
• Sharing of drives/libraries between experiments
Access to data from jobs– Support very different jobs: reco (1GB/2h) user (5GB/10min)– Optimisation for all of them is not possible – compromise
• Remote open of files vs. copy to the local disk
• Read_ahead buffer tunning, disk contention in the WNs
Real users (chasing real data)– Simulations of “user analysis” load have been done – Ok – The system has never been tested with 1000s of real users
simultaneously accessing data with realistic (random?) patterns
23
Challenges ahead
The Cloud– A new buzzword? – Several demos of running scientific apps on commercial clouds. – Sure we can learn a lot from the associated technologies.
Encapsulating jobs as VMs – decouple hardware from applications
Multi-science support environments– Many of the WLCG sites also support other EGEE VOs– Requirements can be very different (access control to data, …)– Maintaining high QoS in these heterogeneous environments will
be a challenge… but that’s precisely the point.
Thank you
Gonzalo Merino, [email protected]
Backup Slides
26
Data Analysis
The original vision: Application thin layer interacting with a powerful m/w layer (super-WMS to which the user throws input dataset queries plus algorithms and spits the result out)
Reality today: LHC experiments have build increasingly sophisticated s/w stacks to interact with the Grid. On top of basic services: CE, SE, FTS, LFC
Workload Management: Pilot jobs, late scheduling, VO-steered prioritisation (DIRAC, Alien, Panda …)
Data Management: Topology aware higher level tools, capable of managing complex data flows (Phedex, DDM …)
User Analysis: Single interface for the whole analysis cycle, hide the complexity of the Grid (Ganga, CRAB, DIRAC, Alien …)
To use the Grid at such large scale is not an easy business!
27
Performance: CPU usage
100.000 ksi2k·month
~ 50.000 simult. busy cores
PIC data centre
150 m2 machine room
200 KVAIT UPS (+diesel generator)
~ 1500 CPU cores batch farm
~ 1 Petabyte of disk
~ 2 Petabytes of tape
STK-5500 and IBM-3584 tape libraries
29
End-to-end Throughput
Besides growing capacity, one of the challenges for sites is to stand high throughput data rates between components
1500 cores cluster 1 Petabyte disk1 Petabyte tape
30
End-to-end Throughput
Besides growing capacity, one of the challenges for sites is to stand high throughput data rates between components
1500 cores cluster 1 Petabyte disk1 Petabyte tape
2,5 GBytes/s peak rates WNs-disk during June-09
10 Gbps
31
End-to-end Throughput
Besides growing capacity, one of the challenges for sites is to stand high throughput data rates between components
1500 cores cluster 1 Petabyte disk1 Petabyte tape
2,5 GBytes/s peak rates WNs-disk during June-09
>250 MB/s r+w tape bandwidth demonstrated in June-09 250 MB/s
32
Data Transfers to Tier-2s
• Reconstructed data sent to the T2s for analysis. Bursty nature.
• Experiment requirements very fuzzy for this dataflow (as fast as possible)– Links to all SP/PT Tier-2s certified with 20-100 MB/s sustained– CMS Computing Model: sustained transfers to > 43 T2s
worldwide
33
Reliability: experiments view
• Seeing sites reliability improving, experiments were motivated to making their sensors to measure it more realistic.
• An increasing number of more realistic sensors, plus a powerful monitoring framework that ensures peer pressure, guarantees that reliability of WLCG service will keep improving.