Upload
mackenzie-hagan
View
216
Download
0
Embed Size (px)
Citation preview
[email protected] GridPP19, 29/8/07
CMS and the Grid: 2004 – 7
Or
Peta-meta-computing in the proto-Grid era:a sociotechnological retrospective
Or
Whose paradigm is it anyway?
“Experiments for the Grid”
“The Grid for Experiments”
[email protected] GridPP19, 29/8/07
History 2004: The end of innocence
DC04 data challenge - a learning experience (also start playing with a new idea: “PhEDEx”)
2005: The year of (re-)design Computing TDR A new ground-up software & computing framework
2006: Making it work CSA06 - the acid test CASTOR in the UK
• “…we all thought you were crazy” - I. Fisk
2007: Making it work without losing our sanity Learning about storage + data transfers CSA07: the real fun is yet to come
[email protected] GridPP19, 29/8/07
DC04 25% (of startup) data challenge
Set the traditional formula for subsequent challenges Exercised T0, T1, T2 centres for a full month 50M events, ~15 centres, ~30 people
Tools & technologies Ad hoc scripts for workflow mgmt + CASTOR at CERN Variety of storage at T1 (SRB + ADS at RAL), plain disk at T2 First-generation EDG workload + data mgmt tools (incl. RLS)
Results Technically: largely a disaster
• Managed to break essentially every component in the system Organisationally: a major step forward
• Learn several key lessons that informed the computing model• Established new projects to solve the technical problems
GridPP2 contributed substantially to the analysis / solution of problems
[email protected] GridPP19, 29/8/07
CSA06 - in Numbers7 Tier-1 centres
35 Tier-2 sites
100M fully simulated events
1.4PB of data
400MB/s rate from RAL CASTOR
1200MB/s peak dataflow
70 people
180 meetings
500 dodgy disks (but only one tape eaten!)
£300k of electricity
40l of Champagne at end-of-CSA party
“Shambolic” CSA Forced to Abandon Targets- Headline, The Independent, 5th Jan
[email protected] GridPP19, 29/8/07
Lesson the 1st: Data
It’s all about the data, stupid!
Remember the ‘DataGrid’? We’ve failed to build a uniform approach to WADM
We deal with data processing centres, not CPU centres The (remaining) hard problems are in efficient data access
I.e. storage, IO, data transfer at Tier-2, not just Tier-1
Need to nail the ‘local IO problem’ very soon Still have a lot to do on reliable data transfer as well
BTW, the network still isn’t the bottleneck (but keep trying…)
[email protected] GridPP19, 29/8/07
Lesson the 2nd: Locality
Keep local stuff local
Aim of the Grid: avoid central points of congestion Present coherent local interfaces, but reduce global state Actually: aim of all coherent system-building strategies
Examples from current CMS system: Use local catalogues whenever possible; update asynchronously Don’t use off-site ‘Grid services’ for local workflows (e.g. reco).
This also applies to contacts with sites ‘Users’ / ‘experiments’ must work with sites directly ‘Up the chain of command and down again’ does not work NB: also applies to ‘strategic’ discussions and resource planning
[email protected] GridPP19, 29/8/07
Lesson the 3rd: Reliability
Reliability trumps performance & scalability
Unreliable systems are extremely inefficient N_tries goes as log(1-p)-1, bookkeeping at least as N_tries2
Unreliable systems are not trusted by users
If one can’t make a small system work, larger systems will be progressively worse
We are getting there; but not fast enough Reliability can be achieved iff robustness is built in
Without reliability, what is the point of the Grid?
[email protected] GridPP19, 29/8/07
Lesson the 4th: Exceptions
Sticking plasters won’t cure a broken leg
We use the ‘network stack model’ of fault tolerance Higher layer functions compensate for unreliability of lower layers
Alas, does not work for intrinsically unreliable systems Example: wireless network in CERN building 40 802.11b fine with 1% error rate; collapses with 10% (CMS week!)
Fault-finding is impossible without fault reporting And intelligible logging, recorded and accessible at all levels
‘Exception handling’ is clearly hard A key property of a mature system Remember: exceptions should be exceptional
[email protected] GridPP19, 29/8/07
Lesson the 5th: People
“Generic Grid sites” do not really exist
A site is precisely as “good” as the people running it Objectively: throughput (transfers) tracks national holidays! We are still in a highly labour-intensive mode; the labour is at sites
What does CMS need from site operators, today? Close contact with ‘the project’ and ‘the users’;sharing of
experience Proactive deployment & testing of new services, software Active participation in resource planning and data operations
Will ‘generic sites’ ever exist? Not until central support and problem-tracking are much improved
[email protected] GridPP19, 29/8/07
Lesson the 6th: Focus
No more neat ideas - for now
In 2007/8, that is! Focus on (dull, tedious, hard) integration, testing, documentation The excitement will come from the physics!
But many ‘big unsolved problems’ for later: How can we store data more efficiently? Can we? How can we compute more efficiently? Can we? How should we use multi-threading & virtualisation? How do we use really high-speed networks? Will anyone ever make videoconferencing work properly?
Someone should start targeting these problems…
[email protected] GridPP19, 29/8/07
Whither the Grid? Is CMS using the Grid?
PKI-based, uniform(ish) web services interfaces? Yes• But also a lot of remote DB access for many purposes
Resource discovery / Info service? Not really.• >90% of CMS jobs are whitelisted at RB level (many even at user level)
Replica management? Partially, through our own mechanisms• No real attempt at optimisation of data access - yet
Support, authentication, ROC services? Partially• Augmented with CMS-specific and national support mechanisms
Has it all been worth it (so far?) Yes! If if didn’t exist, we’d have had to invent it anyway
Will we become more Grid-like? Undoubtedly (though not sure ‘utility computing’ will ever be a goer) For now, efficiency appears to require simplicity - no surprise
The real value of ‘The Grid’ is yet to come
[email protected] GridPP19, 29/8/07
(Near) FutureThe hard work starts here!
I say this every six months So far it’s always been true
CSA07 Really the last big test of our organisation, readiness for data Already reviewing some aspects of model after discussion…
2008: The crunch year Focus should be on basic reliable services at centres Need to reinforce communications between expts and sites
GridPP3 Clearly a major role to play in CMS computing - at many levels Roll on LHC startup!