Analysis support issues, virtual analysis facilities, and Tier 3s

Analysis support issues, virtual analysis facilities,

and Tier 3sDoug Benjamin

(Duke University)With inputs from Hiro Ito, Val Hendrix,

How do US Atlas physics work today?

• As Rik mentioned, surveys are being conducted (US ATLAS 128 response and increasing – ATLAS 266)

ATLASwide

US ATLAS

US ATLAS analysis grid frequency

US ATLAS local usage

Difference in cultures ATLAS vs US ATLAS (local

usage)

US Analyzers – rely more heavily on local resources than collegues

What data do people now use?

US and ATLAS analyzers not that different

What types of D3PD’s are used?

Provides an indication what types of D3PD’s we should have in US

How do people get their data?• dq2-get very popular amount analyzers

o Seen in surveys and dq2 traces

Dq2-get frequency and importance

Interim Conclusions• US ATLAS Physicists use a variety of resources at

their disposal o Most use the grid , significant fraction do not ~ 25% (similar number for

ATLAS)• ARRA funding was a success local resources are

usedo US Physicists make frequent and heavy use of local resourceso US Physicists share with others in ATLAS ( 47%)

• dq2-get extremely popular way to fetch datao Easy to use, instant feedbacko Easy to misuse and cause troubles up streamo Need a solution that provides “instant” gratification , yet protects the

system as a whole. With Rucio provide this?

Helping the users

US ATLAS survey

ATLAS survey

Better communication is needed - We need to reach ~ 25% of the analyzers. All ATLAS analyzers should know about DAST. It is a resource for them

Moving toward the future• 2014/2015 will be important time for ATLAS and

US ATLASo Machine will be at full energyo ATLAS wants to have a trigger rate to tape ~ 1 kHZ (vs ~ 400 Hz now)o We need to produce results as fast as before (maybe faster)o ARRA funded computing will be 5 years old and beginning to age outo US funding of LHC computing likely flat at best.

• Need to evolve the ATLAS computing model o Need at least factor of 5 improvement (more data and need to go faster)

• Need to evolve how people do their worko Fewer local resources will be available in the futureo US ATLAS users need to move towards central resourceso But…. Need to avoid a lost in functionality and performance

Additional functionalities• What about giving users space on the grid?

• What about EOS?

EOS (or another similar technology)

• If US ATLAS were to provide you with EOS space in the US how useful would it be to be to you?

US ATLAS Local group disk3 primary uses1. Output of common datasets for US ATLAS analyzers

o Should try to have sufficient space in the ATLAS group space tokenso For example large amount of space in Top group space tokens (NET,SWT2)

allow Top group responsible (me) to direct all of current data ntuples to US2. Output of datasets used by a few individuals

o Some users are currently using a lot of spaceo Number of users who know that they can write into this space is limited.

(need to better education users)o Lack of quotas an issue

3. “Cache space”o Used effectively should reduce the need for post processing DATRI request.o By have the User jobs send output directly from analysis jobs – makes it

easier for user to fetch the outputo Can we separate this space out? Should it be separate?

• Rucio (next-gen data management) will provide “cloud” wide quotas for users.

Future of Tier 3 sites• Why do research groups like local computing?

o Groups like the clusters that they have because now they are easy to maintain (We succeeded!!!)

o They give them instant access to their resourceso They give unlimited quota to use what they want for what ever they

want• We need to have a multi-faceted approach to the

Tier 3 evolution• Need to allow and encourage sites with the

funding to continue to refresh their existing clusters. o We do not want to give the impression in any way that we would no

welcome addition DOE/NFS funding to be used in this wayo What about cast of compute nodes from Tier 1/Tier 2 sites?

• Need to use beyond pledged resources to the benefit of all US ATLAS users and groupso If we can not get sufficient funding to fund the refresh of the existing

Tier 3’so Beyond Pledged resources are a key component to the solution.

New technologies on horizon• Federated storage

o With caching Tier 3’s can really reduce data management • WAN data access

o User code must be improved to reduce latencieso Decouples Storage and CPU

• Analysis Queueso Can be used to make use of beyond pledges resources for data analysis

not just MC production• Cloud computing

o Users are becoming more comfortable for cloud computing (gmail, icloud)

o Virtual Tier 3’s • All are coupled• Done right, they should allow us to do more with

the resources that we will have

Analysis queues• Test at BNL for “beta” unofficial analysis queue• As a test of MAPR storage, BNL team setup a

special queue • Doug B ran several instances of his standard 2011

test job, SMWZ ntuple, ~250 files, 10 subjobs, 25 files, simple cut and count

• Each subjob runs less than 10 minutes (1300 Hz)• Initial results – show additional tuning is needed

o Some subjobs waited 45 minutes to starto System started from cold start, plenty of job slots

• Will repeat the same tests at Dukeo First need to rename files in xrootd (get rid of _sub* etc)

Latency analysis (by Hiro)• ”autopyfactory has the cycle period of 360s. ““I am guessing this is what happen (I copied the time from various logs)”Details for one subjob ba1. job created at 2012-10-24 16:17 2. pyfactory sleep cycle also starts at 2012-10-24 16:17 (nothing happens for next 6 minutes.) 3. pyfactor submit to condor at 2012-10-24 16:23:58 4. pilot wrapper on worker nodes start at Wed Oct 24 16:26:21 UTC 2012 5. pilot starts on 24 Oct 16:27:07 6. prun starts at 24 Oct 16:27:26 7. prun ends at 24 Oct 16:28:24 8. data output written to BNL dCache at 24 Oct 16:28:36 9. data output register to LFC at 24 Oct 16:28:40 10. job pilot ends at 24 Oct 16:29:25 11. log output written to BNL dcache at 24 Oct 16:31:00 12. log output register to BNL LFC at 24 Oct 16:31:02 13. job registered to be success at 24 Oct 16:31:03

Need to tune the various sources of latency – process is beginning

Virtual Analysis clusters( Tier 3’s)

• R&D effort by Val Hendrix, Sergey Panitkin, DB and qualification task for student Henrik Ohmano Part time effort for all of us, is a issue toward progress

• We been working on a variety of clouds (by necessity and not necessarily by choice)o Public clouds – EC2 (micro spot instances), google (while still free)o Research clouds – Future grid (we have an approve project)o We are scrounging resouces where we can (not always the most

efficient) • Val has made progress on configuration and

contextuation• Sergey has done a lot of work with Proof• Sergey and I have both independently working

data access/handling using Xrootd and the federation

• We could use a stable Private cloud testbed.

Virtual analysis clusters (cont.)• Panda is used for work load management in most

cases (Proof is only exception)o Perhaps we can use Proof on demand

• Open issueso Configuration and contextualization - We are collaborating with CERN

and BNL on puppet - outlook looks promisingo What is the scale of the virtual clusters?

• Personal analysis cluster • Research group analysis cluster (institution level)• Multi institution analysis cluster (many people , several groups to

country wide)o Proxies for the Panda Pilots (personal or robotic)o What are the latency incurred by this dynamic system? Spin up and tear

down time?o What leave to data reuse do we want (ie caching)?o How is the data handling done in and out of the clusters

• Initial activity used Federation storage as sourceo What is the performance penalty of virtualization

Virtual analysis cluster (future)• Present snapshot of results at upcoming T1/T2/T3 at

CERN• Got extension of trial period Google Compute Engine,

Sergey will continue to work on analysis clusters• Get Henrik to ramp up (he owes us 80 work days – full

time) o I have been asked to follow his work, by Fernando

• Henrik will focus on Google cloud and Futuregrid (openstack)

• LBL (Beatte et al) received grant from Amazon to run on EC2.o Using Val’s scripts I will work with them to get them started right awayo Data will come from US ATLAS federated storage - Might have to transfer files

to local group disk, Need a good solution for output in long termo Need to help them migrate to private clouds once grant ends

• Work with John and Jose on APF

Conclusions• US ATLAS Tier 3 program has been an unqualified

success• We must start the process for the future• New technologies are maturing that can be used

and have the potential of being transformative• So additional resources from the facilities would

go a long way toward helping • Virtual Tier 3 are making slow process

o New labor has been added

Documents

Analysis support issues, virtual analysis facilities, and Tier 3s