25
David Cameron Claire Adam Bourdarios Andrej Filipcic Eric Lancon Wenjing Wu ATLAS Computing Jamboree, 3 December 2014 Volunteer Computing

David Cameron Claire Adam Bourdarios Andrej Filipcic Eric Lancon Wenjing Wu ATLAS Computing Jamboree, 3 December 2014 Volunteer Computing

Embed Size (px)

Citation preview

David CameronClaire Adam BourdariosAndrej FilipcicEric LanconWenjing Wu

ATLAS Computing Jamboree, 3 December 2014

Volunteer Computing

What is volunteer computing?

Ordinary people voluntarily running scientific tasks on their PCs

Berkeley Open Infrastructure for Network Computing (BOINC)

Volunteer Computing @ CERN

• 2004: LHC@Home Sixtrack• 2011: LHC@Home Test4Theory• 2014: ATLAS@Home, CMS@Home,

Beauty@Home (LHCb)

ATLAS@Home

• Why use volunteer computing for ATLAS?– It’s free! (almost)– Public outreach

• Considerations– Low priority jobs with high CPU-I/O ratio

• Non-urgent Monte Carlo simulation

– Need virtualisation for ATLAS sw environment• CERNVM image and CVMFS

– No grid credentials or access on volunteer hosts• ARC middleware for data staging

– The resources should look like a regular Panda queue• ARC Control Tower

Initial ATLAS@Home Architecture

ARC Control Tower

Panda Server

ARC CE

Session Directory

BOINC LRMS Plugin

BOINC server

Volunteer PC

BOINC Client

VM

Shared Directory

Grid Catalogs and Storage

DB

proxy cert

BOINC PQ

CERN

Current ATLAS@Home Setup

ARC Control Tower

Panda Server

ARC CE

BOINC server (vLHC@Home)

Volunteer PC

BOINC Client

VM

Shared Directory

Grid Catalogs and Storage

DB on demand

BOINC PQ

SharedNFS

ATLAS@Home History

• Test server with ARC CE and BOINC server with ATLAS@Home app ran in Beijing from January– http://gilda117.ihep.ac.cn– Volunteers found it somehow…

• In July volunteers were moved to CERN server with ARC CE + BOINC– http://arc-boinc-01.cern.ch (alias atlasathome.cern.ch)– CERN IT provided 1TB NFS space for job input/output

• At the same time ATLAS@Home became an official BOINC project• In early October the BOINC server was changed to a vLHC@Home

server run by CERN IT– Volunteers + credit moved too

• A parallel test setup with separate ARC CE and BOINC server exists for testing

Boinc jobs• Real simulation tasks

– mc12_8TeV.117079.PowhegPythia_P2011C_ttbar_nonallhad_mtt_2000p.simul.e2940_s1773

– Full athena jobs– 50 events/job

• Runs in CERNVM with pre-cached software• But some data still needs to be downloaded at runtime

– Conditions data from squid/frontier

• Image is 1.1GB (500MB compressed) and downloaded only once• Input files (data file + small scripts) is 1-100MB• Output is ~100MB• VM memory is now 2GB (was 1GB initially, but now more complex jobs)• Jobs take from few hours up to a few days on fast (single) core• Validation

– Per work unit, that correct output is produced (just that file exists, the content is not checked)

– Physics validation comparing results to regular Grid task

How does it work for volunteers?

• Install BOINC client and VirtualBox– Linux, Mac and Windows supported– Currently 80% of hosts have Windows

• In BOINC client choose ATLAS@Home and create an account

• That’s it!

Issues with jobs

• The majority of volunteers (~80%) never complete a single job– Not powerful enough resources, entry barrier is too high

• Requires 64-bit, at least 4GB, decent bandwidth, installing VirtualBox• ATLAS@home is the hardest BOINC project to run (quote from volunteer)

– Unreliable system/failing jobs also push people away• The worst thing for volunteers is to use CPU and not give credit

– BUT the normal retention rate of a project is 10%• More problems

– Virtualisation/VMwrapper causes a lot of problems (memory, jobs not finishing, unstable)

– Firewall issues accessing conditions data through squids• We are working on ways to cache this data in the image to avoid network access

from the job

Volunteer growth

Currently >12000 volunteers, 1000 active300 new volunteers/week

Einstein@Home: 300k volunteers, 47k activeSeti@Home: 5 million volunteers, 150k active

Job statistics

• Continuous 2000-3000 running jobs• almost 300k completed jobs• 500k CPU hours• 14M events• 50% CPU efficiency

ATLAS@Home in PANDA

Scale of ATLAS@Home

28th largest ATLAS simulation site

Very roughly 3 credits/event

Very active message boards

Standard Boinc webpage

• http://atlasathome.cern.ch• Technical info on how to

join• Message boards• Jobs/results• Job statistics

ATLAS@Home public outreach page

• https://atlasphysathome.cern.ch

• Designed by Claire using Drupal

• Entry point for the public to find out what they are contributing to

• Many links to existing outreach pages

Screensaver

• Many BOINC projects run as “screensavers”• Working with Riccardo-Maria Bianchi from ATLAS event display

VP1 to make ATLAS@Home screensaver– Show pre-configured event displays as events are produced to show

people what they are running

• This can help motivate people to look more into the physics details

Screensaver

Lessons Learned and Future

• It takes a lot of effort to run ATLAS@Home– In the interaction with volunteers

• Some volunteers are extremely competent and knowledgeable and help others

– Maintaining and improving the system workflow

• The number of running jobs has reached a plateau– We are exploring scaling options with CERN IT (Ceph, multiple apache servers etc)– Not enough people joining

• But we deliberately haven’t advertised too much to ramp up slowly

• The major problems are caused by vboxwrapper• BOINC developers very enthusiastic to help us

– They give us fixes/new features in days

• We have a few more things to fix before ATLAS@Home can move out of beta– New manpower starting now will help greatly

• We want to push ATLAS@home internally inside ATLAS– eg now available as part of NICE, to put on CERN administrative PCs

ATLAS@Home potential

• It is not possible to run any ATLAS jobs on ATLAS@home– See earlier considerations about I/O, unreliability etc

• But ~50% of jobs could feasibly run on this platform• The high entry barrier may limit general public

participation• Can it replace small Grid sites?

– For example a CPU-only T3 site or small university cluster– Instead of setting up all the Grid infrastructure just install

BOINC on the worker nodes– Standard Grid accounting in APEL is provided by ARC CE

Thanks

• Thanks to our CERN IT colleagues in LHC@Home for providing the Boinc infrastructure and storage space

• .. and please join us!http://atlasathome.cern.ch