A Year of HTCondor at the RAL Tier-1
Ian Collier, Andrew LahiffSTFC Rutherford Appleton Laboratory
HEPiX Spring 2014 Workshop
Outline
• Overview of HTCondor at RAL• Computing elements• Multi-core jobs• Monitoring
2
Introduction
• RAL is a Tier-1 for all 4 LHC experiments– In terms of Tier-1 computing requirements, RAL provides
• 2% ALICE• 13% ATLAS• 8% CMS• 32% LHCb
– Also support ~12 non-LHC experiments, including non-HEP• Computing resources
– 784 worker nodes, over 14K cores– Generally have 40-60K jobs submitted per day
• Torque / Maui had been used for many years– Many issues– Severity & number of problems increased as size of farm increased– In 2012 decided it was time to start investigating moving to a new
batch system3
Choosing a new batch system
• Considered, tested & eventually rejected the following– LSF, Univa Grid Engine*
• Requirement: avoid commercial products unless absolutely necessary– Open source Grid Engines
• Competing products, not sure which has the next long term future• Communities appear less active than HTCondor & SLURM• Existing Tier-1s running Grid Engine using the commercial version
– Torque 4 / Maui• Maui problematic• Torque 4 seems less scalable than alternatives (but better than Torque 2)
– SLURM• Carried out extensive testing & comparison with HTCondor• Found that for our use case
– Very fragile, easy to break– Unable to get reliably working above 6000 running jobs
4* Only tested open source Grid Engine, not Univa Grid Engine
Choosing a new batch system
• HTCondor chosen as replacement for Torque/Maui– Has the features we require– Seems very stable– Easily able to run 16,000 simultaneous jobs
• Didn’t do any tuning – it “just worked”• Have since tested > 30,000 running jobs
– Is more customizable than all other batch systems
5
Migration to HTCondor
• Strategy– Start with a small test pool– Gain experience & slowly move resources from Torque / Maui
• Migration2012 Aug Started evaluating alternatives to Torque / Maui
(LSF, Grid Engine, Torque 4, HTCondor, SLURM)
2013 Jun Began testing HTCondor with ATLAS & CMS~1000 cores from old WNs beyond MoU commitments
2013 Aug Choice of HTCondor approved by management2013 Sep HTCondor declared production service
Moved 50% of pledged CPU resources to HTCondor2013 Nov Migrated remaining resources to HTCondor
6
Experience so far
• Experience– Very stable operation
• Generally just ignore the batch system & everything works fine• Staff don’t need to spend all their time fire-fighting problems
– No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores
– Job start rate much higher than Torque / Maui even when throttled• Farm utilization much better
– Very good support
7
Problems
• A few issues found, but fixed quickly by developers– Found job submission hung when one of a HA pair of central
managers was down• Fixed & released in 8.0.2
– Found problem affecting HTCondor-G job submission to ARC with HTCondor as LRMS
• Fixed & released in 8.0.5– Experienced jobs dying 2 hours after network break between CEs and
WNs• Fixed & released in 8.1.4
8
• All job submission to RAL is via the Grid– No local users
• Currently have 5 CEs– 2 CREAM CEs– 3 ARC CEs
• CREAM doesn’t currently support HTCondor– We developed the missing functionality ourselves– Will feed this back so that it can be included in an official release
• ARC better– But didn’t originally handle partitionable slots, passing CPU/memory
requirements to HTCondor, …– We wrote lots of patches, all included in the recent 4.1.0 release
• Will make it easier for more European sites to move to HTCondor
Computing elements
9
• ARC CE experience– Have run almost 9 million jobs so far across our 3 ARC CEs– Generally ignore them and they “just work”– VOs
• ATLAS & CMS fine from the beginning• LHCb added ability to submit to ARC CEs to DIRAC
– Seem to be ready to move entirely to ARC• ALICE not yet able to submit to ARC
– They have said they will work on this• Non-LHC VOs
– Some use DIRAC, which now can submit to ARC– Others use EMI WMS, which can submit to ARC
• CREAM CE status– Plan to phase-out CREAM CEs this year
Computing elements
10
HTCondor & ARC in the UK
• Since the RAL Tier-1 migrated, other sites in the UK have started moving to HTCondor and/or ARC– RAL T2 HTCondor + ARC (in production)
– Bristol HTCondor + ARC (in production)
– Oxford HTCondor + ARC (small pool in production, migration in progress)
– Durham SLURM + ARC– Glasgow Testing HTCondor + ARC– Liverpool Testing HTCondor
• 7 more sites considering moving to HTCondor or SLURM• Configuration management: community effort
– The Tier-2s using HTCondor and ARC have been sharing Puppet modules
11
Multi-core jobs
• Current situation– ATLAS have been running multi-core jobs at RAL since November– CMS started submitting multi-core jobs in early May– Interest so far only for multi-core jobs, not whole-node jobs
• Only 8-core jobs• Our aims
– Fully dynamic• No manual partitioning of resources
– Number of running multi-core jobs determined by fairshares
12
Getting multi-core jobs to work
• Job submission– Haven’t setup dedicated multi-core queues– VO has to request how many cores they want in their JDL, e.g.
(count=8)• Worker nodes configured to use partitionable slots
– Resources of each WN (CPU, memory, …) divided up as necessary amongst jobs
• Setup multi-core groups & associated fairshares– HTCondor configured to assign multi-core jobs to the appropriate
groups• Adjusted the order in which the negotiator considers groups
– Consider multi-core groups before single core groups• 8 free cores are “expensive” to obtain, so try not to lose them to single core
jobs too quickly
13
Getting multi-core jobs to work
• If lots of single-core jobs are idle & running, how does a multi-core job start?– By default it probably won’t
• condor_defrag daemon– Finds WNs to drain, triggers draining & cancels draining as required– Configuration changes from default:
• Drain 8-cores only, not whole WNs• Pick WNs to drain based on how many cores they have that can be freed
up– E.g. getting 8 free CPUs by draining a full 32-core WN is generally faster than
draining a full 8-core WN– Demand for multi-core jobs not known by condor_defrag
• Setup simple cron to adjust number of concurrent draining WNs based on demand
– If many idle multi-core jobs but few running, drain aggressively– Otherwise very little draining
14
Results
• Effect of changing the way WNs to drain are selected
– No change in the number of concurrent draining machines– Rate in increase in number of running multi-core jobs much higher
15
Running multi-core jobs
• Recent ATLAS activity
Results
16
Running & idle multi-core jobs
Gaps in submission by ATLAS resultsin loss of multi-core slots.
Significantly reduced CPU wastagedue to the cron• Aggressive draining: 3% waste• Less-aggressive draining: < 1% waste
Number of WNs running multi-core jobs & draining WNs
Worker node health check
• Startd cron– Checks for problems on worker nodes
• Disk full or read-only• CVMFS• Swap• …
– Prevents jobs from starting in the event of problems• If problem with ATLAS CVMFS, then only prevents ATLAS jobs from
starting– Information about problems made available in machine ClassAds
• Can easily identify WNs with problems, e.g.
# condor_status –constraint 'NODE_STATUS =!= "All_OK”’ -autoformat Machine NODE_STATUSlcg0980.gridpp.rl.ac.uk Problem: CVMFS for alice.cern.chlcg0981.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch Problem: CVMFS for lhcb.cern.chlcg1069.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.chlcg1070.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.chlcg1197.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.chlcg1675.gridpp.rl.ac.uk Problem: Swap in use, less than 25% free
17
Worker node health check
• Also can put this data into ganglia– RAL tests new CVMFS releases
• Therefore it’s important for us to detect increases in CVMFS problems– Generally have only small numbers of WNs with issues:
– Example: a user’s “problematic” jobs affected CVMFS on many WNs:
18
Jobs monitoring
• CASTOR team at RAL have been testing Elasticsearch– Why not try using it with HTCondor?
• Elasticsearch ELK stack– Logstash: parses log files– Elasticsearch: search & analyze data in real-time– Kibana: data visualization
• Hardware setup– Test cluster of 13 servers (old diskservers & worker nodes)
• But 3 servers could handle 16 GB of CASTOR logs per day• Adding HTCondor
– Wrote config file for Logstash to enable history files to be parsed– Add Logstash to machines running schedds
19
HTCondor history files Logstash
Elasticsearch Kibana
Jobs monitoring
• Can see full job ClassAds
20
Jobs monitoring
21
• Custom plots– E.g. completed jobs by schedd
• Custom dashboards
Jobs monitoring
22
Jobs monitoring
23
• Benefits– Easy to setup
• Took less than a day to setup the initial cluster– Seems to be able to handle the load from HTCondor
• For us (so far): < 1 GB, < 100K documents per day– Arbitrary queries
• Seem faster than using native HTCondor commands (condor_history)– Horizontal construction
• Need more capacity? Just add more nodes
Summary
• Due to scalability problems with Torque/Maui, migrated to HTCondor last year
• We are happy with the choice we made based on our requirements– Confident that the functionality & scalability of HTCondor will meet our
needs for the foreseeable future• Multi-core jobs working well
– Looking forward to ATLAS and CMS running multi-core jobs at the same time
24
Future plans
• HTCondor– Phase in cgroups onto WNs
• Integration with private cloud– When production cloud is ready, want to be able to expand the batch
system into the cloud– Using condor_rooster for provisioning resources
• HEPiX Fall 2013: http://indico.cern.ch/event/214784/session/9/contribution/205
• Monitoring– Move Elasticsearch into production– Try sending all HTCondor & ARC CE log files to Elasticsearch
• E.g. could easily find information about a particular job from any log file
25
Future plans
• Ceph– Have setup a 1.8 PB test Ceph storage system– Accessible from some WNs using Ceph FS– Setting up an ARC CE with shared filesystem (Ceph)
• ATLAS testing with arcControlTower– Pulls jobs from PanDA, pushes jobs to ARC CEs– Unlike the normal pilot concept, jobs can have more precise resource
requirements specified• Input files pre-staged & cached by ARC on Ceph
26
Thank you!
27