Maria Grazia Pia Simulation in a Distributed Computing Environment Simulation in a Distributed Computing Environment S. Guatelli 1, A. Mantero 1, P. Mendez

Maria Grazia Pia

Simulation in a Distributed Simulation in a Distributed Computing EnvironmentComputing Environment

S. Guatelli1, A. Mantero1, P. Mendez Lorenzo2, J. Moscicki2, M.G. Pia1

1INFN Genova, Italy 2CERN, Geneva, Switzerland

CHEP 2006Mumbai, 13-17 February 2006

Maria Grazia Pia

Speed of Monte Carlo Speed of Monte Carlo simulationsimulation

Speed of execution is often a concern in Monte Carlo simulationOften a trade-off between precision of the simulation and speed of execution

Fast simulation

Variance reduction techniques (event biasing)

Inverse Monte Carlo methods

Parallelisation

Methods for faster simulation response

Semi-interactive responseSemi-interactive response

Detector design Optimisation Oncological radiotherapy

Very long execution timeVery long execution time

High statistics simulation High precision simulation

Typical use casesTypical use cases

Maria Grazia Pia

Features of this studyFeatures of this study

Geant4 application in a distributed computing environment– Architecture– Implications on simulation applications

Environments– PC farm– GRID

Two use cases: Geant4 Advanced Examples– semi-interactive response (brachytherapy)– high statistics (medical_linac)

By-product:By-product: results for Geant4 medical applicationGeant4 medical application (technology transfer)

Quantitative study– results to be submitted for publication

Maria Grazia Pia

RequirementsRequirements

Transparent execution in sequential/parallel mode

Transparent execution on a PC farm and on the Grid

Geant4 brachytherapy brachytherapy Execution time for 20 M events: 5 hoursGoal: execution time ~ few minutesGoal: execution time ~ few minutes

Architectural requirementsArchitectural requirements

High statistics simulationHigh statistics simulationSemi-interactive simulationSemi-interactive simulation

Geant4 medical_linacmedical_linacExecution time for 109 events: ~10 daysGoal: execution time ~ few hoursGoal: execution time ~ few hours

Reference: sequential mode on a Pentium IV, 3 GHz

Maria Grazia Pia

Parallel mode: local cluster / Parallel mode: local cluster / GRIDGRID

Both applications have the same computing model– a job consists of a number of independent tasks which may be executed in parallel – result of each task is a small data packet (few kb), which is merged as the job runs

In a cluster– computing resources are used for parallel execution– user connects to a possibly remote cluster– input data for the job must be available on the site – typically there is a shared file system and a queuing system– network is fast

GRID computing uses resources from multiple computing centres– typically there is no shared file system– (parts of) input data must be replicated in remote sites– network connection is slower than within a cluster

Maria Grazia Pia

OverviewOverview

Architectural issues– DIANE– How to dianize a Geant4 application

Performance tests– On a single CPU– On clusters– On the GRID

Conclusions– Lessons learned– Outlook

Quantitative, documented

results

Publicly distributed:

DIANE

Geant4 application code

Maria Grazia Pia

DIANEDIANER&D project

– started in 2001 in CERN/IT with very limited resources – collaboration with Geant4 groups at CERN, INFN, ESA– succesful prototypes running on LSF and EDG

Parallel cluster processingParallel cluster processing– make fine tuning and customisation easy– transparently using GRID technology– application independentapplication independent

Developed by J. Moscicki, CERN/IThttp://cern.ch/DIANE

Master-WorkerMaster-Worker architectural pattern

prototype for an intermediate layer between applications and the GRID

Hide complex details of underlying technology

Maria Grazia Pia

Practical example: Geant4 simulation with Practical example: Geant4 simulation with analysisanalysis

Each task produces a file with histogramsThe job result is the sum of histograms produced by tasks

Master-worker model– client starts a job– workers perform tasks and produce histograms– master integrates the results

Distributed Processing for Geant4 Applications– task = N events – job = M tasks – tasks may be executed in parallel – tasks produce histograms/ntuples – task output is automatically combined (add histograms, append ntuples)

Master-Worker Model – Master steers the execution of job, automatically splits the job and merges

the results – Worker initializes the Geant4 application and executes macros – Client gets the results

Maria Grazia Pia

UML Deployment Diagram for Geant4 applications

Completely transparent to the user: same Geant4 application code

G4Simulation class is responsible of managing the simulation– manage random number seeds– Geant4 initialisation– macros to be executed in batch mode– termination

simulation simulation with DIANEwith DIANE

Maria Grazia Pia

Development costsDevelopment costsStrategy to minimise the cost of migrating a Geant4 simulation to a distributed environment

DIANE Active Workflow framework– provides automatic communication/synchronization mechanisms– application is “glued” to the framework using a small Python module– in most cases no code changes to the original application are required– load balancing and error recovery policies may be plugged in form of simple python

functions

Transparent adaptation for Clusters/GRIDs, shared/local file systems, shared/private queues

Development/modification of application code – original source code unmodified – addition of an interface class which binds together application and M-W framework

The application developer is shielded from the complexity of underlying technology via DIANE

Maria Grazia Pia

Test resultsTest resultsPerformance of the execution of the dianized Brachytherapy example

Test on a single CPU

Test on a dedicated farm (60 CPUs)

Test on a farm shared with other users (LSF, CERN)

Test on the GRID (LCG)

Tools and libraries:Simulation toolkit: Geant4 7.0.p01

Analysis tools: AIDA 3.2.1 and PI 1.3.3

DIANE: DIANE 1.4.2

CLHEP: 1.9.1.2

G4EMLOW 2.3

Maria Grazia Pia

Overhead at Overhead at initialisation/terminationinitialisation/termination

Standalone application 4.6 0.2 s

Application via DIANE, simulation only

8.8 0.8 s

Application via DIANE, with analysis integration

9.5 0.5 s

Test on a single dedicated CPU (Intel ®, Pentium IV, 3.00 GHz)

Study execution via DIANE w.r.t. sequential execution– run 1 event

Overhead: ~ 5 s, negligible in a high statistics job

Maria Grazia Pia

Overhead due to DIANEOverhead due to DIANE

with respect to the number of events

Test on a single dedicated CPU (Intel ®, Pentium IV, 3.00 GHz) Study execution via DIANE w.r.t. sequential execution

MODESEQUENTIAL

DIANE

imeExecutionT

imeExecutionT

_Ratio =

Execution time vs. number of events in the job

The overhead of DIANE is negligible in high

statistics jobs

Maria Grazia Pia

Farm: execution time Farm: execution time andand efficiencyefficiency

Dedicated farm : 30 identical bi-processors (Pentium IV, 3 GHz)– Thanks to Regional Operation Centre (ROC) Team, Taiwan– Thanks to Hurng-Chun Lee (Academia Sinica Grid Computing Center, Taiwan)

Load balancing: optimisation of the number of tasks and workers

nimeExecutionT

imeExecutionTEfficiency

parallel

sequential

Maria Grazia Pia

Optimizing the number of Optimizing the number of taskstasksThe job ends when all the tasks are executed in the workers

If the job is split into a higher number of tasks, the chance that the workers finish the tasks at the same time is a higher

Note: the overall time of the job is determined by the last worker to finish the last task

Example of a good job balancingExample of a job that can be improved from a performance point of view

Worker number

Time (seconds)

Worker number

Time (seconds)

Maria Grazia Pia

Farm shared with other usersFarm shared with other users

Preliminary!

Real-life case: farm shared with other users

Execution in parallel mode on 5 workers of

CERN LSF

DIANE used as intermediate layer

The load of the cluster changes quickly in timeThe conditions of the test are not reproducible

Highly variable performance

Maria Grazia Pia

Parallel execution in a PC farmParallel execution in a PC farm

Required production of Brachytherapy: 20 M events

20 M events in sequential mode :

16646 s (~ 4h and 38 minutes) on a a Intel ®, Pentium IV, 3.00 GHz

The same simulation runs in 5 minutes in parallel on 56 CPUs– appropriate for clinical usage

Similar results for Geant4 medical_linac Advanced Example– production can become compatible with usage for the verification of IMRT

treatment planning– sequential execution requires ~ 10 days to obtain significant results

Maria Grazia Pia

Running on the Grid (LCG)Running on the Grid (LCG)

G4Brachy executed on the GRID (LCG)– nodes located in Spain, Russia, Italy, Germany, Switzerland

Conditions of the testThe load of the GRID changes quickly in time

The conditions of the test are not reproducible

EfficiencyThe evaluation of the efficiency with the same criterion as in a dedicated farm does not make much sense in this context

Study the “efficiency” of DIANE as automated job management w.r.t. manual submission through simple scripts

Maria Grazia Pia

Test resultsTest resultsExecution on the GRID through DIANE,

20 M events,180 tasks, 30 workersExecution on the GRID, without DIANE

Without DIANE: - 2 jobs not successfully executed due to set-up problems of the workers

Through DIANE: - All the tasks are executed successfully on 22 workers- Not all the workers are initialized and used: on-going investigation

3.0__

__ NoDianeMODEPARALLEL

DianeMODEPARALLEL

imeExecutionT

imeExecutionT

Worker number

Time (seconds)

Worker number

Time (seconds)

Maria Grazia Pia

How the GRID load changesHow the GRID load changesExecution time of Brachytherapy in two different conditions of the GRID

DIANE used as intermediate layer

Worker number

Time (seconds)

Worker number

Time (seconds)

20 M events, 60 workers initialized, 360 tasks

Very different result!

Maria Grazia Pia

Farm/GRID executionFarm/GRID execution

Brachy, 20 M events, 180 tasks

Taipei cluster:

29 machines, 734 s ~ 12 minutes

GRID:

27 machines, 1517 s ~ 25 minutes

Preliminary indication

The conditions are not reproducible

Maria Grazia Pia

Lessons learnedLessons learnedDIANE as intermediate layer

– Transparency – Good separation of the subsystems– Good management of CPU resources– Negligible overhead

Load balancing– A relatively large number of tasks increases the efficiency of parallel execution

in a farm– Trade-off between optimisation of task splitting and overhead introduced

Controlled and real life situation is quite different in a farm– need dedicated farm for critical usage (i.e. hospital)

Grid– highly variable environment– not mature yet for critical usage– automated management through a smart system is mandatory– work in progress, details still to be understood quantitatively

Maria Grazia Pia

OutlookOutlook

Work in progress– A quantitative analysis of the all the performance results is still on-going

Generalize job splitting optimization

Better characterize the performance on the Grid quantitatively

Improve DIANE

To be submitted for publication in IEEE Trans. Nucl. Sci.

Maria Grazia Pia

ConclusionsConclusionsGeneral approach to the execution of Geant4 simulation in a distributed computing environment

– transparent sequential/parallel application– transparent execution on a local farm or on the Grid– user code is the same

Quantitative, documented results– reference for users and for further improvement– on-going work to understand details

Acknowledgments to:– M. Lamanna (CERN), Hurng-Chun Lee (ASGC, Taiwan), L. Moneta

(CERN), A. Pfeiffer (CERN)– the LCG teams at CERN and the Regional Operation Centre Team of Taiwan– no support from INFN GRID team

Maria Grazia Pia

IEEE Transactions on Nuclear ScienceIEEE Transactions on Nuclear Sciencehttp://ieeexplore.ieee.org/xpl/RecentIssue.jsp?puNumber=23

Prime journal on technology in particle/nuclear physics

Review process reorganized about one year ago Associate Editor dedicated to computing papers

Various papers associated to CHEP 2004 published on IEEE TNS

Papers associated to CHEP 2006 are welcomePapers associated to CHEP 2006 are welcome

Manuscript submission: http://tns-ieee.manuscriptcentral.com/Papers submitted for publication will be subject to the regular review process

Publications on refereed journals are beneficial not only to authors, but to the whole community of computing-oriented physicists

Our “hardware colleagues” have better established publication habits…

Further info: [email protected]

Documents

Maria Grazia Pia Simulation in a Distributed Computing Environment Simulation in a Distributed Computing Environment S. Guatelli 1, A. Mantero 1, P. Mendez