Scalable Molecular Dynamics for Large Biomolecular Systems

Scalable Molecular Dynamicsfor Large Biomolecular Systems

Robert Brunner

James C Phillips

Laxmikant Kale

Overview

• Context: approach and methodology• Molecular dynamics for biomolecules• Our program NAMD

– Basic Parallelization strategy

• NAMD performance Optimizations– Techniques

– Results

• Conclusions: summary, lessons and future work

The context

• Objective: Enhance Performance and productivity in parallel programming– For complex, dynamic applications

– Scalable to thousands of processors

• Theme:– Adaptive techniques for handling dynamic behavior

• Look for optimal division of labor between human programmer and the “system”– Let the programmer specify what to do in parallel

– Let the system decide when and where to run the subcomputations

• Data driven objects as the substrate

Data driven execution

Scheduler Scheduler

Message Q Message Q

Charm++

• Parallel C++ with Data Driven Objects• Object Arrays and collections• Asynchronous method invocation• Object Groups:

– global object with a “representative” on each PE

• Prioritized scheduling• Mature, robust, portable• http://charm.cs.uiuc.edu

Multi-partition decomposition

Load balancing

• Based on migratable objects• Collect timing data for several cycles• Run heuristic load balancer

– Several alternative ones

• Re-map and migrate objects accordingly– Registration mechanisms facilitate migration

Measurement based load balancing

• Application induced imbalances:– Abrupt, but infrequent, or

– Slow, cumulative

– rarely: frequent, large changes

• Principle of persistence– Extension of principle of locality

– Behavior, including computational load and communication patterns, of objects tend to persist over time

• We have implemented strategies that exploit this automatically

Molecular Dynamics

Molecular dynamics and NAMD

• MD to understand the structure and function of biomolecules– proteins, DNA, membranes

• NAMD is a production quality MD program– Active use by biophysicists (science publications)

– 50,000+ lines of C++ code

– 1000+ registered users

– Features and “accessories” such as

• VMD: visualization

• Biocore: collaboratory

• Steered and Interactive Molecular Dynamics

NAMD Contributors

• PI s : – Laxmikant Kale, Klaus Schulten, Robert Skeel

• NAMD 1: – Robert Brunner, Andrew Dalke, Attila Gursoy, Bill

Humphrey, Mark Nelson

• NAMD2: – M. Bhandarkar, R. Brunner, A. Gursoy, J. Phillips,

N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..

Molecular Dynamics

• Collection of [charged] atoms, with bonds• Newtonian mechanics• At each time-step

– Calculate forces on each atom

• bonds:

• non-bonded: electrostatic and van der Waal’s

– Calculate velocities and Advance positions

• 1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 100,000)

Cut-off radius

• Use of cut-off radius to reduce work– 8 - 14 Å

– Faraway charges ignored!

• 80-95 % work is non-bonded force computations• Some simulations need faraway contributions

– Periodic systems: Ewald, Particle-Mesh Ewald

– Aperiodic systems: FMA

• Even so, cut-off based computations are important:– near-atom calculations are part of the above

– multiple time-stepping is used: k cut-off steps, 1 PME/FMA

Scalability

• The Program should scale up to use a large number of processors. – But what does that mean?

• An individual simulation isn’t truly scalable• Better definition of scalability:

– If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size

Isoefficiency

• Quantify scalability – (Work of Vipin Kumar, U. Minnesota)

• How much increase in problem size is needed to retain the same efficiency on a larger machine?

• Efficiency : Seq. Time/ (P · Parallel Time)– parallel time =

• computation + communication + idle

Atom decomposition

• Partition the Atoms array across processors– Nearby atoms may not be on the same processor

– Communication: O(N) per processor

– Communication/Computation: O(N)/(N/P): O(P)

– Again, not scalable by our definition

Force Decomposition

• Distribute force matrix to processors– Matrix is sparse, non uniform

– Each processor has one block

– Communication:

– Ratio:

• Better scalability in practice – (can use 100+ processors)

– Plimpton:

– Hwang, Saltz, et al:

• 6% on 32 Pes 36% on 128 processor

– Yet not scalable in the sense defined here!

Spatial Decomposition

• Allocate close-by atoms to the same processor• Three variations possible:

– Partitioning into P boxes, 1 per processor

• Good scalability, but hard to implement

– Partitioning into fixed size boxes, each a little larger than the cutoff distance

– Partitioning into smaller boxes

• Communication: O(N/P): – so, scalable in principle

Spatial Decomposition in NAMD

• NAMD 1 used spatial decomposition• Good theoretical isoefficiency, but for a fixed size

system, load balancing problems• For midsize systems, got good speedups up to 16

processors….• Use the symmetry of Newton’s 3rd law to facilitate

load balancing

Spatial Decomposition

But the load balancing problems are still severe:

FD + SD

• Now, we have many more objects to load balance:– Each diamond can be assigned to any processor

– Number of diamonds (3D):

• 14·Number of Patches

Bond Forces

• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..

– Luckily, each involves atoms in neighboring patches only

• Straightforward implementation:– Send message to all neighbors,

– receive forces from them

– 26*2 messages per patch!

Bonded Forces:• Assume one patch per processor:

– an angle force involving atoms in patches:

• (x1,y1,z1), (x2,y2,z2), (x3,y3,z3)

• is calculated in patch: (max{xi}, max{yi}, max{zi})

Implementation

• Multiple Objects per processor– Different types: patches, pairwise forces, bonded forces,

– Each may have its data ready at different times

– Need ability to map and remap them

– Need prioritized scheduling

• Charm++ supports all of these

Load Balancing

• Is a major challenge for this application– especially for a large number of processors

• Unpredictable workloads– Each diamond (force object) and patch encapsulate variable

amount of work

– Static estimates are inaccurate

• Measurement based Load Balancing Framework– Robert Brunner’s recent Ph.D. thesis

– Very slow variations across timesteps

Bipartite graph balancing

• Background load:– Patches (integration, ..) and bond-related forces:

• Migratable load:– Non-bonded forces

• Bipartite communication graph – between migratable and non-migratable objects

• Challenge:– Balance Load while minimizing communication

Load balancing strategy

Greedy variant (simplified):

Sort compute objects (diamonds)

Repeat (until all assigned)

S = set of all processors that:

-- are not overloaded

-- generate least new commun.

P = least loaded {S}

Assign heaviest compute to P

Refinement:

Repeat

- Pick a compute from

the most overloaded PE

- Assign it to a suitable

underloaded PE

Until (No movement)

Cell CellCompute

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

5000000

Processors

e migratable work

non-migratable work

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

0 2 4 6 8 10 12 14

Processors

e migratable work

non-migratable work

Initial Speedup Results: ASCI RedSpeedup on ASCI Red: Apo-A1

0 200 400 600 800 1000 1200 1400 1600 1800

Processors

BC1 complex: 200k atoms

Optimizations

• Series of optimizations• Examples to be covered here:

– Grainsize distributions (bimodal)

– Integration: message sending overheads

Grainsize and Amdahls’s law

• A variant of Amdahl’s law, for objects, would be:– The fastest time can be no shorter than the time for the biggest

single object!

• How did it apply to us?– Sequential step time was 57 seconds

– To run on 2k processors, no object should be more than 28 msecs.

• Should be even shorter

– Grainsize analysis via projections showed that was not so..

Grainsize analysisGrainsize distribution

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

grainsize in milliseconds

Solution:

Split compute objects that may have too much work:

using a heuristics based on number of interacting atoms

Problem

Grainsize reduced

Grainsize distribution after splitting

1 3 5 7 9 11 13 15 17 19 21 23 25

grainsize in msecs

Performance audit

• Through the optimization process, – an audit was kept to

decide where to look to improve performance

Total Ideal Actual

Total 57.04 86

nonBonded 52.44 49.77

Bonds 3.16 3.9

Integration 1.44 3.05

Overhead 0 7.97

Imbalance 0 10.45

Idle 0 9.25

Receives 0 1.61

Integration time doubled

Integration overhead analysis

integration

Problem: integration time had doubled from sequential run

Integration overhead example:

• The projections pictures showed the overhead was associated with sending messages.

• Many cells were sending 30-40 messages.– The overhead was still too much compared with the cost of

messages.

– Code analysis: memory allocations!

– Identical message is being sent to 30+ processors.

• Simple multicast support was added to Charm++– Mainly eliminates memory allocations (and some copying)

Integration overhead: After multicast

Improved Performance DataSpeedup on Asci Red

0 500 1000 1500 2000 2500

Processors

Results on Linux Cluster

Speedup on Linux Cluster

0 20 40 60 80 100 120

Processors

Performance of Apo-A1 on Asci Red

0 500 1000 1500 2000 2500

Processors

Performance of Apo-A1 on O2k and T3E

0 50 100 150 200 250 300

Processors

Lessons learned

• Need to downsize objects!– Choose smallest possible grainsize that amortizes overhead

• One of the biggest challenge – was getting time for performance tuning runs on parallel

machines

Future and Planned work

• Speedup on small molecules!– Interactive molecular dynamics

• Increased speedups on 2k-10k processors– Smaller grainsizes

– New algorithms for reducing communication impact

– New load balancing strategies

• Further performance improvements for PME/FMA– With multiple timestepping

– Needs multi-phase load balancing

Steered MD: example picture

Image and Simulation by the theoretical biophysics group, Beckman Institute, UIUC

More information

• Charm++ and associated framework:– http://charm.cs.uiuc.edu

• NAMD and associated biophysics tools:– http://www.ks.uiuc.edu

• Both include downloadable software

Performance: size of system

# ofatoms

Procs 1 2 4 8 16 32 64 128 160

bR Time 1.14 0.58 .315 .158 .086 .0483,762atoms

Speedup 1.0 1.97 3.61 7.20 13.2 23.7

ER-ERE Time 6.115 3.099 1.598 .810 .397 0.212 0.123 0.09836,573atoms

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

ApoA-I Time 10.76 5.46 2.85 1.47 0.729 0.382 0.32192,224atoms

Speedup (3.88) 7.64 14.7 28.4 57.3 109 130

Performance data on Cray T3E

Performance: various machines

Procs 1 2 4 8 16 32 64 128 160 192

T3E Time 6.12 3.10 1.60 0.810 0.397 0.212 0.123 0.098

- ---------

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

Origin Time 8.28 4.20 2.17 1.07 0.542 0.271 0.152

2000-------

Speedup 1.0 1.96 3.80 7.74 15.3 30.5 54.3

ASCI- Time 28.0 13.9 7.24 3.76 1.91 1.01 0.500 0.279 0.227 0.196

Red ---------

Speedup 1.0 2.01 3.87 7.45 14.7 27.9 56.0 100 123 143

NOWs Time 24.1 12.4 6.39 3.69

HP735/125

Speedup 1.0 1.94 3.77 6.54

Scalable Molecular Dynamics for Large Biomolecular Systems

Documents

Master class Biomolecular Sciences Part 1: Molecular Cell Biology. October 9, 2008

Scalable Quantum Simulation of Molecular Energies

Computational Microscopy of Biomolecular Processes using ... · Our Computational Microscope: Molecular dynamics simulations Cosolvent Water High Performance Computing Molecular simulations

Biomolecular Processes as Concurrent Computation: Modeling Molecular Processes in the -calculus Process Algebra

1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale

17. Biomolecular Interactionchemweb.unl.edu/guo/Chpater 17. Biomolecular interactions...17. Biomolecular Interaction • Methods for characterizing biomolecular interactions • Sequence-specific

Turning genomics data into Biology Martijn Huynen Nijmegen Center for Molecular Life Sciences, Centre for Molecular and Biomolecular Informatics

Biomolecular vibrational spectroscopy - Rutgers University · 2009-03-31 · Biomolecular vibrational spectroscopy Methods in Molecular Biophysics, Spring 2009 ... principle) Model

Free Energy Via Molecular Simulation: Applications to ... · dynamics (MD) (2). Applications of molecular simulation to diverse chemi-cal and biomolecular systems have included studies

Master class Biomolecular Sciences Part 1: Molecular Cell Biology. 4 September 2007 Paul van Bergen en Henegouwen

BioMolecular Oxygen - Cloudinary€¦ · • BioMolecular Oxygen supports cellular oxygen status. That promotes tissue health and well-being from head to toe. • The molecular oxygen

Scalable Molecular Dynamics for Large Biomolecular Systems

Accelerating Biomolecular Simulation using a Scalable ...pc/research/publications/patel.texpo200… · Accelerating Biomolecular Simulation using a Scalable Network of Reconfigurable

Dept. of Chemical & Biomolecular Engineering Class Number ......Class Number: 16007 Dept. of Chemical & Biomolecular Engineering BMS 4324: Advanced Cell and Molecular Biology II Response

Biomolecular solvation: from molecular to continuum modelsBiomolecular solvation: from molecular to continuum models Jason Wagoner, Feng Dong, Nathan Baker ... Dept. of Biochemistry

Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy · 2016-07-29 · 80 C. Hansda et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 157

The basis of your life: Biomolecular Masters Basicmaster BMS: Biomolecular Sciences Biomolecular Sciences Topmaster BMI/BS: Biomolecular Integration Biomolecular

Biomolecular committor probability calculation enabled by ...wozniak/papers/GEMS_PINS_2008.pdf2. Protein domain simulation Atomic scale biomolecular simulation based on molecular dynamics

Biomolecular simulations Patrice Koehl. Material Science Chemistry QUANTUM MECHANICS Molecular Mechanics Force Fields Hierarchical Simulations Meso- scale

Biomolecular Transport Across and Along Membranes studied ... · Biomolecular Transport Across and Along Membranes studied by Molecular Dynamics Simulations Dissertation zur Erlangung