Developing a computational infrastructure for parallel high performance FE/FVM simulations Dr. Stan Tomov Brookhaven National Laboratory August 11, 2003

Developing a computational Developing a computational infrastructure infrastructure

for parallel high performance for parallel high performance FE/FVM simulationsFE/FVM simulations

Dr. Stan TomovBrookhaven National Laboratory

August 11, 2003

Outline• Motivation and overview• Mesh generation• Mesh partitioning and load balancing• Code optimization• Parallel FEM/FVM using

pthreads/OpenMP/MPI• Code organization and data structures• Applications• Visualization• Extensions and future work• Conclusions

Motivation• Technological advances facilitate research requiring

very large scale computations• High computing power is needed in many FE/FVM

simulations (fluid flow & transport in porous media, heat & mass transfer, elasticity, etc.)– higher demand for simulation accuracy

higher demand for computing power

• To meet the demand for high computational power:– the use of sequential machines is often insufficient (physical

limitations of both system memory and computer processing speed)

use parallel machines– develop better algorithms

• accuracy and reliability of the computational method

• efficient use of the available computing resources

Closely related to: - error control and adaptive mesh refinement - optimization

Motivation

Parallel HP FE/FVM simulations (issues)

• Choose the solver: direct or iterative– sparse matrices, storage considerations, parallelization,

preconditioners

• How to parallelize:– extract parallelism from sequential algorithm, or– develop ones with enhanced parallelism

• domain decomposition data distribution

• Mesh generation– Importance of finding a “good” mesh– in parallel, adaptive!

• Data structures to maintain – preconditioners

Overview

MPIOpenMPpthreadsOpenGL

Mesh generation

• Importance and requirements• Sequential generators

– Triangle (2D triangular meshes)– Netgen (3D tetrahedral meshes)

• ParaGrid– Based on sequential generators– Adaptively refines a starting mesh in parallel– Provides data structures suitable for domain

decomposition and multilevel type preconditioners

Mesh refinement

Mesh partitioning

• Mesh partitioners– Metis (University of Minnesota)– Chaco (Sandia National Laboratory)– Jostle (University of Greenwich, London)

• Requirements– Balance of elements and minimum interface

Load balancing (in AMR)

• For steady state problems– Algorithm 1: locally adapt the mesh (sequentially);

split using Metis; refine uniformly in parallel

– Algorithm 2: use error estimates as weights in splitting

the mesh; do parallel AMR

• For transient problems– Algorithm 3: ParMetis is used to check the load

balance, and if needed there is “transfer” of elements between sub-domains

Code optimization• Main concepts:

– Locality of reference (to improve memory performance)

– Software pipelining (to improve CPU performance)

Locality of reference (or “keep things used together close together”):

• Due to memory hierarchies - Disc, network RAM (200 CP) Cache levels (L2: 6 CP, L1: 3 CP) Registers (0 CP) / data for SGI Origin 2000, Mips R10000, 250 MHz

• Techniques (for cache friendly algorithms in NA) - Loop interchange : for i, j, k = 0 .. 100 do A[i][j][k] += B[i][j][k]*C[i][j][k], 10 x faster than k, j, i = 0..100 - Vertex reordering: for example Cuthill-McKee algorithm (CG example 1.16 x faster) - Blocking : related to domain decomposition data distribution

- Fusion : merge multiple loops into 1, for example vector operations in CG, GMRES, etc. to improve reuse

Code optimization

• Performance monitoring & benchmarking:– importance (in code optimization)– on SGI we use ssrun, prof, and perfex– SGI’s pmchart to monitor cluster network traffic

Software pipelining:• Machine dependence - if CPU functional units are pipelined

• Can be turned on with compiler options: - computing with SWP

A[i][j][k] += B[i][j][k]*C[i][j][k], i, j, k=0..100 increased performance 100 x

• Techniques to improve SWP : - inlining, splitting/fusing, loop unrolling

Parallel FE/FVM with pthreads

• Pthreads are portable and simple• Used in shared memory parallel

systems• Low level parallel programming• User has to create more

complicated parallel constructs– not widely used in parallel FE/FVM

simulations

• We use it on HP Systems that are both Distributed Memory Parallel & Shared Memory Parallel

extern pthread_mutex_t mlock;extern pthread_cond_t sync_wait;extern int barrier_counter;extern int number_of_threads;

void pthread_barrier(){ pthread_mutex_lock(&mlock); if (barrier_counter){ barrier_counter --; pthread_cond_wait(&sync_wait,

&mlock); } else{ barrier_counter = munber_of_threads-1; pthread_cond_signal(&sync_wait); } pthread_mutex_unlock(&mlock);}

• We use : (1) “Peer model” parallelism (threads working concurrently)

(2) “main thread” deals with MPI communications

Parallel FE/FVM with OpenMP

• OpenMP is a portable and simple set of compiler directives and functions for parallel shared memory programming

• Higher level parallel programming• Implementation often based on

pthreads• Iterative solvers scale well• Used as pthreads in mixed distributed

and shared parallel systems• On NUMA architectures we need to

have arrays properly distributed among the processors:– #pragma distribute, #pragma

redistribute– #pragma distribute_reshape

• We use– domain decomposition data distribution– Programming model similar to MPI – Model : one parallel region

MachineSpeedup for #

threads

2 4 8

SGI Power Challenge

2.04 4.00 7.64

SGI Origin 2000 1.76 3.48 6.11

Table 3. Parallel CG on problem of size 1024x1024

… // sequential initialization

#pragma omp parallel{ int myrank = omp_get_thread_num();

// distribution using “first touch rule” S[myrank] = new Subdomain(myrank, …);

…

}

Parallel FE/FVM with MPI• MPI is a system of functions for parallel

distributed memory programming• Parallel processes communicate by sending and

receiving messages• Domain decomposition data distribution approach • Usually 6 or 7 functions are used

– MPI_Allreduce: in computing dot-products– MPI_Isend and MPI_Recv: in computing

Matrix-vector products – MPI_Barrier: many uses– MPI_Bcast: to broadcast sequential input– MPI_Comm_rank, MPI_Comm_size

Mixed implementations

# cluster nodes

x #CPUs

Time in seconds

100Mbit / 1Gbit

Speedup100Mbit /

1Gbit

1 x 1 427.89 ---

2 x 1 223.36 / 224.49

1.92 / 1.90

4 x 1 115.13 / 112.77

3.72 / 3.79

4 x 2 82.67 / 77.11

5.17 / 5.55

# nodes

x #CPU

Pure MPI MPI & pthreads

Time (s)

speedup

Time (s)

speedup

1 x 2 290.43 --- 295.36 ---

2 x 2 168.46 1.72 148.15 1.99

4 x 2 82.67 3.51 75.14 3.93

• MPI & pthreads/OpenMP in a cluster environment- Example: Parallel CG on (1) a problem of size 314,163, on (2) commodity- based cluster (4 nodes, each node with 2 Pentium III, running at 1GHz, 100Mbit or 1Gbit network)

Table 1. MPI implementation scalability over the two networks.

Table 2. MPI implementation scalability vs mixed (pthreads on the dual processors).

ParaGrid code organization

ParaGrid data structures• Connections between the different

subdomains– in terms of packets– A vertex packet is all vertices

shared by the same subdomains– The subdomains sharing packet

have:• their own packet copy• “pointers” to the packet copies

in the other subdomains • only one subdomain is owner

of the packet

– Similarly for edges and faces, used in:• refinement• problems with degrees of freedom in edges or

faces

Applications• Generation of large, sparse linear systems of

equations on massively parallel computers– Generated on fly, no need to store large meshes or

linear systems– Distributed among processing nodes– Used at LLNL to generate test problems for their

HYPRE project (scalable software for solving such problems)

• Various FE/FVM discretizations (used at TAMU and LLNL) with applications to : – Heat and mass transfer– Linear elasticity– Flow and transport in porous media

http://www.ccd.bnl.gov/~tomov/flow.mov

Applications• A posteriori error control and AMR (at TAMU and

BNL)– Accuracy and reliability of a computational method– Efficient use of available computational resources

• Studies in domain decomposition and multigrid preconditioners (at LLNL, TAMU)

• Studies in domain decomposition on non-matching grids (at LLNL and TAMU) – interior penalty discontinuous approximations– mortar finite element approximations

• Visualization (at LLNL, TAMU, and BNL)• Benchmarking hardware (at BNL)

– CPU performance– network traffic, etc.

http://www.ccd.bnl.gov/~tomov/contamination/contamination.html

Visualization• Importance• Integration of ParaGrid with visualization (not compiled together): - save mesh & solution in files for later visualization - send directly mesh & solution through sockets for visualization• GLVis - portable, based on OpenGL (compiled also with Mesa) - visualize simple geometric primitives (vertices, lines, and polygons) - can be used as a “server” - waits for data to be visualized - uses fork after every data set received - combines parallel input (from ParaGrid) into a sequential visualization

• VTK based - added to support volume visualization

VisualizationGLVis code structure and features

Abstract classes

2D scalar data visualization

2D vector data visualization

3D scalar data visualization

3D vector data visualization

Extensions and future work• Extend and use the technology developed

with other already existing tools for HPC– Legacy FE/FVM (or just user specific) software– Interfaces to external solvers (including

direct) and preconditioners, etc.

• Extend the use to various applications– Electromagnetics– Elasticity, etc.

• Tune the code to particular architectures– Benchmarking and optimization– Commodity-based clusters

Extensions and future work• Further develop methods and tools for

adaptive error control and mesh refinement– Time dependent and non-linear problems– Better study of the constants involved in the

estimates

• Visualization– User specific– GPU as coprocessor?

• Create user-friendly interfaces

Conclusions

A step toward developing computational infrastructure for parallel HPC• Domain decomposition framework

−Fundamental concept/technique for parallel computing with wide area of applications

− Needed for parallel HPC research in numerical PDEs

• Benefit to computational researchers−Require efficient techniques to solve linear systems

with millions of unknowns− Finding a “good” mesh essential for developing

efficient computational methodology based on FE/FVM

Documents

Developing a computational infrastructure for parallel high performance FE/FVM simulations Dr. Stan Tomov Brookhaven National Laboratory August 11, 2003