Upload
bernadette-madlyn-hopkins
View
221
Download
3
Tags:
Embed Size (px)
Citation preview
Developing a computational Developing a computational infrastructure infrastructure
for parallel high performance for parallel high performance FE/FVM simulationsFE/FVM simulations
Dr. Stan TomovBrookhaven National Laboratory
August 11, 2003
Outline• Motivation and overview• Mesh generation• Mesh partitioning and load balancing• Code optimization• Parallel FEM/FVM using
pthreads/OpenMP/MPI• Code organization and data structures• Applications• Visualization• Extensions and future work• Conclusions
Motivation• Technological advances facilitate research requiring
very large scale computations• High computing power is needed in many FE/FVM
simulations (fluid flow & transport in porous media, heat & mass transfer, elasticity, etc.)– higher demand for simulation accuracy
higher demand for computing power
• To meet the demand for high computational power:– the use of sequential machines is often insufficient (physical
limitations of both system memory and computer processing speed)
use parallel machines– develop better algorithms
• accuracy and reliability of the computational method
• efficient use of the available computing resources
Closely related to: - error control and adaptive mesh refinement - optimization
Motivation
Parallel HP FE/FVM simulations (issues)
• Choose the solver: direct or iterative– sparse matrices, storage considerations, parallelization,
preconditioners
• How to parallelize:– extract parallelism from sequential algorithm, or– develop ones with enhanced parallelism
• domain decomposition data distribution
• Mesh generation– Importance of finding a “good” mesh– in parallel, adaptive!
• Data structures to maintain – preconditioners
Overview
MPIOpenMPpthreadsOpenGL
Mesh generation
• Importance and requirements• Sequential generators
– Triangle (2D triangular meshes)– Netgen (3D tetrahedral meshes)
• ParaGrid– Based on sequential generators– Adaptively refines a starting mesh in parallel– Provides data structures suitable for domain
decomposition and multilevel type preconditioners
Mesh refinement
Mesh partitioning
• Mesh partitioners– Metis (University of Minnesota)– Chaco (Sandia National Laboratory)– Jostle (University of Greenwich, London)
• Requirements– Balance of elements and minimum interface
Load balancing (in AMR)
• For steady state problems– Algorithm 1: locally adapt the mesh (sequentially);
split using Metis; refine uniformly in parallel
– Algorithm 2: use error estimates as weights in splitting
the mesh; do parallel AMR
• For transient problems– Algorithm 3: ParMetis is used to check the load
balance, and if needed there is “transfer” of elements between sub-domains
Code optimization• Main concepts:
– Locality of reference (to improve memory performance)
– Software pipelining (to improve CPU performance)
Locality of reference (or “keep things used together close together”):
• Due to memory hierarchies - Disc, network RAM (200 CP) Cache levels (L2: 6 CP, L1: 3 CP) Registers (0 CP) / data for SGI Origin 2000, Mips R10000, 250 MHz
• Techniques (for cache friendly algorithms in NA) - Loop interchange : for i, j, k = 0 .. 100 do A[i][j][k] += B[i][j][k]*C[i][j][k], 10 x faster than k, j, i = 0..100 - Vertex reordering: for example Cuthill-McKee algorithm (CG example 1.16 x faster) - Blocking : related to domain decomposition data distribution
- Fusion : merge multiple loops into 1, for example vector operations in CG, GMRES, etc. to improve reuse
Code optimization
• Performance monitoring & benchmarking:– importance (in code optimization)– on SGI we use ssrun, prof, and perfex– SGI’s pmchart to monitor cluster network traffic
Software pipelining:• Machine dependence - if CPU functional units are pipelined
• Can be turned on with compiler options: - computing with SWP
A[i][j][k] += B[i][j][k]*C[i][j][k], i, j, k=0..100 increased performance 100 x
• Techniques to improve SWP : - inlining, splitting/fusing, loop unrolling
Parallel FE/FVM with pthreads
• Pthreads are portable and simple• Used in shared memory parallel
systems• Low level parallel programming• User has to create more
complicated parallel constructs– not widely used in parallel FE/FVM
simulations
• We use it on HP Systems that are both Distributed Memory Parallel & Shared Memory Parallel
extern pthread_mutex_t mlock;extern pthread_cond_t sync_wait;extern int barrier_counter;extern int number_of_threads;
void pthread_barrier(){ pthread_mutex_lock(&mlock); if (barrier_counter){ barrier_counter --; pthread_cond_wait(&sync_wait,
&mlock); } else{ barrier_counter = munber_of_threads-1; pthread_cond_signal(&sync_wait); } pthread_mutex_unlock(&mlock);}
• We use : (1) “Peer model” parallelism (threads working concurrently)
(2) “main thread” deals with MPI communications
Parallel FE/FVM with OpenMP
• OpenMP is a portable and simple set of compiler directives and functions for parallel shared memory programming
• Higher level parallel programming• Implementation often based on
pthreads• Iterative solvers scale well• Used as pthreads in mixed distributed
and shared parallel systems• On NUMA architectures we need to
have arrays properly distributed among the processors:– #pragma distribute, #pragma
redistribute– #pragma distribute_reshape
• We use– domain decomposition data distribution– Programming model similar to MPI – Model : one parallel region
MachineSpeedup for #
threads
2 4 8
SGI Power Challenge
2.04 4.00 7.64
SGI Origin 2000 1.76 3.48 6.11
Table 3. Parallel CG on problem of size 1024x1024
… // sequential initialization
#pragma omp parallel{ int myrank = omp_get_thread_num();
// distribution using “first touch rule” S[myrank] = new Subdomain(myrank, …);
…
}
Parallel FE/FVM with MPI• MPI is a system of functions for parallel
distributed memory programming• Parallel processes communicate by sending and
receiving messages• Domain decomposition data distribution approach • Usually 6 or 7 functions are used
– MPI_Allreduce: in computing dot-products– MPI_Isend and MPI_Recv: in computing
Matrix-vector products – MPI_Barrier: many uses– MPI_Bcast: to broadcast sequential input– MPI_Comm_rank, MPI_Comm_size
Mixed implementations
# cluster nodes
x #CPUs
Time in seconds
100Mbit / 1Gbit
Speedup100Mbit /
1Gbit
1 x 1 427.89 ---
2 x 1 223.36 / 224.49
1.92 / 1.90
4 x 1 115.13 / 112.77
3.72 / 3.79
4 x 2 82.67 / 77.11
5.17 / 5.55
# nodes
x #CPU
Pure MPI MPI & pthreads
Time (s)
speedup
Time (s)
speedup
1 x 2 290.43 --- 295.36 ---
2 x 2 168.46 1.72 148.15 1.99
4 x 2 82.67 3.51 75.14 3.93
• MPI & pthreads/OpenMP in a cluster environment- Example: Parallel CG on (1) a problem of size 314,163, on (2) commodity- based cluster (4 nodes, each node with 2 Pentium III, running at 1GHz, 100Mbit or 1Gbit network)
Table 1. MPI implementation scalability over the two networks.
Table 2. MPI implementation scalability vs mixed (pthreads on the dual processors).
ParaGrid code organization
ParaGrid data structures• Connections between the different
subdomains– in terms of packets– A vertex packet is all vertices
shared by the same subdomains– The subdomains sharing packet
have:• their own packet copy• “pointers” to the packet copies
in the other subdomains • only one subdomain is owner
of the packet
– Similarly for edges and faces, used in:• refinement• problems with degrees of freedom in edges or
faces
Applications• Generation of large, sparse linear systems of
equations on massively parallel computers– Generated on fly, no need to store large meshes or
linear systems– Distributed among processing nodes– Used at LLNL to generate test problems for their
HYPRE project (scalable software for solving such problems)
• Various FE/FVM discretizations (used at TAMU and LLNL) with applications to : – Heat and mass transfer– Linear elasticity– Flow and transport in porous media
Applications• A posteriori error control and AMR (at TAMU and
BNL)– Accuracy and reliability of a computational method– Efficient use of available computational resources
• Studies in domain decomposition and multigrid preconditioners (at LLNL, TAMU)
• Studies in domain decomposition on non-matching grids (at LLNL and TAMU) – interior penalty discontinuous approximations– mortar finite element approximations
• Visualization (at LLNL, TAMU, and BNL)• Benchmarking hardware (at BNL)
– CPU performance– network traffic, etc.
Visualization• Importance• Integration of ParaGrid with visualization (not compiled together): - save mesh & solution in files for later visualization - send directly mesh & solution through sockets for visualization• GLVis - portable, based on OpenGL (compiled also with Mesa) - visualize simple geometric primitives (vertices, lines, and polygons) - can be used as a “server” - waits for data to be visualized - uses fork after every data set received - combines parallel input (from ParaGrid) into a sequential visualization
• VTK based - added to support volume visualization
VisualizationGLVis code structure and features
Abstract classes
2D scalar data visualization
2D vector data visualization
3D scalar data visualization
3D vector data visualization
Extensions and future work• Extend and use the technology developed
with other already existing tools for HPC– Legacy FE/FVM (or just user specific) software– Interfaces to external solvers (including
direct) and preconditioners, etc.
• Extend the use to various applications– Electromagnetics– Elasticity, etc.
• Tune the code to particular architectures– Benchmarking and optimization– Commodity-based clusters
Extensions and future work• Further develop methods and tools for
adaptive error control and mesh refinement– Time dependent and non-linear problems– Better study of the constants involved in the
estimates
• Visualization– User specific– GPU as coprocessor?
• Create user-friendly interfaces
Conclusions
A step toward developing computational infrastructure for parallel HPC• Domain decomposition framework
−Fundamental concept/technique for parallel computing with wide area of applications
− Needed for parallel HPC research in numerical PDEs
• Benefit to computational researchers−Require efficient techniques to solve linear systems
with millions of unknowns− Finding a “good” mesh essential for developing
efficient computational methodology based on FE/FVM