Upload
primrose-randall
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
OPAL: Open Source Parallel Algorithm LibraryDesigning High-Performance Algorithms for SMP Clusters
David A. BaderElectrical & Computer Engineering Department
Albuquerque High Performance Computing Center
University of New Mexico
http://hpc.eece.unm.edu/
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader2
High-Performance Applications using SMP Clusters
• Long-term Earth science studies using terascale remotely-sensed global satellite imagery (4 km AVHRR GAC)
• Computational Ecological Studies: Self-Organization of Semi-Arid Landscapes: Test of Optimality Principles
• Computational Bioinformatics: Large Scale Phylogeny Reconstruction
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader3
Research Collaborators• Joseph JáJá, University of Maryland• Bernard Moret, CS (Experimental Algorithmics), University of
New Mexico• Bruce Milne, Biology (Landscape Ecology), University of New
Mexico• Tandy Warnow, CS, University of Texas-Austin• IBM ACTC Group (David Klepacki, John Levesque, and others)• Current Graduate Students:
• Mi Yan, Niranjan Prabhu, Vinila Yarlagadda
• Laboratory Alumni:• Kavita Balakavi (Intel), Ajith Illendula (Intel)
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader4
Acknowledgment of Support
• NSF CISE Postdoctoral Research Associate in Experimental Computer Science No. 96-25668
• NSF BIO Division of Environmental Biology DEB 99-10123
• Department of Energy Sandia-University New Assistant Professorship Program (SUNAPP) Award AX-3006
• IBM SUR Grant (UNM Vista-Azul Project )• NPACI/SDSC and NCSA/Alliance • NSF 00-* Algorithms for Irregular Discrete Computations
on SMPs
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader5
Outline• Motivation
• SMP Cluster Programming (SIMPLE)
• Complexity model• Message-Passing• Shared-Memory
• OPAL Facets (parallel libraries)• OPAL Setting (programming framework)• Example SMP Algorithms
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader6
Motivation• High performance computing has been leveraging
COTS workstation technologies• Commodity microprocessors• High-performance networks• Operating system and compiler technology
• Symmetric multiprocessor (SMP)• Hardware support for hierarchical memory management• Multithreaded operating system kernels• Optimizing compilers and runtime systems
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader7
SMP Cluster Architectures• IBM SP (NPACI Blue Horizon 144x8)• Linux Clusters• Compaq AlphaServers (PSC/NSF Terascale 682x4)• Sun Ultra HPC (4x64)
LLNL ASCI WhiteIBM SP (512x16) UNM/Alliance LosLobos
IBM Netfinity(256x2)
UNM/Alliance Roadrunner Linux
SuperCluster (64x2)
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader8
Message-Passing Performance
MPI BandwidthIBM Netfinity 4500R (733MHz) Cluster with Myrinet
MPI Message Length (Bytes)
0.0 2.0e+5 4.0e+5 6.0e+5 8.0e+5 1.0e+6 1.2e+6
Ba
nd
wid
th (
MB
/s)
0
20
40
60
80
100
120
MPI TimeIBM Netfinity 4500R (733MHz) Cluster with Myrinet
MPI Message Length (Bytes)
0.0 2.0e+5 4.0e+5 6.0e+5 8.0e+5 1.0e+6 1.2e+6
Tim
e (us)
0
2000
4000
6000
8000
10000
12000
14000
16000
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader9
Shared-Memory Performance
• One Sun HPC E10K processor
• Contiguous array; each element read exactly once
• C, X = cyclic read (stride X) of contiguous array
• R = random access of array
log2 (Problem Size)
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Tim
e pe
r m
emor
y re
ad (
ns)
1 ns
10 ns
100 ns
1000 ns
C, 1C, 2C, 4 C, 8 C, 16 C, 32 C, 64 R
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader10
High Performance Algorithms for SMP Clusters
• “SIMPLE” Model• Use a hybrid, natural combination of message-
passing and shared-memory• Message passing interface between nodes• Shared-memory programming (OpenMP, POSIX Threads)
on each SMP node
• Methodology for adapting message-passing algorithms for SMP Clusters
• Freely-available open source implementation of parallel algorithms, libraries, and programming environment, for C/C++/Fortran with GNU Public License (GPL)
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader11
Optimizing from MPI to SIMPLE (Regular or Irregular Algorithms)
• Similar Single-Program Multiple-Data (SPMD) paradigm
• Replace multiple MPI tasks per node with a single task and multiple shared-memory threads
• Parallelize sequential work into equivalent shared-memory algorithms
• Replace MPI communication primitives with corresponding “SIMPLE” primitives
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader13
Portability: Access from User Space
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader14
Parallel Complexity Models
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader15
SIMPLE Complexity ModelMessage Passing Primitives
Send/ReceiveBarrierScanReduceAllreduceAlltoallBroadcastGather/ScatterShift
m/ + (1) (p-1)/ + (1) (p-1)/ + (p) (p-1)/ + (p) (p-1)/ + (p) (m – m/p)/ + (m)2( (m – m/p)/) + (p) (m – m/p)/ + (m) m/ + (1)
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader16
Comparison of PRAM to SMP• PRAM (theory)
• O(n) processors• Global clock• Synchronous
shared-memory• Unit cost for
computation or memory access
• Ideal Read/Write models (EREW, CREW, CRCW)
• SMP (practice)• “P” processors (2 to 64)• Asynchronous lock-step
operation• Uniform memory
access to main memory (< 600 ns), faster access to local cache (10-40 ns)
• Cache-coherency at external caches
• Contention for shared memory
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader17
OPAL Complexity Model
• SMP Complexity model motivated by Helman and
JáJá, Ramachandran
• Complexity given by the triplet (MA, ME, TC)
• MA is the number of memory accesses,
• ME is the maximum volume of data exchanged
between any processor and memory,
• TC is the computational complexity.
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader18
OPAL Facets• Common Primitives
• Read/Write• Replicate• Barrier• Scan• Reduce• Broadcast• Allreduce
• Techniques• Pointer-jumping• Balanced Trees (Prefix-Sums)• Symmetric Breaking (3-
Coloring)• Parallel Prefix (List Ranking)
• Graph Algorithms• Spanning Tree• Euler Tour• Tree Functions• Ear Decomposition
• Combinatorics• Sorting• Selection
• Bioinformatics• (Minimum Evolution)
Phylogeny Trees• Computational Genomics:
Breakpoints, Inversions, Translocations
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader19
SMP Complexity ModelSMP Node Primitives
• Read/Write• Replicate• Barrier• Scan• Reduce• Broadcast• Allreduce• Etc.
• SMP Complexity model motivated by Helman and JáJá• Complexity given by the triplet (MA, ME, TC)
• MA is the number of memory accesses,
• ME is the maximum volume of data
exchanged between any processor and memory,• TC is the computational complexity.
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader20
OPAL Setting:Programming Environment
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader21
Local Context Parameters for Each Thread
NODES Total number of nodes in the cluster
MYNODE My node rank
THREADS Total number of threads on my node
MYTHREAD Rank of my thread on this node
TID Total number of threads in the cluster
ID My thread rank, with respect to the cluster
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader22
Control Primitives
on_one_thread Only one thread per nodeon_one_node All threads on a single nodeon_one Only one thread on a single nodeon_thread(i) On one thread (i) per nodeon_node(j) All threads on node j
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader23
Memory Management Primitives
node_malloc Dynamically allocate ashared structure
node_free Release memory
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader24
Example Application: Radixsort
• Stable sort of n integers spread evenly across a cluster of p shared-memory r-way nodes
• Decompose b-bit keys into -bit digits• Perform b / passes of counting sort on digits (LSD
MSD)
• Counting Sort• Compute histogram of local keys• Communicate: Alltoall primitive of histograms• Locally compute prefix-sums of histograms• Communicate: (Inverse) Alltoall of prefix-sums• Rank each local element• Perform a personalized communication (1-relation)
rearranging elements into sorted order
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader25
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader27
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader28
Execution Time of Radix Sort on an SMP Cluster
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader29
SMP Example: Ear Decomposition
• Ear decomposition• Partitions the edges of a graph, useful in parallel processing• “Like peeling the layers of an onion”
• Applied to scientific computing problems• Computational mechanics (structural rigidity)• Computational biology (molecular structure, atoms in DNA chains)• Computational fluid dynamics
• Similar to other parallel algorithms for combinatorial problems
• Trivial and fast sequential algorithm• Efficient PRAM algorithm• But no known practical, parallel algorithm
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader30
Ear Decomposition Example
Spanning Tree
Output Ears
n = number of verticesm = number of edges
Input
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader31
Ear Decomposition Complexities
• Message Passing:
• Spanning Tree
• Ear Decomposition
• Shared Memory:
• Spanning Tree
• Ear Decomposition
pnpnpnT log, 3
pnpnmpnT loglog,
n
p
nm
p
npnT log,,1,
n
p
nm
p
npnT log,,
nnmnT logSequential Complexity:
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader33
Comparison of Ear Decomposition Algorithms
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader34
Performance of SMP Ear Decomposition on a Variety of Input Graphs
n = 8192
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader35
SMP Ear Decomposition Algorithms
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader36
Conclusions
• New hybrid model for SMP Clusters
• Open Source Parallel Algorithm Library
(OPAL)
• High-Performance methodology
• Fastest known algorithms on SMPs and SMP
clusters
• Preliminary experimental results
15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader37
Future Work• Algorithms for SMP Clusters
• Validate complexity model• Identify classes of efficient algorithms• Library of SMP algorithms• Methodology for algorithm-engineering
• Clusters of Heterogeneous SMP Nodes• Varying node sizes• Nodes from different vendors & architectures• Hierarchical clusters of SMPs
• Scientific Applications• Bioinformatics and Genomics• Landscape Ecology and Remote Sensing• Computational Fluid Dynamics