34
OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department Albuquerque High Performance Computing Center University of New Mexico [email protected] http://hpc.eece.unm.edu/

OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

Embed Size (px)

Citation preview

Page 1: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

OPAL: Open Source Parallel Algorithm LibraryDesigning High-Performance Algorithms for SMP Clusters

David A. BaderElectrical & Computer Engineering Department

Albuquerque High Performance Computing Center

University of New Mexico

[email protected]

http://hpc.eece.unm.edu/

Page 2: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader2

High-Performance Applications using SMP Clusters

• Long-term Earth science studies using terascale remotely-sensed global satellite imagery (4 km AVHRR GAC)

• Computational Ecological Studies: Self-Organization of Semi-Arid Landscapes: Test of Optimality Principles

• Computational Bioinformatics: Large Scale Phylogeny Reconstruction

Page 3: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader3

Research Collaborators• Joseph JáJá, University of Maryland• Bernard Moret, CS (Experimental Algorithmics), University of

New Mexico• Bruce Milne, Biology (Landscape Ecology), University of New

Mexico• Tandy Warnow, CS, University of Texas-Austin• IBM ACTC Group (David Klepacki, John Levesque, and others)• Current Graduate Students:

• Mi Yan, Niranjan Prabhu, Vinila Yarlagadda

• Laboratory Alumni:• Kavita Balakavi (Intel), Ajith Illendula (Intel)

Page 4: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader4

Acknowledgment of Support

• NSF CISE Postdoctoral Research Associate in Experimental Computer Science No. 96-25668

• NSF BIO Division of Environmental Biology DEB 99-10123

• Department of Energy Sandia-University New Assistant Professorship Program (SUNAPP) Award AX-3006

• IBM SUR Grant (UNM Vista-Azul Project )• NPACI/SDSC and NCSA/Alliance • NSF 00-* Algorithms for Irregular Discrete Computations

on SMPs

Page 5: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader5

Outline• Motivation

• SMP Cluster Programming (SIMPLE)

• Complexity model• Message-Passing• Shared-Memory

• OPAL Facets (parallel libraries)• OPAL Setting (programming framework)• Example SMP Algorithms

Page 6: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader6

Motivation• High performance computing has been leveraging

COTS workstation technologies• Commodity microprocessors• High-performance networks• Operating system and compiler technology

• Symmetric multiprocessor (SMP)• Hardware support for hierarchical memory management• Multithreaded operating system kernels• Optimizing compilers and runtime systems

Page 7: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader7

SMP Cluster Architectures• IBM SP (NPACI Blue Horizon 144x8)• Linux Clusters• Compaq AlphaServers (PSC/NSF Terascale 682x4)• Sun Ultra HPC (4x64)

LLNL ASCI WhiteIBM SP (512x16) UNM/Alliance LosLobos

IBM Netfinity(256x2)

UNM/Alliance Roadrunner Linux

SuperCluster (64x2)

Page 8: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader8

Message-Passing Performance

MPI BandwidthIBM Netfinity 4500R (733MHz) Cluster with Myrinet

MPI Message Length (Bytes)

0.0 2.0e+5 4.0e+5 6.0e+5 8.0e+5 1.0e+6 1.2e+6

Ba

nd

wid

th (

MB

/s)

0

20

40

60

80

100

120

MPI TimeIBM Netfinity 4500R (733MHz) Cluster with Myrinet

MPI Message Length (Bytes)

0.0 2.0e+5 4.0e+5 6.0e+5 8.0e+5 1.0e+6 1.2e+6

Tim

e (us)

0

2000

4000

6000

8000

10000

12000

14000

16000

Page 9: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader9

Shared-Memory Performance

• One Sun HPC E10K processor

• Contiguous array; each element read exactly once

• C, X = cyclic read (stride X) of contiguous array

• R = random access of array

log2 (Problem Size)

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Tim

e pe

r m

emor

y re

ad (

ns)

1 ns

10 ns

100 ns

1000 ns

C, 1C, 2C, 4 C, 8 C, 16 C, 32 C, 64 R

Page 10: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader10

High Performance Algorithms for SMP Clusters

• “SIMPLE” Model• Use a hybrid, natural combination of message-

passing and shared-memory• Message passing interface between nodes• Shared-memory programming (OpenMP, POSIX Threads)

on each SMP node

• Methodology for adapting message-passing algorithms for SMP Clusters

• Freely-available open source implementation of parallel algorithms, libraries, and programming environment, for C/C++/Fortran with GNU Public License (GPL)

Page 11: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader11

Optimizing from MPI to SIMPLE (Regular or Irregular Algorithms)

• Similar Single-Program Multiple-Data (SPMD) paradigm

• Replace multiple MPI tasks per node with a single task and multiple shared-memory threads

• Parallelize sequential work into equivalent shared-memory algorithms

• Replace MPI communication primitives with corresponding “SIMPLE” primitives

Page 12: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader13

Portability: Access from User Space

Page 13: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader14

Parallel Complexity Models

Page 14: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader15

SIMPLE Complexity ModelMessage Passing Primitives

Send/ReceiveBarrierScanReduceAllreduceAlltoallBroadcastGather/ScatterShift

m/ + (1) (p-1)/ + (1) (p-1)/ + (p) (p-1)/ + (p) (p-1)/ + (p) (m – m/p)/ + (m)2( (m – m/p)/) + (p) (m – m/p)/ + (m) m/ + (1)

Page 15: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader16

Comparison of PRAM to SMP• PRAM (theory)

• O(n) processors• Global clock• Synchronous

shared-memory• Unit cost for

computation or memory access

• Ideal Read/Write models (EREW, CREW, CRCW)

• SMP (practice)• “P” processors (2 to 64)• Asynchronous lock-step

operation• Uniform memory

access to main memory (< 600 ns), faster access to local cache (10-40 ns)

• Cache-coherency at external caches

• Contention for shared memory

Page 16: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader17

OPAL Complexity Model

• SMP Complexity model motivated by Helman and

JáJá, Ramachandran

• Complexity given by the triplet (MA, ME, TC)

• MA is the number of memory accesses,

• ME is the maximum volume of data exchanged

between any processor and memory,

• TC is the computational complexity.

Page 17: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader18

OPAL Facets• Common Primitives

• Read/Write• Replicate• Barrier• Scan• Reduce• Broadcast• Allreduce

• Techniques• Pointer-jumping• Balanced Trees (Prefix-Sums)• Symmetric Breaking (3-

Coloring)• Parallel Prefix (List Ranking)

• Graph Algorithms• Spanning Tree• Euler Tour• Tree Functions• Ear Decomposition

• Combinatorics• Sorting• Selection

• Bioinformatics• (Minimum Evolution)

Phylogeny Trees• Computational Genomics:

Breakpoints, Inversions, Translocations

Page 18: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader19

SMP Complexity ModelSMP Node Primitives

• Read/Write• Replicate• Barrier• Scan• Reduce• Broadcast• Allreduce• Etc.

• SMP Complexity model motivated by Helman and JáJá• Complexity given by the triplet (MA, ME, TC)

• MA is the number of memory accesses,

• ME is the maximum volume of data

exchanged between any processor and memory,• TC is the computational complexity.

Page 19: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader20

OPAL Setting:Programming Environment

Page 20: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader21

Local Context Parameters for Each Thread

NODES Total number of nodes in the cluster

MYNODE My node rank

THREADS Total number of threads on my node

MYTHREAD Rank of my thread on this node

TID Total number of threads in the cluster

ID My thread rank, with respect to the cluster

Page 21: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader22

Control Primitives

on_one_thread Only one thread per nodeon_one_node All threads on a single nodeon_one Only one thread on a single nodeon_thread(i) On one thread (i) per nodeon_node(j) All threads on node j

Page 22: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader23

Memory Management Primitives

node_malloc Dynamically allocate ashared structure

node_free Release memory

Page 23: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader24

Example Application: Radixsort

• Stable sort of n integers spread evenly across a cluster of p shared-memory r-way nodes

• Decompose b-bit keys into -bit digits• Perform b / passes of counting sort on digits (LSD

MSD)

• Counting Sort• Compute histogram of local keys• Communicate: Alltoall primitive of histograms• Locally compute prefix-sums of histograms• Communicate: (Inverse) Alltoall of prefix-sums• Rank each local element• Perform a personalized communication (1-relation)

rearranging elements into sorted order

Page 24: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader25

Page 25: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader27

Page 26: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader28

Execution Time of Radix Sort on an SMP Cluster

Page 27: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader29

SMP Example: Ear Decomposition

• Ear decomposition• Partitions the edges of a graph, useful in parallel processing• “Like peeling the layers of an onion”

• Applied to scientific computing problems• Computational mechanics (structural rigidity)• Computational biology (molecular structure, atoms in DNA chains)• Computational fluid dynamics

• Similar to other parallel algorithms for combinatorial problems

• Trivial and fast sequential algorithm• Efficient PRAM algorithm• But no known practical, parallel algorithm

Page 28: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader30

Ear Decomposition Example

Spanning Tree

Output Ears

n = number of verticesm = number of edges

Input

Page 29: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader31

Ear Decomposition Complexities

• Message Passing:

• Spanning Tree

• Ear Decomposition

• Shared Memory:

• Spanning Tree

• Ear Decomposition

pnpnpnT log, 3

pnpnmpnT loglog,

n

p

nm

p

npnT log,,1,

n

p

nm

p

npnT log,,

nnmnT logSequential Complexity:

Page 30: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader33

Comparison of Ear Decomposition Algorithms

Page 31: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader34

Performance of SMP Ear Decomposition on a Variety of Input Graphs

n = 8192

Page 32: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader35

SMP Ear Decomposition Algorithms

Page 33: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader36

Conclusions

• New hybrid model for SMP Clusters

• Open Source Parallel Algorithm Library

(OPAL)

• High-Performance methodology

• Fastest known algorithms on SMPs and SMP

clusters

• Preliminary experimental results

Page 34: OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department

15 August 2000High Performance Algorithms for SMP Clusters, Prof. David A. Bader37

Future Work• Algorithms for SMP Clusters

• Validate complexity model• Identify classes of efficient algorithms• Library of SMP algorithms• Methodology for algorithm-engineering

• Clusters of Heterogeneous SMP Nodes• Varying node sizes• Nodes from different vendors & architectures• Hierarchical clusters of SMPs

• Scientific Applications• Bioinformatics and Genomics• Landscape Ecology and Remote Sensing• Computational Fluid Dynamics