Upload
austin-barry
View
50
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Unified Parallel C. Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell. Outline. Global Address Space Languages in General Programming models Overview of Unified Parallel C (UPC) - PowerPoint PPT Presentation
Citation preview
Unified Parallel C
Kathy YelickEECS, U.C. Berkeley and NERSC/LBNL
NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian
Bell
Outline
•Global Address Space Languages in General– Programming models
•Overview of Unified Parallel C (UPC)– Programmability advantage– Performance opportunity
•Status– Next step
•Related projects
Programming Model 1: Shared Memory•Program is a collection of threads of control.
– Many languages allow threads to be created dynamically,
•Each thread has a set of private variables, e.g. local variables on the stack.
•Collectively with a set of shared variables, e.g., static variables, shared common blocks, global heap.– Threads communicate implicitly by writing/reading shared
variables.– Threads coordinate using synchronization operations on shared
variables
PnP1P0 . . .
x = ...y = ..x ...
Shared
Private
Programming Model 2: Message Passing• Program consists of a collection of named processes.
– Usually fixed at program startup time– Thread of control plus local address space -- NO shared
data.– Logically shared data is partitioned over local processes.
• Processes communicate by explicit send/receive pairs– Coordination is implicit in every communication event.– MPI is the most common example
PnP1P0 . . .
send P0,X
recv Pn,Y
XY
Tradeoffs Between the Models
•Shared memory+Programming is easier
• Can build large shared data structures – Machines don’t scale
• SMPs typically < 16 processors (Sun, DEC, Intel, IBM)• Distributed shared memory < 128 (SGI)
– Performance is hard to predict and control
•Message passing+Machines easier to build from commodity parts– Can scale (given sufficient network)– Programming is harder
• Distributed data structures only in the programmers mind
• Tedious packing/unpacking of irregular data structures
Global Address Space Programming
• Intermediate point between message passing and shared memory
• Program consists of a collection of processes.– Fixed at program startup time, like MPI
• Local and shared data, as in shared memory model– But, shared data is partitioned over local processes– Remote data stays remote on distributed memory machines– Processes communicate by reads/writes to shared variables
• Examples are UPC, Titanium, CAF, Split-C
• Note: These are not data-parallel languages– heroic compilers not required
GAS Languages on Clusters of SMPs
•Cluster of SMPs (CLUMPs)hb– IBM SP: 16-way SMP nodes– Berkeley Millennium: 2-way and 4-way nodes
•What is an appropriate programming model?– Use message passing throughout
• Most common model• Unnecessary packing/unpacking overhead
– Hybrid models• Write 2 parallel programs (MPI + OpenMP or Threads)
– Global address space• Only adds test (on/off node) before local read/write
Support for GAS Languages
•Unified Parallel C (UPC)– Funded by the NSA– Compaq compiler for Alpha/Quadrics– HP, Sun and Cray compilers under
development– Gcc-based compiler for SGI (Intrepid)– Gcc-based compiler (SRC) for Cray T3E– MTU and Compaq effort for MPI-based compiler– LBNL compiler based on Open64
•Co-Array Fortran (CAF)– Cray compiler– Rice and UMN effort based on Open64
•SPMD Java (Titanium)– UCB compiler available for most machines
Parallelism Model in UPC
• UPC uses an SPMD model of parallelism– A set if THREADS threads working independently
• Two compilation models– THREADS may be fixed at compile time or– Dynamically set at program startup time
• MYTHREAD specifies thread index (0..THREADS-1)• Basic synchronization mechanisms
– Barriers (normal and split-phase), locks
• What UPC does not do automatically:– Determine data layout– Load balance – move computations– Caching – move data
• These are intentionally left to the programmer
UPC Pointers
•Pointers may point to shared or private variables•Same syntax for use, just add qualifier
shared int *sp;
int *lp;
• sp is a pointer to an integer residing in the shared memory space.
• sp is called a shared pointer (somewhat sloppy).
Shared
Glo
bal ad
dre
ss s
pace x: 3
Privatesp: sp: sp:
lp: lp: lp:
Shared Arrays in UPC
• Shared array elements are spread across the threadsshared int x[THREADS] /*One element per thread */shared int y[3][THREADS] /* 3 elements per thread */shared int z[3*THREADS] /* 3 elements per thread, cyclic */
• In the pictures below– Assume THREADS = 4– Elements with affinity to processor 0 are red
x
y blocked
z cyclic
Of course, this is really a 2D array
Overlapping Communication in UPC
•Programs with fine-grained communication require overlap for performance
•UPC compiler does this automatically for “relaxed” accesses.
–Accesses may be designated as strict, relaxed, or unqualified (the default).
–There are several ways of designating the ordering type.
• A type qualifier, strict or relaxed can be used to affect all variables of that type.
• Labels strict or relaxed can be used to control the accesses within a statement.
strict : { x = y ; z = y+1; }• A strict or relaxed cast can be used to override the
current label or type qualifier.
Performance of UPC
•Reason why UPC may be slower than MPI– Shared array indexing is expensive– Small messages encouraged by model
•Reasons why UPC may be faster than MPI– MPI encourages synchrony– Buffering required for many MPI calls
• Remote read/write of a single word may require very little overhead
• Cray t3e, Quadrics interconnect (next version)
•Assuming overlapped communication, the real issues is overhead: how much time does it take to issue a remote read/write?
UPC vs. MPI: Sparse MatVec Multiply
•Short term goal: – Evaluate language and compilers using small
applications
•Longer term, identify large application
Sparse Matrix-Vector Multiply (T3E)
0
50
100
150
200
250
1 2 4 8 16 32
Processors
Mfl
op
s
UPC + PrefetchMPI (Aztec)UPC BulkUPC Small
•Show advantage of t3e network model and UPC
•Performance on Compaq machine worse:- Serial code- Communication
performance- New compiler just
released
UPC versus MPI for Edge detection
a. Execution time b. Scalability
Execution time(N=512)
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0 5 10 15 20
Proc.
Tim
e(s
)
UPC O1+O2
MPI
Speedup(N=512)
0
2
4
6
8
10
12
14
16
18
20
0 5 10 15 20Proc.
Spe
edup
UPC O1+O2MPILinear
•Performance from Cray T3E •Benchmark developed by El Ghazawi’s group at GWU
UPC versus MPI for Matrix Multiplication
a. Execution time b. Scalability
Execution time
0
1
2
3
4
5
6
7
0 5 10 15 20
Proc.
Tim
e(s
)
UPC O1 + O2
MPI
Speedup
0
5
10
15
20
0 5 10 15 20Proc.
UPC O1+O2
MPI
Linear
•Performance from Cray T3E •Benchmark developed by El Ghazawi’s group at GWU
Implementing UPC
•UPC extensions to C are small– < 1 person-year to implement in existing
compiler
•Simplest approach– Reads and writes of shared pointers become
small message puts/gets– UPC has “relaxed” keyword for nonblocking
communication– Small message performance is key
•Advanced optimizations include conversions to bulk communication by either– Application programmer – Compiler
Overview of NERSC Compiler
1) Compiler – Portable compiler infrastructure (UPC->C)– Explore optimizations: communication, shared
pointers– Based on Open64: plan to release sources
2) Runtime systems for multiple compilers– Allow use by other languages (Titanium and CAF)– And in other UPC compilers, e.g., Intrepid– Performance of small message put/get are key– Designed to be easily ported, then tuned– Also designed for low overhead (macros, inline
functions)
Compiler and Runtime Status
•Basic parsing and type-checking complete•Generates code for small serial kernels
– Still testing and debugging– Needs runtime for complete testing
•UPC runtime layer– Initial implementation should be done this month– Based on processes (not threads) on GASNet
•GASNet– Initial specification complete– Reference implementation done on MPI– Working on Quadrics and IBM (LAPI…)
Benchmarks for GAS Languages
• EEL – End to end latency or time spent sending a short message between two processes.
• BW – Large message network bandwidth• Parameters of the LogP Model
– L – “Latency”or time spent on the network• During this time, processor can be doing other work
– O – “Overhead” or processor busy time on the sending or receiving side.
• During this time, processor cannot be doing other work• We distinguish between “send” and “recv” overhead
– G – “gap” the rate at which messages can be pushed onto the network.
– P – the number of processors
LogP Parameters: Overhead & Latency
•Non-overlapping overhead
•Send and recv overhead can overlap
P0
P1
osend
L
orecv
P0
P1
osend
orecv
EEL = osend + L + orecv EEL = f(osend, L, orecv)
Benchmarks
• Designed to measure the network parameters – Also provide: gap as function of queue depth – Measured for “best case” in general
• Implemented once in MPI– For portability and comparison to target specific layer
• Implemented again in target specific communication layer:– LAPI– ELAN– GM– SHMEM– VIPL
Results: EEL and Overhead
0
5
10
15
20
25
T3E/M
PI
T3E/S
hmem
T3E/E
-Reg
IBM
/MPI
IBM
/LAPI
Quadr
ics/M
PI
Quadr
ics/P
ut
Quadr
ics/G
et
M2K
/MPI
M2K
/GM
Dolph
in/M
PI
Gigan
et/V
IPL
use
c
Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency
Results: Gap and Overhead
6.7
1.2 0.2
8.2 9.5
6.0
1.6
6.5
10.3
17.8
7.84.6
0.0
5.0
10.0
15.0
20.0
use
c
Gap Send Overhead Receive Overhead
Send Overhead Over Time
• Overhead has not improved significantly; T3D was best– Lack of integration; lack of attention in software
Myrinet2K
Dolphin
T3E
Cenju4
CM5
CM5
Meiko
MeikoParagon
T3D
Dolphin
Myrinet
SP3
SCI
Compaq
NCube/2
T3E0
2
4
6
8
10
12
14
1990 1992 1994 1996 1998 2000 2002Year (approximate)
usec
Summary
•Global address space languages offer alternative to MPI for large machines– Easier to use: shared data structures– Recover users left behind on shared
memory?– Performance tuning still possible
•Implementation – Small compiler effort given lightweight
communication– Portable communication layer: GASNet– Difficulty with small message performance on
IBM SP platform