Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Getting Multicore Performance with UPC
Yili Zheng
Lawrence Berkeley National Lab
Berkeley UPC Group
• PI: Katherine Yelick
• Group members: Filip Blagojevic, Dan Bonachea, Paul Hargrove, Costin Iancu, Seung-Jai Min, Yili Zheng
• Former members: Christian Bell, Wei Chen, Jason Duell, Parry Husbands, Rajesh Nishtala , Mike Welcome
• A joint project of LBNL and UC Berkeley
2/26/2010 2SIAM PP 10 -- Getting Multicore Performance with UPC
Outline
• Introduction of PGAS and UPC
• UPC examples
• UPC on shared-memory machines
• Auto-tuned multi-threaded Collectives
• Scheduling and load balancing
• Performance tuning
2/26/2010 3SIAM PP 10 -- Getting Multicore Performance with UPC
Features in Computer Architectures
• Many cores on a chip, multi-sockets in a node– Global address space– May not be cache coherent
• Non-Uniform Memory Access– Multi-level memory hierarchies– Private vs. shared– Local vs. remote
• Hybrid systems– Heterogeneous processors– Separate memory systems
2/26/2010 4SIAM PP 10 -- Getting Multicore Performance with UPC
Partitioned Global Address Space
Thread 1 Thread 2 Thread 3 Thread 4
Global data view abstraction for productivity Vertical partitions among threads for locality control Horizontal partitions between shared and private
segments for data placement optimizations Friendly to non-coherent cache architecture
Private Segment
Shared Segment
Private Segment
Shared Segment
Private Segment
Shared Segment
Private Segment
Shared Segment
2/26/2010 5SIAM PP 10 -- Getting Multicore Performance with UPC
UPC Programming Models
SPMD Fork-Join
Synchronizations2/26/2010 6SIAM PP 10 -- Getting Multicore Performance with UPC
UPC Pointers
int *p1; /* private pointer to local memory */
shared int *p2; /* private pointer to shared space */
int *shared p3; /* shared pointer to local memory */
shared int *shared p4; /* shared pointer to shared space */
Shared
Glo
bal
ad
dre
ss
spac
e
Private
p1:
Thread0 Thread1 Threadn
p2:
p1:
p2:
p1:
p2:
p3:
p4:
p3:
p4:
p3:
p4:
2/26/2010 7SIAM PP 10 -- Getting Multicore Performance with UPC
Multi-Dimensional Arrays
Shared-memory of Thread 1
Shared-memory of Thread 1
Shared-memory of Thread 2
Shared-memory of Thread 2
pointersA
Static 2-D array: shared [*] double A[M][N];
Dynamic 2-D array: shared [] double **A;
A[i]
A and pointers are private and replicated on all threads.
A[i][j]
2/26/2010 8SIAM PP 10 -- Getting Multicore Performance with UPC
UPC Example of Jacobi
• Good spatial locality• Mostly local memory accesses• No explicit communication ghost-zone management
shared [ngrid*ngrid/THREADS] double u[ngrid][ngrid];
shared [ngrid*ngrid/THREADS] double unew[ngrid][ngrid];
shared [ngrid*ngrid/THREADS] double f[ngrid][ngrid];
upc_forall( int i=1; i<n; i++; &unew[i][0] ) {
for(int j=1; j<n; j++) {
utmp = 0.25 * (u[i+1][j] + u[i-1][j] + u[i][j+1] + u[i][j-1] -
h*h*f[i][j]); /* 5-point stencil */
unew[i][j] = omega * utmp + (1.0-omega)*u[i][j];
}
}
2/26/2010 9SIAM PP 10 -- Getting Multicore Performance with UPC
UPC Example of Random Access
shared uint64 Table[TableSize]; /* cyclic distribution */
uint64 i, ran;
/* owner computes, iteration matches data distribution */
upc_forall (i = 0; i < TableSize; i++; i) Table[i] = i;
upc_barrier; /* synchronization */
ran = starts(NUPDATE / THREADS * MYTHREAD); /* ran. seed */
for (i = MYTHREAD; i < NUPDATE; i+=THREADS) /* SPMD */
{
ran = (ran << 1) ^ (((int64_1) ran < 0) ? POLY : 0);
Table[ran & (TableSize-1)] = Table[ran & (TableSize-1)] ^ ran;
}
upc_barrier; /* synchronization */
2/26/2010 10SIAM PP 10 -- Getting Multicore Performance with UPC
UPC Parallel DGEMM
• Transfer data in large blocks (use upc_memcpy)
• Use optimized BLAS dgemm (e.g., Intel MKL)
• Use non-blocking collective communication if available (e.g., row and column broadcasts)
1
3
2
4
9
11
10
12
5
7
6
8
13
15
14
16
1
3
2
4
9
11
10
12
5
7
6
8
13
15
14
16
1
3
2
4
9
11
10
12
5
7
6
8
13
15
14
16
= X
2/26/2010 11SIAM PP 10 -- Getting Multicore Performance with UPC
Matrix-Multiplication on Cray XT4
0
1000
2000
3000
4000
0 50 100 150 200 250 300 350 400
GFl
op
s
Cores
DGEMM Peak
UPC (nonblocking collectives)
UPC (flat point-to-point)
UPC (blocking collectivs)
MPI / PBLAS
Matrix size: (8K X 8K doubles) per node
2/26/2010 12SIAM PP 10 -- Getting Multicore Performance with UPC
Berkeley UPC Software Stack
UPC-to-C Translator
UPC Applications
UPC Runtime
GASNet Communication Library
Network Driver and OS Libraries
Translated C code with Runtime Calls
Har
dw
are
Dep
en
dan
t Langu
age Dep
en
dan
t
2/26/2010 13SIAM PP 10 -- Getting Multicore Performance with UPC
Berkeley UPC Features
• Data transfer for complex data types (vector, indexed, stride)
• Non-blocking memory copy
• Point-to-point synchronization
• Remote atomic operations
• Active Messages
• Extension to UPC collectives
• Portable timers
2/26/2010 14SIAM PP 10 -- Getting Multicore Performance with UPC
Process vs. Threads
CPU CPU CPU CPU
Physical Shared-memory Virtual Address Space
Map UPC threads to Processes Map UPC threads to Pthreads
2/26/2010 15SIAM PP 10 -- Getting Multicore Performance with UPC
Processes with Shared Memory
0
500
1000
1500
2000
2500
3000
8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K
Ban
dw
idth
(M
B/s
)
Data Chunk Size (in Bytes)
Aggregate Bandwidth
(32,1) (32,2) (32,4)
(32,8) (32,16)
Figure from Filip Blagojevic
(#UPC Threads, #Pthreads)
2/26/2010 16SIAM PP 10 -- Getting Multicore Performance with UPC
Process vs. Threads
0%
10%
20%
30%
40%
50%
60%
70%
(32,2) (32,4) (32,8) (32,16)
(#UPC Threads, #Pthreads)
GUPS Speedup over (32,1)
0
0.05
0.1
0.15
0.2
0.25
0.3
(32,2) (32,4) (32,8) (32,16)
(#UPC Threads, #Pthreads)
PSEARCH Speedup over (32,1)
Figures from Filip Blagojevic 2/26/2010 17SIAM PP 10 -- Getting Multicore Performance with UPC
Multi-threaded Collective Communication
• Enhance both productivity and performance
• Performance Auto-tuning– Offline tuning for platform common characteristics
– Online tuning Optimize for application runtime characteristics
• Multi-threaded implementation
– Lower context switching overhead
– Faster shared data access
2/26/2010 18SIAM PP 10 -- Getting Multicore Performance with UPC
best root: 4
Barrier on AMD Opteron (32 cores)
0
500
1000
1500
2000
2500
3000
packed spread rand
Bar
rie
r Ex
ecu
tio
n T
ime
(n
s)
Thread Layout
Thread 0 max minbest root: 24
best root: 18
Figure from Rajesh Nishtala
2/26/2010 19SIAM PP 10 -- Getting Multicore Performance with UPC
Broadcast on Sun Niagara2 (128 threads)
Figure from Rajesh Nishtala
2/26/2010 20SIAM PP 10 -- Getting Multicore Performance with UPC
Scheduling and Load Balancing
• Over-subscription– Run more logical threads than physical cores
– Moderate performance improvement if synchronization intervals are not too small
• Speed balancing– User-level thread scheduling based on thread progress
– Better system throughput in shared environments
• Cooperative thread scheduling– Good for event-driven type of applications
– Accelerators, e.g., CELL processor
2/26/2010 21SIAM PP 10 -- Getting Multicore Performance with UPC
NPB UPC on Intel (16 cores)
Oversubscription on Multicore Processors, Costin Iancu, Steven Hofmeyr, Yili Zheng and Filip Blagojevic. IPDPS 2010.
2/26/2010 22SIAM PP 10 -- Getting Multicore Performance with UPC
NPB UPC on AMD (16 cores)
2/26/2010 23SIAM PP 10 -- Getting Multicore Performance with UPC
Tips for UPC Programming
• Coarsen synchronization intervals• Hierarchically map UPC threads to OS processes
and Pthreads• Pin processes and threads to cores to minimize
migration cost• Take advantage of data locality in the application
level• Overlapping communication and computation• Use tuned math libraries, e.g., AMD ACML, IBM
ESSL, Intel MKL
2/26/2010 24SIAM PP 10 -- Getting Multicore Performance with UPC
Tools for Debugging and Tuning UPC Applications on Multi-core Systems
• Same as other multi-process and multi-thread programs
– Open Source tools: PAPI, TAU, Valgrind
– Commercial tools, e.g. Intel VTUNE, TotalView
• BUPC tracing tool for analyzing the communication behavior of UPC programs
• Parallel Performance Wizard (PPW)
– http://ppw.hcs.ufl.edu/
2/26/2010 25SIAM PP 10 -- Getting Multicore Performance with UPC
Summary
• Global address space improves productivity
• Data partitioning enables performance optimizations
• Interoperable with other programming models and languages including MPI, FORTRAN, C++
• Growing UPC community with actively developed and maintained software implementations
– Berkeley UPC and GASNet: http://upc.lbl.gov
– Other UPC compilers: Cray UPC, GCC UPC, HP UPC, IBM UPC, MTU UPC
2/26/2010 26SIAM PP 10 -- Getting Multicore Performance with UPC