26
Getting Multicore Performance with UPC Yili Zheng Lawrence Berkeley National Lab

Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Getting Multicore Performance with UPC

Yili Zheng

Lawrence Berkeley National Lab

Page 2: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Berkeley UPC Group

• PI: Katherine Yelick

• Group members: Filip Blagojevic, Dan Bonachea, Paul Hargrove, Costin Iancu, Seung-Jai Min, Yili Zheng

• Former members: Christian Bell, Wei Chen, Jason Duell, Parry Husbands, Rajesh Nishtala , Mike Welcome

• A joint project of LBNL and UC Berkeley

2/26/2010 2SIAM PP 10 -- Getting Multicore Performance with UPC

Page 3: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Outline

• Introduction of PGAS and UPC

• UPC examples

• UPC on shared-memory machines

• Auto-tuned multi-threaded Collectives

• Scheduling and load balancing

• Performance tuning

2/26/2010 3SIAM PP 10 -- Getting Multicore Performance with UPC

Page 4: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Features in Computer Architectures

• Many cores on a chip, multi-sockets in a node– Global address space– May not be cache coherent

• Non-Uniform Memory Access– Multi-level memory hierarchies– Private vs. shared– Local vs. remote

• Hybrid systems– Heterogeneous processors– Separate memory systems

2/26/2010 4SIAM PP 10 -- Getting Multicore Performance with UPC

Page 5: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Partitioned Global Address Space

Thread 1 Thread 2 Thread 3 Thread 4

Global data view abstraction for productivity Vertical partitions among threads for locality control Horizontal partitions between shared and private

segments for data placement optimizations Friendly to non-coherent cache architecture

Private Segment

Shared Segment

Private Segment

Shared Segment

Private Segment

Shared Segment

Private Segment

Shared Segment

2/26/2010 5SIAM PP 10 -- Getting Multicore Performance with UPC

Page 6: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

UPC Programming Models

SPMD Fork-Join

Synchronizations2/26/2010 6SIAM PP 10 -- Getting Multicore Performance with UPC

Page 7: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

UPC Pointers

int *p1; /* private pointer to local memory */

shared int *p2; /* private pointer to shared space */

int *shared p3; /* shared pointer to local memory */

shared int *shared p4; /* shared pointer to shared space */

Shared

Glo

bal

ad

dre

ss

spac

e

Private

p1:

Thread0 Thread1 Threadn

p2:

p1:

p2:

p1:

p2:

p3:

p4:

p3:

p4:

p3:

p4:

2/26/2010 7SIAM PP 10 -- Getting Multicore Performance with UPC

Page 8: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Multi-Dimensional Arrays

Shared-memory of Thread 1

Shared-memory of Thread 1

Shared-memory of Thread 2

Shared-memory of Thread 2

pointersA

Static 2-D array: shared [*] double A[M][N];

Dynamic 2-D array: shared [] double **A;

A[i]

A and pointers are private and replicated on all threads.

A[i][j]

2/26/2010 8SIAM PP 10 -- Getting Multicore Performance with UPC

Page 9: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

UPC Example of Jacobi

• Good spatial locality• Mostly local memory accesses• No explicit communication ghost-zone management

shared [ngrid*ngrid/THREADS] double u[ngrid][ngrid];

shared [ngrid*ngrid/THREADS] double unew[ngrid][ngrid];

shared [ngrid*ngrid/THREADS] double f[ngrid][ngrid];

upc_forall( int i=1; i<n; i++; &unew[i][0] ) {

for(int j=1; j<n; j++) {

utmp = 0.25 * (u[i+1][j] + u[i-1][j] + u[i][j+1] + u[i][j-1] -

h*h*f[i][j]); /* 5-point stencil */

unew[i][j] = omega * utmp + (1.0-omega)*u[i][j];

}

}

2/26/2010 9SIAM PP 10 -- Getting Multicore Performance with UPC

Page 10: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

UPC Example of Random Access

shared uint64 Table[TableSize]; /* cyclic distribution */

uint64 i, ran;

/* owner computes, iteration matches data distribution */

upc_forall (i = 0; i < TableSize; i++; i) Table[i] = i;

upc_barrier; /* synchronization */

ran = starts(NUPDATE / THREADS * MYTHREAD); /* ran. seed */

for (i = MYTHREAD; i < NUPDATE; i+=THREADS) /* SPMD */

{

ran = (ran << 1) ^ (((int64_1) ran < 0) ? POLY : 0);

Table[ran & (TableSize-1)] = Table[ran & (TableSize-1)] ^ ran;

}

upc_barrier; /* synchronization */

2/26/2010 10SIAM PP 10 -- Getting Multicore Performance with UPC

Page 11: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

UPC Parallel DGEMM

• Transfer data in large blocks (use upc_memcpy)

• Use optimized BLAS dgemm (e.g., Intel MKL)

• Use non-blocking collective communication if available (e.g., row and column broadcasts)

1

3

2

4

9

11

10

12

5

7

6

8

13

15

14

16

1

3

2

4

9

11

10

12

5

7

6

8

13

15

14

16

1

3

2

4

9

11

10

12

5

7

6

8

13

15

14

16

= X

2/26/2010 11SIAM PP 10 -- Getting Multicore Performance with UPC

Page 12: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Matrix-Multiplication on Cray XT4

0

1000

2000

3000

4000

0 50 100 150 200 250 300 350 400

GFl

op

s

Cores

DGEMM Peak

UPC (nonblocking collectives)

UPC (flat point-to-point)

UPC (blocking collectivs)

MPI / PBLAS

Matrix size: (8K X 8K doubles) per node

2/26/2010 12SIAM PP 10 -- Getting Multicore Performance with UPC

Page 13: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Berkeley UPC Software Stack

UPC-to-C Translator

UPC Applications

UPC Runtime

GASNet Communication Library

Network Driver and OS Libraries

Translated C code with Runtime Calls

Har

dw

are

Dep

en

dan

t Langu

age Dep

en

dan

t

2/26/2010 13SIAM PP 10 -- Getting Multicore Performance with UPC

Page 14: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Berkeley UPC Features

• Data transfer for complex data types (vector, indexed, stride)

• Non-blocking memory copy

• Point-to-point synchronization

• Remote atomic operations

• Active Messages

• Extension to UPC collectives

• Portable timers

2/26/2010 14SIAM PP 10 -- Getting Multicore Performance with UPC

Page 15: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Process vs. Threads

CPU CPU CPU CPU

Physical Shared-memory Virtual Address Space

Map UPC threads to Processes Map UPC threads to Pthreads

2/26/2010 15SIAM PP 10 -- Getting Multicore Performance with UPC

Page 16: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Processes with Shared Memory

0

500

1000

1500

2000

2500

3000

8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K

Ban

dw

idth

(M

B/s

)

Data Chunk Size (in Bytes)

Aggregate Bandwidth

(32,1) (32,2) (32,4)

(32,8) (32,16)

Figure from Filip Blagojevic

(#UPC Threads, #Pthreads)

2/26/2010 16SIAM PP 10 -- Getting Multicore Performance with UPC

Page 17: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Process vs. Threads

0%

10%

20%

30%

40%

50%

60%

70%

(32,2) (32,4) (32,8) (32,16)

(#UPC Threads, #Pthreads)

GUPS Speedup over (32,1)

0

0.05

0.1

0.15

0.2

0.25

0.3

(32,2) (32,4) (32,8) (32,16)

(#UPC Threads, #Pthreads)

PSEARCH Speedup over (32,1)

Figures from Filip Blagojevic 2/26/2010 17SIAM PP 10 -- Getting Multicore Performance with UPC

Page 18: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Multi-threaded Collective Communication

• Enhance both productivity and performance

• Performance Auto-tuning– Offline tuning for platform common characteristics

– Online tuning Optimize for application runtime characteristics

• Multi-threaded implementation

– Lower context switching overhead

– Faster shared data access

2/26/2010 18SIAM PP 10 -- Getting Multicore Performance with UPC

Page 19: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

best root: 4

Barrier on AMD Opteron (32 cores)

0

500

1000

1500

2000

2500

3000

packed spread rand

Bar

rie

r Ex

ecu

tio

n T

ime

(n

s)

Thread Layout

Thread 0 max minbest root: 24

best root: 18

Figure from Rajesh Nishtala

2/26/2010 19SIAM PP 10 -- Getting Multicore Performance with UPC

Page 20: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Broadcast on Sun Niagara2 (128 threads)

Figure from Rajesh Nishtala

2/26/2010 20SIAM PP 10 -- Getting Multicore Performance with UPC

Page 21: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Scheduling and Load Balancing

• Over-subscription– Run more logical threads than physical cores

– Moderate performance improvement if synchronization intervals are not too small

• Speed balancing– User-level thread scheduling based on thread progress

– Better system throughput in shared environments

• Cooperative thread scheduling– Good for event-driven type of applications

– Accelerators, e.g., CELL processor

2/26/2010 21SIAM PP 10 -- Getting Multicore Performance with UPC

Page 22: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

NPB UPC on Intel (16 cores)

Oversubscription on Multicore Processors, Costin Iancu, Steven Hofmeyr, Yili Zheng and Filip Blagojevic. IPDPS 2010.

2/26/2010 22SIAM PP 10 -- Getting Multicore Performance with UPC

Page 23: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

NPB UPC on AMD (16 cores)

2/26/2010 23SIAM PP 10 -- Getting Multicore Performance with UPC

Page 24: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Tips for UPC Programming

• Coarsen synchronization intervals• Hierarchically map UPC threads to OS processes

and Pthreads• Pin processes and threads to cores to minimize

migration cost• Take advantage of data locality in the application

level• Overlapping communication and computation• Use tuned math libraries, e.g., AMD ACML, IBM

ESSL, Intel MKL

2/26/2010 24SIAM PP 10 -- Getting Multicore Performance with UPC

Page 25: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Tools for Debugging and Tuning UPC Applications on Multi-core Systems

• Same as other multi-process and multi-thread programs

– Open Source tools: PAPI, TAU, Valgrind

– Commercial tools, e.g. Intel VTUNE, TotalView

• BUPC tracing tool for analyzing the communication behavior of UPC programs

• Parallel Performance Wizard (PPW)

– http://ppw.hcs.ufl.edu/

2/26/2010 25SIAM PP 10 -- Getting Multicore Performance with UPC

Page 26: Getting Multicore Performance with UPC...GUPS Speedup over (32,1) 0 0.05 0.1 0.15 0.2 0.25 0.3 (32,2) (32,4) (32,8) (32,16) (#UPC Threads, #Pthreads) PSEARCH Speedup over (32,1) Figures

Summary

• Global address space improves productivity

• Data partitioning enables performance optimizations

• Interoperable with other programming models and languages including MPI, FORTRAN, C++

• Growing UPC community with actively developed and maintained software implementations

– Berkeley UPC and GASNet: http://upc.lbl.gov

– Other UPC compilers: Cray UPC, GCC UPC, HP UPC, IBM UPC, MTU UPC

2/26/2010 26SIAM PP 10 -- Getting Multicore Performance with UPC