23
spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016

ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

TORSTEN HOEFLER

Progress in automatic GPU compilation and

why you want to run MPI on your GPU

with Tobias Grosser and Tobias Gysi @ SPCL

presented at CCDSC, Lyon, France, 2016

Page 2: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

2

#pragma ivdep

Page 3: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

3

!$ACC DATA &

!$ACC PRESENT(density1,energy1) &

!$ACC PRESENT(vol_flux_x,vol_flux_y,volume,mass_flux_x,mass_flux_y,vertexdx,vertexdy) &

!$ACC PRESENT(pre_vol,post_vol,ener_flux)

!$ACC KERNELS

IF(dir.EQ.g_xdir) THEN

IF(sweep_number.EQ.1)THEN

!$ACC LOOP INDEPENDENT

DO k=y_min-2,y_max+2

!$ACC LOOP INDEPENDENT

DO j=x_min-2,x_max+2

pre_vol(j,k)=volume(j,k)+(vol_flux_x(j+1,k )-vol_flux_x(j,k)+vol_flux_y(j ,k+1)-vol_flux_y(j,k))

post_vol(j,k)=pre_vol(j,k)-(vol_flux_x(j+1,k )-vol_flux_x(j,k))

ENDDO

ENDDO

ELSE

!$ACC LOOP INDEPENDENT

DO k=y_min-2,y_max+2

!$ACC LOOP INDEPENDENT

DO j=x_min-2,x_max+2

pre_vol(j,k)=volume(j,k)+vol_flux_x(j+1,k)-vol_flux_x(j,k)

post_vol(j,k)=volume(j,k)

ENDDO

ENDDO

ENDIF

Page 4: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

4Heitlager et al.: A Practical Model for Measuring Maintainability

Page 5: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

5

!$ACC DATA &

!$ACC COPY(chunk%tiles(1)%field%density0) &

!$ACC COPY(chunk%tiles(1)%field%density1) &

!$ACC COPY(chunk%tiles(1)%field%energy0) &

!$ACC COPY(chunk%tiles(1)%field%energy1) &

!$ACC COPY(chunk%tiles(1)%field%pressure) &

!$ACC COPY(chunk%tiles(1)%field%soundspeed) &

!$ACC COPY(chunk%tiles(1)%field%viscosity) &

!$ACC COPY(chunk%tiles(1)%field%xvel0) &

!$ACC COPY(chunk%tiles(1)%field%yvel0) &

!$ACC COPY(chunk%tiles(1)%field%xvel1) &

!$ACC COPY(chunk%tiles(1)%field%yvel1) &

!$ACC COPY(chunk%tiles(1)%field%vol_flux_x) &

!$ACC COPY(chunk%tiles(1)%field%vol_flux_y) &

!$ACC COPY(chunk%tiles(1)%field%mass_flux_x)&

!$ACC COPY(chunk%tiles(1)%field%mass_flux_y)&

!$ACC COPY(chunk%tiles(1)%field%volume) &

!$ACC COPY(chunk%tiles(1)%field%work_array1)&

!$ACC COPY(chunk%tiles(1)%field%work_array2)&

!$ACC COPY(chunk%tiles(1)%field%work_array3)&

!$ACC COPY(chunk%tiles(1)%field%work_array4)&

!$ACC COPY(chunk%tiles(1)%field%work_array5)&

!$ACC COPY(chunk%tiles(1)%field%work_array6)&

!$ACC COPY(chunk%tiles(1)%field%work_array7)&

!$ACC COPY(chunk%tiles(1)%field%cellx) &

!$ACC COPY(chunk%tiles(1)%field%celly) &

!$ACC COPY(chunk%tiles(1)%field%celldx) &

!$ACC COPY(chunk%tiles(1)%field%celldy) &

!$ACC COPY(chunk%tiles(1)%field%vertexx) &

!$ACC COPY(chunk%tiles(1)%field%vertexdx) &

!$ACC COPY(chunk%tiles(1)%field%vertexy) &

!$ACC COPY(chunk%tiles(1)%field%vertexdy) &

!$ACC COPY(chunk%tiles(1)%field%xarea) &

!$ACC COPY(chunk%tiles(1)%field%yarea) &

!$ACC COPY(chunk%left_snd_buffer) &

!$ACC COPY(chunk%left_rcv_buffer) &

!$ACC COPY(chunk%right_snd_buffer) &

!$ACC COPY(chunk%right_rcv_buffer) &

!$ACC COPY(chunk%bottom_snd_buffer) &

!$ACC COPY(chunk%bottom_rcv_buffer) &

!$ACC COPY(chunk%top_snd_buffer) &

!$ACC COPY(chunk%top_rcv_buffer)

Sloccount *f90: 6,440

!$ACC: 833 (13%)

Page 6: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

6

Page 7: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

7

do i = 0, N

do j = 0, i

y(i,j) = ( y(i,j) + y(i,j+1) )/2

Page 8: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

8

Some results: Polybench 3.2

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

arithmean: ~30x

geomean: ~6x

Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

Speedup over icc –O3

Page 9: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

0:00

1:12

2:24

3:36

4:48

6:00

7:12

8:24

Mobile Workstation

icc icc -openmp clang Polly ACC

9

Compiles all of SPEC CPU 2006 – Example: LBM

T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

Runtim

e (

m:s

)

Xeon E5-2690 (10 cores, 0.5Tflop) vs.

Titan Black Kepler GPU (2.9k cores, 1.7Tflop)

essentially my 4-core x86 laptop

with the (free) GPU that’s in there

~20%

~4x

Page 10: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

10

Page 11: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

GPU latency hiding vs. MPI

ld

ld

ld

ld

st

st

st

st

device compute core active thread instruction latency

CUDA• over-subscribe hardware• use spare parallel slack for latency hiding

MPI• host controlled• full device synchronization

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 12: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

Hardware latency hiding at the cluster level?

ld

ld

ld

ld

device compute core active thread instruction latency

dCUDA (distributed CUDA)• unified programming model for GPU clusters• avoid unnecessary device synchronization to enable system wide latency hiding

st

put

st

put ld

ld

ld

ld

st

st

st

st

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 13: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

dCUDA: MPI-3 RMA extensions

for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];

if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);

if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);

dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);

swap(in, out); swap(win, wout);}

computation

communication

• iterative stencil kernel• thread specific idx

• map ranks to blocks• device-side put/get operations• notifications for synchronization• shared and distributed memory

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 14: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

Hardware supported communication overlap

traditional MPI-CUDA dCUDA

1

device compute core active block

2

1

2

4

3

4

3

5

6

6

5

7

8

7

8

1

2

4

3

5

6

7

8

2

1

4

3

6

5

8

7

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 15: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

The dCUDA runtime system

MPI

context

loggingqueue

commandqueue

ackqueue

notificationqueue

mor

e b

lock

s

event handler

device library

block manager host-side

device-side

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 16: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node

(Very) simple stencil benchmark

compute & exchange

compute only

halo exchange

0

500

1000

30 60 90# of copy iterations per exchange

exec

uti

on

tim

e [m

s]

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

no overlap

Page 17: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node

Real stencil (COSMO weather/climate code)

dCUDA

halo exchange

MPI-CUDA

0

50

100

2 4 6 8

# of nodes

exec

uti

on

tim

e [m

s]

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 18: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node

Particle simulation code (Barnes Hut)

dCUDA

halo exchange

MPI-CUDA

0

50

100

150

200

2 4 6 8

# of nodes

exec

uti

on

tim

e [m

s]

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 19: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node

Sparse matrix-vector multiplication

dCUDA

communication

MPI-CUDA

0

50

100

150

200

1 4 9

# of nodes

exec

uti

on

tim

e [m

s]

T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

Page 20: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];

if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);

if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);

dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);

swap(in, out); swap(win, wout);}

20

http://spcl.inf.ethz.ch/Polly-ACC

Automatic

“Regression Free” High Performance

dCUDA – distributed memory

Automatic

Overlap High Performance

Page 21: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

1

10

100

1000

10000

SCoPs 0-dim 1-dim 2-dim 3-dim

No Heuristics Heuristics

21

LLVM Nightly Test Suite

Page 22: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

22

Cactus ADM (SPEC 2006)

Work

sta

tion

Mobile

Page 23: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your

spcl.inf.ethz.ch

@spcl_eth

23

Evading various “ends” – the hardware view