37
Lecture 10 Floating point (continued) Stencil methods

Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Lecture 10

Floating point (continued) Stencil methods

Page 2: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Announcements •  Mac Mini lab (APM 2402)

u  Tuesday at 4pm to 6pm •  A3 will be posted on Friday •  Sign on to bang and run the Basic example

interactively and with batch

©2012 Scott B. Baden /CSE 260/ Fall 2012 2

Page 3: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Projects! •  Stencil method in 3 dimensions •  Multigrid •  Communication avoiding matrix multiplication (MPI) •  Algorithm based fault tolerance (MPI) •  3D Fast Fourier Transform (MPI or CUDA) •  Particle simulation (MPI) •  Groups of 3 will do a more ambitious project

u  MPI projects can add communication overlap u  MPI + CUDA

•  Propose your own •  Make your choice by 11/9

www-cse.ucsd.edu/classes/fa12/cse260-b/Projects/ProjectList.html

©2012 Scott B. Baden /CSE 260/ Fall 2012 5

Page 4: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Today’s lecture

•  More about floating point •  Stencil methods on the GPU

u  2D u  3D

©2012 Scott B. Baden /CSE 260/ Fall 2012 6

Page 5: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

IEEE Floating point standard P754 •  Normalized representation ±1.d…d× 2esp

u  Macheps = Machine epsilon = ε = 2-#significand bits relative error in each operation

u  OV = overflow threshold = largest number u  UN = underflow threshold = smallest number

•  ±Zero: ±significand and exponent = 0

©2012 Scott B. Baden /CSE 260/ Fall 2012

Format # bits #significand bits macheps #exponent bits exponent range ---------- -------- ---------------------- ------------ -------------------- ---------------------- Single 32 23+1 2-24 (~10-7) 8 2-126 - 2127 (~10+-38) Double 64 52+1 2-53 (~10-16) 11 2-1022 - 21023 (~10+-308) Double ≥80 ≥64 ≤2-64(~10-19) ≥15 2-16382 - 216383 (~10+-4932)

Jim Demmel 7

Page 6: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Denormalized numbers •  Compute: if (a ≠ b) then x = a/(a-b) •  We should never divide by 0, even if a-b is tiny •  Underflow exception occurs when

exact result a-b < underflow threshold UN •  Return a denormalized number for a-b

u  Relax restriction that leading digit is 1: ±0.d…d x 2min_exp

u  Reserve the smallest exponent value, lose a set of small normalized numbers

u  Fill in the gap between 0 and UN uniform distribution of values

©2012 Scott B. Baden /CSE 260/ Fall 2012 8

Page 7: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Anomalous behavior •  Floating point arithmetic is not associative

(x + y) + z ≠ x+ (y+z)

•  Distributive law doesn’t always hold

•  These expressions have different values when y ≈ z x*y – x*z ≠ x(y-z) •  Optimizers can’t reason about floating point

•  If we compute a quantity in extended precision (80 bits)

we lose digits when we store to memory y ≠ x float x, y=…, z=…; x = y + z; y=x;

©2012 Scott B. Baden /CSE 260/ Fall 2012 9

Page 8: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

NaN (Not a Number) •  Invalid exception

u  Exact result is not a well-defined real number 0/0,√-1

•  NaN op number = NaN •  We can have a quiet NaN or an sNan

u  Quiet –does not raise an exception, but propagates a distinguished value

•  E.g. missing data: max(3,NAN) = 3 u  Signaling - generate an exception when accessed

•  Detect uninitialized data

©2012 Scott B. Baden /CSE 260/ Fall 2012 10

Page 9: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Exception handling •  Each of the 5 exceptions manipulates 2 flags •  Sticky flag set by an exception, can be read and cleared by

the user •  Exception flag: should a trap occur?

u  If so, we can enter a trap handler u  But requires precise interrupts, causes problems on a

parallel computer •  We can use exception handling to build faster algorithms

u  Try the faster but “riskier” algorithm u  Rapidly test for accuracy

(possibly with the aid of exception handling) u  Substitute slower more stable algorithm as needed

©2012 Scott B. Baden /CSE 260/ Fall 2012 11

Page 10: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

P754 on the GPU •  Cuda Programming Guide (4.1)

“All compute devices follow the IEEE 754-2008 standard for binary floating-point arithmetic with the following deviations”

u  There is no mechanism for detecting that a floating-point exception has occurred and all operations behave as if the exceptions are always masked… SNaN … are handled as quiet

•  Cap. 2.x: FFMA … is an IEEE-754-2008 compliant fused multiply-add instruction … the full-width product … used in the addition & a single rounding occurs during generation of the final result

u  rnd(A ☓ A + B) with FFMA (2.x) vs rnd(rnd(A ☓ A) + B) FMAD for 1.x

•  FFMA can avoid loss of precision during subtractive cancellation when adding quantities of similar magnitude but opposite signs

•  Also see Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs,” by N. Whitehead and A. Fit-Florea

©2012 Scott B. Baden /CSE 260/ Fall 2012 13

Page 11: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Today’s lecture

•  More about floating point •  Stencil methods on the GPU

u  2D u  3D

©2012 Scott B. Baden /CSE 260/ Fall 2012 14

Page 12: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

The Aliev-Panfilov Method •  Models signal propagation in cardiac tissue

u  Demonstrates complex behavior of spiral waves that are known to cause life-threatening situations

•  Reaction-diffusion system u  Reactions are the cellular exchanges of certain ions across the cell

membrane during the cellular electrical impulse •  Our simulation has two state variables

u  Transmembrane potential: e u  Recovery of the tissue: r

©2012 Scott B. Baden /CSE 260/ Fall 2012 15

Page 13: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

The Aliev-Panfilov Model •  Two parts

u  2 Ordinary Differential Equations •  Kinetics of reactions occurring at every point in space

u  Partial Differential Equation •  Spatial diffusion of reactants

•  First-order explicit numerical scheme

©2012 Scott B. Baden /CSE 260/ Fall 2012 16

Page 14: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Data Dependencies •  ODE solver:

u  No data dependency, trivially parallelizable u  Requires a lot of registers to hold temporary variables

•  PDE solver: u  Jacobi update for the 5-point Laplacian operator. u  Sweeps over a uniformly spaced mesh u  Updates voltage to weighted contributions from

the 4 nearest neighbors

©2012 Scott B. Baden /CSE 260/ Fall 2012 18

for (j=1; j<=m+1; j++){ _DOUBLE_ *RR = &R[j][1], *EE = &E[j][1]; for (i=1; i<=n+1; i++, EE++, RR++) { EE[0]= E_p[j][i]+α*(E_p[j][i+1]+E_p[j][i-1]-4*E_p[j][i]+E_p[j+1][i]+E_p[j-1][i]); // PDE SOLVER EE[0] += -dt*(kk*EE[0]*(EE[0]-a)*(EE[0]-1)+EE[0]*RR[0]); RR[0] += dt*(ε+M1* RR[0]/( EE[0]+M2))*(-RR[0]-kk*EE[0]*(EE[0]-b-1)); } }

Page 15: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Naïve CUDA Implementation •  All array references go through device memory •  ./apf -n 6144 -t 0.04, 16x16 thread blocks

u  Lilliput, cseclass01, cseclass05 u  SP: 22, 73, 34 GFlops [Triton, 32 cores, MPI: 85GF] u  DP: 13, 45, 20 GFlops (19GF n=8192) [Triton: 48 GF]

19 ©2012 Scott B. Baden /CSE 260/ Fall 2012

for (j=1; j<= m+1; j++) for (i=1; i<= n+1; i++) E[j][i] = E′ [j][i]+α*(E′ [j][i-1]+E′ [j][i+1] + E′ [j-1][i]+E′ [j+1][i] - 4*E′ [j][i]);

#define E′ [i,j] E_prev[(j+1)*(m+3) + (i+1)]

I = blockIdx.y*blockDim.y + threadIdx.y; J = blockIdx.x*blockDim.x + threadIdx.x; if ((I <= n) && (J <= m) ) E[I] = E′ [I,J] + α*(E′ [I-1,J] + E′ [I+1,J] + E′ [I,J-1]+ E′ [I,J+1] – 4*E′ [I,J]);

Page 16: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Nx

Ny

Cuda thread block

dx

dy

•  Create 1D thread block to process 2D data block •  Iterate over rows in y dim •  While first and last threads read ghost cells, others are idle

Using Shared Memory

Compared to a 2D thread blocking, 1D thread blocks provide a 12% improvement in double precision and 64% improvement in single precision

©2012 Scott B. Baden /CSE 260/ Fall 2012 20

Didem Unat

Page 17: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Top Row in Registers Curr Row in Shared memory Bottom Row in Registers

Sliding rows with 1D thread blocks reduces global memory accesses.

Top row <-- Curr row, Curr row <-- Bottom row Bottom row <-- read new row from global memory

Sliding row algorithm

Sliding rows

Read new row from global memory ©2012 Scott B. Baden /CSE 260/ Fall 2012 21

Page 18: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

22

CUDA Code

__shared__ float block[DIM_Y + 2 ][DIM_X + 2 ]; int idx = threadIdx.x, idy = threadIdx.y ; //local indices //global indices int x = blockIdx.x * (DIM_X) + idx; int y = blockIdx.y * (DIM_Y) + idy; idy++; idx++; unsigned int index = y * N + x ; //interior points float center = E_prev[index] ; block[idy][idx] = center; __syncthreads();

©2012 Scott B. Baden /CSE 260/ Fall 2012

Page 19: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

23

Copying the ghost cells

Didem Unat

if (idy == 1 && y > 0 ) block[0][idx]= E_prev[index - N]; else if(idy == DIM_Y && y < N-1) block[DIM_Y+1][idx] = E_prev[index + N]; if ( idx==1 && x > 0 ) block[idy][0] = E_prev[index - 1]; else if( idx== DIM_X && x < N-1 ) block[idy][DIM_X +1] = E_prev[index + 1]; __syncthreads();

©2012 Scott B. Baden /CSE 260/ Fall 2012

Page 20: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

©2012 Scott B. Baden /CSE 260/ Fall 2012

Thread Mapping for Ghost Cells

•  Branches and thread divergence if(threadIdx.y < 4 ){ //read a ghost cell into shared memory

block[borderIdy][borderIdx] = Uold[index]; }

Ghost cells When loading ghost cells, only some of the threads are active, some are idle

24

Page 21: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

©2012 Scott B. Baden /CSE 260/ Fall 2012

Ghost Cells (cont.) •  Divide the work between threads so each

thread responsible for one ghost cell load •  For 16 × 16 tile size

u  There are 16×4 = 64 ghost cells u  Create 64 threads u  Each thread computes 4 elements in the Y-dim

Ghost cells Each thread loads one ghost cell Thead uses a hashmap to find its ghost cell assignment

25

Page 22: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

26

The stencil computation and ODE float r = R[index]; float e = center + α* (block[idy][idx-1] + block[idy][idx+1] + block[idy-1][idx] + block[idy+1][idx] - 4*center); e = e - dt*(kk * e * ( e- a) * ( e - 1 ) + e * r); E[index] = e; R[index] = r + dt *(ε+ M1 * r / ( e + M2 ) ) * ( -r - kk * e * (e - b - 1));

©2012 Scott B. Baden /CSE 260/ Fall 2012

Page 23: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

•  Single Precision –  Nearly saturates the off-chip memory bandwidth –  Utilizing 98% of the sustainable bandwidth for the Tesla C1060. –  Achieves 13.3% of the single precision peak performance

•  Sngle precision performance is bandwidth limited.

•  Double Precision –  41.5% of the sustainable bandwidth –  1/3 of the peak double precision performance –  Performance hurt by the division operation that appears in ODE

Results on Lilliput GFlop/s rates for Nehalem and C1060 implementations

©2012 Scott B. Baden /CSE 260/ Fall 2012 27

Page 24: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Limits to performance

©2012 Scott B. Baden /CSE 260/ Fall 2012 28

Page 25: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

•  Not all the operations are multiply-and-add instructions –  Add or multiply run at the half speed of MADD

•  Register-to-register instructions achieve highest throughput •  Shared memory instructions only a fraction of the pea

(66% in single, 84% in double precision)

Instruction Throughput

Page 26: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Memory Accesses

Total Memory Accesses = Number of Tiles = Total Memory Accesses = Estimated Kernel Time = Total Mem. Acc. (bytes) / Empirical Device Bandwidth

©2012 Scott B. Baden /CSE 260/ Fall 2012 30

Page 27: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Today’s lecture

•  More about floating point •  Stencil methods on the GPU

u  2D u  3D

©2012 Scott B. Baden /CSE 260/ Fall 2012 31

Page 28: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

3D Stencils •  More demanding

u  Large strides u  Curse of dimensionality

©2012 Scott B. Baden /CSE 260/ Fall 2012 32

2D

Page 29: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Memory Strides

10/31/12 33

i+1,j,k i-1,j,k

i,j+1,k

i,j,k+1

i,j,k-1

i,j-1,k

k j i

U

H. Das, S. Pan, L. Chen

Sam Williams et al.

33 ©2012 Scott B. Baden /CSE 260/ Fall 2012

Page 30: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

CUDA Thread Blocks

•  Split mesh into 3D tiles •  Divide elements in a tile over a thread block

35

tile (tx,ty,tz)

ty

threads (tx/cx, ty/cy, tz/cz)

chunksize (cx,cy,cz)

Cuda thread

tx tz

3D Grid (Nx, Ny, Nz)

©2012 Scott B. Baden /CSE 260/ Fall 2012

Page 31: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

On chip memory optimization •  Copy center plane into shared memory •  Store others in registers •  Move in and out of registers

©2012 Scott B. Baden /CSE 260/ Fall 2012 36

Page 32: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Rotating planes •  Copy center plane into shared memory •  Store others in registers •  Move in and out of registers

©2012 Scott B. Baden /CSE 260/ Fall 2012 37

3D tile

top

center

bottom

top

center

bottom

top

center

bottom

y x z

registers reg + shared mem registers

Page 33: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

©2012 Scott B. Baden /CSE 260/ Fall 2012

Performance

•  N3=2563, double precision

GFLOPS GTX 280 Tesla 1060

Naïve 12.3 8.9

Shared Memory 15.8 15.5

Sliding Planes 22.3 20.7

Registers 25.6 23.6

38

Page 34: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

©2012 Scott B. Baden /CSE 260/ Fall 2012

Multiple Elements in Y-dim

•  If we let a thread compute more than one plane, we can assign more than one row in the slowest varying dimension

•  Reduces index calculations u  But requires more registers

•  May be advantageous in handling ghost cells

39

Page 35: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

©2012 Scott B. Baden /CSE 260/ Fall 2012

Contributions to Performance

•  N3=2563, double precision

GFLOPS GTX 280 Tesla 1060

Naïve 12.3 8.9

Shared Memory 15.8 15.5

Sliding Planes 22.3 20.7

Registers 25.6 23.6

MultipleY2 31.4 26.3

MultipleY4 33.9 26.2

41

Page 36: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

42

Influence of memory traffic on performance

©2012 Scott B. Baden /CSE 260/ Fall 2012

Page 37: Lecture 10 Floating point (continued) Stencil methodscseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec10.pdf · Ghost Cells (cont.) • Divide the work between threads so each thread

Performance and optimizations •  GFlop/s rates for stencil kernels for each optimization on a GPU

device. 1st column represents each optimization performed. Max values are highlighted.

Intel AMD Nehalem Barcelona 7-point 11 4.2 Divergence 6 2.2 Gradient 3.9 1.4

Reference: S. Kamil, C. Chan, L. Oliker, J. Shalf, S. Williams, "An Auto-tuning framework for Parallel Multi-core Stencil Computations", (IPDPS), 2010.

•  Comparison with the other recently published CPU results

©2012 Scott B. Baden /CSE 260/ Fall 2012 43