Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Lecture 10
Floating point (continued) Stencil methods
Announcements • Mac Mini lab (APM 2402)
u Tuesday at 4pm to 6pm • A3 will be posted on Friday • Sign on to bang and run the Basic example
interactively and with batch
©2012 Scott B. Baden /CSE 260/ Fall 2012 2
Projects! • Stencil method in 3 dimensions • Multigrid • Communication avoiding matrix multiplication (MPI) • Algorithm based fault tolerance (MPI) • 3D Fast Fourier Transform (MPI or CUDA) • Particle simulation (MPI) • Groups of 3 will do a more ambitious project
u MPI projects can add communication overlap u MPI + CUDA
• Propose your own • Make your choice by 11/9
www-cse.ucsd.edu/classes/fa12/cse260-b/Projects/ProjectList.html
©2012 Scott B. Baden /CSE 260/ Fall 2012 5
Today’s lecture
• More about floating point • Stencil methods on the GPU
u 2D u 3D
©2012 Scott B. Baden /CSE 260/ Fall 2012 6
IEEE Floating point standard P754 • Normalized representation ±1.d…d× 2esp
u Macheps = Machine epsilon = ε = 2-#significand bits relative error in each operation
u OV = overflow threshold = largest number u UN = underflow threshold = smallest number
• ±Zero: ±significand and exponent = 0
©2012 Scott B. Baden /CSE 260/ Fall 2012
Format # bits #significand bits macheps #exponent bits exponent range ---------- -------- ---------------------- ------------ -------------------- ---------------------- Single 32 23+1 2-24 (~10-7) 8 2-126 - 2127 (~10+-38) Double 64 52+1 2-53 (~10-16) 11 2-1022 - 21023 (~10+-308) Double ≥80 ≥64 ≤2-64(~10-19) ≥15 2-16382 - 216383 (~10+-4932)
Jim Demmel 7
Denormalized numbers • Compute: if (a ≠ b) then x = a/(a-b) • We should never divide by 0, even if a-b is tiny • Underflow exception occurs when
exact result a-b < underflow threshold UN • Return a denormalized number for a-b
u Relax restriction that leading digit is 1: ±0.d…d x 2min_exp
u Reserve the smallest exponent value, lose a set of small normalized numbers
u Fill in the gap between 0 and UN uniform distribution of values
©2012 Scott B. Baden /CSE 260/ Fall 2012 8
Anomalous behavior • Floating point arithmetic is not associative
(x + y) + z ≠ x+ (y+z)
• Distributive law doesn’t always hold
• These expressions have different values when y ≈ z x*y – x*z ≠ x(y-z) • Optimizers can’t reason about floating point
• If we compute a quantity in extended precision (80 bits)
we lose digits when we store to memory y ≠ x float x, y=…, z=…; x = y + z; y=x;
©2012 Scott B. Baden /CSE 260/ Fall 2012 9
NaN (Not a Number) • Invalid exception
u Exact result is not a well-defined real number 0/0,√-1
• NaN op number = NaN • We can have a quiet NaN or an sNan
u Quiet –does not raise an exception, but propagates a distinguished value
• E.g. missing data: max(3,NAN) = 3 u Signaling - generate an exception when accessed
• Detect uninitialized data
©2012 Scott B. Baden /CSE 260/ Fall 2012 10
Exception handling • Each of the 5 exceptions manipulates 2 flags • Sticky flag set by an exception, can be read and cleared by
the user • Exception flag: should a trap occur?
u If so, we can enter a trap handler u But requires precise interrupts, causes problems on a
parallel computer • We can use exception handling to build faster algorithms
u Try the faster but “riskier” algorithm u Rapidly test for accuracy
(possibly with the aid of exception handling) u Substitute slower more stable algorithm as needed
©2012 Scott B. Baden /CSE 260/ Fall 2012 11
P754 on the GPU • Cuda Programming Guide (4.1)
“All compute devices follow the IEEE 754-2008 standard for binary floating-point arithmetic with the following deviations”
u There is no mechanism for detecting that a floating-point exception has occurred and all operations behave as if the exceptions are always masked… SNaN … are handled as quiet
• Cap. 2.x: FFMA … is an IEEE-754-2008 compliant fused multiply-add instruction … the full-width product … used in the addition & a single rounding occurs during generation of the final result
u rnd(A ☓ A + B) with FFMA (2.x) vs rnd(rnd(A ☓ A) + B) FMAD for 1.x
• FFMA can avoid loss of precision during subtractive cancellation when adding quantities of similar magnitude but opposite signs
• Also see Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs,” by N. Whitehead and A. Fit-Florea
©2012 Scott B. Baden /CSE 260/ Fall 2012 13
Today’s lecture
• More about floating point • Stencil methods on the GPU
u 2D u 3D
©2012 Scott B. Baden /CSE 260/ Fall 2012 14
The Aliev-Panfilov Method • Models signal propagation in cardiac tissue
u Demonstrates complex behavior of spiral waves that are known to cause life-threatening situations
• Reaction-diffusion system u Reactions are the cellular exchanges of certain ions across the cell
membrane during the cellular electrical impulse • Our simulation has two state variables
u Transmembrane potential: e u Recovery of the tissue: r
©2012 Scott B. Baden /CSE 260/ Fall 2012 15
The Aliev-Panfilov Model • Two parts
u 2 Ordinary Differential Equations • Kinetics of reactions occurring at every point in space
u Partial Differential Equation • Spatial diffusion of reactants
• First-order explicit numerical scheme
©2012 Scott B. Baden /CSE 260/ Fall 2012 16
Data Dependencies • ODE solver:
u No data dependency, trivially parallelizable u Requires a lot of registers to hold temporary variables
• PDE solver: u Jacobi update for the 5-point Laplacian operator. u Sweeps over a uniformly spaced mesh u Updates voltage to weighted contributions from
the 4 nearest neighbors
©2012 Scott B. Baden /CSE 260/ Fall 2012 18
for (j=1; j<=m+1; j++){ _DOUBLE_ *RR = &R[j][1], *EE = &E[j][1]; for (i=1; i<=n+1; i++, EE++, RR++) { EE[0]= E_p[j][i]+α*(E_p[j][i+1]+E_p[j][i-1]-4*E_p[j][i]+E_p[j+1][i]+E_p[j-1][i]); // PDE SOLVER EE[0] += -dt*(kk*EE[0]*(EE[0]-a)*(EE[0]-1)+EE[0]*RR[0]); RR[0] += dt*(ε+M1* RR[0]/( EE[0]+M2))*(-RR[0]-kk*EE[0]*(EE[0]-b-1)); } }
Naïve CUDA Implementation • All array references go through device memory • ./apf -n 6144 -t 0.04, 16x16 thread blocks
u Lilliput, cseclass01, cseclass05 u SP: 22, 73, 34 GFlops [Triton, 32 cores, MPI: 85GF] u DP: 13, 45, 20 GFlops (19GF n=8192) [Triton: 48 GF]
19 ©2012 Scott B. Baden /CSE 260/ Fall 2012
for (j=1; j<= m+1; j++) for (i=1; i<= n+1; i++) E[j][i] = E′ [j][i]+α*(E′ [j][i-1]+E′ [j][i+1] + E′ [j-1][i]+E′ [j+1][i] - 4*E′ [j][i]);
#define E′ [i,j] E_prev[(j+1)*(m+3) + (i+1)]
I = blockIdx.y*blockDim.y + threadIdx.y; J = blockIdx.x*blockDim.x + threadIdx.x; if ((I <= n) && (J <= m) ) E[I] = E′ [I,J] + α*(E′ [I-1,J] + E′ [I+1,J] + E′ [I,J-1]+ E′ [I,J+1] – 4*E′ [I,J]);
Nx
Ny
Cuda thread block
dx
dy
…
• Create 1D thread block to process 2D data block • Iterate over rows in y dim • While first and last threads read ghost cells, others are idle
Using Shared Memory
Compared to a 2D thread blocking, 1D thread blocks provide a 12% improvement in double precision and 64% improvement in single precision
©2012 Scott B. Baden /CSE 260/ Fall 2012 20
Didem Unat
Top Row in Registers Curr Row in Shared memory Bottom Row in Registers
Sliding rows with 1D thread blocks reduces global memory accesses.
Top row <-- Curr row, Curr row <-- Bottom row Bottom row <-- read new row from global memory
Sliding row algorithm
Sliding rows
…
Read new row from global memory ©2012 Scott B. Baden /CSE 260/ Fall 2012 21
22
CUDA Code
__shared__ float block[DIM_Y + 2 ][DIM_X + 2 ]; int idx = threadIdx.x, idy = threadIdx.y ; //local indices //global indices int x = blockIdx.x * (DIM_X) + idx; int y = blockIdx.y * (DIM_Y) + idy; idy++; idx++; unsigned int index = y * N + x ; //interior points float center = E_prev[index] ; block[idy][idx] = center; __syncthreads();
©2012 Scott B. Baden /CSE 260/ Fall 2012
23
Copying the ghost cells
Didem Unat
if (idy == 1 && y > 0 ) block[0][idx]= E_prev[index - N]; else if(idy == DIM_Y && y < N-1) block[DIM_Y+1][idx] = E_prev[index + N]; if ( idx==1 && x > 0 ) block[idy][0] = E_prev[index - 1]; else if( idx== DIM_X && x < N-1 ) block[idy][DIM_X +1] = E_prev[index + 1]; __syncthreads();
©2012 Scott B. Baden /CSE 260/ Fall 2012
©2012 Scott B. Baden /CSE 260/ Fall 2012
Thread Mapping for Ghost Cells
• Branches and thread divergence if(threadIdx.y < 4 ){ //read a ghost cell into shared memory
block[borderIdy][borderIdx] = Uold[index]; }
Ghost cells When loading ghost cells, only some of the threads are active, some are idle
24
©2012 Scott B. Baden /CSE 260/ Fall 2012
Ghost Cells (cont.) • Divide the work between threads so each
thread responsible for one ghost cell load • For 16 × 16 tile size
u There are 16×4 = 64 ghost cells u Create 64 threads u Each thread computes 4 elements in the Y-dim
Ghost cells Each thread loads one ghost cell Thead uses a hashmap to find its ghost cell assignment
25
26
The stencil computation and ODE float r = R[index]; float e = center + α* (block[idy][idx-1] + block[idy][idx+1] + block[idy-1][idx] + block[idy+1][idx] - 4*center); e = e - dt*(kk * e * ( e- a) * ( e - 1 ) + e * r); E[index] = e; R[index] = r + dt *(ε+ M1 * r / ( e + M2 ) ) * ( -r - kk * e * (e - b - 1));
©2012 Scott B. Baden /CSE 260/ Fall 2012
• Single Precision – Nearly saturates the off-chip memory bandwidth – Utilizing 98% of the sustainable bandwidth for the Tesla C1060. – Achieves 13.3% of the single precision peak performance
• Sngle precision performance is bandwidth limited.
• Double Precision – 41.5% of the sustainable bandwidth – 1/3 of the peak double precision performance – Performance hurt by the division operation that appears in ODE
Results on Lilliput GFlop/s rates for Nehalem and C1060 implementations
©2012 Scott B. Baden /CSE 260/ Fall 2012 27
Limits to performance
©2012 Scott B. Baden /CSE 260/ Fall 2012 28
• Not all the operations are multiply-and-add instructions – Add or multiply run at the half speed of MADD
• Register-to-register instructions achieve highest throughput • Shared memory instructions only a fraction of the pea
(66% in single, 84% in double precision)
Instruction Throughput
Memory Accesses
Total Memory Accesses = Number of Tiles = Total Memory Accesses = Estimated Kernel Time = Total Mem. Acc. (bytes) / Empirical Device Bandwidth
©2012 Scott B. Baden /CSE 260/ Fall 2012 30
Today’s lecture
• More about floating point • Stencil methods on the GPU
u 2D u 3D
©2012 Scott B. Baden /CSE 260/ Fall 2012 31
3D Stencils • More demanding
u Large strides u Curse of dimensionality
©2012 Scott B. Baden /CSE 260/ Fall 2012 32
2D
Memory Strides
10/31/12 33
i+1,j,k i-1,j,k
i,j+1,k
i,j,k+1
i,j,k-1
i,j-1,k
k j i
U
H. Das, S. Pan, L. Chen
Sam Williams et al.
33 ©2012 Scott B. Baden /CSE 260/ Fall 2012
CUDA Thread Blocks
• Split mesh into 3D tiles • Divide elements in a tile over a thread block
35
tile (tx,ty,tz)
ty
threads (tx/cx, ty/cy, tz/cz)
chunksize (cx,cy,cz)
Cuda thread
tx tz
3D Grid (Nx, Ny, Nz)
©2012 Scott B. Baden /CSE 260/ Fall 2012
On chip memory optimization • Copy center plane into shared memory • Store others in registers • Move in and out of registers
©2012 Scott B. Baden /CSE 260/ Fall 2012 36
Rotating planes • Copy center plane into shared memory • Store others in registers • Move in and out of registers
©2012 Scott B. Baden /CSE 260/ Fall 2012 37
3D tile
top
center
bottom
top
center
bottom
top
center
bottom
y x z
registers reg + shared mem registers
©2012 Scott B. Baden /CSE 260/ Fall 2012
Performance
• N3=2563, double precision
GFLOPS GTX 280 Tesla 1060
Naïve 12.3 8.9
Shared Memory 15.8 15.5
Sliding Planes 22.3 20.7
Registers 25.6 23.6
38
©2012 Scott B. Baden /CSE 260/ Fall 2012
Multiple Elements in Y-dim
• If we let a thread compute more than one plane, we can assign more than one row in the slowest varying dimension
• Reduces index calculations u But requires more registers
• May be advantageous in handling ghost cells
39
©2012 Scott B. Baden /CSE 260/ Fall 2012
Contributions to Performance
• N3=2563, double precision
GFLOPS GTX 280 Tesla 1060
Naïve 12.3 8.9
Shared Memory 15.8 15.5
Sliding Planes 22.3 20.7
Registers 25.6 23.6
MultipleY2 31.4 26.3
MultipleY4 33.9 26.2
41
42
Influence of memory traffic on performance
©2012 Scott B. Baden /CSE 260/ Fall 2012
Performance and optimizations • GFlop/s rates for stencil kernels for each optimization on a GPU
device. 1st column represents each optimization performed. Max values are highlighted.
Intel AMD Nehalem Barcelona 7-point 11 4.2 Divergence 6 2.2 Gradient 3.9 1.4
Reference: S. Kamil, C. Chan, L. Oliker, J. Shalf, S. Williams, "An Auto-tuning framework for Parallel Multi-core Stencil Computations", (IPDPS), 2010.
• Comparison with the other recently published CPU results
©2012 Scott B. Baden /CSE 260/ Fall 2012 43