Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of

Introduction to CUDA ProgrammingHistograms and Sparse Array

Multiplication

Andreas MoshovosWinter 2009

Based on documents from:NVIDIA & Appendix A of the New P&H Book

Histogram

• E.g., Given an image calculate this:

• Distribution of values

Sequential Algorithm

for (int i = 0; i < BIN_COUNT; i++)

result[i] = 0;

for (int i = 0; i < dataN; i++)

result[data[i]]++;

Parallel Strategy

• Distribute work across multiple blocks– Divide input data to blocks

• Each block process it’s own portion– Multiple threads, one per image pixel?

• #pixels / #threads per thread

– Produces a partial histogram• Could produce multiple histograms

• Merge all partial histograms– Produces the final histogram

Data Structures and Access Patterns

result[data[i]]++;• Data[]:

– We control: Can be accessed sequentially– Each element accessed only once

• Result[]:– Access is data-dependent– Each element may be accessed multiple times

• Data[] in global memory

• Result[] in shared memory

Sub-Histograms

• How many sub-histograms can we fit in shared memory?– Input value range: 0-255, 1 byte– Each histogram needs 256 entries– How many bytes per entry?

• That’s data dependent

– Let’s assume 32-bits or 4 bytes

• 16KB (shared memory) / (256 x 4) (histogram)– 16 sub-histograms at any given point of time– If one per thread then we have less than a warp

• Let’s try one histogram per block– many threads per block

Partial Histogram Data Structure

• Array in shared memory

• One per block

• One column per possible pixel value

Algorithm Overview• Step 1:

– Initialize partial histogram– Each thread:

• s_Hist[index] = 0• index += threads per block

• Step 2:– Update histogram– Each thread:

• read data[index]• update s_Hist[]• index += Total number of threads

• Step 3:– Update global histogram– Each thread

• read s_hist[index]• update global histogram• index += threads per block

Simultaneous Updates?

• Threads in a block:– update s_Hist[]

• All threads:– update global history

• Without special support this becomes:– register X = value of A– X ++– A = register X

• This is a read-modify-write sequence

The problem with simultaneous updates

• What if we do each step individually– r10 = mem[100] 10 r100 = mem[100] 10– r10++ 11 r100++

11– mem[100] = r10 11 mem[100] = r100 11

• But we really wanted 12

• What if we had 32 threads running in parallel?

• Start with 10 we would want: 10+32– We may still get 11

• Need to think about this:– Special support: Atomic operations

Atomic Operations

• Read-Modify-Write operations that are guaranteed to happen “atomically”– Produces the same result as if the sequence

executed in isolation in time – Think of it as “serializing the execution” of all

atomics– This is not what happens – This is how you should

think about them

Atomic Operations

• Supported both in Shared and Global memory

• Example:– atomicAdd (pointer, value)– does: *pointer += value

• Atomic Operations– Add, Sub, Inc, Dec– Exch, Min, Max, CAS – Bitwise: And, Or, Xor

• Work with (unsigned) integers

• Exch works with single FP as well

atomicExch, atomicMin, atomicMax, atomicCAS

• atomicExch (pointer, value)– tmp = * pointer– *pointer = value– return tmp

• atomicMin (pointer, value) (max is similar)– tmp = *pointer– if (*pointer < value) *pointer = value– return tmp

• atomicCAS (pointer, value1, value2)– tmp = *pointer– if (*pointer == value1) *pointer = value2– return tmp

atomicInc, atomicDec

• atomicInc (pointer, value)– tmp = *pointer– if (*pointer < value) (*pointer)++– else *pointer = 0– return tmp

• atomicDec (pointer, value)– tmp = *pointer– if (*pointer == 0 || *pointer > value) *pointer = value – else (*pointer)--– return tmp

atomicAnd, atomicOr, atomicXOR

• atomicAnd (pointer, value)– tmp = *pointer– *pointer = *pointer & value– return tmp

• Others similar

CUDA Implementation - Declarations__global__ void histogram256Kernel

(uint *d_Result, uint *d_Data, int dataN){

//Current global thread index

const int globalTid = blockIdx.x * blockDim.x + threadIdx.x;

//Total number of threads in the compute grid

const int numThreads = blockDim.x * gridDim.x;

__shared__ uint s_Hist[BIN_COUNT];

Clear partial histogram buffer

//Clear shared memory buffer for current thread block before processing

for ( int pos = threadIdx.x;

pos < BIN_COUNT;

pos += blockDim.x)

s_Hist[pos] = 0;

__syncthreads ();

Generate partial histogramfor ( int pos = globalTid;

pos < dataN; pos += numThreads){

uint data4 = d_Data[pos];

atomicAdd (s_Hist + (data4 >> 0) & 0xFFU, 1);atomicAdd (s_Hist + (data4 >> 8) & 0xFFU, 1);atomicAdd (s_Hist + (data4 >> 16) & 0xFFU, 1);atomicAdd (s_Hist + (data4 >> 24) & 0xFFU, 1);

}

__syncthreads();

Merge partial histogram with global histogram

for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x){

atomicAdd(d_Result + pos, s_Hist[pos]);}

Code overview__global__ void histogram256Kernel (uint *d_Result, uint *d_Data,

int dataN){ const int globalTid = blockIdx.x * blockDim.x + threadIdx.x;const int numThreads = blockDim.x * gridDim.x;__shared__ uint s_Hist[BIN_COUNT];for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x)

s_Hist[pos] = 0;__syncthreads ();for (int pos = globalTid; pos < dataN; pos += numThreads){

uint data4 = d_Data[pos];atomicAdd (s_Hist + (data4 >> 0) & 0xFFU, 1);atomicAdd (s_Hist + (data4 >> 8) & 0xFFU, 1);atomicAdd (s_Hist + (data4 >> 16) & 0xFFU, 1);atomicAdd (s_Hist + (data4 >> 24) & 0xFFU, 1);

}__syncthreads();for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x)

atomicAdd(d_Result + pos, s_Hist[pos]);}

Discussion

• s_Hist updates– Conflicts in shared memory– Data Dependent– 16-way conflicts possible and likely

• Is there an alternative?– One histogram per thread?– Load data in shared memory

• Each thread produces a portion of the s_Hist that maps onto the same bank?

Warp Vote Functions

• int __all (int predicate);– evaluates predicate for all threads of the warp and

returns non-zero if and only if predicate evaluates to non-zero for all of them.

• int __any (int predicate);– evaluates predicate for all threads of the warp and

returns non-zero if and only if predicate evaluates to non-zero for any of them.

Warp Vote Functions Example

• Original code:for (int pos = threadIdx.x; pos < BIN_COUNT; pos +=

blockDim.x){atomicAdd(d_Result + pos, s_Hist[pos]);

• Modified w/ __any (): for (int pos = threadIdx.x; pos < BIN_COUNT; pos +=

blockDim.x){if (__any (s_Hist[pos] != 0) )

atomicAdd(d_Result + pos, s_Hist[pos]);

• Modified w/ __all (): for (int pos = threadIdx.x; pos < BIN_COUNT; pos +=

blockDim.x){if (!__all (s_Hist[pos] == 0) )

atomicAdd(d_Result + pos, s_Hist[pos]);

Sparse Matrix Multiplication

• Sparse Matrix N x N:– number of non-zero entries m is only a small

fraction of the total

• Representation goal:– store only non-zero entries

• Typically:– m = O(N)

Compressed Sparse Row Representation

• Av[]:– Array values in row-major order

• Aj[]:– Column for corresponding Av[] entry

• Ap[]:– row i extends from indexes Ap[i] to Ap[i+1] -1 in Av[]

and Aj[]

Matrix x Vector: y = Ax

Av 3 1 2 4 1 1 1

Aj 0 2 1 2 3 0 3

Ap 0 2 2 5 7

10

20

30

40

2 x 20 + 4 x 30 + 1 x 40

60

0

210

50

x

Ax

Single Row Multiply• Produces an entry of the result vector

float multiply_row (uint rowsize,uint *Aj, // column indices for rowfloat *Av, // nonzero entries for rowfloat *xl // the RHS vector

{float sum = 0;for (uint column=0; column<rowsize; column++)

sum = Av[column] * x[ Aj[column] ] ;return sum;

}

Av 3 1 2 4 1 1 1

Aj 0 2 1 2 3 0 3

Ap 0 2 2 5 7

10

20

30

40

Serial Codevoid csrmul_serial (uint *Ap, uint *Aj,float *Av, uint num_rows,float *x, float *y)

{for (uint row=0; row<num_rows; row++) {

uint row_begin = Ap[row];uint row_end = Ap[row + l];

y[row] = multiply_row (row_end - row_begin, Aj+row_begin,Av+row_begin, x);

}}

Av 3 1 2 4 1 1 1

Aj 0 2 1 2 3 0 3

Ap 0 2 2 5 7

CUDA Strategy

• Assume that there are many rows

• One thread per row

CUDA Kernel__global__void csrmul_kernel (uint *Ap, uint *Aj,float *Av, uint num_rows,float *x, float *y)uint row = blockIdx.x * blockDim.x + threadIdx.x;uint row_begin = Ap[row];uint row_end = Ap[row+1];

y[row] = multiply_row (row_end – row_begin, Aj + row_begin, Av + row_begin, x);

}

Discussion

• We are not using shared memory– all are in global memory

• Claim:– rows using x[i] will be rows near row I

• Block processing rows i through j– cache x[i] through x[j]– load from shared memory if possible

• Unroll multiply_row()

• Fetch Ap[row+1] from adjacent thread

CSR multiplication using shared memory__global__ void csrmul_cached( uint *Ap, uint *Aj,float *Av, uint num_rows,float *x, float *y){

__shared__ float cache[blocksize];uint block_begin = blockIdx.x * blockDim.x;uint block_end = block_begin + blockDim.x;uint row = block_begin + threadIdx.x:if (row < num_rows) cache[threadIdx.x] = x[row];__syncthreads();if (row<num_rows){

uint row_begin = Ap[row];uint row_end = Ap[row + l] ;float sum = 0, x_j ;for (uint col=row_begin; col < row_end; col++ ){

uint j = Aj[col];// Fetch x_j from our cache when possibleif (j>=block_begin && j<block_end )

x_j = cache [j – block_begin];else

x_j = x [j];sum += Av[col] * x_j ;

}y[row] = sum;

}}

Documents

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of