56
Lecture 4: Introduction to Parallel Computing Using CUDA Ken Domino, Domem Technologies May 23, 2011 IEEE Boston Continuing Education Program

Lecture 4: Introduction to Parallel Computing Using CUDA

  • Upload
    tevin

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 4: Introduction to Parallel Computing Using CUDA. IEEE Boston Continuing Education Program. Ken Domino, Domem Technologies May 23, 2011. Even/Odd sort. Very similar to Bubble Sort Easily parallelizable. Even/Odd sort. 1 7 4 0 9 4 8 8 2 4 5 5 1 7 1 1. for K = 1 to  do - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 4: Introduction to Parallel Computing Using CUDA

Lecture 4: Introduction to Parallel Computing Using CUDA

Ken Domino, Domem TechnologiesMay 23, 2011

IEEE Boston Continuing Education Program

Page 2: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sort

• Very similar to Bubble Sort• Easily parallelizable

Page 3: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sortfor K = 1 to do

for I = 1, 3, 5, ..., -2 doif x[I] > x[I+1] then swap(x[I], x[I+1])

end forfor I = 0, 2, 4, 6, ..., do

if x[I] > x[I+1] then swap(x[I], x[I+1])end for

end for

http://en.wikipedia.org/wiki/Odd-even_sort => http://www.eli.sdsu.edu/courses/spring96/cs662/notes/assRelated/assRelated.html

1 7 4 0 9 4 8 8 2 4 5 5 1 7 1 1

Page 4: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sortfor K = 1 to do

for I = 1, 3, 5, ..., -2 doif x[I] > x[I+1] then swap(x[I], x[I+1])

end forfor I = 0, 2, 4, 6, ..., do

if x[I] > x[I+1] then swap(x[I], x[I+1])end for

end for

http://en.wikipedia.org/wiki/Odd-even_sort => http://www.eli.sdsu.edu/courses/spring96/cs662/notes/assRelated/assRelated.html

1 7 4 0 9 4 8 8 2 4 5 5 1 7 1 11 4 7 0 9 4 8 2 8 4 5 1 5 1 7 1

Page 5: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sortfor K = 1 to do

for I = 1, 3, 5, ..., -2 doif x[I] > x[I+1] then swap(x[I], x[I+1])

end forfor I = 0, 2, 4, 6, ..., do

if x[I] > x[I+1] then swap(x[I], x[I+1])end for

end for

http://en.wikipedia.org/wiki/Odd-even_sort => http://www.eli.sdsu.edu/courses/spring96/cs662/notes/assRelated/assRelated.html

1 7 4 0 9 4 8 8 2 4 5 5 1 7 1 11 4 7 0 9 4 8 2 8 4 5 1 5 1 7 11 4 0 7 4 9 2 8 4 8 1 5 1 5 1 7

Page 6: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sortfor K = 1 to do

for I = 1, 3, 5, ..., -2 doif x[I] > x[I+1] then swap(x[I], x[I+1])

end forfor I = 0, 2, 4, 6, ..., do

if x[I] > x[I+1] then swap(x[I], x[I+1])end for

end for

http://en.wikipedia.org/wiki/Odd-even_sort => http://www.eli.sdsu.edu/courses/spring96/cs662/notes/assRelated/assRelated.html

1 7 4 0 9 4 8 8 2 4 5 5 1 7 1 11 4 7 0 9 4 8 2 8 4 5 1 5 1 7 11 4 0 7 4 9 2 8 4 8 1 5 1 5 1 71 4 0 7 4 9 2 8 4 8 1 5 1 5 1 71 0 4 4 7 2 9 4 8 1 8 1 5 1 5 70 1 4 4 2 7 4 9 1 8 1 8 1 5 5 70 1 4 4 2 7 4 9 1 8 1 8 1 5 5 70 1 4 2 4 4 7 1 9 1 8 1 8 5 5 70 1 2 4 4 4 1 7 1 9 1 8 5 8 5 70 1 2 4 4 4 1 7 1 9 1 8 5 8 5 70 1 2 4 4 1 4 1 7 1 9 5 8 5 8 70 1 2 4 1 4 1 4 1 7 5 9 5 8 7 80 1 2 4 1 4 1 4 1 7 5 9 5 8 7 80 1 2 1 4 1 4 1 4 5 7 5 9 7 8 80 1 1 2 1 4 1 4 4 5 5 7 7 9 8 80 1 1 2 1 4 1 4 4 5 5 7 7 9 8 80 1 1 1 2 1 4 4 4 5 5 7 7 8 9 80 1 1 1 1 2 4 4 4 5 5 7 7 8 8 90 1 1 1 1 2 4 4 4 5 5 7 7 8 8 90 1 1 1 1 2 4 4 4 5 5 7 7 8 8 90 1 1 1 1 2 4 4 4 5 5 7 7 8 8 9

Page 7: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sortfor K = 1 to do

for I = 1, 3, 5, ..., -2 do in parallelif x[I] > x[I+1] then swap(x[I], x[I+1])

end forfor I = 0, 2, 4, 6, ..., do in parallel

if x[I] > x[I+1] then swap(x[I], x[I+1])end for

end for

http://en.wikipedia.org/wiki/Odd-even_sort => http://www.eli.sdsu.edu/courses/spring96/cs662/notes/assRelated/assRelated.html

Page 8: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sortvoid EvenOddSorter::gpuSort(int * _a, int _length){ a = _a; length = _length; int * d_a; cudaMalloc((void**)&d_a, sizeof(int) * length); cudaMemcpy(d_a, a, sizeof(int) * length, cudaMemcpyHostToDevice);

int sorted = 0; int * d_sorted; cudaMalloc((void**)&d_sorted, sizeof(int)); while (sorted == 0) { sorted = 1; cudaMemcpy(d_sorted, &sorted, sizeof(int), cudaMemcpyHostToDevice); kernel1<<<length / 32 + 1, 32, 33 * sizeof(int)>>>(d_a, length, d_sorted); kernel2<<<length / 32 + 1, 32, 33 * sizeof(int)>>>(d_a, length, d_sorted); cudaMemcpy(a, d_a, length * sizeof(int), cudaMemcpyDeviceToHost); cudaMemcpy(&sorted, d_sorted, sizeof(int), cudaMemcpyDeviceToHost); } cudaFree(d_a);}

void cpuSort(int * _a, int _length){ a = _a; length = _length; bool sorted = false; while (!sorted) { sorted=true; for (int i = 1; i < length-1; i += 2) { if (EvenOddSorter::compare(a, i, i+1, ASCENDING)) sorted = false; } for(int i = 0; i < length-1; i += 2) { if (EvenOddSorter::compare(a, i, i+1, ASCENDING)) sorted = false; } }}

Page 9: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sortvoid EvenOddSorter::gpuSort(int * _a, int _length){ a = _a; length = _length; int * d_a; cudaMalloc((void**)&d_a, sizeof(int) * length); cudaMemcpy(d_a, a, sizeof(int) * length, cudaMemcpyHostToDevice);

int sorted = 0; int * d_sorted; cudaMalloc((void**)&d_sorted, sizeof(int)); while (sorted == 0) { sorted = 1; cudaMemcpy(d_sorted, &sorted, sizeof(int), cudaMemcpyHostToDevice); kernel1<<<length / 32 + 1, 32, 65 * sizeof(int)>>>(d_a, length, d_sorted); kernel2<<<length / 32 + 1, 32, 65 * sizeof(int)>>>(d_a, length, d_sorted); cudaMemcpy(a, d_a, length * sizeof(int), cudaMemcpyDeviceToHost); cudaMemcpy(&sorted, d_sorted, sizeof(int), cudaMemcpyDeviceToHost); } cudaFree(d_a);}

Work in blocks of 32 threads.

Thread 0 works with a[0..1] (or a[1..2])

Thread 1 works with a[2..3] (or a[3..4])

Thread 31 works with a[62..63] (or a[63..64]

=> No gaps in thread id’s

=> Allocate shared memory to contain 65 contiguous elements of input

Page 10: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sort__global__ void kernel1(int * a, int length, int * sorted){ const unsigned int i = 1 + 2 * (threadIdx.x + blockIdx.x * blockDim.x); if (i+1 >= length) return;

// Copy input to shared mem. extern __shared__ int shared[]; int * sa = &shared[-(1 + 2 * blockIdx.x * blockDim.x)]; sa[i] = a[i]; sa[i+1] = a[i+1]; __syncthreads();

if (EvenOddSorter::compare(sa, i, i+1, EvenOddSorter::ASCENDING)) { *sorted = 0; // Write result. a[i] = sa[i]; a[i+1] = sa[i+1]; } __syncthreads();}

void cpuSort(int * _a, int _length){ a = _a; length = _length; bool sorted = false; while (!sorted) { sorted=true; for (int i = 1; i < length-1; i += 2) { if (EvenOddSorter::compare(a, i, i+1, ASCENDING)) sorted = false; } for(int i = 0; i < length-1; i += 2) { if (EvenOddSorter::compare(a, i, i+1, ASCENDING)) sorted = false; } }}

Page 11: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sort__global__ void kernel2(int * a, int length, int * sorted){ const unsigned int i = 2 * (threadIdx.x + blockIdx.x * blockDim.x); if (i+1 >= length) return;

// Copy input to shared mem. extern __shared__ int shared[]; int * sa = &shared[-(2 * blockIdx.x * blockDim.x)]; sa[i] = a[i]; sa[i+1] = a[i+1]; __syncthreads();

if (EvenOddSorter::compare(sa, i, i+1, EvenOddSorter::ASCENDING)) { *sorted = 0; // Write result. a[i] = sa[i]; a[i+1] = sa[i+1]; } __syncthreads();}

void cpuSort(int * _a, int _length){ a = _a; length = _length; bool sorted = false; while (!sorted) { sorted=true; for (int i = 1; i < length-1; i += 2) { if (EvenOddSorter::compare(a, i, i+1, ASCENDING)) sorted = false; } for(int i = 0; i < length-1; i += 2) { if (EvenOddSorter::compare(a, i, i+1, ASCENDING)) sorted = false; } }}

Page 12: Lecture 4: Introduction to Parallel Computing Using CUDA

Even/Odd sort

• Easy to implement• But, VERY slow• Annoying CUDA notes:– void EvenOddSorter::gpuSort(int * _a, int _length)

is in a class;– __global__ void kernel1(int * a, int length, int *

sorted) (i.e., “kernel”) routines cannot be contained in a class!

Page 13: Lecture 4: Introduction to Parallel Computing Using CUDA

Bitonic Sort

• Fast and easy to implement in CUDA• Based on merging of sorted lists

Page 14: Lecture 4: Introduction to Parallel Computing Using CUDA

Bitonic Sort• A sorted sequence is a monotonically non-decreasing (or

non-increasing) sequence.

• A bitonic sequence is a sequence with x[0] ≤ x[1] ≤ x[2] … ≤ x[k] ≥ x[k+1] … ≥ x[n-1] for some k, where 0 ≤ k < n, or a circular shift of such a sequence.

• [1, 2, 5, 7, 4, 3], [3, 1, 2, 5, 7, 4] are bitonic sequences.

• Bitonic sequences are graphically represented with lines.

Page 15: Lecture 4: Introduction to Parallel Computing Using CUDA

Bitonic Split

• Consider a special bitonic sequence:x[0] ≤ x[1] ≤ x[2] ≤ … ≤ x[ - 1], and x[] ≥ x[ + 1] ≥ … ≥ x[n - 1].

• A bitonic split are two subsequences:s1 = {min(x[0], x[]), min(x[1], x[]), …, min(x[ - 1], x[n - 1]))s2 = {max(x[0], x[]), max(x[1], x[]), …, max(x[ - 1], x[n - 1]))

• s = [1, 2, 5, 7, 4, 3] => s1 = [1, 2, 3], s2 = [7, 4, 5]

Page 16: Lecture 4: Introduction to Parallel Computing Using CUDA

Bitonic Merge• With bitonic sequence, a merge results in a sorted bitonic sequence.

We reorder if the two halves to make sure each half is bitonic, then sort each half using divide and conquer. (Note: direction!)

• Append s1 and s2, each length = n. Then:Merge(int lo, int n)

for (int i = lo; i < lo + ; ++i)if (s[i] > s[i+])

exchange(s[i], s[I + ])Merge(lo, )Merge(, )

• s1 = [1, 3], s2 = [7, 0] => [1, 3, 7, 0] => [1, 0, 7, 3] => [0, 1, 7, 3] => [0, 1, 3, 7]

Page 17: Lecture 4: Introduction to Parallel Computing Using CUDA

Bitonic Sort void sort(int lo, int n, bool dir) { if (n>1) { int m=n/2; sort(lo, m, ASCENDING); sort(lo+m, m, DESCENDING); merge(lo, n, dir); } }

void merge(int lo, int n, bool dir) { if (n>1) { int m=n/2; for (int i=lo; i<lo+m; i++) compare(i, i+m, dir); merge(lo, m, dir); merge(lo+m, m, dir); } }

1 7 4 0 9 4 8 8 2 4 5 5 1 7 1 1sort lo = 0 n = 16 dir = 1 sort lo = 0 n = 8 dir = 1 sort lo = 0 n = 4 dir = 1 sort lo = 0 n = 2 dir = 1 merge lo = 0 n = 2 dir = 1 compare i = 0, i+m = 1 a[0] = 1, a[1] = 7 sort lo = 2 n = 2 dir = 0 merge lo = 2 n = 2 dir = 0 compare i = 2, i+m = 3 a[2] = 4, a[3] = 0 merge lo = 0 n = 4 dir = 1 compare i = 0, i+m = 2 a[0] = 1, a[2] = 4 compare i = 1, i+m = 3 a[1] = 7, a[3] = 0 exchanged1 0 4 7 9 4 8 8 2 4 5 5 1 7 1 1 merge lo = 0 n = 2 dir = 1 compare i = 0, i+m = 1 a[0] = 1, a[1] = 0 exchanged0 1 4 7 9 4 8 8 2 4 5 5 1 7 1 1 merge lo = 2 n = 2 dir = 1 compare i = 2, i+m = 3 a[2] = 4, a[3] = 7 sort lo = 4 n = 4 dir = 0...0 1 4 7 4 9 8 8 2 4 5 5 1 7 1 10 1 4 7 8 9 4 8 2 4 5 5 1 7 1 10 1 4 7 9 8 4 8 2 4 5 5 1 7 1 10 1 4 7 9 8 8 4 2 4 5 5 1 7 1 1

Page 18: Lecture 4: Introduction to Parallel Computing Using CUDA

Bitonic Sort

• Known as a sorting network.

http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm

Page 19: Lecture 4: Introduction to Parallel Computing Using CUDA

Bitonic Sort in CUDA void cpuSortNonrecursive(int * _a, int _length) { a = _a; length = _length; int i,j,k; // Parallel bitonic sort for (k = 2; k <= length; k = 2*k) { // Bitonic merge for (j = k >> 1; j > 0; j = j>>1) { for (i=0; i < length; i++) { int ixj = i ^ j; if (ixj > i) { if ((i & k) == 0 && a[i] > a[ixj]) exchange(i, ixj); if ((i & k) != 0 && a[i] < a[ixj]) exchange(i, ixj); } } } } }

__global__ void bitonicSort(int * a, int length){

const unsigned int i = threadIdx.x;// Copy input to shared mem.__shared__ extern int sa[];sa[i] = a[i];__syncthreads();for (unsigned int k = 2; k <= length; k *= 2){

for (unsigned int j = k / 2; j>0; j /= 2){

unsigned int ixj = i ^ j;if (ixj > i){

if ((i & k) == 0 && sa[i] > sa[ixj])swap(sa[i], sa[ixj]);

if ((i & k) != 0 && sa[i] < sa[ixj])swap(sa[i], sa[ixj]);

}__syncthreads();

} } // Write result. a[i] = sa[i];}

Page 20: Lecture 4: Introduction to Parallel Computing Using CUDA

Bitonic Sort in CUDA

• Driver not scalable for length > 1024.• (Scaling an exercise. Answer in CUDA SDK Examples. Hint: need

to factor sort and merge steps.)

void gpuSort(int * _a, int _length){ a = _a; length = _length; int * d_a; cudaMalloc((void**)&d_a, sizeof(int) * length); cudaMemcpy(d_a, a, sizeof(int) * length,

cudaMemcpyHostToDevice); bitonicSort<<<1, length, sizeof(int) * length>>>(d_a, length); cudaMemcpy(a, d_a, sizeof(int) * length,

cudaMemcpyDeviceToHost); cudaFree(d_a);}

Page 21: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering

• Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS):

where μi is the mean of points in Si.

D

Page 22: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering

• Example– Xi = [1, 2, 3, 7, 8, 9];– K = 2– S = {S1, S2} where S1 = [1, 2, 3], S2 = [7, 8, 9]

– μ1 = 2, μ2 = 8– D = (1-2)2 + (2-2)2 + (3-2)2 + (7-8)2 + (8-8)2 + (9-8)2 = 4

Page 23: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering sequentialvoid cpu_kmeans(int k, int length, Point * points, int * clusters){ Point * means = (Point*)malloc(k * sizeof(Point));

for (;;) { // Compute means of all clusters. for (int i = 0; i < k; ++i) { // means[i] = Point::Zero; means[i].ZeroOut(); int count = 0; for (int j = 0; j < length; ++j) { if (i == clusters[j]) { count++; means[i] = means[i] + points[j]; } } if (count != 0) means[i] = means[i] / count; }

// For all points, get minimum distance from point to all cluster means, // and reassign point to cluster if the minimum distance has changed. bool changed = false; for (int i = 0; i < length; ++i) { float min_delta = FLT_MAX; int min_cluster = clusters[i]; for (int j = 0; j < k; ++j) {

float delta = Point::distance(points[i], means[j]); if (min_delta > delta) { min_cluster = j; min_delta = delta; } } if (min_cluster != clusters[i]) { changed = true; clusters[i] = min_cluster; } }

if (! changed) break; }}

Page 24: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering sequentialvoid cpu_kmeans(int k, int length, Point * points, int * clusters){ Point * means = (Point*)malloc(k * sizeof(Point));

for (;;) { // Compute means of all clusters. for (int i = 0; i < k; ++i) { // means[i] = Point::Zero; means[i].ZeroOut(); int count = 0; for (int j = 0; j < length; ++j) { if (i == clusters[j]) { count++; means[i] = means[i] + points[j]; } } if (count != 0) means[i] = means[i] / count; }

// For all points, get minimum distance from point to all cluster means, // and reassign point to cluster if the minimum distance has changed. bool changed = false; for (int i = 0; i < length; ++i) { float min_delta = FLT_MAX; int min_cluster = clusters[i]; for (int j = 0; j < k; ++j) {

float delta = Point::distance(points[i], means[j]); if (min_delta > delta) { min_cluster = j; min_delta = delta; } } if (min_cluster != clusters[i]) { changed = true; clusters[i] = min_cluster; } }

if (! changed) break; }}

points = [1, 2, 3, 7, 8, 9]clusters = [0, 1, 0, 1, 0, 1]means = [4.0, 6.0]

Page 25: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering sequentialvoid cpu_kmeans(int k, int length, Point * points, int * clusters){ Point * means = (Point*)malloc(k * sizeof(Point));

for (;;) { // Compute means of all clusters. for (int i = 0; i < k; ++i) { // means[i] = Point::Zero; means[i].ZeroOut(); int count = 0; for (int j = 0; j < length; ++j) { if (i == clusters[j]) { count++; means[i] = means[i] + points[j]; } } if (count != 0) means[i] = means[i] / count; }

// For all points, get minimum distance from point to all cluster means, // and reassign point to cluster if the minimum distance has changed. bool changed = false; for (int i = 0; i < length; ++i) { float min_delta = FLT_MAX; int min_cluster = clusters[i]; for (int j = 0; j < k; ++j) {

float delta = Point::distance(points[i], means[j]); if (min_delta > delta) { min_cluster = j; min_delta = delta; } } if (min_cluster != clusters[i]) { changed = true; clusters[i] = min_cluster; } }

if (! changed) break; }}

points = [1, 2, 3, 7, 8, 9]clusters = [0, 0, 0, 1, 1, 1]means = [4.0, 6.0]change = true

Page 26: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering sequentialvoid cpu_kmeans(int k, int length, Point * points, int * clusters){ Point * means = (Point*)malloc(k * sizeof(Point));

for (;;) { // Compute means of all clusters. for (int i = 0; i < k; ++i) { // means[i] = Point::Zero; means[i].ZeroOut(); int count = 0; for (int j = 0; j < length; ++j) { if (i == clusters[j]) { count++; means[i] = means[i] + points[j]; } } if (count != 0) means[i] = means[i] / count; }

// For all points, get minimum distance from point to all cluster means, // and reassign point to cluster if the minimum distance has changed. bool changed = false; for (int i = 0; i < length; ++i) { float min_delta = FLT_MAX; int min_cluster = clusters[i]; for (int j = 0; j < k; ++j) {

float delta = Point::distance(points[i], means[j]); if (min_delta > delta) { min_cluster = j; min_delta = delta; } } if (min_cluster != clusters[i]) { changed = true; clusters[i] = min_cluster; } }

if (! changed) break; }}

points = [1, 2, 3, 7, 8, 9]clusters = [0, 0, 0, 1, 1, 1]means = [2.0, 8.0]change = false

Page 27: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering sequentialvoid cpu_kmeans(int k, int length, Point * points, int * clusters){ Point * means = (Point*)malloc(k * sizeof(Point));

for (;;) { // Compute means of all clusters. for (int i = 0; i < k; ++i) { // means[i] = Point::Zero; means[i].ZeroOut(); int count = 0; for (int j = 0; j < length; ++j) { if (i == clusters[j]) { count++; means[i] = means[i] + points[j]; } } if (count != 0) means[i] = means[i] / count; }

// For all points, get minimum distance from point to all cluster means, // and reassign point to cluster if the minimum distance has changed. bool changed = false; for (int i = 0; i < length; ++i) { float min_delta = FLT_MAX; int min_cluster = clusters[i]; for (int j = 0; j < k; ++j) {

float delta = Point::distance(points[i], means[j]); if (min_delta > delta) { min_cluster = j; min_delta = delta; } } if (min_cluster != clusters[i]) { changed = true; clusters[i] = min_cluster; } }

if (! changed) break; }}

points = [1, 2, 3, 7, 8, 9]clusters = [0, 0, 0, 1, 1, 1]means = [2.0, 8.0]change = false break

Page 28: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallel?void cpu_kmeans(int k, int length, Point * points, int * clusters){ Point * means = (Point*)malloc(k * sizeof(Point));

for (;;) { // Compute means of all clusters. for (int i = 0; i < k; ++i) { // means[i] = Point::Zero; means[i].ZeroOut(); int count = 0; for (int j = 0; j < length; ++j) { if (i == clusters[j]) { count++; means[i] = means[i] + points[j]; } } if (count != 0) means[i] = means[i] / count; }

// For all points, get minimum distance from point to all cluster means, // and reassign point to cluster if the minimum distance has changed. bool changed = false; for (int i = 0; i < length; ++i) { float min_delta = FLT_MAX; int min_cluster = clusters[i]; for (int j = 0; j < k; ++j) {

float delta = Point::distance(points[i], means[j]); if (min_delta > delta) { min_cluster = j; min_delta = delta; } } if (min_cluster != clusters[i]) { changed = true; clusters[i] = min_cluster; } }

if (! changed) break; }}

How to implement nested for-loops?

Page 29: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallel?

// Compute means of all clusters. for (int i = 0; i < k; ++i) { // means[i] = Point::Zero; means[i].ZeroOut(); int count = 0; for (int j = 0; j < length; ++j) { if (i == clusters[j]) { count++; means[i] = means[i] + points[j]; } } if (count != 0) means[i] = means[i] / count; }

__global__ void compute_means_orig(Point * means, int length, Point * points, int * clusters){ int i = threadIdx.x; means[i].ZeroOut(); int count = 0; for (int j = 0; j < length; ++j) { if (i == clusters[j]) { count++; means[i] = means[i] + points[j]; } } if (count != 0) means[i] = means[i] / count;}

Naïve implementation: one thread for each cluster, used to figure out mean.

It works, but does not use shared memory, and accounts for 98% of overall runtime.

Page 30: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallelvoid gpu_kmeans(int k, int length, Point * points, int * clusters){ Point * means = (Point*)malloc(k * sizeof(Point)); Point * d_means; cudaMalloc(&d_means, k * sizeof(Point));

Point * d_points; cudaMalloc(&d_points, length * sizeof(Point)); cudaMemcpy(d_points, points, length * sizeof(Point),

cudaMemcpyHostToDevice);

int * d_clusters; cudaMalloc(&d_clusters, length * sizeof(int)); cudaMemcpy(d_clusters, clusters, length * sizeof(Point),

cudaMemcpyHostToDevice);

bool * d_changed; cudaMalloc(&d_changed, sizeof(bool));

int * count = (int*)malloc(k * sizeof(int)); for (int i = 0; i < k; ++i) count[i] = 0; int * d_count; cudaMalloc(&d_count, k * sizeof(int)); int zero = 0;

for (;;) { // Compute means of all clusters.

compute_means0<<<1,k>>>(d_means, d_count, k); compute_means1<<<length / 32 + 1, 32,

k * sizeof(Point) + k * sizeof(int)>>>(

d_means, d_count, k, length, d_points, d_clusters); compute_means2<<<1,k>>>(d_means, d_count, k); // For every point, compute minimum distance from point to all // cluster means, // and reassign point to cluster if the minimum distance has changed. bool changed = false; cudaMemcpy(d_changed, &changed, sizeof(bool),

cudaMemcpyHostToDevice); compute_clusters<<<length / 32 + 1, 32, k * sizeof(Point)>>>(

d_means, k, length, d_points, d_clusters, d_changed); cudaMemcpy(&changed, d_changed, sizeof(bool),

cudaMemcpyDeviceToHost);

if (! changed) break; } cudaMemcpy(clusters, d_clusters, length * sizeof(Point),

cudaMemcpyDeviceToHost);}

Have three kernels for calculating means: (1) compute_means0 to initialize means to 0; (2) compute_means1 to compute sum of points and counts; (3) compute_means2 to divide each mean by count.

Page 31: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallel

• Kernel compute_means0:

– k threads per block, one block

0.1 2.0 -1.3 4.3 means[]

Page 32: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallel

• Initialize counts and means (k threads for k means)

__global__ void compute_means0(Point * means, int * count, int k){ int i = threadIdx.x; count[i] = 0; means[i].ZeroOut();}

• Note: CUDA annoyance. You cannot define static device data in a class, e.g., for singleton or constants:– static __device__ Point Point::Zero;– means[i] = Zero; => error : memory qualifier on data member is not allowed

Page 33: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallel

• Kernel compute_means1:

– 32 threads per block, length / 32 + 1 blocks

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3

0.1

2.0

-1.3

4.3points[]

Page 34: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallel• Compute counts and means (32 threads per block, for

all points)__global__ void compute_means1(Point * means, int * count, int k, int length, Point * points, int * clusters){ extern __shared__ int buffer[]; Point * smeans = (Point*)buffer; int * scount = (int*)&smeans[k];

int i = threadIdx.x + blockIdx.x * blockDim.x; if (i >= length) return; if (threadIdx.x == 0) { for (int j = 0; j < k; ++j) { smeans[j].ZeroOut(); scount[j] = 0; } } __syncthreads();

int c = clusters[i]; for (int j = 0; j < k; ++j)

{ if (j == c) { atomicAdd(&scount[j], 1); for (int d = 0; d < NUM_COMPONENTS; ++d) atomicAdd(&smeans[j].x[d], points[i].x[d]); } } __syncthreads();

if (threadIdx.x == 0) { for (int j = 0; j < k; ++j) { count[j] += scount[j]; for (int d = 0; d < NUM_COMPONENTS; ++d) means[j].x[d] += smeans[j].x[d]; } } __syncthreads();}

Page 35: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallel

• Kernel compute_means2:

– k threads per block, one block

0.91 20.0 -1.93 4.63 means[]

11 44 19 65 count[]

Page 36: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallel

• Compute means = sum divided by count for that cluster.

__global__ void compute_means2(Point * means, int * count, int k){ int i = threadIdx.x; if (count[i] != 0) means[i] = means[i] / count[i];}

Note: updating global variable “means” to be used in final kernel.

Page 37: Lecture 4: Introduction to Parallel Computing Using CUDA

K-Means Clustering parallel

__global__ void compute_clusters(Point * means, int k, int length, Point * points, int * clusters, bool * changed){ //extern __shared__ int buffer[]; //Point * smeans = (Point*)buffer; Point * smeans = means;

int i = threadIdx.x + blockIdx.x * blockDim.x; if (i >= length) return;

// Preload means. if (threadIdx.x == 0) { for (int j = 0; j < k; ++j) smeans[j] = means[j]; } __syncthreads();

float min_delta = FLT_MAX; int min_cluster = clusters[i]; for (int j = 0; j < k; ++j) { float delta = Point::distance(points[i], means[j]); if (min_delta > delta) { min_cluster = j; min_delta = delta; } } if (min_cluster != clusters[i]) { *changed = true; clusters[i] = min_cluster; }}

Page 38: Lecture 4: Introduction to Parallel Computing Using CUDA

Breath-First Search

• A graph is a collection of nodes (V) and edges (E). They can be directed or undirected.

03

52

2013

15

7

9

18

11

6

Page 39: Lecture 4: Introduction to Parallel Computing Using CUDA

Breath-First Search

• A graph can be represented a number of ways. But, if it is large, it is very important to choose a representation carefully because the GPU has limited memory.

Page 40: Lecture 4: Introduction to Parallel Computing Using CUDA

Breadth-First Search

• Breadth-first search is a fundamental algorithm for graphs.

• The idea is to keep track of a set of frontier nodes, mark them as such, and navigate to the next set of nodes along edges from each node in the frontier.

Page 41: Lecture 4: Introduction to Parallel Computing Using CUDA

Breadth-First Search

Cormen, T., Leiserson, C., Rivest, R. and Stein, C. Introduction to Algorithms. MIT PressMcGraw-Hill Book Company, 2001.

Page 42: Lecture 4: Introduction to Parallel Computing Using CUDA

Parallel Breadth-First Search• Parallel BFS found by keeping a set of nodes for the

frontier (F), update frontier (Fu), visited (X), and level (C).

Harish, P., Vineet, V. and Narayanan, P. Large Graph Algorithms for Massively Multithreaded Architectures, Large Graph Algorithms for Massively Multithreaded Architectures, Hyderabad, INDIA, 2009.

Page 43: Lecture 4: Introduction to Parallel Computing Using CUDA

Parallel Breadth-First Search• Kernel1 picks a node that is in the frontier (F), and marks it no longer in the frontier.• For all neighbor nodes that are not visited, get the depth for the neighbor, and mark

it as frontier.

Page 44: Lecture 4: Introduction to Parallel Computing Using CUDA

Parallel Breadth-First Search• Kernel2 updates the frontier and visited sets.• If the frontier changed, then the algorithm must pass over the

new frontier.

Page 45: Lecture 4: Introduction to Parallel Computing Using CUDA

Breadth-First Search in CUDAsigned char * BFS::gpuBFS(int block_side){ // Copy graph into GPU memory. int * V_d; int * E_d;… memset(F_h, 0, Vs*sizeof(unsigned char)); memset(C_h, -1, Vs*sizeof(signed char)); memset(X_h, 0, Vs*sizeof(unsigned char)); F_h[S] = 1; C_h[S] = 0;… int * any_change_d; if (cudaMalloc((void**)&any_change_d, sizeof(int))) return 0; … int N = graph->Vs; double side = sqrt((double)N); side = ceil(side); int nside = (int)side; int n_blocks = nside / block_side

+ (nside % block_side == 0 ? 0:1); int count = 0; for (;;) { *any_change_h = 0; cudaMemcpy(any_change_d, any_change_h, sizeof(int),

cudaMemcpyHostToDevice); dim3 Dg(n_blocks, n_blocks, 1); dim3 Db(block_side, block_side, 1); count++; do_kernel_bfs1<<<Dg, Db>>>(any_change_d, Vs, V_d,

Es, E_d, EIs, EI_d, F_d, newF_d, C_d, X_d); cudaMemcpy(any_change_h, any_change_d, sizeof(int),

cudaMemcpyDeviceToHost); if (*any_change_h == 0) break; do_kernel_bfs2<<<Dg, Db>>>(any_change_d, Vs, V_d,

Es, E_d, EIs, EI_d, F_d, newF_d, C_d, X_d); cudaMemcpy(any_change_h, any_change_d, sizeof(int),

cudaMemcpyDeviceToHost); if (*any_change_h == 0) break; } cudaMemcpy(C_h, C_d, Vs * sizeof(unsigned char),

cudaMemcpyDeviceToHost);

… return C_h;}

Two kernels with same dimensions. Is there a way to do global synchronization so only one kernel could be called? NOT KNOWN

Page 46: Lecture 4: Introduction to Parallel Computing Using CUDA

Matrix Multiplication

• MM is used everywhere!

Page 47: Lecture 4: Introduction to Parallel Computing Using CUDA

Matrix Multiplication

• A simple sequential CPU implementation: static bool Multiply_Host(Matrix<T> * C, Matrix<T> * A, Matrix<T> * B) { int hA = A->height; int wA = A->width; int wB = B->width; for (int i = 0; i < hA; ++i) for (int j = 0; j < wB; ++j) { T sum = 0; for (int k = 0; k < wA; ++k) { T a = A->data[i * wA + k]; T b = B->data[k * wB + j]; sum += a * b; } C->data[i * wB + j] = sum; } return true; };

Page 48: Lecture 4: Introduction to Parallel Computing Using CUDA

Simple Parallel Matrix Multiplication

• Idea:Compute C(i, j) in parallel, each in it’s own thread.

Page 49: Lecture 4: Introduction to Parallel Computing Using CUDA

Simple Parallel Matrix Multiplication // setup execution parameters dim3 threads(wC, hC); dim3 grid(1,1); Kernel_Matrix_Multiply_Simple<T>

<<< grid, threads >>>(d_C, d_A, d_B);

template <class T>__global__ void Kernel_Matrix_Multiply_Simple(Matrix<T> * C, Matrix<T> * A, Matrix<T> * B){ int wA = A->width; int wB = B->width; int wC = C->width; // 2D Thread ID int col = threadIdx.x; int row = threadIdx.y; // Pvalue stores the Pd element that is computed by the thread T Pvalue = 0; for (int k = 0; k < wA; ++k) { T Aelement = A->data[row * wA + k]; T Belement = B->data[k * wB + col]; Pvalue += Aelement * Belement; } // Write the matrix to device memory each thread writes one element C->data[row * wC + col] = Pvalue;}

Page 50: Lecture 4: Introduction to Parallel Computing Using CUDA

Simple Parallel Matrix Multiplication

• Problems:– Slow– Not scalable

because (r, c) maximum is (1024, 1024)

Page 51: Lecture 4: Introduction to Parallel Computing Using CUDA

Simple Scalable Parallel Matrix Multiplication

• Idea:Compute C(i, j) in parallel on block-wise fashion, each in it’s own thread, but a tile at a time.

Page 52: Lecture 4: Introduction to Parallel Computing Using CUDA

Simple Scalable Parallel Matrix Multiplicationtemplate <class T>__global__ void Kernel_Matrix_Multiply_Simple_Tile(int wTile, int hTile, Matrix<T> * C, Matrix<T> * A, Matrix<T> * B){ // get column number (x). int tx = threadIdx.x + blockIdx.x * wTile;

// get row number (y). int ty = threadIdx.y + blockIdx.y * hTile;

int wA = A->width; int wB = B->width; int wC = C->width;

// Bounds checking... if (tx >= C->width || ty >= C->height) return; // Pvalue stores the Pd element that is computed by the thread T Pvalue = 0; for (int k = 0; k < wA; ++k) { T Aelement = A->data[ty * wA + k]; T Belement = B->data[k * wB + tx]; Pvalue += Aelement * Belement; } // Write the matrix to device memory each thread writes one element C->data[ty * wC + tx] = Pvalue;}

dim3 threads(wTile, hTile); dim3 grid(wC / wTile, hC / hTile); Kernel_Matrix_Multiply_Simple_Tile<T><<< grid, threads >>>(wTile, hTile, d_C, d_A, d_B);

Page 53: Lecture 4: Introduction to Parallel Computing Using CUDA

Simple Scalable Parallel Matrix Multiplication

• Problem:– Still not that fast

because shared memory is not used.

Page 54: Lecture 4: Introduction to Parallel Computing Using CUDA

Fast Scalable Parallel Matrix Multiplication

• Idea:– Use shared memory to

load a sub-matrix of A and B of size threadDim.x by threadDim.y

– Eg, for bid(0,1)• t(0,0) loads A(0,0) and

B(0,2);• t(0,1) loads A(0,1) and B(0,3)• t(1,0) loads A(1,0) and B(1,2)• t(1,1) loads A(1,1) and B(1,3)

Page 55: Lecture 4: Introduction to Parallel Computing Using CUDA

Fast Scalable Parallel Matrix Multiplication

template <class T>__global__ void Kernel_Matrix_Multiply_Fancy(int wTile, int hTile, Matrix<T> * C, Matrix<T> * A, Matrix<T> * B){#define AS(i, j) As[i][j]#define BS(i, j) Bs[i][j] int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int wA = A->width; int wB = B->width; int aBegin = wA * hTile * by; int aEnd = aBegin + wA - 1; int aStep = wTile; int bBegin = wTile * bx; int bStep = wTile * wB; T Csub = 0; for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {#define MAX_BLOCK_SIZE 30 __shared__ T As[MAX_BLOCK_SIZE][MAX_BLOCK_SIZE]; __shared__ T Bs[MAX_BLOCK_SIZE]

[MAX_BLOCK_SIZE]; AS(ty, tx) = A->data[a + wA * ty + tx]; BS(ty, tx) = B->data[b + wB * ty + tx]; __syncthreads(); for (int k = 0; k < wTile; ++k) { Csub += AS(ty, k) * BS(k, tx); } __syncthreads(); } int c = wB * hTile * by + wTile * bx; C->data[c + wB * ty + tx] = Csub;};

Page 56: Lecture 4: Introduction to Parallel Computing Using CUDA

Wrap up

• What is CUDA?• Why is it useful?• How hard is it to write CUDA?• Where is it going?