CUDA Lecture 6 Embarrassingly Parallel Computations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 6Embarrassingly Parallel

Computations

A computation that can obviously be divided into a number of completely independent parts, each of which can be executed by a separate process(or).

No communication or very little communication between processes.

Each process can do its tasks without any interaction with other processes.

Embarrassingly Parallel Computations – Slide 2

Embarrassingly Parallel Computations

Low level image processingMandelbrot setMonte Carlo calculations


Embarrassingly Parallel Computation Examples

Many low level image processing operations only involve local data with very limited if any communication between areas of interest.


Topic 1: Low level image processing

Square region for each process Can also use strips


Partitioning into regions for individual processes

ShiftingObject shifted by x in the x-dimension and y

in the y-dimension:x = x + x y = y + y

where x and y are the original and x and y are the new coordinates.

ScalingObject scaled by a factor of Sx in the x-direction

and Sy in the y-direction:x = xSx y = ySy


Some geometrical operations

RotationObject rotated through an angle about the

origin of the coordinate system: x = x cos + y sin y = -x sin + y cos

ClippingApplies defined rectangular boundaries to a

figure and delete points outside the defined area.Display the point (x,y) only if

xlow x xhigh ylow y yhigh

otherwise discard.


Some geometrical operations (cont.)

GPUs sent starting row numbersSince these are related to block and thread

i.d.s each GPU should be able to figure it out for themselves

Return results are single valuesNot good programming but done to simplify

first exposure to CUDA programsNo terminating code is shown


Notes

Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function

zk+1 = zk2 + c

where zk+1 is the (k+1)th iteration of the complex number z = a + bi and c is a complex number giving position of point in the complex plane.

The initial value for z is zero.Embarrassingly Parallel Computations – Slide 9

Topic 2: Mandelbrot Set

Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit.

Magnitude of z is the length of the vector given by


Mandelbrot Set (cont.)

22 bazlength

structure complex{ float real; float imag;}

int cal_pixel(complex c){ int count=0, max=256; complex z; float temp, lengthsq;

z.real = 0; z.imag = 0; do { temp = z.real * z.real – z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count++; } while ((lengthsq < 4.0) && (count < max)); return count;}


Sequential routine computing value of one point returning number of iterations

Square of length compared to 4 (rather than comparing length to 2) to avoid a square root computation

Given the terminating conditions, all of the Mandelbrot points must be within a circle centered at the origin having radius 2

Resolution is expanded at will to obtain interesting images


Notes


Mandelbrot set

Dynamic Task AssignmentHave processor request regions after

computing previous regions.Doesn’t lend itself to a CUDA solution.

Static Task AssignmentSimply divide the region into a fixed number of

parts, each computed by a separate processor.Not very successful because different regions

require different numbers of iterations and times.

…but what we’ll do in CUDA.


Parallelizing Mandelbrot Set Computation


Dynamic Task Assignment: Work Pool/Processor Farms

#include <stdio.h>#include <conio.h>#include <math.h>#include “../common/cpu_bitmap.h”#define DIM 1000

struct cuComplex { float r, i; cuComplex(float a, float b) : r(a), i(b) {} __device__ float magnitude2 (void) { return r*r + i*i; } __device__ cuComplex operator*(const cuComplex& a) { return cuComplex(r*a.r – i*a.i, i*a.r + r*a.i); } __device__ cuComplex operator+(const cuComplex& a) { return cuComplex(r + a.r, i + a.i); }};


Simplified CUDA Solution

__device__ int mandelbrot(int x, int y) { float jx = 2.0 * (x – DIM/2) / (DIM/2); float jy = 2.0 * (y - DIM/2) / (DIM/2); cuComplex c(jx,jy); cuComplex z(0.0,0.0); int i = 0; do { z = z * z + c; i++; } while ((z.magnitude2() < 4.0) && (i < 256)); return i % 8;}


Simplified CUDA Solution (cont.)

__global__ void kernel (unsigned char *ptr) { // map from blockIdx to pixel position int x = blockIdx.x; int y = blockIdx.y; int offset = x + y * gridDim.x; // now calculate the value at that position int mValue = mandelbrot(x,y); ptr[offset*4 + 0] = 0; ptr[offset*4 + 1] = 0; ptr[offset*4 + 2] = 0; ptr[offset*4 + 3] = 255; switch (mValue) { case 0: break; // black case 1: ptr[offset*4 + 0] = 255; break; // red case 2: ptr[offset*4 + 1] = 255; break; // green case 3: ptr[offset*4 + 2] = 255; break; // blue case 4: ptr[offset*4 + 0] = 255; // yellow ptr[offset*4 + 1] = 255; break; case 5: ptr[offset*4 + 1] = 255; // cyan ptr[offset*4 + 2] = 255; break; case 6: ptr[offset*4 + 0] = 255; // magenta ptr[offset*4 + 2] = 255; break; default: ptr[offset*4 + 0] = 255; // white ptr[offset*4 + 1] = 255; ptr[offset*4 + 2] = 255; break; }}



int main(void) { CPUBitmap bitmap(DIM, DIM); unsigned char *dev_bitmap; cudaMalloc((void**) &dev_bitmap, bitmap.image_size()); dim3 grid(DIM, DIM); kernel<<<grid,1>>>(dev_bitmap); cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost)); bitmap.display_and_exit(); cudaFree(dev_bitmap);}



An embarrassingly parallel computation named for Monaco’s gambling resort city, method’s first important use in development of atomic bomb during World War II.

Monte Carlo methods use random selections.Given a very large set and a probability

distribution over itDraw a set of samples identically and

independently distributedCan then approximate the distribution using

these samples


Topic 3: Monte Carlo methods

Evaluating integrals of arbitrary functions of 6+ dimensions

Predicting future values of stocksSolving partial differential equationsSharpening satellite imagesModeling cell populationsFinding approximate solutions to NP-hard

problems


Applications of Monte Carlo Method

Form a circle within a square, with unit radius so that the square has sides 22. Ratio of the area of the circle to the square given by

Randomly choose points within square.Keep score of how many points happen to lie within

circle.Fraction of points within the circle will be /4 given

a sufficient number of randomly selected samples.


Monte Carlo Example: Calculating

4221

square of Areacircle of Area 2


Calculating (cont.)


Calculating (cont.)

x x x x x x x x x x x x

x x xx x x x x

23420

16 .

Relative error is a way to measure the quality of an estimate

The smaller the error, the better the estimatea: actual valuee: estimated valueRelative error = |e-a|/a


Relative Error


Increasing Sample Size Reduces Error

n Estimate Abs. Error 1/(2n)10 2.40000 0.23606 0.15811

100 3.36000 0.06952 0.05000

1,000 3.14400 0.00077 0.01581

10,000 3.13920 0.00076 0.00500

100,000 3.14132 0.00009 0.00158

1,000,000 3.14006 0.00049 0.00050

10,000,000 3.14136 0.00007 0.00016

100,000,000 3.14154 0.00002 0.00005

1,000,000,000 3.14155 0.00001 0.00002

One quadrant of the construction can be described by the integral

Random numbers (xr,yr)generated, each between 0and 1.

Counted as in circle ifxr

2+yr2 1.


Next Example: Computing an Integral

41

1

0

2 dxx

Computing the integral

Sequential codeRoutine randv(x1,x2)

returns a pseudorandomnumber between x1 and x2.

Monte Carlo method very useful if the function cannot be integrated numerically (maybe having a large number of variables).


Example

sum=0;for (i=0; i<N; i++) { xr = rand_v(x1,x2); sum += xr * xr - 3*xr;}area = (sum / N) * (x2 - x1);

2

1

)3( 2x

x

dxxxI

Error in Monte Carlo estimate decreases by the factor 1/n

Rate of convergence independent of integrand’s dimensionDeterministic numerical integration methods

do not share this propertyHence Monte Carlo superior when integrand

has 6 or more dimensionsFurthermore Monte Carlo methods often

amenable to parallelismCan find an estimate about p times faster or

reduce error of estimate by pEmbarrassingly Parallel Computations – Slide 29

Why Monte Carlo Methods Interesting

Embarrassingly Parallel Computation ExamplesLow level image processingMandelbrot setMonte Carlo calculations

Applications: numerical integrationRelated topics: Metropolis algorithm, simulated

annealingFor parallelizing such applications, need best way

to generate random numbers in parallel.


Summary

Based on original material fromThe University of Akron

Tim O’Neil, Saranya VinjarapuThe University of North Carolina at Charlotte

Barry Wilkinson, Michael AllenOregon State University: Michael Quinn

Revision history: last updated 7/28/2011.


End Credits

Documents

CUDA Lecture 6 Embarrassingly Parallel Computations