31
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CUDA Lecture 6 Embarrassingly Parallel Computations

CUDA Lecture 6 Embarrassingly Parallel Computations

  • Upload
    bona

  • View
    64

  • Download
    0

Embed Size (px)

DESCRIPTION

CUDA Lecture 6 Embarrassingly Parallel Computations. Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Embarrassingly Parallel Computations. - PowerPoint PPT Presentation

Citation preview

Page 1: CUDA Lecture 6 Embarrassingly Parallel Computations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 6Embarrassingly Parallel

Computations

Page 2: CUDA Lecture 6 Embarrassingly Parallel Computations

A computation that can obviously be divided into a number of completely independent parts, each of which can be executed by a separate process(or).

No communication or very little communication between processes.

Each process can do its tasks without any interaction with other processes.

Embarrassingly Parallel Computations – Slide 2

Embarrassingly Parallel Computations

Page 3: CUDA Lecture 6 Embarrassingly Parallel Computations

Low level image processingMandelbrot setMonte Carlo calculations

Embarrassingly Parallel Computations – Slide 3

Embarrassingly Parallel Computation Examples

Page 4: CUDA Lecture 6 Embarrassingly Parallel Computations

Many low level image processing operations only involve local data with very limited if any communication between areas of interest.

Embarrassingly Parallel Computations – Slide 4

Topic 1: Low level image processing

Page 5: CUDA Lecture 6 Embarrassingly Parallel Computations

Square region for each process Can also use strips

Embarrassingly Parallel Computations – Slide 5

Partitioning into regions for individual processes

Page 6: CUDA Lecture 6 Embarrassingly Parallel Computations

ShiftingObject shifted by x in the x-dimension and y

in the y-dimension:x = x + x y = y + y

where x and y are the original and x and y are the new coordinates.

ScalingObject scaled by a factor of Sx in the x-direction

and Sy in the y-direction:x = xSx y = ySy

Embarrassingly Parallel Computations – Slide 6

Some geometrical operations

Page 7: CUDA Lecture 6 Embarrassingly Parallel Computations

RotationObject rotated through an angle about the

origin of the coordinate system: x = x cos + y sin y = -x sin + y cos

ClippingApplies defined rectangular boundaries to a

figure and delete points outside the defined area.Display the point (x,y) only if

xlow x xhigh ylow y yhigh

otherwise discard.

Embarrassingly Parallel Computations – Slide 7

Some geometrical operations (cont.)

Page 8: CUDA Lecture 6 Embarrassingly Parallel Computations

GPUs sent starting row numbersSince these are related to block and thread

i.d.s each GPU should be able to figure it out for themselves

Return results are single valuesNot good programming but done to simplify

first exposure to CUDA programsNo terminating code is shown

Embarrassingly Parallel Computations – Slide 8

Notes

Page 9: CUDA Lecture 6 Embarrassingly Parallel Computations

Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function

zk+1 = zk2 + c

where zk+1 is the (k+1)th iteration of the complex number z = a + bi and c is a complex number giving position of point in the complex plane.

The initial value for z is zero.Embarrassingly Parallel Computations – Slide 9

Topic 2: Mandelbrot Set

Page 10: CUDA Lecture 6 Embarrassingly Parallel Computations

Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit.

Magnitude of z is the length of the vector given by

Embarrassingly Parallel Computations – Slide 10

Mandelbrot Set (cont.)

22 bazlength

Page 11: CUDA Lecture 6 Embarrassingly Parallel Computations

structure complex{ float real; float imag;}

int cal_pixel(complex c){ int count=0, max=256; complex z; float temp, lengthsq;

z.real = 0; z.imag = 0; do { temp = z.real * z.real – z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count++; } while ((lengthsq < 4.0) && (count < max)); return count;}

Embarrassingly Parallel Computations – Slide 11

Sequential routine computing value of one point returning number of iterations

Page 12: CUDA Lecture 6 Embarrassingly Parallel Computations

Square of length compared to 4 (rather than comparing length to 2) to avoid a square root computation

Given the terminating conditions, all of the Mandelbrot points must be within a circle centered at the origin having radius 2

Resolution is expanded at will to obtain interesting images

Embarrassingly Parallel Computations – Slide 12

Notes

Page 13: CUDA Lecture 6 Embarrassingly Parallel Computations

Embarrassingly Parallel Computations – Slide 13

Mandelbrot set

Page 14: CUDA Lecture 6 Embarrassingly Parallel Computations

Dynamic Task AssignmentHave processor request regions after

computing previous regions.Doesn’t lend itself to a CUDA solution.

Static Task AssignmentSimply divide the region into a fixed number of

parts, each computed by a separate processor.Not very successful because different regions

require different numbers of iterations and times.

…but what we’ll do in CUDA.

Embarrassingly Parallel Computations – Slide 14

Parallelizing Mandelbrot Set Computation

Page 15: CUDA Lecture 6 Embarrassingly Parallel Computations

Embarrassingly Parallel Computations – Slide 15

Dynamic Task Assignment: Work Pool/Processor Farms

Page 16: CUDA Lecture 6 Embarrassingly Parallel Computations

#include <stdio.h>#include <conio.h>#include <math.h>#include “../common/cpu_bitmap.h”#define DIM 1000

struct cuComplex { float r, i; cuComplex(float a, float b) : r(a), i(b) {} __device__ float magnitude2 (void) { return r*r + i*i; } __device__ cuComplex operator*(const cuComplex& a) { return cuComplex(r*a.r – i*a.i, i*a.r + r*a.i); } __device__ cuComplex operator+(const cuComplex& a) { return cuComplex(r + a.r, i + a.i); }};

Embarrassingly Parallel Computations – Slide 16

Simplified CUDA Solution

Page 17: CUDA Lecture 6 Embarrassingly Parallel Computations

__device__ int mandelbrot(int x, int y) { float jx = 2.0 * (x – DIM/2) / (DIM/2); float jy = 2.0 * (y - DIM/2) / (DIM/2); cuComplex c(jx,jy); cuComplex z(0.0,0.0); int i = 0; do { z = z * z + c; i++; } while ((z.magnitude2() < 4.0) && (i < 256)); return i % 8;}

Embarrassingly Parallel Computations – Slide 17

Simplified CUDA Solution (cont.)

Page 18: CUDA Lecture 6 Embarrassingly Parallel Computations

__global__ void kernel (unsigned char *ptr) { // map from blockIdx to pixel position int x = blockIdx.x; int y = blockIdx.y; int offset = x + y * gridDim.x; // now calculate the value at that position int mValue = mandelbrot(x,y); ptr[offset*4 + 0] = 0; ptr[offset*4 + 1] = 0; ptr[offset*4 + 2] = 0; ptr[offset*4 + 3] = 255; switch (mValue) { case 0: break; // black case 1: ptr[offset*4 + 0] = 255; break; // red case 2: ptr[offset*4 + 1] = 255; break; // green case 3: ptr[offset*4 + 2] = 255; break; // blue case 4: ptr[offset*4 + 0] = 255; // yellow ptr[offset*4 + 1] = 255; break; case 5: ptr[offset*4 + 1] = 255; // cyan ptr[offset*4 + 2] = 255; break; case 6: ptr[offset*4 + 0] = 255; // magenta ptr[offset*4 + 2] = 255; break; default: ptr[offset*4 + 0] = 255; // white ptr[offset*4 + 1] = 255; ptr[offset*4 + 2] = 255; break; }}

Embarrassingly Parallel Computations – Slide 18

Simplified CUDA Solution (cont.)

Page 19: CUDA Lecture 6 Embarrassingly Parallel Computations

int main(void) { CPUBitmap bitmap(DIM, DIM); unsigned char *dev_bitmap; cudaMalloc((void**) &dev_bitmap, bitmap.image_size()); dim3 grid(DIM, DIM); kernel<<<grid,1>>>(dev_bitmap); cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost)); bitmap.display_and_exit(); cudaFree(dev_bitmap);}

Embarrassingly Parallel Computations – Slide 19

Simplified CUDA Solution (cont.)

Page 20: CUDA Lecture 6 Embarrassingly Parallel Computations

An embarrassingly parallel computation named for Monaco’s gambling resort city, method’s first important use in development of atomic bomb during World War II.

Monte Carlo methods use random selections.Given a very large set and a probability

distribution over itDraw a set of samples identically and

independently distributedCan then approximate the distribution using

these samples

Embarrassingly Parallel Computations – Slide 20

Topic 3: Monte Carlo methods

Page 21: CUDA Lecture 6 Embarrassingly Parallel Computations

Evaluating integrals of arbitrary functions of 6+ dimensions

Predicting future values of stocksSolving partial differential equationsSharpening satellite imagesModeling cell populationsFinding approximate solutions to NP-hard

problems

Embarrassingly Parallel Computations – Slide 21

Applications of Monte Carlo Method

Page 22: CUDA Lecture 6 Embarrassingly Parallel Computations

Form a circle within a square, with unit radius so that the square has sides 22. Ratio of the area of the circle to the square given by

Randomly choose points within square.Keep score of how many points happen to lie within

circle.Fraction of points within the circle will be /4 given

a sufficient number of randomly selected samples.

Embarrassingly Parallel Computations – Slide 22

Monte Carlo Example: Calculating

4221

square of Areacircle of Area 2

Page 23: CUDA Lecture 6 Embarrassingly Parallel Computations

Embarrassingly Parallel Computations – Slide 23

Calculating (cont.)

Page 24: CUDA Lecture 6 Embarrassingly Parallel Computations

Embarrassingly Parallel Computations – Slide 24

Calculating (cont.)

x x x x x x x x x x x x

x x xx x x x x

23420

16 .

Page 25: CUDA Lecture 6 Embarrassingly Parallel Computations

Relative error is a way to measure the quality of an estimate

The smaller the error, the better the estimatea: actual valuee: estimated valueRelative error = |e-a|/a

Embarrassingly Parallel Computations – Slide 25

Relative Error

Page 26: CUDA Lecture 6 Embarrassingly Parallel Computations

Embarrassingly Parallel Computations – Slide 26

Increasing Sample Size Reduces Error

n Estimate Abs. Error 1/(2n)10 2.40000 0.23606 0.15811

100 3.36000 0.06952 0.05000

1,000 3.14400 0.00077 0.01581

10,000 3.13920 0.00076 0.00500

100,000 3.14132 0.00009 0.00158

1,000,000 3.14006 0.00049 0.00050

10,000,000 3.14136 0.00007 0.00016

100,000,000 3.14154 0.00002 0.00005

1,000,000,000 3.14155 0.00001 0.00002

Page 27: CUDA Lecture 6 Embarrassingly Parallel Computations

One quadrant of the construction can be described by the integral

Random numbers (xr,yr)generated, each between 0and 1.

Counted as in circle ifxr

2+yr2 1.

Embarrassingly Parallel Computations – Slide 27

Next Example: Computing an Integral

41

1

0

2 dxx

Page 28: CUDA Lecture 6 Embarrassingly Parallel Computations

Computing the integral

Sequential codeRoutine randv(x1,x2)

returns a pseudorandomnumber between x1 and x2.

Monte Carlo method very useful if the function cannot be integrated numerically (maybe having a large number of variables).

Embarrassingly Parallel Computations – Slide 28

Example

sum=0;for (i=0; i<N; i++) { xr = rand_v(x1,x2); sum += xr * xr - 3*xr;}area = (sum / N) * (x2 - x1);

2

1

)3( 2x

x

dxxxI

Page 29: CUDA Lecture 6 Embarrassingly Parallel Computations

Error in Monte Carlo estimate decreases by the factor 1/n

Rate of convergence independent of integrand’s dimensionDeterministic numerical integration methods

do not share this propertyHence Monte Carlo superior when integrand

has 6 or more dimensionsFurthermore Monte Carlo methods often

amenable to parallelismCan find an estimate about p times faster or

reduce error of estimate by pEmbarrassingly Parallel Computations – Slide 29

Why Monte Carlo Methods Interesting

Page 30: CUDA Lecture 6 Embarrassingly Parallel Computations

Embarrassingly Parallel Computation ExamplesLow level image processingMandelbrot setMonte Carlo calculations

Applications: numerical integrationRelated topics: Metropolis algorithm, simulated

annealingFor parallelizing such applications, need best way

to generate random numbers in parallel.

Embarrassingly Parallel Computations – Slide 30

Summary

Page 31: CUDA Lecture 6 Embarrassingly Parallel Computations

Based on original material fromThe University of Akron

Tim O’Neil, Saranya VinjarapuThe University of North Carolina at Charlotte

Barry Wilkinson, Michael AllenOregon State University: Michael Quinn

Revision history: last updated 7/28/2011.

Embarrassingly Parallel Computations – Slide 31

End Credits