Upload
audrey-capps
View
214
Download
0
Embed Size (px)
Citation preview
Starting Parallel Algorithm Design
David MonismithBased on notes from Introduction to Parallel Programming 2nd Edition by Grama, Gupta,
Karypis, and Kumar
Decomposition
• Decomposition - dividing a computation into parts that may be executed in parallel
• Tasks - programmer defined units of computation into which the main computation is subdivided
• Task-dependency graphs - abstraction used to express dependencies between tasks and their relative order of execution
Granularity
• Granularity - number/size of tasks that a computation can be divided into
• Fine grained - task divided into many small tasks• Coarse grained - task divided into few large tasks• Degree of concurrency - maximum number of
tasks that can be executed in parallel in a program at any time
• Average degree of concurrency can be more useful as it provides a better indication of performance
Example
• Matrix-Vector Multiplication– Figure will be drawn upon the board– Generally considered fine grained if parallelizing
based upon each dot product– Could be considered coarse-grained if using a dual
core processor and each task computes half of the dot products
Task Graphs
• Critical path - longest directed path between a pair of start an finish nodes in the task graph
• Critical path length – sum of the weights of the nodes along a critical path
• Weight of a node is the size of the task or amount of work associated with the task
• Aside from these factors, the interaction between tasks running on different processors may cost additional runtime
• An example of a task dependency graph will be drawn in class to aid in the understanding of these concepts
Processes and Threads vs. Processors
• mapping - mechanism by which tasks are assigned to processes and/or threads for execution
• Threads and processes are logical units that perform tasks• Processors physically perform the computations• Important to realize this because we may have multiple
stages of computation• For example, internode communication vs. shared memory
communication• Drawing a task dependency or task interaction graph may
help us to understand how tasks interact with one another and will aid in development of a parallel algorithm
Decomposition Techniques
• Embarrassingly Parallel• Recursive decomposition• Data Decomposition• Exploratory decomposition• Speculative decomposition
Embarrassingly Parallel Tasks
• Some tasks lend themselves to direct parallelization• Such tasks are said to be embarrassingly parallel and
can be directly mapped to processes or threads• A subset of these types of tasks represent the map
pattern• Note that the map pattern represents a function that
can be “replicated and applied to all elements in a collection” – source https://software.intel.com/en-us/blogs/2009/06/10/parallel-patterns-3-map
• Map operations occur in independent loop iterations
Embarrassingly Parallel (Map)
• Performing array (or matrix) addition is a straightforward example that is easily parallelized
• The serial example of this follows:
for(i = 0; i < N; i++) C[i] = A[i] + B[i];
• Three OpenMP parallel versions follow on the next slides
OpenMP First Try • We could parallelize the loop on the last slide directly as follows: #pragma omp parallel private(i) shared(A,B,C) { int start = omp_get_thread_num()*(N / omp_get_num_threads()); int end = start + (N/omp_get_num_threads()); for(i = start; i < end; i++) C[i] = A[i] + B[i]; }• Notice that i is declared private because it it is not shared between
threads – each thread gets its own copy of i• Arrays A, B, and C are declared shared because they are shared between
threads
OpenMP for clause
• It is preferred to allow OpenMP to directly parallelize loops using the for clause as follows
#pragma omp parallel private(i) shared(A,B,C){ #pragma omp for for(i = 0; i < N; i++) C[i] = A[i] + B[i];}
• Notice that the loop can be written in a serial fashion and it will be automatically partitioned and tasked to a thread
Shortened OpenMP for
• When using a single for loop, the parallel and for clauses may be combined
#pragma omp parallel for private(i) \shared(A,B,C) for(i = 0; i < N; i++) C[i] = A[i] + B[i];
Recursive Decomposition
• Used to include concurrency in problems that can be solved with divide-and-conquer
• Such a problem is solved by dividing it into independent sub-problems
• A special type of this decomposition is the Reduction Pattern, wherein elements of a collection are combined with a binary associative operator (e.g. +, -, min, max, etc.), source - https://software.intel.com/en-us/blogs/2009/07/23/parallel-pattern-7-reduce
Example
• To find a minimum serially given an array A of size N use the following algorithm
min = A[0]; for(i = 1; i < N; i++) if(A[i] < min) min = A[i];
Example• Decomposing this task for parallelism requires a recursive solution
int findMinRec(int A[], int i, int n){ if(n == 1) return A[i]; else { int lmin = findMinRec(A, i, n/2); int rmin = findMinRec(A, i+n/2, n-n/2); return min(lmin,rmin); }}
OpenMP Implementation
for(i = 0; i < N; i++) A[i] = rand() % 100;
small = A[0];#pragma omp parallel for reduction(min:small)for(i = 0; i < N; i++) { if(A[i] < small) small = A[i];}
OpenMP Sum Reduction
for(i = 0; i < N; i++) A[i] = i+1;
sum = 0;#pragma omp parallel for reduction(+:sum)for(i = 0; i < N; i++) sum += A[i];
printf("The sum is %d\n", sum);
Data Decomposition
• Commonly used on algorithms that operate on large data structures
• Involves two steps– Data is partitioned– Data partitioning is used to cause partitioning of
computations into tasks• Operations on different data partitions are
typically similar or are chosen from a small set of operations
Partitioning
• Partitioning output data – outputs computed independently of others as a function of input– Example – matrix multiplication can be partitioned into
submatrices• Partitioning input data – task is created for each
partition of the input data– Example – finding a minimum or maximum
• Partitioning input and output – combination of the two cases above
• Partitioning intermediate data