Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Parallel and Distributed Processing

CSE 8380

February 8, 2005February 8, 2005

Session 8Session 8


Contents

Computing sum on EREW PRAM

Computing all partial sums on EREW PRAM

Matrix Multiplication on CREW

Other Algorithms


Recall (PRAM Model)

Synchronized Read Compute Write Cycle

EREW ERCW CREW CRCW Complexity:

T(n), P(n), C(n)

Control

PrivateMemory

P1

PrivateMemory

P2

PrivateMemory

Pp

Global

Memory


Sum on EREW PRAM

Compute the sum of an array A[1..n]

We use n/2 processors

Summation will end up in location A[n]

For simplicity, we assume n is an integral power of 2

Work is done in log n iterations. In the first iteration, all processors are active. In the second iteration, only half the processors will be active, and so on.


ExampleSum of an array of numbers on the EREW model

Example of algorithm Sum_EREW when n=8

5 2 10 1 8 12 7 3

5 7 10 11 8 20 7 10

5 7 10 18 8 20 7 30

5 7 10 18 8 20 7 48

Active processors

P1, P2, P3, P4

P2, P4

P4

A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]


Group Work

1- Discuss the algorithm with your neighbor

2- Design the main loops

3- Discuss the Complexity


Algorithm sum_EREW

for i =1 to log n do

forall Pj, where 1 < j < n/2 do in parallel

if (2j mod 2i) = 0 then

A[2j] A[2j] + A[j – 2i-1]

endif

endfor

endfor


Complexity

Run time: T(n) = O(log n)

Number of processors: P(n) = n/2

Cost: c(n) = O(n log n)

Is it cost optimal?


All partial sums - EREW PRAM

Compute all partial sums of an array A[1..n]

These are A[1], A[1]+A[2], A[1]+A[2]+A[3], …, A[1]+A[2]+… + A[n].

At first glance you might think it is inherently sequential because one must add up the first k elements before adding in element k+1

We’ll see that it can be parallelized

Let’s extend sum_EREW to do that


All partial sums (cont.)

We noticed that in sum_EREW most processors are idle most of the time

By exploiting these idle processors, we should be able to compute all partial sums in the same amount of time it takes to compute the single sum


All partial sums (cont.)

Compute all partial sums of A[1..n]

We use n-1 processors (P2, P3, …, Pn)

A[k] will be replaced by the sum of all elements preceding and including A[k]

In algorithm sum_EREW, at iteration i, only n/2i processors were active, while in allsums_EREW, nearly all processors will be in use.


ExampleAll partial sums on EREW PRAM

Example of algorithm allsums_EREW when n=8

5 2 10 1 8 12 7 3

5 7 12 11 9 20 19 10

5 7 17 18 21 31 28 30

5 7 17 18 26 38 45 48

Active processors

P2, P3, …, P8

P3, P4, …, P8

P5, P6, P7, P8

A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]


Group Work

1- Discuss the algorithm with your neighbor

2- Design the main loops

3- Discuss the Complexity


Algorithm allsums_EREW

for i =1 to log n do

forall Pj, where 2i-1 + 1 < j < n do in parallel

a[j] A[j] + A[j – 2i-1]

endfor

endfor


Complexity


Number of processors: P(n) = n-1

Cost: c(n) = O(n log n)


Matrix Multiplication

Two n X n matrices For clarity, we assume n is power of 2

We use CREW to allow concurrent read Two matrices in the shared memory A[1..n,1..n],

B[1..n,1..n].

We will use n3 processors We will also show how to reduce the number of

processors


Matrix Multiplication (cont)

The n3 processors are arranged in a three dimensional array. Processor Pi,j,k is the one with index (i,j,k)

We will use the 3 dimensional array C[1..n,1..n,1..n] in the shared memory as working space.

The resulting matrix will be stored in locations C[i,j,n], where 1<= i,j <= n


Two steps

1. All n3 processors operate in parallel to compute n3 multiplications. (For each of the n2 cells in the output matrix, n products are computed)

2. The n products are summed to produce the final value of each cell


Matrix multiplicationUsing n3 processors

Two steps of the Algorithms

1. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

2. The idea of Algorithm Sum_EREW is applied along the k dimension n2 times in parallel to compute C[i,j,n], where 1<i, j<n. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].


Algorithm MatMult_CREW

/* step 1 */

forall Pi,j,k, where 1 < i, j, k<n do in parallelC[i,j,k] A[i,k] * B[k,j]

Endfor

/* step 2 */for i=1 to log n do

forall Pi,j,k, where 1 < i, j<n & 1<k<n/2 do in parallelif (2k mod 2l) = 0 then C[i,j,2k] C[i,j,2k] + C[i,j, 2k-2l-1]endif

endfor

/* the output matrix is stored in locations C[i,j,n], where l<i, j<n */

endfor


Complexity


Number of processors: P(n) = n3

Cost: c(n) = O(n3 log n)

Is it cost optimal?


Example

Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

C[1,1,1] A[1,1]B[1,1] C[1,2,1] A[1,1]B[1,2]

C[2,1,1] A[2,1]B[1,1] C[2,2,1] A[2,1]B[1,2]

C[1,1,2] A[1,2]B[2,1] C[1,2,2] A[1,2]B[2,2]

C[2,1,2] A[2,2]B[2,1] C[2,2,2] A[2,2]B[2,2]

i

j

ij

P1,1,1 K = 1 P1,2,1

P1,1,2 P1,2,2K = 2

After step 1

P2,1,1 P2,2,1

P2,1,2 P2,2,2


Example (cont.)

C[1,1,2] C[1,1,2]+C[1,1,1] C[1,2,2] C[1,2,2]+C[1,2,1]

C[2,1,2] C[2,1,2]+C[2,1,1] C[2,2,2] C[2,2,2]+C[2,2,1]

ij

P1,1,2 P1,2,2K = 2

After step 2

P2,1,2 P2,2,2

Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW


Matrix multiplicationreducing the number of processors to n3/log n

Processors are arranged in n X n X n/(log n) 3-dimensional array

1. Each processors Pi,j,k, where 1 <k < n/log n, computes the sum of (log n) product. This step will produce (n3/log n) partial sums.

2. The sum of products produced in step 1 are added to produce the resulting matrix as discussed previously.

Complexity analysis Run time, T(n) = O(log n) Number of processors, P(n) = n3/log n Cost, c(n) = O(n3)


Searching

Given A = a1, a2, …, ai, …, an & x

Determine whether x = ai for some i Sequential Binary Search O(log n) Simple idea

Divide the list among the processors and let each processor conduct its own binary search

EREW PRAM O(log n/p) + O(log p) = O(log n) CREW O(log n/p)


Parallel Binary Search

Split A into p+1 segments of almost equal length

Compare x with p elements at the boundary between successive segments

Either x = ai or search is restricted to only one of the p+1 segments

Repeat until x is found or length of the list is <= p

Documents

Parallel and Distributed Processing CSE 8380