26
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 February 8, 2005 Session 8 Session 8

Parallel and Distributed Processing CSE 8380

  • Upload
    arich

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

Parallel and Distributed Processing CSE 8380. February 8, 2005 Session 8. Contents. Computing sum on EREW PRAM Computing all partial sums on EREW PRAM Matrix Multiplication on CREW Other Algorithms. Recall (PRAM Model). Control. Private Memory. P 1. - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Parallel and Distributed Processing

CSE 8380

February 8, 2005February 8, 2005

Session 8Session 8

Page 2: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Contents

Computing sum on EREW PRAM

Computing all partial sums on EREW PRAM

Matrix Multiplication on CREW

Other Algorithms

Page 3: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Recall (PRAM Model)

Synchronized Read Compute Write Cycle

EREW ERCW CREW CRCW Complexity:

T(n), P(n), C(n)

Control

PrivateMemory

P1

PrivateMemory

P2

PrivateMemory

Pp

Global

Memory

Page 4: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Sum on EREW PRAM

Compute the sum of an array A[1..n]

We use n/2 processors

Summation will end up in location A[n]

For simplicity, we assume n is an integral power of 2

Work is done in log n iterations. In the first iteration, all processors are active. In the second iteration, only half the processors will be active, and so on.

Page 5: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

ExampleSum of an array of numbers on the EREW model

Example of algorithm Sum_EREW when n=8

5 2 10 1 8 12 7 3

5 7 10 11 8 20 7 10

5 7 10 18 8 20 7 30

5 7 10 18 8 20 7 48

Active processors

P1, P2, P3, P4

P2, P4

P4

A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]

Page 6: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Group Work

1- Discuss the algorithm with your neighbor

2- Design the main loops

3- Discuss the Complexity

Page 7: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Algorithm sum_EREW

for i =1 to log n do

forall Pj, where 1 < j < n/2 do in parallel

if (2j mod 2i) = 0 then

A[2j] A[2j] + A[j – 2i-1]

endif

endfor

endfor

Page 8: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Complexity

Run time: T(n) = O(log n)

Number of processors: P(n) = n/2

Cost: c(n) = O(n log n)

Is it cost optimal?

Page 9: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

All partial sums - EREW PRAM

Compute all partial sums of an array A[1..n]

These are A[1], A[1]+A[2], A[1]+A[2]+A[3], …, A[1]+A[2]+… + A[n].

At first glance you might think it is inherently sequential because one must add up the first k elements before adding in element k+1

We’ll see that it can be parallelized

Let’s extend sum_EREW to do that

Page 10: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

All partial sums (cont.)

We noticed that in sum_EREW most processors are idle most of the time

By exploiting these idle processors, we should be able to compute all partial sums in the same amount of time it takes to compute the single sum

Page 11: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

All partial sums (cont.)

Compute all partial sums of A[1..n]

We use n-1 processors (P2, P3, …, Pn)

A[k] will be replaced by the sum of all elements preceding and including A[k]

In algorithm sum_EREW, at iteration i, only n/2i processors were active, while in allsums_EREW, nearly all processors will be in use.

Page 12: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

ExampleAll partial sums on EREW PRAM

Example of algorithm allsums_EREW when n=8

5 2 10 1 8 12 7 3

5 7 12 11 9 20 19 10

5 7 17 18 21 31 28 30

5 7 17 18 26 38 45 48

Active processors

P2, P3, …, P8

P3, P4, …, P8

P5, P6, P7, P8

A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]

Page 13: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Group Work

1- Discuss the algorithm with your neighbor

2- Design the main loops

3- Discuss the Complexity

Page 14: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Algorithm allsums_EREW

for i =1 to log n do

forall Pj, where 2i-1 + 1 < j < n do in parallel

a[j] A[j] + A[j – 2i-1]

endfor

endfor

Page 15: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Complexity

Run time: T(n) = O(log n)

Number of processors: P(n) = n-1

Cost: c(n) = O(n log n)

Page 16: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Matrix Multiplication

Two n X n matrices For clarity, we assume n is power of 2

We use CREW to allow concurrent read Two matrices in the shared memory A[1..n,1..n],

B[1..n,1..n].

We will use n3 processors We will also show how to reduce the number of

processors

Page 17: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Matrix Multiplication (cont)

The n3 processors are arranged in a three dimensional array. Processor Pi,j,k is the one with index (i,j,k)

We will use the 3 dimensional array C[1..n,1..n,1..n] in the shared memory as working space.

The resulting matrix will be stored in locations C[i,j,n], where 1<= i,j <= n

Page 18: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Two steps

1. All n3 processors operate in parallel to compute n3 multiplications. (For each of the n2 cells in the output matrix, n products are computed)

2. The n products are summed to produce the final value of each cell

Page 19: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Matrix multiplicationUsing n3 processors

Two steps of the Algorithms

1. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

2. The idea of Algorithm Sum_EREW is applied along the k dimension n2 times in parallel to compute C[i,j,n], where 1<i, j<n. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

Page 20: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Algorithm MatMult_CREW

/* step 1 */

forall Pi,j,k, where 1 < i, j, k<n do in parallelC[i,j,k] A[i,k] * B[k,j]

Endfor

/* step 2 */for i=1 to log n do

forall Pi,j,k, where 1 < i, j<n & 1<k<n/2 do in parallelif (2k mod 2l) = 0 then C[i,j,2k] C[i,j,2k] + C[i,j, 2k-2l-1]endif

endfor

/* the output matrix is stored in locations C[i,j,n], where l<i, j<n */

endfor

Page 21: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Complexity

Run time: T(n) = O(log n)

Number of processors: P(n) = n3

Cost: c(n) = O(n3 log n)

Is it cost optimal?

Page 22: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Example

Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

C[1,1,1] A[1,1]B[1,1] C[1,2,1] A[1,1]B[1,2]

C[2,1,1] A[2,1]B[1,1] C[2,2,1] A[2,1]B[1,2]

C[1,1,2] A[1,2]B[2,1] C[1,2,2] A[1,2]B[2,2]

C[2,1,2] A[2,2]B[2,1] C[2,2,2] A[2,2]B[2,2]

i

j

ij

P1,1,1 K = 1 P1,2,1

P1,1,2 P1,2,2K = 2

After step 1

P2,1,1 P2,2,1

P2,1,2 P2,2,2

Page 23: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Example (cont.)

C[1,1,2] C[1,1,2]+C[1,1,1] C[1,2,2] C[1,2,2]+C[1,2,1]

C[2,1,2] C[2,1,2]+C[2,1,1] C[2,2,2] C[2,2,2]+C[2,2,1]

ij

P1,1,2 P1,2,2K = 2

After step 2

P2,1,2 P2,2,2

Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

Page 24: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Matrix multiplicationreducing the number of processors to n3/log n

Processors are arranged in n X n X n/(log n) 3-dimensional array

1. Each processors Pi,j,k, where 1 <k < n/log n, computes the sum of (log n) product. This step will produce (n3/log n) partial sums.

2. The sum of products produced in step 1 are added to produce the resulting matrix as discussed previously.

Complexity analysis Run time, T(n) = O(log n) Number of processors, P(n) = n3/log n Cost, c(n) = O(n3)

Page 25: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Searching

Given A = a1, a2, …, ai, …, an & x

Determine whether x = ai for some i Sequential Binary Search O(log n) Simple idea

Divide the list among the processors and let each processor conduct its own binary search

EREW PRAM O(log n/p) + O(log p) = O(log n) CREW O(log n/p)

Page 26: Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Parallel Binary Search

Split A into p+1 segments of almost equal length

Compare x with p elements at the boundary between successive segments

Either x = ai or search is restricted to only one of the p+1 segments

Repeat until x is found or length of the list is <= p