Matrix Multiplication Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi

Matrix MultiplicationMatrix MultiplicationMatrix MultiplicationMatrix Multiplication

Instructor: Dr Sushil K PrasadInstructor: Dr Sushil K PrasadPresented By: R. Jayampathi Presented By: R. Jayampathi SampathSampath

Instructor: Dr Sushil K PrasadInstructor: Dr Sushil K PrasadPresented By: R. Jayampathi Presented By: R. Jayampathi SampathSampath

Outline Outline Outline Outline

IntroductionIntroduction Hypercube Interconnection NetworkHypercube Interconnection Network The Parallel AlgorithmThe Parallel Algorithm Matrix TranspositionMatrix Transposition Communication Efficient Matrix Multiplication Communication Efficient Matrix Multiplication

on Hypercubes (The paper)on Hypercubes (The paper)

IntroductionIntroductionIntroductionIntroduction Matrix multiplication is important algorithm design in Matrix multiplication is important algorithm design in

parallel computation. parallel computation. Matrix multiplication on hypercubeMatrix multiplication on hypercube

– Diameter is smallerDiameter is smaller– Degree = Degree = log(p)log(p)

Straightforward RAM algorithm for MatMul requires Straightforward RAM algorithm for MatMul requires O(nO(n33)) time. time.

– Sequential Algorithm: Sequential Algorithm: for (i=0; i<n; i++){ for (i=0; i<n; i++){ for (j=0; j<n; j++) { for (j=0; j<n; j++) {

t=0;t=0; for(k = 0; k<n; k++){for(k = 0; k<n; k++){ t=t +at=t +aikik*b*bkjkj; ; }} ccijij=t;=t; }} }}

Hypercube Interconnection NetworkHypercube Interconnection NetworkHypercube Interconnection NetworkHypercube Interconnection Network

00000001

00100011

011101000101

0110

1001

1000

1100

1101

1111

1011

Hypercube Interconnection Network Hypercube Interconnection Network (contd.)(contd.)

Hypercube Interconnection Network Hypercube Interconnection Network (contd.)(contd.)

The formal specification of a Hypercube Interconnection The formal specification of a Hypercube Interconnection Network.Network.

– Let processors be available Let processors be available

– Let Let ii and and ii(b) (b) be two integers whose binary be two integers whose binary representation differ only in position representation differ only in position bb, ,

– Specifically,Specifically, If is binary representation of If is binary representation of

ii Then is the binary Then is the binary

representation of representation of i i(b) (b) where is the complement of bitwhere is the complement of bit A g-Dimentional Hypercube interconnection network A g-Dimentional Hypercube interconnection network is formed by is formed by

connection each processor connection each processor ppi i to by two way link for to by two way link for all all

gN 2 ,1,2,1,0 ... Npppp 1g

1,0 )( Nii b

gb 0

011121 ...... iiiiiii bbbgg

011'

121 ...... iiiiiii bbbgg 'bi bi

10 Ni )(bip

gb 0

The Parallel AlgorithmThe Parallel AlgorithmThe Parallel AlgorithmThe Parallel Algorithm Example Example (parallel algorithm)(parallel algorithm)

4-4

3-3

4 -4

3-3

2-2

1 -1

2 -2

1-1

2 -2

1 -2

3 -4

4-4

4-3

3-3

1 -1

2-1

4-4

4-3

3-2

3-1

1 -1

1 -2

2-3

2-4

4-4

3 -3

2 -2

1 -1

000000 001001

010010 011011

100100

110110 111111

101101

1 2

3 4

-1 -2

-3 -4

A =

B =

Step1:Step1:

2*22*2

n=2n=211

#processors N#processors N

N=nN=n33=2=233=8=8

X,X,XX,X,X

ii jj kk

Initial stepInitial step Step 1.1Step 1.1

Step 1.2Step 1.2 Step 1.2Step 1.2

A(0,j,k) & A(0,j,k) & B(0,j,k) -> B(0,j,k) -> processors (i,j,k), processors (i,j,k), where 1<=i<=n-1.where 1<=i<=n-1.

A(i,j,i) -> A(i,j,i) -> processors processors (i,j,k) (i,j,k)

where where 0<=k<=n-10<=k<=n-1

B(i,j,k) -> B(i,j,k) -> processors processors (i,j,k)(i,j,k)

where where 0<=j<=n-10<=j<=n-1

The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)

-16-12

-6-3

-2-1

-6 -8

-22-15

-10-7

Step2:Step2:

Step3:Step3:

The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.) Implementation of straightforward RAM algorithm on Implementation of straightforward RAM algorithm on

HC.HC. The multiplication of two The multiplication of two n x nn x n matrices A , B where matrices A , B where n=2n=2qq

Use HC with Use HC with N = nN = n33 = 2 = 23q3q Each processor Each processor PPrr occupying position occupying position (i,j,k)(i,j,k)

where where r = inr = in22 + jn + k for 0<= i,j,k <= n-1 + jn + k for 0<= i,j,k <= n-1 If the binary representation of If the binary representation of rr is : is :

rr3q-13q-1rr3q-23q-2…r…r2q2qrr2q-12q-1…r…rqqrrq-1q-1…r…r00

then the binary representation of then the binary representation of i, j, ki, j, k are are

rr3q-13q-1rr3q-23q-2…r…r2q2q,, rr2q-12q-1…r…rqq,, rrq-1q-1…r…r00 respectively respectively

The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.) Example Example (positioning(positioning))

– The multiplication of two The multiplication of two 2 x 22 x 2 matrices A , B where matrices A , B where n=2=2n=2=21 1 q=1 q=1

– Use HC with Use HC with N = nN = n33 = 2 = 23q3q=8 processors=8 processors

– Each processor Each processor PPrr occupying position occupying position (i,j,k)(i,j,k)

– where where r = i2r = i222 + j2 + k for 0<= i,j,k <= 1 + j2 + k for 0<= i,j,k <= 1

– If the binary representation of If the binary representation of rr is : is :

– rr22rr11rr0 0

– then the binary representation of then the binary representation of i, j, ki, j, k are are

– rr22,, rr11,, rr00 respectively respectively


All processors with same index value in the one of All processors with same index value in the one of i,j,ki,j,k form a HC with form a HC with nn22 processors processors

All processors with the same index value in two field All processors with the same index value in two field coordinates form a HC with coordinates form a HC with nn processors processors

Each processor will have 3 registers Each processor will have 3 registers AArr, B, Brr and and CCrr also also denoted denoted A(I,j,k) B(I,j,k) and C(i,j,k)A(I,j,k) B(I,j,k) and C(i,j,k)

000000 001001

010010 011011

100100

110110 111111

101101

ArAr

BrBr

CrCr

101101

The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.) Step 1:The elements of A and B are distributed to the nStep 1:The elements of A and B are distributed to the n33

processors so that the processor in position i,j,k will contain aprocessors so that the processor in position i,j,k will contain a jiji and band bikik

– 1.1 Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors 1.1 Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = ain positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = a ijij and and B(i,j,k) = bB(i,j,k) = bjkjk for 0<=i<=n-1. for 0<=i<=n-1.

– 1.2 Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), 1.2 Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = awhere 0<=k<=n-1. Resulting in A(i,j,k) = a jiji for 0<=k<=n-1. for 0<=k<=n-1.

– 1.3 Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), 1.3 Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = bwhere 0<=j<=n-1. Resulting in B(i,j,k) = b ikik for 0<=j<=n-1. for 0<=j<=n-1.

Step 2: Each processor in position (i,j,k) computes the product Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k)C(i,j,k) = A(i,j,k) * B(i,j,k)

Step 3: The sum C(0,j,k) = Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.computed for 0<=j,k<n-1.


AnalysisAnalysis– Steps 1.1,1.1,1.3 and 3 consists of q constant time Steps 1.1,1.1,1.3 and 3 consists of q constant time

iterations.iterations.

– Step 2 requires constant timeStep 2 requires constant time

– So, So, T(nT(n33) = O(q)) = O(q)

= O(logn)= O(logn)

– Cost Cost pT(p) = O(npT(p) = O(n33logn)logn)

– Not cost optimal.Not cost optimal.

n=4=2n=4=222


N=nN=n33=4=433=64=64

XX,XX,XXXX,XX,XX

ii jj kk

Copies of Copies of data data initially in initially in A(0,j,k) A(0,j,k) and and B(0,j,k)B(0,j,k)

n=4=2n=4=222


N=nN=n33=4=433=64=64

XX,XX,XXXX,XX,XX

ii jj kk

A(0,j,k) A(0,j,k) and and B(0,j,k) B(0,j,k) are sent to are sent to processors processors in in positions positions (i,j,k), (i,j,k), where where 1<=i<=n-11<=i<=n-1

n=4=2n=4=222


N=nN=n33=4=433=64=64

XX,XX,XXXX,XX,XX

ii jj kk

Copies of Copies of data data initially in initially in A(0,j,k) A(0,j,k) and and B(0,j,k)B(0,j,k)

n=4=2n=4=222


N=nN=n33=4=433=64=64

XX,XX,XXXX,XX,XX

ii jj kk

Senders Senders of Aof A

Copies of Copies of data in data in A(i,j,i) are A(i,j,i) are sent to sent to processors in processors in positions positions (i,j,k), where (i,j,k), where 0<=k<=n-10<=k<=n-1

n=4=2n=4=222


N=nN=n33=4=433=64=64

XX,XX,XXXX,XX,XX

ii jj kk

Senders Senders of Bof B

Copies of Copies of data in data in B(i,j,k) are B(i,j,k) are sent to sent to processors in processors in positions positions (i,j,k), where (i,j,k), where 0<=j<=n-10<=j<=n-1

Matrix TranspositionMatrix TranspositionMatrix TranspositionMatrix Transposition

The number of processors used is N = nThe number of processors used is N = n22 = 2 = 22q2q and and processor Pprocessor Prr occupies position (i,j) where r = in + j occupies position (i,j) where r = in + j where 0<=i,j<=n-1.where 0<=i,j<=n-1.

Initially, processor PInitially, processor Prr holds all of the elements of holds all of the elements of matrix A where r = in + j.matrix A where r = in + j.

Upon termination, processor PUpon termination, processor Pss holds element a holds element aijij where s = jn + i.where s = jn + i.

Matrix Transposition (contd.)Matrix Transposition (contd.)Matrix Transposition (contd.)Matrix Transposition (contd.) A recursive interpretation of the algorithmA recursive interpretation of the algorithm

– Divide the matrix into 4 sub-matrices – n/2 x n/2Divide the matrix into 4 sub-matrices – n/2 x n/2– The first level of recursionThe first level of recursion

The elements of bottom left sub-matrix are swapped with The elements of bottom left sub-matrix are swapped with the corresponding elements of the top right sub-matrix.the corresponding elements of the top right sub-matrix.

The elements of other two sub-matrix are untouched.The elements of other two sub-matrix are untouched.

– The same step is now applied to each of four The same step is now applied to each of four (n/2)*(n/2) matrices.(n/2)*(n/2) matrices.

– This continues until 2*2 matrices are transposed.This continues until 2*2 matrices are transposed. AnalysisAnalysis

– The algorithm consists of The algorithm consists of qq constant time iterations. constant time iterations.– T(n) = log(n)T(n) = log(n)– Cost = nCost = n22log(n)log(n) – not cost optimal ( – not cost optimal (n(n-1)/2n(n-1)/2 operations operations

on on n*nn*n matrix on the RAM by swapping matrix on the RAM by swapping aaijij wit wit aajiji for for all all i<ji<j))

Matrix Transposition (contd.)Matrix Transposition (contd.)Matrix Transposition (contd.)Matrix Transposition (contd.) ExampleExample

A=A=

11

eebb

22cc

ffdd

gg

hh

xxvv

yy33

zzww

44

A=A=

11

eebb

22hh

xxvv

yy

cc

ffdd

gg33

zzww

44

A=A=

11 bb hh vv

ee 22 xx YY

cc dd 33 ww

ff gg zz 44

A=A=

11

bbee

22hh

vvxx

yy

cc

ddff

gg33

wwzz

44

1.01.0 1.11.1

1.21.2 1.31.3

OutlineOutlineOutlineOutline

2D Diagonal Algorithm2D Diagonal Algorithm The 3-D Diagonal AlgorithmThe 3-D Diagonal Algorithm

2D Diagonal Algorithm2D Diagonal Algorithm2D Diagonal Algorithm2D Diagonal Algorithm

A*0A*0 A*1A*1 A*2A*2 A*3A*3B0*B0*

B1*B1*

B2*B2*

B3*B3*

A*0A*0

B0*B0*

A*1A*1

B1*B1*

A*2A*2

B2*B2*

A*3A*3

B3*B3*

AA BB

Step 1Step 1

4*44*4 4*44*4

2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)A*0A*0

B0*B0*A*0A*0 A*0A*0 A*0A*0

A*1A*1 A*1A*1

B1*B1*A*1A*1 A*1A*1

A*2A*2 A*2A*2 A*2A*2

B2*B2*A*2A*2

A*3A*3 A*3A*3 A*3A*3 A*3A*3

B3*B3*

A*0A*0

B0*B0*A*0A*0

B01B01A*0A*0

B02B02A*0A*0

B03B03

A*1A*1

B10B10A*1A*1

B1*B1*A*1A*1

B12B12A*1A*1

B13B13

A*2A*2

B20B20A*2A*2

B21B21A*2A*2

B2*B2*A*2A*2

B23B23

A*3A*3

B30B30A*3A*3

B31B31A*3A*3

B32B32A*3A*3

B3*B3*

Step 2Step 2 Step 3Step 3

C00,C00,

C10C10

C20,C20,

C30C30

C01,C01,

C11C11

C21,C21,

C31C31

C02,C02,

C12C12

C22,C22,

C32C32

C03,C03,

C13C13

C23,C23,

C33C33

Step 4Step 4

2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.) Above algorithm can be extended to a 3-D mesh Above algorithm can be extended to a 3-D mesh

embedded in a hypercube with embedded in a hypercube with AA*i*i and and BBi*i* being being initially distributed along the third dimension z.initially distributed along the third dimension z.

Processor Processor ppiikiik holding the sub-blocks of holding the sub-blocks of AAkiki and and BBikik

One-to-all-personalized One-to-all-personalized broadcast of broadcast of Bi*Bi* then then replaced by replaced by point-to-point communicationpoint-to-point communication of of BBikik from from ppiikiik to to ppkikkik

It fallows one-to-all broadcast It fallows one-to-all broadcast BBikik to to ppkikkik along the along the z direction.z direction.

The 3-D Diagonal AlgorithmThe 3-D Diagonal AlgorithmThe 3-D Diagonal AlgorithmThe 3-D Diagonal Algorithm HC consisting of p processorsHC consisting of p processors Can be visualized as a 3-D mesh of sizeCan be visualized as a 3-D mesh of size Matrices A and B are partitioned into blocks of pMatrices A and B are partitioned into blocks of p⅔⅔ with blocks along with blocks along

each dimension.each dimension. Initially, it is assumed that A and B are mapped onto the 2-D plane x = yInitially, it is assumed that A and B are mapped onto the 2-D plane x = y processor pprocessor piikiik containing the blocks of A containing the blocks of Akiki and B and Bkiki

333 ppp 3 p

The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.) Algorithm consists of 3 phasesAlgorithm consists of 3 phases

– Point to point communication of Bki by piik to pikkPoint to point communication of Bki by piik to pikk

– One to all broadcast of blocks of A along the x One to all broadcast of blocks of A along the x direction and the newly acquired block of B along the direction and the newly acquired block of B along the z direction.z direction.

Now processor pijk has the blocks of Akj and BjiNow processor pijk has the blocks of Akj and Bji Each processor calculates the products of blocks of A and B.Each processor calculates the products of blocks of A and B.

– The reduction by adding the result sub matrices along The reduction by adding the result sub matrices along the z direction.the z direction.

The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.) AnalysisAnalysis

– Phase 1: Passing messages of size nPhase 1: Passing messages of size n22/ / pp⅔⅔ require require log(log(33√p(t√p(tss + t + tww((nn22/ / pp⅔ ⅔ ))) where t))) where tss is the time it takes to is the time it takes to start up for message sending and tstart up for message sending and tww is time it takes to is time it takes to send a word from one processor to its neighbor. send a word from one processor to its neighbor.

– Phase 2: takes twice as much time as phase 1.Phase 2: takes twice as much time as phase 1.

– Phase 3: Can be completed in the same amount of time Phase 3: Can be completed in the same amount of time as Phase 1.as Phase 1.

Overall, the algorithm takes (4/3 log p, Overall, the algorithm takes (4/3 log p, nn22/ / pp⅔ ⅔ (4/3 (4/3 log p))log p))

BibliographyBibliographyBibliographyBibliography

Akl, Akl, Parallel Computation, Models and MethodsParallel Computation, Models and Methods, , Prentice Hall 1997.Prentice Hall 1997.

Gupta, H & Sadayappan P., Gupta, H & Sadayappan P., Communication Communication Efficient Matrix Mulitplication on HypercubesEfficient Matrix Mulitplication on Hypercubes , , August 1994 Proceedings of the sixth annual August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and ACM symposium on Parallel algorithms and architectures, 320 - 329 architectures, 320 - 329

Quinn, M.J., Quinn, M.J., Parallel Computing – Theory and Parallel Computing – Theory and PracticePractice, McGraw Hill, 1997, McGraw Hill, 1997

Documents

Matrix Multiplication Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi