Upload
godfrey-boone
View
215
Download
0
Embed Size (px)
Citation preview
Matrix MultiplicationMatrix MultiplicationMatrix MultiplicationMatrix Multiplication
Instructor: Dr Sushil K PrasadInstructor: Dr Sushil K PrasadPresented By: R. Jayampathi Presented By: R. Jayampathi SampathSampath
Instructor: Dr Sushil K PrasadInstructor: Dr Sushil K PrasadPresented By: R. Jayampathi Presented By: R. Jayampathi SampathSampath
Outline Outline Outline Outline
IntroductionIntroduction Hypercube Interconnection NetworkHypercube Interconnection Network The Parallel AlgorithmThe Parallel Algorithm Matrix TranspositionMatrix Transposition Communication Efficient Matrix Multiplication Communication Efficient Matrix Multiplication
on Hypercubes (The paper)on Hypercubes (The paper)
IntroductionIntroductionIntroductionIntroduction Matrix multiplication is important algorithm design in Matrix multiplication is important algorithm design in
parallel computation. parallel computation. Matrix multiplication on hypercubeMatrix multiplication on hypercube
– Diameter is smallerDiameter is smaller– Degree = Degree = log(p)log(p)
Straightforward RAM algorithm for MatMul requires Straightforward RAM algorithm for MatMul requires O(nO(n33)) time. time.
– Sequential Algorithm: Sequential Algorithm: for (i=0; i<n; i++){ for (i=0; i<n; i++){ for (j=0; j<n; j++) { for (j=0; j<n; j++) {
t=0;t=0; for(k = 0; k<n; k++){for(k = 0; k<n; k++){ t=t +at=t +aikik*b*bkjkj; ; }} ccijij=t;=t; }} }}
Hypercube Interconnection NetworkHypercube Interconnection NetworkHypercube Interconnection NetworkHypercube Interconnection Network
00000001
00100011
011101000101
0110
1001
1000
1100
1101
1111
1011
Hypercube Interconnection Network Hypercube Interconnection Network (contd.)(contd.)
Hypercube Interconnection Network Hypercube Interconnection Network (contd.)(contd.)
The formal specification of a Hypercube Interconnection The formal specification of a Hypercube Interconnection Network.Network.
– Let processors be available Let processors be available
– Let Let ii and and ii(b) (b) be two integers whose binary be two integers whose binary representation differ only in position representation differ only in position bb, ,
– Specifically,Specifically, If is binary representation of If is binary representation of
ii Then is the binary Then is the binary
representation of representation of i i(b) (b) where is the complement of bitwhere is the complement of bit A g-Dimentional Hypercube interconnection network A g-Dimentional Hypercube interconnection network is formed by is formed by
connection each processor connection each processor ppi i to by two way link for to by two way link for all all
gN 2 ,1,2,1,0 ... Npppp 1g
1,0 )( Nii b
gb 0
011121 ...... iiiiiii bbbgg
011'
121 ...... iiiiiii bbbgg 'bi bi
10 Ni )(bip
gb 0
The Parallel AlgorithmThe Parallel AlgorithmThe Parallel AlgorithmThe Parallel Algorithm Example Example (parallel algorithm)(parallel algorithm)
4-4
3-3
4 -4
3-3
2-2
1 -1
2 -2
1-1
2 -2
1 -2
3 -4
4-4
4-3
3-3
1 -1
2-1
4-4
4-3
3-2
3-1
1 -1
1 -2
2-3
2-4
4-4
3 -3
2 -2
1 -1
000000 001001
010010 011011
100100
110110 111111
101101
1 2
3 4
-1 -2
-3 -4
A =
B =
Step1:Step1:
2*22*2
n=2n=211
#processors N#processors N
N=nN=n33=2=233=8=8
X,X,XX,X,X
ii jj kk
Initial stepInitial step Step 1.1Step 1.1
Step 1.2Step 1.2 Step 1.2Step 1.2
A(0,j,k) & A(0,j,k) & B(0,j,k) -> B(0,j,k) -> processors (i,j,k), processors (i,j,k), where 1<=i<=n-1.where 1<=i<=n-1.
A(i,j,i) -> A(i,j,i) -> processors processors (i,j,k) (i,j,k)
where where 0<=k<=n-10<=k<=n-1
B(i,j,k) -> B(i,j,k) -> processors processors (i,j,k)(i,j,k)
where where 0<=j<=n-10<=j<=n-1
The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)
-16-12
-6-3
-2-1
-6 -8
-22-15
-10-7
Step2:Step2:
Step3:Step3:
The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.) Implementation of straightforward RAM algorithm on Implementation of straightforward RAM algorithm on
HC.HC. The multiplication of two The multiplication of two n x nn x n matrices A , B where matrices A , B where n=2n=2qq
Use HC with Use HC with N = nN = n33 = 2 = 23q3q Each processor Each processor PPrr occupying position occupying position (i,j,k)(i,j,k)
where where r = inr = in22 + jn + k for 0<= i,j,k <= n-1 + jn + k for 0<= i,j,k <= n-1 If the binary representation of If the binary representation of rr is : is :
rr3q-13q-1rr3q-23q-2…r…r2q2qrr2q-12q-1…r…rqqrrq-1q-1…r…r00
then the binary representation of then the binary representation of i, j, ki, j, k are are
rr3q-13q-1rr3q-23q-2…r…r2q2q,, rr2q-12q-1…r…rqq,, rrq-1q-1…r…r00 respectively respectively
The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.) Example Example (positioning(positioning))
– The multiplication of two The multiplication of two 2 x 22 x 2 matrices A , B where matrices A , B where n=2=2n=2=21 1 q=1 q=1
– Use HC with Use HC with N = nN = n33 = 2 = 23q3q=8 processors=8 processors
– Each processor Each processor PPrr occupying position occupying position (i,j,k)(i,j,k)
– where where r = i2r = i222 + j2 + k for 0<= i,j,k <= 1 + j2 + k for 0<= i,j,k <= 1
– If the binary representation of If the binary representation of rr is : is :
– rr22rr11rr0 0
– then the binary representation of then the binary representation of i, j, ki, j, k are are
– rr22,, rr11,, rr00 respectively respectively
The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)
All processors with same index value in the one of All processors with same index value in the one of i,j,ki,j,k form a HC with form a HC with nn22 processors processors
All processors with the same index value in two field All processors with the same index value in two field coordinates form a HC with coordinates form a HC with nn processors processors
Each processor will have 3 registers Each processor will have 3 registers AArr, B, Brr and and CCrr also also denoted denoted A(I,j,k) B(I,j,k) and C(i,j,k)A(I,j,k) B(I,j,k) and C(i,j,k)
000000 001001
010010 011011
100100
110110 111111
101101
ArAr
BrBr
CrCr
101101
The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.) Step 1:The elements of A and B are distributed to the nStep 1:The elements of A and B are distributed to the n33
processors so that the processor in position i,j,k will contain aprocessors so that the processor in position i,j,k will contain a jiji and band bikik
– 1.1 Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors 1.1 Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = ain positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = a ijij and and B(i,j,k) = bB(i,j,k) = bjkjk for 0<=i<=n-1. for 0<=i<=n-1.
– 1.2 Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), 1.2 Copies of data in A(i,j,i) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = awhere 0<=k<=n-1. Resulting in A(i,j,k) = a jiji for 0<=k<=n-1. for 0<=k<=n-1.
– 1.3 Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), 1.3 Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = bwhere 0<=j<=n-1. Resulting in B(i,j,k) = b ikik for 0<=j<=n-1. for 0<=j<=n-1.
Step 2: Each processor in position (i,j,k) computes the product Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k)C(i,j,k) = A(i,j,k) * B(i,j,k)
Step 3: The sum C(0,j,k) = Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.computed for 0<=j,k<n-1.
The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)The Parallel Algorithm (contd.)
AnalysisAnalysis– Steps 1.1,1.1,1.3 and 3 consists of q constant time Steps 1.1,1.1,1.3 and 3 consists of q constant time
iterations.iterations.
– Step 2 requires constant timeStep 2 requires constant time
– So, So, T(nT(n33) = O(q)) = O(q)
= O(logn)= O(logn)
– Cost Cost pT(p) = O(npT(p) = O(n33logn)logn)
– Not cost optimal.Not cost optimal.
n=4=2n=4=222
#processors N#processors N
N=nN=n33=4=433=64=64
XX,XX,XXXX,XX,XX
ii jj kk
Copies of Copies of data data initially in initially in A(0,j,k) A(0,j,k) and and B(0,j,k)B(0,j,k)
n=4=2n=4=222
#processors N#processors N
N=nN=n33=4=433=64=64
XX,XX,XXXX,XX,XX
ii jj kk
A(0,j,k) A(0,j,k) and and B(0,j,k) B(0,j,k) are sent to are sent to processors processors in in positions positions (i,j,k), (i,j,k), where where 1<=i<=n-11<=i<=n-1
n=4=2n=4=222
#processors N#processors N
N=nN=n33=4=433=64=64
XX,XX,XXXX,XX,XX
ii jj kk
Copies of Copies of data data initially in initially in A(0,j,k) A(0,j,k) and and B(0,j,k)B(0,j,k)
n=4=2n=4=222
#processors N#processors N
N=nN=n33=4=433=64=64
XX,XX,XXXX,XX,XX
ii jj kk
Senders Senders of Aof A
Copies of Copies of data in data in A(i,j,i) are A(i,j,i) are sent to sent to processors in processors in positions positions (i,j,k), where (i,j,k), where 0<=k<=n-10<=k<=n-1
n=4=2n=4=222
#processors N#processors N
N=nN=n33=4=433=64=64
XX,XX,XXXX,XX,XX
ii jj kk
Senders Senders of Bof B
Copies of Copies of data in data in B(i,j,k) are B(i,j,k) are sent to sent to processors in processors in positions positions (i,j,k), where (i,j,k), where 0<=j<=n-10<=j<=n-1
Matrix TranspositionMatrix TranspositionMatrix TranspositionMatrix Transposition
The number of processors used is N = nThe number of processors used is N = n22 = 2 = 22q2q and and processor Pprocessor Prr occupies position (i,j) where r = in + j occupies position (i,j) where r = in + j where 0<=i,j<=n-1.where 0<=i,j<=n-1.
Initially, processor PInitially, processor Prr holds all of the elements of holds all of the elements of matrix A where r = in + j.matrix A where r = in + j.
Upon termination, processor PUpon termination, processor Pss holds element a holds element aijij where s = jn + i.where s = jn + i.
Matrix Transposition (contd.)Matrix Transposition (contd.)Matrix Transposition (contd.)Matrix Transposition (contd.) A recursive interpretation of the algorithmA recursive interpretation of the algorithm
– Divide the matrix into 4 sub-matrices – n/2 x n/2Divide the matrix into 4 sub-matrices – n/2 x n/2– The first level of recursionThe first level of recursion
The elements of bottom left sub-matrix are swapped with The elements of bottom left sub-matrix are swapped with the corresponding elements of the top right sub-matrix.the corresponding elements of the top right sub-matrix.
The elements of other two sub-matrix are untouched.The elements of other two sub-matrix are untouched.
– The same step is now applied to each of four The same step is now applied to each of four (n/2)*(n/2) matrices.(n/2)*(n/2) matrices.
– This continues until 2*2 matrices are transposed.This continues until 2*2 matrices are transposed. AnalysisAnalysis
– The algorithm consists of The algorithm consists of qq constant time iterations. constant time iterations.– T(n) = log(n)T(n) = log(n)– Cost = nCost = n22log(n)log(n) – not cost optimal ( – not cost optimal (n(n-1)/2n(n-1)/2 operations operations
on on n*nn*n matrix on the RAM by swapping matrix on the RAM by swapping aaijij wit wit aajiji for for all all i<ji<j))
Matrix Transposition (contd.)Matrix Transposition (contd.)Matrix Transposition (contd.)Matrix Transposition (contd.) ExampleExample
A=A=
11
eebb
22cc
ffdd
gg
hh
xxvv
yy33
zzww
44
A=A=
11
eebb
22hh
xxvv
yy
cc
ffdd
gg33
zzww
44
A=A=
11 bb hh vv
ee 22 xx YY
cc dd 33 ww
ff gg zz 44
A=A=
11
bbee
22hh
vvxx
yy
cc
ddff
gg33
wwzz
44
1.01.0 1.11.1
1.21.2 1.31.3
OutlineOutlineOutlineOutline
2D Diagonal Algorithm2D Diagonal Algorithm The 3-D Diagonal AlgorithmThe 3-D Diagonal Algorithm
2D Diagonal Algorithm2D Diagonal Algorithm2D Diagonal Algorithm2D Diagonal Algorithm
A*0A*0 A*1A*1 A*2A*2 A*3A*3B0*B0*
B1*B1*
B2*B2*
B3*B3*
A*0A*0
B0*B0*
A*1A*1
B1*B1*
A*2A*2
B2*B2*
A*3A*3
B3*B3*
AA BB
Step 1Step 1
4*44*4 4*44*4
2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)A*0A*0
B0*B0*A*0A*0 A*0A*0 A*0A*0
A*1A*1 A*1A*1
B1*B1*A*1A*1 A*1A*1
A*2A*2 A*2A*2 A*2A*2
B2*B2*A*2A*2
A*3A*3 A*3A*3 A*3A*3 A*3A*3
B3*B3*
A*0A*0
B0*B0*A*0A*0
B01B01A*0A*0
B02B02A*0A*0
B03B03
A*1A*1
B10B10A*1A*1
B1*B1*A*1A*1
B12B12A*1A*1
B13B13
A*2A*2
B20B20A*2A*2
B21B21A*2A*2
B2*B2*A*2A*2
B23B23
A*3A*3
B30B30A*3A*3
B31B31A*3A*3
B32B32A*3A*3
B3*B3*
Step 2Step 2 Step 3Step 3
C00,C00,
C10C10
C20,C20,
C30C30
C01,C01,
C11C11
C21,C21,
C31C31
C02,C02,
C12C12
C22,C22,
C32C32
C03,C03,
C13C13
C23,C23,
C33C33
Step 4Step 4
2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.)2D Diagonal Algorithm (Contd.) Above algorithm can be extended to a 3-D mesh Above algorithm can be extended to a 3-D mesh
embedded in a hypercube with embedded in a hypercube with AA*i*i and and BBi*i* being being initially distributed along the third dimension z.initially distributed along the third dimension z.
Processor Processor ppiikiik holding the sub-blocks of holding the sub-blocks of AAkiki and and BBikik
One-to-all-personalized One-to-all-personalized broadcast of broadcast of Bi*Bi* then then replaced by replaced by point-to-point communicationpoint-to-point communication of of BBikik from from ppiikiik to to ppkikkik
It fallows one-to-all broadcast It fallows one-to-all broadcast BBikik to to ppkikkik along the along the z direction.z direction.
The 3-D Diagonal AlgorithmThe 3-D Diagonal AlgorithmThe 3-D Diagonal AlgorithmThe 3-D Diagonal Algorithm HC consisting of p processorsHC consisting of p processors Can be visualized as a 3-D mesh of sizeCan be visualized as a 3-D mesh of size Matrices A and B are partitioned into blocks of pMatrices A and B are partitioned into blocks of p⅔⅔ with blocks along with blocks along
each dimension.each dimension. Initially, it is assumed that A and B are mapped onto the 2-D plane x = yInitially, it is assumed that A and B are mapped onto the 2-D plane x = y processor pprocessor piikiik containing the blocks of A containing the blocks of Akiki and B and Bkiki
333 ppp 3 p
The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.) Algorithm consists of 3 phasesAlgorithm consists of 3 phases
– Point to point communication of Bki by piik to pikkPoint to point communication of Bki by piik to pikk
– One to all broadcast of blocks of A along the x One to all broadcast of blocks of A along the x direction and the newly acquired block of B along the direction and the newly acquired block of B along the z direction.z direction.
Now processor pijk has the blocks of Akj and BjiNow processor pijk has the blocks of Akj and Bji Each processor calculates the products of blocks of A and B.Each processor calculates the products of blocks of A and B.
– The reduction by adding the result sub matrices along The reduction by adding the result sub matrices along the z direction.the z direction.
The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.)The 3-D Diagonal Algorithm (contd.) AnalysisAnalysis
– Phase 1: Passing messages of size nPhase 1: Passing messages of size n22/ / pp⅔⅔ require require log(log(33√p(t√p(tss + t + tww((nn22/ / pp⅔ ⅔ ))) where t))) where tss is the time it takes to is the time it takes to start up for message sending and tstart up for message sending and tww is time it takes to is time it takes to send a word from one processor to its neighbor. send a word from one processor to its neighbor.
– Phase 2: takes twice as much time as phase 1.Phase 2: takes twice as much time as phase 1.
– Phase 3: Can be completed in the same amount of time Phase 3: Can be completed in the same amount of time as Phase 1.as Phase 1.
Overall, the algorithm takes (4/3 log p, Overall, the algorithm takes (4/3 log p, nn22/ / pp⅔ ⅔ (4/3 (4/3 log p))log p))
BibliographyBibliographyBibliographyBibliography
Akl, Akl, Parallel Computation, Models and MethodsParallel Computation, Models and Methods, , Prentice Hall 1997.Prentice Hall 1997.
Gupta, H & Sadayappan P., Gupta, H & Sadayappan P., Communication Communication Efficient Matrix Mulitplication on HypercubesEfficient Matrix Mulitplication on Hypercubes , , August 1994 Proceedings of the sixth annual August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and ACM symposium on Parallel algorithms and architectures, 320 - 329 architectures, 320 - 329
Quinn, M.J., Quinn, M.J., Parallel Computing – Theory and Parallel Computing – Theory and PracticePractice, McGraw Hill, 1997, McGraw Hill, 1997