Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Weapons of Math Inductionfor the War on Parallel Programming Error
Robert A. van de Geijn
Department of Computer ScienceInstitute for Computational Engineering and Sciences
The University of Texas at Austin
ICES – Sept, 2010
http://www.cs.utexas.edu/users/flame/ 1
Outline
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 2
Introduction
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 3
Introduction
The Team
UT-Austin
Faculty/StaffErnie ChanVictor EijkhoutMaggie MyersAndy TerrelRobert van de GeijnField Van Zee
Graduate StudentsBryan MarkerKyungjoo KimIsaac LeeArdavan PedramJack PoulsonMartin Schatz
UndergradsBurns HealyEileen MartinJon MonetteTyler RhodesRichard VerasNick Wiz
Univ. Jaume I, Spain
FacultyGregorio Quintana-OrtıEnrique Quintana-OrtıMercedes Marques
Graduate StudentsManuel FogueFrancisco D. Igual
RWTH Aachen
FacultyPaolo Bientinesi
http://www.cs.utexas.edu/users/flame/ 4
Introduction
Sponsors
UT-Austin
Numerous NSF GrantsMicrosoftIntel
Univ. Jaume I
Ministerio de Ciencia e InnovacionClearspeedMicrosoftNvidia
http://www.cs.utexas.edu/users/flame/ 5
Introduction
Who is this famous (former) Texan?
http://www.cs.utexas.edu/users/flame/ 6
Introduction
“I mean, if 10 years from now, when you are doing somethingquick and dirty, you suddenly visualize that I am looking over yourshoulders and say to yourself ”Dijkstra would not have liked this”,well, that would be enough immortality for me.”
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 7
Introduction
“Literature professors read each other’s books. Why don’tcomputer science professors read each other’s programs?”
– Tim Mattson
http://www.cs.utexas.edu/users/flame/ 8
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
Introduction
LAPACK Cholesky factorization (dpotrf)
DO 20 J = 1, N, NB
*
* Update and factorize the current diagonal block and test
* for non-positive-definiteness.
*
JB = MIN( NB, N-J+1 )
CALL DSYRK( ’Lower’, ’No transpose’, JB, J-1, -ONE,
$ A( J, 1 ), LDA, ONE, A( J, J ), LDA )
CALL DPOTF2( ’Lower’, JB, A( J, J ), LDA, INFO )
IF( INFO.NE.0 )
$ GO TO 30
IF( J+JB.LE.N ) THEN
*
* Compute the current block column.
*
CALL DGEMM( ’No transpose’, ’Transpose’, N-J-JB+1, JB,
$ J-1, -ONE, A( J+JB, 1 ), LDA, A( J, 1 ),
$ LDA, ONE, A( J+JB, J ), LDA )
CALL DTRSM( ’Right’, ’Lower’, ’Transpose’, ’Non-unit’,
$ N-J-JB+1, JB, ONE, A( J, J ), LDA,
$ A( J+JB, J ), LDA )
END IF
20 CONTINUE
<deleted code>
GO TO 40
*
30 CONTINUE
INFO = INFO + J - 1
*
40 CONTINUE
RETURN
http://www.cs.utexas.edu/users/flame/ 11
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
Introduction
The sky is falling
http://www.cs.utexas.edu/users/flame/ 13
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
Introduction
General lesson:
This crisis is an opportunity to completely rethink your code.
http://www.cs.utexas.edu/users/flame/ 16
Introduction
To keep you interested ...
0
100
200
300
400
500
600
700
0 5000 10000 15000 20000
GFL
OPS
Matrix size
Performance of the Cholesky factorization on GPU/CPU
MKL 10.0 spotrf on two Intel Xeon QuadCore (2.2 GHz)Algorithm-by-blocks on Tesla S870
Algorithm-by-blocks on Tesla S1070
http://www.cs.utexas.edu/users/flame/ 17
Notation
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 18
Notation
A Motivating Example: The Cholesky Factorization
Given A→ n× n symmetric positive definite, compute
A = L · LT ,
where L is an n× n lower triangular triangular matrix
http://www.cs.utexas.edu/users/flame/ 19
Notation
The Cholesky Factorization: On the Whiteboard
done
done
done
A(partially
updated)
?
α11 ?
a21 A22
?
α11:=√α11
?
a21:=a21/α11
A22:=
A22−a21aT21
?
done
done
done
A(partially
updated)
http://www.cs.utexas.edu/users/flame/ 20
Notation
FLAME Notation
done
done
done
A(partially
updated)
?
α11 aT12
a21 A22
Repartition„ATL ATR
ABL ABR
«
→
0BB@A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1CCAwhere α11 is a scalar
http://www.cs.utexas.edu/users/flame/ 21
Notation
Algorithm: [A] := Chol unb(A)
Partition A→(
AT L AT R
ABL ABR
)where ATL is 0× 0
while n(ABR) 6= 0 do
Repartition(ATL ATR
ABL ABR
)→
(A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
)where α11 is a scalar
α11 :=√
α11
a21 := a21/α11
A22 := A22 − a21aT21 (syr)
Continue with(ATL ATR
ABL ABR
)←
(A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
)endwhile
http://www.cs.utexas.edu/users/flame/ 22
Notation
General lesson:
Algorithms should be represented in a way that captures how wereason about them.
http://www.cs.utexas.edu/users/flame/ 23
Deriving Algorithms to be Correct
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 24
Deriving Algorithms to be Correct
Family Values
http://www.cs.utexas.edu/users/flame/ 25
Deriving Algorithms to be Correct
“The only effective way to raise the confidence level of a programsignificantly is to give a convincing proof of its correctness. Butone should not first make the program and then prove itscorrectness, because then the requirement of providing the proofwould only increase the poor programmers burden. On thecontrary: the programmer should let correctness proof andprogram grow hand in hand.”
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 26
Deriving Algorithms to be Correct
The Worksheet: A Weapon of Math Induction
http://www.cs.utexas.edu/users/flame/ 27
Deriving Algorithms to be Correct
Step 1: Precondition and postcondition
Precondition: A = A
Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.
Postcondition: A = L ∧ A = LLT
Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.
http://www.cs.utexas.edu/users/flame/ 28
Deriving Algorithms to be Correct
Step 1: Precondition and postcondition
Precondition: A = A
Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.
Postcondition: A = L ∧ A = LLT
Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.
http://www.cs.utexas.edu/users/flame/ 28
Deriving Algorithms to be Correct
Step 1: Precondition and postcondition
Precondition: A = A
Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.
Postcondition: A = L ∧ A = LLT
Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.
http://www.cs.utexas.edu/users/flame/ 28
Deriving Algorithms to be Correct
Step Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4
2
3 while m(ATL) < m(A) do2,3
5a
6
8
5b
7
2
endwhile2,3
1bn
A = L ∧ A = LLTo
http://www.cs.utexas.edu/users/flame/ 29
Deriving Algorithms to be Correct
Step 2: Finding Loop-Invariants
Partition the operands
A→(
ATL ?
ABL ABR
)and L→
(LTL 0
LBL LBR
)
Plug into postcondition A = L ∧ A = LLT :(ATL ?
ABL ABR
)=
(LTL 0
LBL LBR
)∧
(ATL ?
ABL ABR
)=
(LTL 0
LBL LBR
)(LTL 0
LBL LBR
)T
http://www.cs.utexas.edu/users/flame/ 30
Deriving Algorithms to be Correct
Step 2: Finding Loop-Invariants
Partition the operands
A→(
ATL ?
ABL ABR
)and L→
(LTL 0
LBL LBR
)
Plug into postcondition A = L ∧ A = LLT :(ATL ?
ABL ABR
)=
(LTL 0
LBL LBR
)∧
(ATL ?
ABL ABR
)=
(LTL 0
LBL LBR
)(LTL 0
LBL LBR
)T
http://www.cs.utexas.edu/users/flame/ 30
Deriving Algorithms to be Correct
Determine the loop-invariants
„ATL ?
ABL ABR
«=
„LTL 0
LBL LBR
«∧
ATL ?
ABL ABR
!=
„LTLLT
TL ?
LBLLTTL LBLLT
BL + LBRLTBR
«
Loop-invariant 1:
„ATL ?
ABL ABR
«=
LTL 0
ABL ABR
!∧ ATL = LTLLT
TL
Loop-invariant 2:„ATL ?
ABL ABR
«=
LTL 0
LBL ABR
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«Loop-invariant 3:„
ATL ?
ABL ABR
«=
LTL 0
LBL ABR − LBLLTBL
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«
http://www.cs.utexas.edu/users/flame/ 31
Deriving Algorithms to be Correct
Determine the loop-invariants
„ATL ?
ABL ABR
«=
„LTL 0
LBL LBR
«∧
ATL ?
ABL ABR
!=
„LTLLT
TL ?
LBLLTTL LBLLT
BL + LBRLTBR
«
Loop-invariant 1:
„ATL ?
ABL ABR
«=
LTL 0
ABL ABR
!∧ ATL = LTLLT
TL
Loop-invariant 2:„ATL ?
ABL ABR
«=
LTL 0
LBL ABR
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«Loop-invariant 3:„
ATL ?
ABL ABR
«=
LTL 0
LBL ABR − LBLLTBL
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«
http://www.cs.utexas.edu/users/flame/ 31
Deriving Algorithms to be Correct
Determine the loop-invariants
„ATL ?
ABL ABR
«=
„LTL 0
LBL LBR
«∧
ATL ?
ABL ABR
!=
„LTLLT
TL ?
LBLLTTL LBLLT
BL + LBRLTBR
«
Loop-invariant 1:
„ATL ?
ABL ABR
«=
LTL 0
ABL ABR
!∧ ATL = LTLLT
TL
Loop-invariant 2:„ATL ?
ABL ABR
«=
LTL 0
LBL ABR
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«Loop-invariant 3:„
ATL ?
ABL ABR
«=
LTL 0
LBL ABR − LBLLTBL
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«
http://www.cs.utexas.edu/users/flame/ 31
Deriving Algorithms to be Correct
Determine the loop-invariants
„ATL ?
ABL ABR
«=
„LTL 0
LBL LBR
«∧
ATL ?
ABL ABR
!=
„LTLLT
TL ?
LBLLTTL LBLLT
BL + LBRLTBR
«
Loop-invariant 1:
„ATL ?
ABL ABR
«=
LTL 0
ABL ABR
!∧ ATL = LTLLT
TL
Loop-invariant 2:„ATL ?
ABL ABR
«=
LTL 0
LBL ABR
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«Loop-invariant 3:„
ATL ?
ABL ABR
«=
LTL 0
LBL ABR − LBLLTBL
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«
http://www.cs.utexas.edu/users/flame/ 31
Deriving Algorithms to be Correct
Step 2: Enter loop-invariant in worksheet
Step Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)3 while do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)∧ · · ·
5a685b7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)∧ · · ·
1bn
A = L ∧ A = LLTo
http://www.cs.utexas.edu/users/flame/ 32
Deriving Algorithms to be Correct
Why a Weapon of Math Induction?
http://www.cs.utexas.edu/users/flame/ 33
Deriving Algorithms to be Correct
Step 3: Finding the Loop-Guard
Step Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)3 while m(AT L) < m(A) do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)∧ m(AT L) < m(A)
5a685b7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)∧ ¬( m(AT L) < m(A) )
1bn
A = L ∧ A = LLTo
http://www.cs.utexas.edu/users/flame/ 34
Deriving Algorithms to be Correct
Step 4: Finding the Initialization
Step Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4 Partition A→„
AT L ?
ABL ABR
«, L→
„LT L 0
LBL LBR
«where AT L and LT L are 0× 0
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)3 while m(AT L) < m(A) do
2,3
( „AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!!∧ (m(AT L) < m(A))
)5a685b7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
2,3
( „AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!!∧ ¬ (m(AT L) < m(A))
)1b
nA = L ∧ A = LLT
o
http://www.cs.utexas.edu/users/flame/ 35
Deriving Algorithms to be Correct
Step 5: Marching through the MatrixStep Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4 Partition A→„
AT L ?
ABL ABR
«, L→
„LT L 0
LBL LBR
«where AT L and LT L are 0× 0
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)3 while m(AT L) < m(A) do
2,3
( „AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!!∧ (m(AT L) < m(A))
)5a Repartition„
AT L ?
ABL ABR
«→
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«→
0@L00 0 0
lT10 λ11 0L20 l21 L22
1Awhere α11 and λ11 are scalars
685b Continue with„
AT L ?
ABL ABR
«←
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«←
0@L00 0 0
lT10 λ11 0
L20 l21 L22
1A7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
2,3
( „AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!!∧ ¬ (m(AT L) < m(A))
)1b
nA = L ∧ A = LLT
ohttp://www.cs.utexas.edu/users/flame/ 36
Deriving Algorithms to be Correct
Step 6: State Before the Update
.
.
.
.
.
.3 while m(AT L) < m(A) do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)5a Repartition„
AT L ?
ABL ABR
«→
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«→
0@L00 0 0
lT10 λ11 0L20 l21 L22
1Awhere α11 and λ11 are scalars
6
8><>:0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 α11 − lT10l10 ?
L20 a21 − L20l10 A22 − L20LT20
1CA ∧0B@ A00
aT10
A20
1CA =
0B@L00LT00
lT10LT00
L20LT00
1CA9>=>;
85b Continue with„
AT L ?
ABL ABR
«←
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«←
0@L00 0 0
lT10 λ11 0
L20 l21 L22
1A7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
.
.
.
.
.
.
http://www.cs.utexas.edu/users/flame/ 37
Deriving Algorithms to be Correct
Step 7: State After the Update...
.
.
.3 while m(AT L) < m(A) do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)5a Repartition„
AT L ?
ABL ABR
«→
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«→
0@L00 0 0
lT10 λ11 0L20 l21 L22
1Awhere α11 and λ11 are scalars
6
8><>:0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 α11 − lT10l10 ?
L20 a21 − L20l10 A22 − L20LT20
1CA ∧0B@ A00
aT10
A20
1CA =
0B@L00LT00
lT10LT00
L20LT00
1CA9>=>;
85b Continue with„
AT L ?
ABL ABR
«←
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«←
0@L00 0 0
lT10 λ11 0
L20 l21 L22
1A
7
8>>>>>>><>>>>>>>:
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 λ11 ?
L20 l21 A22 − L20LT20 − l21lT21
1CA∧
0B@ A00 ?
aT10 α11
A20 a21
1CA =
0B@L00LT00 ?
lT10LT00 lT10l10 + λ2
11
L20LT00 L20l10 + l21λ11
1CA
9>>>>>>>=>>>>>>>;2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
.
.
.
.
.
.http://www.cs.utexas.edu/users/flame/ 38
Deriving Algorithms to be Correct
Step 8: The Update...
.
.
.3 while m(AT L) < m(A) do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)5a Repartition„
AT L ?
ABL ABR
«→
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«→
0@L00 0 0
lT10 λ11 0L20 l21 L22
1Awhere α11 and λ11 are scalars
6
8><>:0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 α11 − lT10l10 ?
L20 a21 − L20l10 A22 − L20LT20
1CA ∧0B@ A00
aT10
A20
1CA =
0B@L00LT00
lT10LT00
L20LT00
1CA9>=>;
8
α11 :=√
α11a21 := a21/α11
A22 := A22 − a21aT21
5b Continue with„AT L ?
ABL ABR
«←
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«←
0@L00 0 0
lT10 λ11 0
L20 l21 L22
1A
7
8>>>>>>><>>>>>>>:
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 λ11 ?
L20 l21 A22 − L20LT20 − l21lT21
1CA∧
0B@ A00 ?
aT10 α11
A20 a21
1CA =
0B@L00LT00 ?
lT10LT00 lT10l10 + λ2
11
L20LT00 L20l10 + l21λ11
1CA
9>>>>>>>=>>>>>>>;2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
.
.
.
.
.
.
http://www.cs.utexas.edu/users/flame/ 39
Deriving Algorithms to be Correct
The Algorithm
Algorithm: A := Chol unb var3(A)
Partition A→„
ATL ?
ABL ABR
«where ATL is 0× 0
while m(ATL) < m(A) doRepartition„
ATL ?
ABL ABR
«→
0@ A00 ? ?
aT10 α11 ?
A20 a21 A22
1Awhere α11 is a scalars
α11 :=√
α11
a21 := a21/α11
A22 := A22 − a21aT21
Continue with„ATL ?
ABL ABR
«←
0@ A00 ? ?
aT10 α11 ?
A20 a21 A22
1Aendwhile
http://www.cs.utexas.edu/users/flame/ 40
Deriving Algorithms to be Correct
Having families of correct algorithms is good
Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.
Find all (most) algorithms and pick the best for the targetarchitecture.
In our case, we can systematically generate all (loop-based)algorithms.
http://www.cs.utexas.edu/users/flame/ 41
Deriving Algorithms to be Correct
Having families of correct algorithms is good
Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.
Find all (most) algorithms and pick the best for the targetarchitecture.
In our case, we can systematically generate all (loop-based)algorithms.
http://www.cs.utexas.edu/users/flame/ 41
Deriving Algorithms to be Correct
Having families of correct algorithms is good
Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.
Find all (most) algorithms and pick the best for the targetarchitecture.
In our case, we can systematically generate all (loop-based)algorithms.
http://www.cs.utexas.edu/users/flame/ 41
Deriving Algorithms to be Correct
Having families of correct algorithms is good
Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.
Find all (most) algorithms and pick the best for the targetarchitecture.
In our case, we can systematically generate all (loop-based)algorithms.
http://www.cs.utexas.edu/users/flame/ 41
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
Deriving Algorithms to be Correct
How does this apply to parallel programming?
Coding correct parallel code is difficult.
We derive our (for now sequential) algorithms to be correct.
It is important to choose from a family of algorithms.
Choose an algorithm that parallelizes well.
http://www.cs.utexas.edu/users/flame/ 43
Deriving Algorithms to be Correct
How does this apply to parallel programming?
Coding correct parallel code is difficult.
We derive our (for now sequential) algorithms to be correct.
It is important to choose from a family of algorithms.
Choose an algorithm that parallelizes well.
http://www.cs.utexas.edu/users/flame/ 43
Deriving Algorithms to be Correct
How does this apply to parallel programming?
Coding correct parallel code is difficult.
We derive our (for now sequential) algorithms to be correct.
It is important to choose from a family of algorithms.
Choose an algorithm that parallelizes well.
http://www.cs.utexas.edu/users/flame/ 43
From Correct Algorithm to Correct Code
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 44
From Correct Algorithm to Correct Code
FLAME/C Code
Repartition„ATL ATR
ABL ABR
«→
0@ A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1Awhere α11 is a scalar
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &a01, &A02,
/* ************** */ /* *************************** */
&a10t, /**/ &alpha11, &a12t,
ABL, /**/ ABR, &A20, /**/ &a21, &A22,
1, 1, FLA_BR );
http://www.cs.utexas.edu/users/flame/ 45
From Correct Algorithm to Correct Code
FLAME/C Code
Repartition„ATL ATR
ABL ABR
«→
0@ A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1Awhere α11 is a scalar
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &a01, &A02,
/* ************** */ /* *************************** */
&a10t, /**/ &alpha11, &a12t,
ABL, /**/ ABR, &A20, /**/ &a21, &A22,
1, 1, FLA_BR );
http://www.cs.utexas.edu/users/flame/ 45
From Correct Algorithm to Correct Code
(Unblocked) FLAME/C Code
int FLA_Cholesky_unb( FLA_Obj A )
{
/* ... FLA_Part_2x2( ); ... */
while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &a01, &A02,
/* ************* */ /* ************************** */
&a10t, /**/ &alpha11, &a12t,
ABL, /**/ ABR, &A20, /**/ &a21, &A22,
1, 1, FLA_BR );
/*------------------------------------------------------------*/
FLA_Sqrt ( alpha11 ); /* a11 := sqrt( alpha11 ) */
FLA_Inv_Scal( alpha11, a21 ); /* a21 := a21 / alpha11 */
FLA_Syr ( FLA_LOWER_TRIANGULAR,
FLA_MINUS_ONE,
a21, A22 ); /* A22 := A22 - a21 * a21t */
/*------------------------------------------------------------*/
/* FLA_Cont_with_3x3_to_2x2( ); ... */
}
}
http://www.cs.utexas.edu/users/flame/ 46
Achieving High Performance
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 47
Achieving High Performance
Who is this famous (former) Texan?
http://www.cs.utexas.edu/users/flame/ 48
Achieving High Performance
Who is this famous Texan? Kazushige Goto (TACC)
http://www.cs.utexas.edu/users/flame/ 49
Achieving High Performance
High-Performance Matrix-Matrix Multiplication
Why is matrix-matrix multiplication (gemm) so important?
O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).
Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance
Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):
Article 12, 25 pages, May 2008.
Use method to derive blocked algorithms that cast morecomputation in terms of gemm.
http://www.cs.utexas.edu/users/flame/ 50
Achieving High Performance
High-Performance Matrix-Matrix Multiplication
Why is matrix-matrix multiplication (gemm) so important?
O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).
Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance
Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):
Article 12, 25 pages, May 2008.
Use method to derive blocked algorithms that cast morecomputation in terms of gemm.
http://www.cs.utexas.edu/users/flame/ 50
Achieving High Performance
High-Performance Matrix-Matrix Multiplication
Why is matrix-matrix multiplication (gemm) so important?
O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).
Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance
Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):
Article 12, 25 pages, May 2008.
Use method to derive blocked algorithms that cast morecomputation in terms of gemm.
http://www.cs.utexas.edu/users/flame/ 50
Achieving High Performance
High-Performance Matrix-Matrix Multiplication
Why is matrix-matrix multiplication (gemm) so important?
O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).
Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance
Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):
Article 12, 25 pages, May 2008.
Use method to derive blocked algorithms that cast morecomputation in terms of gemm.
http://www.cs.utexas.edu/users/flame/ 50
Achieving High Performance
(Unblocked) FLAME/C Code (Again)
int FLA_Cholesky_unb( FLA_Obj A )
{
/* ... FLA_Part_2x2( ); ... */
while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &a01, &A02,
/* ************* */ /* ************************** */
&a10t, /**/ &alpha11, &a12t,
ABL, /**/ ABR, &A20, /**/ &a21, &A22,
1, 1, FLA_BR );
/*------------------------------------------------------------*/
FLA_Sqrt ( alpha11 ); /* a11 := sqrt( alpha11 ) */
FLA_Inv_Scal( alpha11, a21 ); /* a21 := a21 / alpha11 */
FLA_Syr ( FLA_LOWER_TRIANGULAR,
FLA_MINUS_ONE,
a21, A22 ); /* A22 := A22 - a21 * a21t */
/*------------------------------------------------------------*/
/* FLA_Cont_with_3x3_to_2x2( ); ... */
}
}
http://www.cs.utexas.edu/users/flame/ 51
Achieving High Performance
Blocked FLAME/C Code
int FLA_Cholesky_blk( FLA_Obj A, int nb_alg )
{
/* ... FLA_Part_2x2( ); ... */
while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){
b = min( FLA_Obj_length( ABR ), nb_alg );
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &A01, &A02,
/* ************* */ /* ******************** */
&A10, /**/ &A11, &A12,
ABL, /**/ ABR, &A20, /**/ &A21, &A22,
b, b, FLA_BR );
/*------------------------------------------------------------*/
FLA_Cholesky_unb( A11 ); /* A11 := Cholesky( A11 ) */
FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11,
A21 ); /* A21 := A21 * inv( A11 )’*/
FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, A22 ); /* A22 := A22 - A21 * A21’ */
/*------------------------------------------------------------*/
/* FLA_Cont_with_3x3_to_2x2( ); ... */
}
}
http://www.cs.utexas.edu/users/flame/ 52
Fighting the War on Parallel Programming Error
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 53
Fighting the War on Parallel Programming Error
“When we had no computers, we had no programming problemeither. When we had a few computers, we had a mildprogramming problem. Confronted with machines a million timesas powerful, we are faced with a gigantic programming problem.”
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 54
Fighting the War on Parallel Programming Error Multithreaded Architectures
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 55
Fighting the War on Parallel Programming Error Multithreaded Architectures
LAPACK parallelization: multithreaded BLAS
A11 ?
A21 A22
, A11 is b× b
Pro?:
Evolve legacy code
Con:
Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)
http://www.cs.utexas.edu/users/flame/ 56
Fighting the War on Parallel Programming Error Multithreaded Architectures
LAPACK parallelization: multithreaded BLAS
A11 ?
A21 A22
, A11 is b× b
Pro?:
Evolve legacy code
Con:
Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)
http://www.cs.utexas.edu/users/flame/ 56
Fighting the War on Parallel Programming Error Multithreaded Architectures
LAPACK parallelization: multithreaded BLAS
A11 ?
A21 A22
, A11 is b× b
Pro?:
Evolve legacy code
Con:
Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)
http://www.cs.utexas.edu/users/flame/ 56
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
Fighting the War on Parallel Programming Error Multithreaded Architectures
A =
A(0,0) ? ? · · · ?
A(1,0) A(1,1) ? · · · ?
A(2,0) A(2,1) A(2,2) · · · ?...
......
. . ....
A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)
http://www.cs.utexas.edu/users/flame/ 58
Fighting the War on Parallel Programming Error Multithreaded Architectures
Algorithm-by-blocks implementation: (almost) no change
int FLA_Cholesky_blk( FLA_Obj A, int nb_alg )
{
/* ... FLA_Part_2x2( ); ... */
while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){
b = min( FLA_Obj_length( ABR ), nb_alg );
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &A01, &A02,
/* ************* */ /* ******************** */
&A10, /**/ &A11, &A12,
ABL, /**/ ABR, &A20, /**/ &A21, &A22,
1, 1, FLA_BR );
/*------------------------------------------------------------*/
FLA_Chol( FLA_LOWER_TRIANGULAR,
*FLASH_OBJ_PTR_AT( A11 ) );
FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*------------------------------------------------------------*/
/* FLA_Cont_with_3x3_to_2x2( ); ... */
}
}
http://www.cs.utexas.edu/users/flame/ 59
Fighting the War on Parallel Programming Error Multithreaded Architectures
The FLAME runtime system “pre-executes” the code.
Whenever a routine is encountered, a pending task isannotated in a global task queue
http://www.cs.utexas.edu/users/flame/ 60
Fighting the War on Parallel Programming Error Multithreaded Architectures
The FLAME runtime system “pre-executes” the code.
Whenever a routine is encountered, a pending task isannotated in a global task queue
http://www.cs.utexas.edu/users/flame/ 60
Fighting the War on Parallel Programming Error Multithreaded Architectures
A(0,0) ? ? · · · ?
A(1,0) A(1,1) ? · · · ?
A(2,0) A(2,1) A(2,2) · · · ?...
......
. . ....
A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)
A(0,0) ? ? · · · ?
A(1,0) A(1,1) ? · · · ?
A(2,0) A(2,1) A(2,2) · · · ?...
......
. . ....
A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)
http://www.cs.utexas.edu/users/flame/ 61
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
Fighting the War on Parallel Programming Error Multithreaded Architectures
Separation of concerns simplifies programming
Library code that can target many architectures.
Run-time system that can implement different schedulers fordifferent situations.
http://www.cs.utexas.edu/users/flame/ 63
Fighting the War on Parallel Programming Error Multithreaded Architectures
Separation of concerns simplifies programming
Library code that can target many architectures.
Run-time system that can implement different schedulers fordifferent situations.
http://www.cs.utexas.edu/users/flame/ 63
Fighting the War on Parallel Programming Error Multithreaded Architectures
Separation of concerns simplifies programming
Library code that can target many architectures.
Run-time system that can implement different schedulers fordifferent situations.
http://www.cs.utexas.edu/users/flame/ 63
Fighting the War on Parallel Programming Error Multithreaded Architectures
Who is this famous Texan?
UT-Texas must be the better, faster, more successful!
http://www.cs.utexas.edu/users/flame/ 64
Fighting the War on Parallel Programming Error Multithreaded Architectures
Who is this famous Texan?
UT-Texas must be the better, faster, more successful!
http://www.cs.utexas.edu/users/flame/ 64
Fighting the War on Parallel Programming Error Multithreaded Architectures
Target Architecture 1
4 socket 2.66 GHz Intel Dunnington - 24 cores
16MB shared L3 cache per socket
OpenMP Intel compiler 11.1
Intel MKL 11.1 (Windows), 10.2 (Linux)
http://www.cs.utexas.edu/users/flame/ 65
Fighting the War on Parallel Programming Error Multithreaded Architectures
Cholesky factorization (Linux)
http://www.cs.utexas.edu/users/flame/ 66
Fighting the War on Parallel Programming Error Multithreaded Architectures
Cholesky factorization (Windows)
http://www.cs.utexas.edu/users/flame/ 67
Fighting the War on Parallel Programming Error Multithreaded Architectures
LU factorization (Linux)
http://www.cs.utexas.edu/users/flame/ 68
Fighting the War on Parallel Programming Error Multithreaded Architectures
QR factorization (Linux)
http://www.cs.utexas.edu/users/flame/ 69
Fighting the War on Parallel Programming Error Multithreaded Architectures
Target Architecture 2
4 socket 2.3 GHz AMD Opteron Quad-Core
2MB shared L3 cache per socket
OpenMP Intel compiler 10.1
GotoBLAS2 1.00
http://www.cs.utexas.edu/users/flame/ 70
Fighting the War on Parallel Programming Error Multithreaded Architectures
LU factorization with pivoting
http://www.cs.utexas.edu/users/flame/ 71
Fighting the War on Parallel Programming Error Multithreaded Architectures
Related Approaches
Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)
General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies
High-level language based on OpenMP-like pragmas +compiler + runtime system
Modest results for dense linear algebra
PLASMA Project
Next step in the LAPACK evolutionary path
Traditional style of implementing algorithms
Does not solve the programmability problem
Hierarchically Tiled Arrays
Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72
Fighting the War on Parallel Programming Error Multithreaded Architectures
Related Approaches
Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)
General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies
High-level language based on OpenMP-like pragmas +compiler + runtime system
Modest results for dense linear algebra
PLASMA Project
Next step in the LAPACK evolutionary path
Traditional style of implementing algorithms
Does not solve the programmability problem
Hierarchically Tiled Arrays
Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72
Fighting the War on Parallel Programming Error Multithreaded Architectures
Related Approaches
Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)
General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies
High-level language based on OpenMP-like pragmas +compiler + runtime system
Modest results for dense linear algebra
PLASMA Project
Next step in the LAPACK evolutionary path
Traditional style of implementing algorithms
Does not solve the programmability problem
Hierarchically Tiled Arrays
Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72
Fighting the War on Parallel Programming Error Multithreaded Architectures
Related Approaches
Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)
General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies
High-level language based on OpenMP-like pragmas +compiler + runtime system
Modest results for dense linear algebra
PLASMA Project
Next step in the LAPACK evolutionary path
Traditional style of implementing algorithms
Does not solve the programmability problem
Hierarchically Tiled Arrays
Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72
Fighting the War on Parallel Programming Error Distributed Memory Parallel
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 73
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Didn’t we solve the problem in the 1990s?
ScaLAPACK (UTK/Berkeley)
Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)
PLAPACK (UT-Austin)
Object-based libraryInspired the FLAME approach
For very large problems on distributed memory clusters, theseshould suffice.
http://www.cs.utexas.edu/users/flame/ 74
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Didn’t we solve the problem in the 1990s?
ScaLAPACK (UTK/Berkeley)
Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)
PLAPACK (UT-Austin)
Object-based libraryInspired the FLAME approach
For very large problems on distributed memory clusters, theseshould suffice.
http://www.cs.utexas.edu/users/flame/ 74
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Didn’t we solve the problem in the 1990s?
ScaLAPACK (UTK/Berkeley)
Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)
PLAPACK (UT-Austin)
Object-based libraryInspired the FLAME approach
For very large problems on distributed memory clusters, theseshould suffice.
http://www.cs.utexas.edu/users/flame/ 74
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Didn’t we solve the problem in the 1990s?
ScaLAPACK (UTK/Berkeley)
Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)
PLAPACK (UT-Austin)
Object-based libraryInspired the FLAME approach
For very large problems on distributed memory clusters, theseshould suffice.
http://www.cs.utexas.edu/users/flame/ 74
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Renewed interest in distributed memory libraries
Intel’s SCC research processor
48 Pentium cores on one chip.
Connected via very fast on-chipcommunication buffers.
No cache-coherency protocol.
Purpose: to study theprogrammability problem formany-core architectures.
http://www.cs.utexas.edu/users/flame/ 75
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Elemental: FLAME for distributed memory architectures
template<typename T>
void
Elemental::LAPACK::Internal::CholLVar3
( DistMatrix<T,MC,MR>& A )
{
const Grid& grid = A.GetGrid();
// Matrix views
DistMatrix<T,MC,MR>
ATL(grid), ATR(grid), A00(grid), A01(grid), A02(grid),
ABL(grid), ABR(grid), A10(grid), A11(grid), A12(grid),
A20(grid), A21(grid), A22(grid);
// Temporary matrix distributions
DistMatrix<T,Star,Star> A11_Star_Star(grid);
DistMatrix<T,VC, Star> A21_VC_Star(grid);
DistMatrix<T,MC, Star> A21_MC_Star(grid);
DistMatrix<T,MR, Star> A21_MR_Star(grid);
// Start the algorithm
http://www.cs.utexas.edu/users/flame/ 78
Fighting the War on Parallel Programming Error Distributed Memory Parallel
PartitionDownDiagonal( A, ATL, ATR,
ABL, ABR );
while( ABR.Height() > 0 )
{
RepartitionDownDiagonal( ATL, /**/ ATR, A00, /**/ A01, A02,
/*************/ /******************/
/**/ A10, /**/ A11, A12,
ABL, /**/ ABR, A20, /**/ A21, A22 );
A21_MC_Star.AlignWith( A22 );
A21_MR_Star.AlignWith( A22 );
//--------------------------------------------------------------------//
A11_Star_Star = A11;
LAPACK::Chol( Lower, A11_Star_Star.LocalMatrix() );
A11 = A11_Star_Star;
A21_VC_Star = A21;
BLAS::Trsm( Right, Lower, ConjugateTranspose, NonUnit,
(T)1, A11_Star_Star.LockedLocalMatrix(),
A21_VC_Star.LocalMatrix() );
A21_MC_Star = A21_VC_Star;
A21_MR_Star = A21_VC_Star;
BLAS::Internal::HerkLNUpdate( (T)-1, A21_MC_Star, A21_MR_Star,(T)1, A22 );
A21 = A21_MC_Star;
//--------------------------------------------------------------------//
A21_MC_Star.FreeConstraints();
A21_MR_Star.FreeConstraints();
SlidePartitionDownDiagonal( ATL, /**/ ATR, A00, A01, /**/ A02,
/**/ A10, A11, /**/ A12,
/*************/ /******************/
ABL, /**/ ABR, A20, A21, /**/ A22 );
}
}
http://www.cs.utexas.edu/users/flame/ 80
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Target Architecture 3
Total of 15× 4× 4 = 240 cores:
15 nodes (out of 3936 nodes)
4 socket 2.3 GHz AMD Opteron Quad-Core
2MB shared L3 cache per socket
fill-CLAS InfiniBand 1Gb/sec
MVAPICH2 Release 1.2
GotoBLAS 1.30
http://www.cs.utexas.edu/users/flame/ 81
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Elemental GEMM, 240 cores
http://www.cs.utexas.edu/users/flame/ 82
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Elemental Cholesky, 240 cores
http://www.cs.utexas.edu/users/flame/ 83
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Elemental LU with partial pivoting, 240 cores
http://www.cs.utexas.edu/users/flame/ 84
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
Fighting the War on Parallel Programming Error Distributed Memory Parallel
What was required?
Replace MPI layer with Intel’s experimental RCCEcommunication layer (Bryan Marker).
Write a few collective communication routines for RCCE(Ernie Chan).
Important: Great confidence in the implementation.
Note: SuperMatrix port to SCC is almost complete.
We are eagerly waiting for performance results.
http://www.cs.utexas.edu/users/flame/ 86
Other Things I Could Talk About
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 87
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
How Do I Get to Use All This?
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 89
How Do I Get to Use All This?
Available as a Professionally Maintained Library
libflame Version 4.0 - Feb. 2010:http://www.cs.utexas.edu/users/flame/
Functionality that is a considerable subset of LAPACK
LAPACK compatibility layer
Linux and Windows OS
Field G. Van Zee. libflame: The Complete Reference.www.lulu.com, 2009
Elemental: http://code.google.com/p/elemental/ (soonto be incorporated in libflame
http://www.cs.utexas.edu/users/flame/ 90
Conclusion
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 91
Conclusion
“How do we convince people that in programming simplicity andclarity – short: what mathematicians call ”elegance” – not adispensable luxury, but a crucial matter that decides betweensuccess and failure?”
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 92
Conclusion
A Success Story
Practical application of “goal-oriented programming”
For the domain of dense linear algebra libraries,FLAME+SuperMatrix appears to solve the programmabilityproblem for sequential and multicore
For the domain of distributed memory dense linear algebralibraries, Elemental appears to solve the programmabilityproblem for clusters and many-core
http://www.cs.utexas.edu/users/flame/
http://www.cs.utexas.edu/users/flame/ 93
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
Conclusion
Want to learn more?
http://www.cs.utexas.edu/users/flame/publications
http://www.cs.utexas.edu/users/flame/ 95
Conclusion
Questions?
http://www.cs.utexas.edu/users/flame/ 96
Conclusion
Will this be embraced?
Simplicity is a great virtue but it requires hard work to achieve itand education to appreciate it. And to make matters worse:complexity sells better.
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 97
Conclusion
Will this be embraced?
Simplicity is a great virtue but it requires hard work to achieve itand education to appreciate it. And to make matters worse:complexity sells better.
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 97