Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Weapons of Math Inductionfor the War on Parallel Programming Error

Robert A. van de Geijn

Department of Computer ScienceInstitute for Computational Engineering and Sciences

The University of Texas at Austin

ICES – Sept, 2010

http://www.cs.utexas.edu/users/flame/ 1

Outline

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion


Introduction

1 Introduction

2 Notation







9 Conclusion


Introduction

The Team

UT-Austin

Faculty/StaffErnie ChanVictor EijkhoutMaggie MyersAndy TerrelRobert van de GeijnField Van Zee

Graduate StudentsBryan MarkerKyungjoo KimIsaac LeeArdavan PedramJack PoulsonMartin Schatz

UndergradsBurns HealyEileen MartinJon MonetteTyler RhodesRichard VerasNick Wiz

Univ. Jaume I, Spain

FacultyGregorio Quintana-OrtıEnrique Quintana-OrtıMercedes Marques

Graduate StudentsManuel FogueFrancisco D. Igual

RWTH Aachen

FacultyPaolo Bientinesi


Introduction

Sponsors

UT-Austin

Numerous NSF GrantsMicrosoftIntel

Univ. Jaume I

Ministerio de Ciencia e InnovacionClearspeedMicrosoftNvidia


Introduction

Who is this famous (former) Texan?


Introduction

“I mean, if 10 years from now, when you are doing somethingquick and dirty, you suddenly visualize that I am looking over yourshoulders and say to yourself ”Dijkstra would not have liked this”,well, that would be enough immortality for me.”

– Dijkstra


Introduction

“Literature professors read each other’s books. Why don’tcomputer science professors read each other’s programs?”

– Tim Mattson


Introduction

Why dense linear algebra libraries

Widely used in scientific computing

Well-define domain

Thought to be well-understood

Interesting case study


Introduction



Well-define domain




Introduction



Well-define domain




Introduction



Well-define domain




Introduction



Well-define domain




Introduction

LAPACK Cholesky factorization (dpotrf)

DO 20 J = 1, N, NB

*

* Update and factorize the current diagonal block and test

* for non-positive-definiteness.

*

JB = MIN( NB, N-J+1 )

CALL DSYRK( ’Lower’, ’No transpose’, JB, J-1, -ONE,

$ A( J, 1 ), LDA, ONE, A( J, J ), LDA )

CALL DPOTF2( ’Lower’, JB, A( J, J ), LDA, INFO )

IF( INFO.NE.0 )

$ GO TO 30

IF( J+JB.LE.N ) THEN

*

* Compute the current block column.

*

CALL DGEMM( ’No transpose’, ’Transpose’, N-J-JB+1, JB,

$ J-1, -ONE, A( J+JB, 1 ), LDA, A( J, 1 ),

$ LDA, ONE, A( J+JB, J ), LDA )

CALL DTRSM( ’Right’, ’Lower’, ’Transpose’, ’Non-unit’,

$ N-J-JB+1, JB, ONE, A( J, J ), LDA,

$ A( J+JB, J ), LDA )

END IF

20 CONTINUE

<deleted code>

GO TO 40

*

30 CONTINUE

INFO = INFO + J - 1

*

40 CONTINUE

RETURN


Introduction

LAPACK

Fortran-77 codes

One routine (algorithm) per operation in the library

Storage in column major order

Parallelism extracted from calls to multithreaded BLAS

Extracting parallelism increases synchronization and thuslimits performance

Column major order hurts data locality

LAPACK does not use modern coding techniques


Introduction

LAPACK

Fortran-77 codes








Introduction

LAPACK

Fortran-77 codes








Introduction

LAPACK

Fortran-77 codes








Introduction

LAPACK

Fortran-77 codes








Introduction

LAPACK

Fortran-77 codes








Introduction

LAPACK

Fortran-77 codes








Introduction

LAPACK

Fortran-77 codes








Introduction

The sky is falling


Introduction

Evolution vs intelligent design

Parallelism is thrust upon the masses

Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)

Great, let’s start over

Cheaper than trying to evolve?


Introduction







Introduction







Introduction







Introduction







Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip


Introduction

FLAME






Storage-by-blocks





Introduction

FLAME






Storage-by-blocks





Introduction

FLAME






Storage-by-blocks





Introduction

FLAME






Storage-by-blocks





Introduction

FLAME






Storage-by-blocks





Introduction

FLAME






Storage-by-blocks





Introduction

FLAME






Storage-by-blocks





Introduction

FLAME






Storage-by-blocks





Introduction

FLAME






Storage-by-blocks





Introduction

General lesson:

This crisis is an opportunity to completely rethink your code.


Introduction

To keep you interested ...

0

100

200

300

400

500

600

700

0 5000 10000 15000 20000

GFL

OPS

Matrix size

Performance of the Cholesky factorization on GPU/CPU

MKL 10.0 spotrf on two Intel Xeon QuadCore (2.2 GHz)Algorithm-by-blocks on Tesla S870

Algorithm-by-blocks on Tesla S1070


Notation

1 Introduction

2 Notation







9 Conclusion


Notation

A Motivating Example: The Cholesky Factorization

Given A→ n× n symmetric positive definite, compute

A = L · LT ,

where L is an n× n lower triangular triangular matrix


Notation

The Cholesky Factorization: On the Whiteboard

done

done

done

A(partially

updated)

?

α11 ?

a21 A22

?

α11:=√α11

?

a21:=a21/α11

A22:=

A22−a21aT21

?

done

done

done

A(partially

updated)


Notation

FLAME Notation

done

done

done

A(partially

updated)

?

α11 aT12

a21 A22

Repartition„ATL ATR

ABL ABR

«

→

0BB@A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1CCAwhere α11 is a scalar


Notation

Algorithm: [A] := Chol unb(A)

Partition A→(

AT L AT R

ABL ABR

)where ATL is 0× 0

while n(ABR) 6= 0 do

Repartition(ATL ATR

ABL ABR

)→

(A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

)where α11 is a scalar

α11 :=√

α11

a21 := a21/α11

A22 := A22 − a21aT21 (syr)

Continue with(ATL ATR

ABL ABR

)←

(A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

)endwhile


Notation

General lesson:

Algorithms should be represented in a way that captures how wereason about them.


Deriving Algorithms to be Correct

1 Introduction

2 Notation







9 Conclusion



Family Values



“The only effective way to raise the confidence level of a programsignificantly is to give a convincing proof of its correctness. Butone should not first make the program and then prove itscorrectness, because then the requirement of providing the proofwould only increase the poor programmers burden. On thecontrary: the programmer should let correctness proof andprogram grow hand in hand.”

– Dijkstra



The Worksheet: A Weapon of Math Induction



Step 1: Precondition and postcondition

Precondition: A = A

Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.

Postcondition: A = L ∧ A = LLT

Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.




Precondition: A = A







Precondition: A = A






Step Annotated Algorithm: A := Chol unb var3(A)

1an

A = Ao

4

2

3 while m(ATL) < m(A) do2,3

5a

6

8

5b

7

2

endwhile2,3

1bn

A = L ∧ A = LLTo



Step 2: Finding Loop-Invariants

Partition the operands

A→(

ATL ?

ABL ABR

)and L→

(LTL 0

LBL LBR

)

Plug into postcondition A = L ∧ A = LLT :(ATL ?

ABL ABR

)=

(LTL 0

LBL LBR

)∧

(ATL ?

ABL ABR

)=

(LTL 0

LBL LBR

)(LTL 0

LBL LBR

)T



Step 2: Finding Loop-Invariants

Partition the operands

A→(

ATL ?

ABL ABR

)and L→

(LTL 0

LBL LBR

)

Plug into postcondition A = L ∧ A = LLT :(ATL ?

ABL ABR

)=

(LTL 0

LBL LBR

)∧

(ATL ?

ABL ABR

)=

(LTL 0

LBL LBR

)(LTL 0

LBL LBR

)T



Determine the loop-invariants

„ATL ?

ABL ABR

«=

„LTL 0

LBL LBR

«∧

ATL ?

ABL ABR

!=

„LTLLT

TL ?

LBLLTTL LBLLT

BL + LBRLTBR

«

Loop-invariant 1:

„ATL ?

ABL ABR

«=

LTL 0

ABL ABR

!∧ ATL = LTLLT

TL

Loop-invariant 2:„ATL ?

ABL ABR

«=

LTL 0

LBL ABR

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«Loop-invariant 3:„

ATL ?

ABL ABR

«=

LTL 0

LBL ABR − LBLLTBL

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«




„ATL ?

ABL ABR

«=

„LTL 0

LBL LBR

«∧

ATL ?

ABL ABR

!=

„LTLLT

TL ?

LBLLTTL LBLLT

BL + LBRLTBR

«

Loop-invariant 1:

„ATL ?

ABL ABR

«=

LTL 0

ABL ABR

!∧ ATL = LTLLT

TL


ABL ABR

«=

LTL 0

LBL ABR

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL


ATL ?

ABL ABR

«=

LTL 0

LBL ABR − LBLLTBL

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«




„ATL ?

ABL ABR

«=

„LTL 0

LBL LBR

«∧

ATL ?

ABL ABR

!=

„LTLLT

TL ?

LBLLTTL LBLLT

BL + LBRLTBR

«

Loop-invariant 1:

„ATL ?

ABL ABR

«=

LTL 0

ABL ABR

!∧ ATL = LTLLT

TL


ABL ABR

«=

LTL 0

LBL ABR

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL


ATL ?

ABL ABR

«=

LTL 0

LBL ABR − LBLLTBL

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«




„ATL ?

ABL ABR

«=

„LTL 0

LBL LBR

«∧

ATL ?

ABL ABR

!=

„LTLLT

TL ?

LBLLTTL LBLLT

BL + LBRLTBR

«

Loop-invariant 1:

„ATL ?

ABL ABR

«=

LTL 0

ABL ABR

!∧ ATL = LTLLT

TL


ABL ABR

«=

LTL 0

LBL ABR

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL


ATL ?

ABL ABR

«=

LTL 0

LBL ABR − LBLLTBL

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«



Step 2: Enter loop-invariant in worksheet


1an

A = Ao

4

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)3 while do

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)∧ · · ·

5a685b7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)∧ · · ·

1bn

A = L ∧ A = LLTo



Why a Weapon of Math Induction?



Step 3: Finding the Loop-Guard


1an

A = Ao

4

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)3 while m(AT L) < m(A) do

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)∧ m(AT L) < m(A)

5a685b7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)∧ ¬( m(AT L) < m(A) )

1bn

A = L ∧ A = LLTo



Step 4: Finding the Initialization


1an

A = Ao

4 Partition A→„

AT L ?

ABL ABR

«, L→

„LT L 0

LBL LBR

«where AT L and LT L are 0× 0

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L


2,3

( „AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!!∧ (m(AT L) < m(A))

)5a685b7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

2,3

( „AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!!∧ ¬ (m(AT L) < m(A))

)1b

nA = L ∧ A = LLT

o



Step 5: Marching through the MatrixStep Annotated Algorithm: A := Chol unb var3(A)

1an

A = Ao

4 Partition A→„

AT L ?

ABL ABR

«, L→

„LT L 0

LBL LBR

«where AT L and LT L are 0× 0

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L


2,3

( „AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!!∧ (m(AT L) < m(A))

)5a Repartition„

AT L ?

ABL ABR

«→

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«→

0@L00 0 0

lT10 λ11 0L20 l21 L22

1Awhere α11 and λ11 are scalars

685b Continue with„

AT L ?

ABL ABR

«←

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«←

0@L00 0 0

lT10 λ11 0

L20 l21 L22

1A7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

2,3

( „AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!!∧ ¬ (m(AT L) < m(A))

)1b

nA = L ∧ A = LLT

ohttp://www.cs.utexas.edu/users/flame/ 36


Step 6: State Before the Update

.

.

.

.

.

.3 while m(AT L) < m(A) do

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)5a Repartition„

AT L ?

ABL ABR

«→

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«→

0@L00 0 0

lT10 λ11 0L20 l21 L22


6

8><>:0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 α11 − lT10l10 ?

L20 a21 − L20l10 A22 − L20LT20

1CA ∧0B@ A00

aT10

A20

1CA =

0B@L00LT00

lT10LT00

L20LT00

1CA9>=>;


AT L ?

ABL ABR

«←

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«←

0@L00 0 0

lT10 λ11 0

L20 l21 L22

1A7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

.

.

.

.

.

.



Step 7: State After the Update...

.

.


2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)5a Repartition„

AT L ?

ABL ABR

«→

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«→

0@L00 0 0

lT10 λ11 0L20 l21 L22


6

8><>:0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 α11 − lT10l10 ?

L20 a21 − L20l10 A22 − L20LT20

1CA ∧0B@ A00

aT10

A20

1CA =

0B@L00LT00

lT10LT00

L20LT00

1CA9>=>;


AT L ?

ABL ABR

«←

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«←

0@L00 0 0

lT10 λ11 0

L20 l21 L22

1A

7

8>>>>>>><>>>>>>>:

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 λ11 ?

L20 l21 A22 − L20LT20 − l21lT21

1CA∧

0B@ A00 ?

aT10 α11

A20 a21

1CA =

0B@L00LT00 ?

lT10LT00 lT10l10 + λ2

11

L20LT00 L20l10 + l21λ11

1CA

9>>>>>>>=>>>>>>>;2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

.

.

.

.

.

.http://www.cs.utexas.edu/users/flame/ 38


Step 8: The Update...

.

.


2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)5a Repartition„

AT L ?

ABL ABR

«→

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«→

0@L00 0 0

lT10 λ11 0L20 l21 L22


6

8><>:0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 α11 − lT10l10 ?

L20 a21 − L20l10 A22 − L20LT20

1CA ∧0B@ A00

aT10

A20

1CA =

0B@L00LT00

lT10LT00

L20LT00

1CA9>=>;

8

α11 :=√

α11a21 := a21/α11

A22 := A22 − a21aT21

5b Continue with„AT L ?

ABL ABR

«←

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«←

0@L00 0 0

lT10 λ11 0

L20 l21 L22

1A

7

8>>>>>>><>>>>>>>:

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 λ11 ?

L20 l21 A22 − L20LT20 − l21lT21

1CA∧

0B@ A00 ?

aT10 α11

A20 a21

1CA =

0B@L00LT00 ?

lT10LT00 lT10l10 + λ2

11

L20LT00 L20l10 + l21λ11

1CA

9>>>>>>>=>>>>>>>;2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

.

.

.

.

.

.



The Algorithm

Algorithm: A := Chol unb var3(A)

Partition A→„

ATL ?

ABL ABR

«where ATL is 0× 0

while m(ATL) < m(A) doRepartition„

ATL ?

ABL ABR

«→

0@ A00 ? ?

aT10 α11 ?

A20 a21 A22

1Awhere α11 is a scalars

α11 :=√

α11

a21 := a21/α11

A22 := A22 − a21aT21

Continue with„ATL ?

ABL ABR

«←

0@ A00 ? ?

aT10 α11 ?

A20 a21 A22

1Aendwhile



Having families of correct algorithms is good

Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.

Find all (most) algorithms and pick the best for the targetarchitecture.

In our case, we can systematically generate all (loop-based)algorithms.





















Is the methodology just a theoretical curiosity?

Broad applicability to all operations supported by LAPACK

Not yet: eigensolvers, SVD.

The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct

Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.

Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward

Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.

Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular

Approach to Stability Analysis.” SIMAX. Conditionally accepted.















































How does this apply to parallel programming?

Coding correct parallel code is difficult.

We derive our (for now sequential) algorithms to be correct.

It is important to choose from a family of algorithms.

Choose an algorithm that parallelizes well.
















From Correct Algorithm to Correct Code

1 Introduction

2 Notation







9 Conclusion



FLAME/C Code


ABL ABR

«→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1Awhere α11 is a scalar

FLA_Repart_2x2_to_3x3(

ATL, /**/ ATR, &A00, /**/ &a01, &A02,

/* ************** */ /* *************************** */

&a10t, /**/ &alpha11, &a12t,

ABL, /**/ ABR, &A20, /**/ &a21, &A22,

1, 1, FLA_BR );



FLAME/C Code


ABL ABR

«→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1Awhere α11 is a scalar


ATL, /**/ ATR, &A00, /**/ &a01, &A02,

/* ************** */ /* *************************** */

&a10t, /**/ &alpha11, &a12t,

ABL, /**/ ABR, &A20, /**/ &a21, &A22,

1, 1, FLA_BR );



(Unblocked) FLAME/C Code

int FLA_Cholesky_unb( FLA_Obj A )

{

/* ... FLA_Part_2x2( ); ... */

while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){


ATL, /**/ ATR, &A00, /**/ &a01, &A02,

/* ************* */ /* ************************** */

&a10t, /**/ &alpha11, &a12t,

ABL, /**/ ABR, &A20, /**/ &a21, &A22,

1, 1, FLA_BR );

/*------------------------------------------------------------*/

FLA_Sqrt ( alpha11 ); /* a11 := sqrt( alpha11 ) */

FLA_Inv_Scal( alpha11, a21 ); /* a21 := a21 / alpha11 */

FLA_Syr ( FLA_LOWER_TRIANGULAR,

FLA_MINUS_ONE,

a21, A22 ); /* A22 := A22 - a21 * a21t */

/*------------------------------------------------------------*/

/* FLA_Cont_with_3x3_to_2x2( ); ... */

}

}


Achieving High Performance

1 Introduction

2 Notation







9 Conclusion



Who is this famous (former) Texan?



Who is this famous Texan? Kazushige Goto (TACC)



High-Performance Matrix-Matrix Multiplication

Why is matrix-matrix multiplication (gemm) so important?

O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).

Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance

Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):

Article 12, 25 pages, May 2008.

Use method to derive blocked algorithms that cast morecomputation in terms of gemm.






























(Unblocked) FLAME/C Code (Again)

int FLA_Cholesky_unb( FLA_Obj A )

{

/* ... FLA_Part_2x2( ); ... */



ATL, /**/ ATR, &A00, /**/ &a01, &A02,

/* ************* */ /* ************************** */

&a10t, /**/ &alpha11, &a12t,

ABL, /**/ ABR, &A20, /**/ &a21, &A22,

1, 1, FLA_BR );

/*------------------------------------------------------------*/

FLA_Sqrt ( alpha11 ); /* a11 := sqrt( alpha11 ) */

FLA_Inv_Scal( alpha11, a21 ); /* a21 := a21 / alpha11 */

FLA_Syr ( FLA_LOWER_TRIANGULAR,

FLA_MINUS_ONE,

a21, A22 ); /* A22 := A22 - a21 * a21t */

/*------------------------------------------------------------*/


}

}



Blocked FLAME/C Code

int FLA_Cholesky_blk( FLA_Obj A, int nb_alg )

{

/* ... FLA_Part_2x2( ); ... */


b = min( FLA_Obj_length( ABR ), nb_alg );


ATL, /**/ ATR, &A00, /**/ &A01, &A02,

/* ************* */ /* ******************** */

&A10, /**/ &A11, &A12,

ABL, /**/ ABR, &A20, /**/ &A21, &A22,

b, b, FLA_BR );

/*------------------------------------------------------------*/

FLA_Cholesky_unb( A11 ); /* A11 := Cholesky( A11 ) */

FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

FLA_TRANSPOSE, FLA_NONUNIT_DIAG,

FLA_ONE, A11,

A21 ); /* A21 := A21 * inv( A11 )’*/

FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, A21, A22 ); /* A22 := A22 - A21 * A21’ */

/*------------------------------------------------------------*/


}

}


Fighting the War on Parallel Programming Error

1 Introduction

2 Notation







9 Conclusion


Fighting the War on Parallel Programming Error

“When we had no computers, we had no programming problemeither. When we had a few computers, we had a mildprogramming problem. Confronted with machines a million timesas powerful, we are faced with a gigantic programming problem.”

– Dijkstra


Fighting the War on Parallel Programming Error Multithreaded Architectures

1 Introduction

2 Notation







9 Conclusion



LAPACK parallelization: multithreaded BLAS

A11 ?

A21 A22

, A11 is b× b

Pro?:

Evolve legacy code

Con:

Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)




A11 ?

A21 A22

, A11 is b× b

Pro?:

Evolve legacy code

Con:





A11 ?

A21 A22

, A11 is b× b

Pro?:

Evolve legacy code

Con:




Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.







SuperMatrix











SuperMatrix











SuperMatrix











SuperMatrix











SuperMatrix











SuperMatrix











SuperMatrix











SuperMatrix







A =

A(0,0) ? ? · · · ?

A(1,0) A(1,1) ? · · · ?

A(2,0) A(2,1) A(2,2) · · · ?...

......

. . ....

A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)



Algorithm-by-blocks implementation: (almost) no change

int FLA_Cholesky_blk( FLA_Obj A, int nb_alg )

{

/* ... FLA_Part_2x2( ); ... */


b = min( FLA_Obj_length( ABR ), nb_alg );


ATL, /**/ ATR, &A00, /**/ &A01, &A02,

/* ************* */ /* ******************** */

&A10, /**/ &A11, &A12,

ABL, /**/ ABR, &A20, /**/ &A21, &A22,

1, 1, FLA_BR );

/*------------------------------------------------------------*/

FLA_Chol( FLA_LOWER_TRIANGULAR,

*FLASH_OBJ_PTR_AT( A11 ) );

FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

FLA_TRANSPOSE, FLA_NONUNIT_DIAG,

FLA_ONE, A11, A21 );

FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, A21, FLA_ONE, A22 );

/*------------------------------------------------------------*/


}

}



The FLAME runtime system “pre-executes” the code.

Whenever a routine is encountered, a pending task isannotated in a global task queue



The FLAME runtime system “pre-executes” the code.

Whenever a routine is encountered, a pending task isannotated in a global task queue



A(0,0) ? ? · · · ?

A(1,0) A(1,1) ? · · · ?

A(2,0) A(2,1) A(2,2) · · · ?...

......

. . ....

A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)

A(0,0) ? ? · · · ?

A(1,0) A(1,1) ? · · · ?

A(2,0) A(2,1) A(2,2) · · · ?...

......

. . ....

A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)



FLAME Parallelization: SuperMatrix

A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG

FLA Cholesky unb(A(0,0))

A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix

Once all tasks are entered on DAG, the real execution begins!

Tasks with all input operands available are ready, other tasksmust wait in the global queue

Upon termination of a task, the corresponding thread updatesthe list of pending tasks




A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG


A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix







A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG


A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix







A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG


A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix







A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG


A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix







A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG


A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix






Separation of concerns simplifies programming

Library code that can target many architectures.

Run-time system that can implement different schedulers fordifferent situations.













Who is this famous Texan?

UT-Texas must be the better, faster, more successful!



Who is this famous Texan?

UT-Texas must be the better, faster, more successful!



Target Architecture 1

4 socket 2.66 GHz Intel Dunnington - 24 cores

16MB shared L3 cache per socket

OpenMP Intel compiler 11.1

Intel MKL 11.1 (Windows), 10.2 (Linux)



Cholesky factorization (Linux)



Cholesky factorization (Windows)



LU factorization (Linux)



QR factorization (Linux)




4 socket 2.3 GHz AMD Opteron Quad-Core


OpenMP Intel compiler 10.1

GotoBLAS2 1.00



LU factorization with pivoting



Related Approaches

Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)

General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies

High-level language based on OpenMP-like pragmas +compiler + runtime system

Modest results for dense linear algebra

PLASMA Project

Next step in the LAPACK evolutionary path

Traditional style of implementing algorithms

Does not solve the programmability problem

Hierarchically Tiled Arrays

Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72


Related Approaches





PLASMA Project







Related Approaches





PLASMA Project







Related Approaches





PLASMA Project






Fighting the War on Parallel Programming Error Distributed Memory Parallel

1 Introduction

2 Notation







9 Conclusion



Didn’t we solve the problem in the 1990s?

ScaLAPACK (UTK/Berkeley)

Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)

PLAPACK (UT-Austin)

Object-based libraryInspired the FLAME approach

For very large problems on distributed memory clusters, theseshould suffice.






PLAPACK (UT-Austin)








PLAPACK (UT-Austin)








PLAPACK (UT-Austin)





Renewed interest in distributed memory libraries

Intel’s SCC research processor

48 Pentium cores on one chip.

Connected via very fast on-chipcommunication buffers.

No cache-coherency protocol.

Purpose: to study theprogrammability problem formany-core architectures.



A New Framework for Distributed Memory Dense MatrixLibraries

Elemental (Jack Poulson + Bryan Marker)

C++ coded in the style of FLAME/C

2D elemental cyclic matrix distribution.

Does NOT tie algorithmic block size to distribution block size.

ScaLAPACK

Fortran77 coded in the style of LAPACK.

2D block cyclic matrix distribution.

Ties algorithmic block size to distribution block size.








ScaLAPACK











ScaLAPACK











ScaLAPACK











ScaLAPACK






Elemental: FLAME for distributed memory architectures

template<typename T>

void

Elemental::LAPACK::Internal::CholLVar3

( DistMatrix<T,MC,MR>& A )

{

const Grid& grid = A.GetGrid();

// Matrix views

DistMatrix<T,MC,MR>

ATL(grid), ATR(grid), A00(grid), A01(grid), A02(grid),

ABL(grid), ABR(grid), A10(grid), A11(grid), A12(grid),

A20(grid), A21(grid), A22(grid);

// Temporary matrix distributions

DistMatrix<T,Star,Star> A11_Star_Star(grid);

DistMatrix<T,VC, Star> A21_VC_Star(grid);

DistMatrix<T,MC, Star> A21_MC_Star(grid);

DistMatrix<T,MR, Star> A21_MR_Star(grid);

// Start the algorithm



PartitionDownDiagonal( A, ATL, ATR,

ABL, ABR );

while( ABR.Height() > 0 )

{

RepartitionDownDiagonal( ATL, /**/ ATR, A00, /**/ A01, A02,

/*************/ /******************/

/**/ A10, /**/ A11, A12,

ABL, /**/ ABR, A20, /**/ A21, A22 );

A21_MC_Star.AlignWith( A22 );

A21_MR_Star.AlignWith( A22 );

//--------------------------------------------------------------------//

A11_Star_Star = A11;

LAPACK::Chol( Lower, A11_Star_Star.LocalMatrix() );

A11 = A11_Star_Star;

A21_VC_Star = A21;

BLAS::Trsm( Right, Lower, ConjugateTranspose, NonUnit,

(T)1, A11_Star_Star.LockedLocalMatrix(),

A21_VC_Star.LocalMatrix() );

A21_MC_Star = A21_VC_Star;

A21_MR_Star = A21_VC_Star;

BLAS::Internal::HerkLNUpdate( (T)-1, A21_MC_Star, A21_MR_Star,(T)1, A22 );

A21 = A21_MC_Star;

//--------------------------------------------------------------------//

A21_MC_Star.FreeConstraints();

A21_MR_Star.FreeConstraints();

SlidePartitionDownDiagonal( ATL, /**/ ATR, A00, A01, /**/ A02,

/**/ A10, A11, /**/ A12,

/*************/ /******************/

ABL, /**/ ABR, A20, A21, /**/ A22 );

}

}




Total of 15× 4× 4 = 240 cores:

15 nodes (out of 3936 nodes)

4 socket 2.3 GHz AMD Opteron Quad-Core


fill-CLAS InfiniBand 1Gb/sec

MVAPICH2 Release 1.2

GotoBLAS 1.30



Elemental GEMM, 240 cores



Elemental Cholesky, 240 cores



Elemental LU with partial pivoting, 240 cores



An Exercise in Portability: Elemental → SCC

Jan 11 - Tim Mattson:We will send the emulator

Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)

implementation in [Elemental] on an actual [SCC] board.

Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).

Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.

(Many weeks of no progress while everyone was busy with other things)

March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.

March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
















































































What was required?

Replace MPI layer with Intel’s experimental RCCEcommunication layer (Bryan Marker).

Write a few collective communication routines for RCCE(Ernie Chan).

Important: Great confidence in the implementation.

Note: SuperMatrix port to SCC is almost complete.

We are eagerly waiting for performance results.


Other Things I Could Talk About

1 Introduction

2 Notation







9 Conclusion



FLAME/C + GPU

SuperMatrix + Out-of-Core

SuperMatrix + GPU

SuperMatrix + MultiGPU

SuperMatrix + Out-of-Core + MultiGPU

PLAPACK + GPU

New algorithms for algorithms-by-blocks

Weapons of Math Induction for the War on Numerical ErrorAnalysis

Weapons of Math Induction for iterative methods

Mechanical derivation of algorithms

Mechanical translation of FLAME/C code to lower level code

libflame, the library



FLAME/C + GPU


SuperMatrix + GPU



PLAPACK + GPU









FLAME/C + GPU


SuperMatrix + GPU



PLAPACK + GPU









FLAME/C + GPU


SuperMatrix + GPU



PLAPACK + GPU









FLAME/C + GPU


SuperMatrix + GPU



PLAPACK + GPU









FLAME/C + GPU


SuperMatrix + GPU



PLAPACK + GPU









FLAME/C + GPU


SuperMatrix + GPU



PLAPACK + GPU








How Do I Get to Use All This?

1 Introduction

2 Notation







9 Conclusion


How Do I Get to Use All This?

Available as a Professionally Maintained Library

libflame Version 4.0 - Feb. 2010:http://www.cs.utexas.edu/users/flame/

Functionality that is a considerable subset of LAPACK

LAPACK compatibility layer

Linux and Windows OS

Field G. Van Zee. libflame: The Complete Reference.www.lulu.com, 2009

Elemental: http://code.google.com/p/elemental/ (soonto be incorporated in libflame


Conclusion

1 Introduction

2 Notation







9 Conclusion


Conclusion

“How do we convince people that in programming simplicity andclarity – short: what mathematicians call ”elegance” – not adispensable luxury, but a crucial matter that decides betweensuccess and failure?”

– Dijkstra


Conclusion

A Success Story

Practical application of “goal-oriented programming”

For the domain of dense linear algebra libraries,FLAME+SuperMatrix appears to solve the programmabilityproblem for sequential and multicore

For the domain of distributed memory dense linear algebralibraries, Elemental appears to solve the programmabilityproblem for clusters and many-core

http://www.cs.utexas.edu/users/flame/


http://www.cs.utexas.edu/users/flame/

Conclusion

What is next?

My favorate definition of science:“Knowledge that has been reduced to a system”

How can one represent knowledge about linear algebraalgorithms?

How can want systematically perform architecture specifictransformations with this knowledge?

Don’t code the library. Encode the expert knowledge.


Conclusion

What is next?






Conclusion

What is next?






Conclusion

What is next?






Conclusion

What is next?






Conclusion

Want to learn more?

http://www.cs.utexas.edu/users/flame/publications


Conclusion

Questions?


Conclusion

Will this be embraced?

Simplicity is a great virtue but it requires hard work to achieve itand education to appreciate it. And to make matters worse:complexity sells better.

– Dijkstra


Conclusion

Will this be embraced?

Simplicity is a great virtue but it requires hard work to achieve itand education to appreciate it. And to make matters worse:complexity sells better.

– Dijkstra


Documents

Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin