184
Weapons of Math Induction for the War on Parallel Programming Error Robert A. van de Geijn Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin ICES – Sept, 2010 http://www.cs.utexas.edu/users/flame/ 1

Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Weapons of Math Inductionfor the War on Parallel Programming Error

Robert A. van de Geijn

Department of Computer ScienceInstitute for Computational Engineering and Sciences

The University of Texas at Austin

ICES – Sept, 2010

http://www.cs.utexas.edu/users/flame/ 1

Page 2: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Outline

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 2

Page 3: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 3

Page 4: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

The Team

UT-Austin

Faculty/StaffErnie ChanVictor EijkhoutMaggie MyersAndy TerrelRobert van de GeijnField Van Zee

Graduate StudentsBryan MarkerKyungjoo KimIsaac LeeArdavan PedramJack PoulsonMartin Schatz

UndergradsBurns HealyEileen MartinJon MonetteTyler RhodesRichard VerasNick Wiz

Univ. Jaume I, Spain

FacultyGregorio Quintana-OrtıEnrique Quintana-OrtıMercedes Marques

Graduate StudentsManuel FogueFrancisco D. Igual

RWTH Aachen

FacultyPaolo Bientinesi

http://www.cs.utexas.edu/users/flame/ 4

Page 5: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Sponsors

UT-Austin

Numerous NSF GrantsMicrosoftIntel

Univ. Jaume I

Ministerio de Ciencia e InnovacionClearspeedMicrosoftNvidia

http://www.cs.utexas.edu/users/flame/ 5

Page 6: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Who is this famous (former) Texan?

http://www.cs.utexas.edu/users/flame/ 6

Page 7: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

“I mean, if 10 years from now, when you are doing somethingquick and dirty, you suddenly visualize that I am looking over yourshoulders and say to yourself ”Dijkstra would not have liked this”,well, that would be enough immortality for me.”

– Dijkstra

http://www.cs.utexas.edu/users/flame/ 7

Page 8: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

“Literature professors read each other’s books. Why don’tcomputer science professors read each other’s programs?”

– Tim Mattson

http://www.cs.utexas.edu/users/flame/ 8

Page 9: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Why dense linear algebra libraries

Widely used in scientific computing

Well-define domain

Thought to be well-understood

Interesting case study

http://www.cs.utexas.edu/users/flame/ 9

Page 10: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Why dense linear algebra libraries

Widely used in scientific computing

Well-define domain

Thought to be well-understood

Interesting case study

http://www.cs.utexas.edu/users/flame/ 9

Page 11: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Why dense linear algebra libraries

Widely used in scientific computing

Well-define domain

Thought to be well-understood

Interesting case study

http://www.cs.utexas.edu/users/flame/ 9

Page 12: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Why dense linear algebra libraries

Widely used in scientific computing

Well-define domain

Thought to be well-understood

Interesting case study

http://www.cs.utexas.edu/users/flame/ 9

Page 13: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Why dense linear algebra libraries

Widely used in scientific computing

Well-define domain

Thought to be well-understood

Interesting case study

http://www.cs.utexas.edu/users/flame/ 9

Page 14: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

LAPACK Cholesky factorization (dpotrf)

DO 20 J = 1, N, NB

*

* Update and factorize the current diagonal block and test

* for non-positive-definiteness.

*

JB = MIN( NB, N-J+1 )

CALL DSYRK( ’Lower’, ’No transpose’, JB, J-1, -ONE,

$ A( J, 1 ), LDA, ONE, A( J, J ), LDA )

CALL DPOTF2( ’Lower’, JB, A( J, J ), LDA, INFO )

IF( INFO.NE.0 )

$ GO TO 30

IF( J+JB.LE.N ) THEN

*

* Compute the current block column.

*

CALL DGEMM( ’No transpose’, ’Transpose’, N-J-JB+1, JB,

$ J-1, -ONE, A( J+JB, 1 ), LDA, A( J, 1 ),

$ LDA, ONE, A( J+JB, J ), LDA )

CALL DTRSM( ’Right’, ’Lower’, ’Transpose’, ’Non-unit’,

$ N-J-JB+1, JB, ONE, A( J, J ), LDA,

$ A( J+JB, J ), LDA )

END IF

20 CONTINUE

<deleted code>

GO TO 40

*

30 CONTINUE

INFO = INFO + J - 1

*

40 CONTINUE

RETURN

http://www.cs.utexas.edu/users/flame/ 11

Page 15: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

LAPACK

Fortran-77 codes

One routine (algorithm) per operation in the library

Storage in column major order

Parallelism extracted from calls to multithreaded BLAS

Extracting parallelism increases synchronization and thuslimits performance

Column major order hurts data locality

LAPACK does not use modern coding techniques

http://www.cs.utexas.edu/users/flame/ 12

Page 16: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

LAPACK

Fortran-77 codes

One routine (algorithm) per operation in the library

Storage in column major order

Parallelism extracted from calls to multithreaded BLAS

Extracting parallelism increases synchronization and thuslimits performance

Column major order hurts data locality

LAPACK does not use modern coding techniques

http://www.cs.utexas.edu/users/flame/ 12

Page 17: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

LAPACK

Fortran-77 codes

One routine (algorithm) per operation in the library

Storage in column major order

Parallelism extracted from calls to multithreaded BLAS

Extracting parallelism increases synchronization and thuslimits performance

Column major order hurts data locality

LAPACK does not use modern coding techniques

http://www.cs.utexas.edu/users/flame/ 12

Page 18: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

LAPACK

Fortran-77 codes

One routine (algorithm) per operation in the library

Storage in column major order

Parallelism extracted from calls to multithreaded BLAS

Extracting parallelism increases synchronization and thuslimits performance

Column major order hurts data locality

LAPACK does not use modern coding techniques

http://www.cs.utexas.edu/users/flame/ 12

Page 19: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

LAPACK

Fortran-77 codes

One routine (algorithm) per operation in the library

Storage in column major order

Parallelism extracted from calls to multithreaded BLAS

Extracting parallelism increases synchronization and thuslimits performance

Column major order hurts data locality

LAPACK does not use modern coding techniques

http://www.cs.utexas.edu/users/flame/ 12

Page 20: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

LAPACK

Fortran-77 codes

One routine (algorithm) per operation in the library

Storage in column major order

Parallelism extracted from calls to multithreaded BLAS

Extracting parallelism increases synchronization and thuslimits performance

Column major order hurts data locality

LAPACK does not use modern coding techniques

http://www.cs.utexas.edu/users/flame/ 12

Page 21: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

LAPACK

Fortran-77 codes

One routine (algorithm) per operation in the library

Storage in column major order

Parallelism extracted from calls to multithreaded BLAS

Extracting parallelism increases synchronization and thuslimits performance

Column major order hurts data locality

LAPACK does not use modern coding techniques

http://www.cs.utexas.edu/users/flame/ 12

Page 22: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

LAPACK

Fortran-77 codes

One routine (algorithm) per operation in the library

Storage in column major order

Parallelism extracted from calls to multithreaded BLAS

Extracting parallelism increases synchronization and thuslimits performance

Column major order hurts data locality

LAPACK does not use modern coding techniques

http://www.cs.utexas.edu/users/flame/ 12

Page 23: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

The sky is falling

http://www.cs.utexas.edu/users/flame/ 13

Page 24: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Evolution vs intelligent design

Parallelism is thrust upon the masses

Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)

Great, let’s start over

Cheaper than trying to evolve?

http://www.cs.utexas.edu/users/flame/ 14

Page 25: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Evolution vs intelligent design

Parallelism is thrust upon the masses

Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)

Great, let’s start over

Cheaper than trying to evolve?

http://www.cs.utexas.edu/users/flame/ 14

Page 26: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Evolution vs intelligent design

Parallelism is thrust upon the masses

Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)

Great, let’s start over

Cheaper than trying to evolve?

http://www.cs.utexas.edu/users/flame/ 14

Page 27: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Evolution vs intelligent design

Parallelism is thrust upon the masses

Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)

Great, let’s start over

Cheaper than trying to evolve?

http://www.cs.utexas.edu/users/flame/ 14

Page 28: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

Evolution vs intelligent design

Parallelism is thrust upon the masses

Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)

Great, let’s start over

Cheaper than trying to evolve?

http://www.cs.utexas.edu/users/flame/ 14

Page 29: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 30: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 31: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 32: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 33: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 34: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 35: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 36: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 37: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 38: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

FLAME

Notation for expressing algorithms

Systematic derivation procedure

Families of algorithms for each operation

APIs to transform algorithms into codes

Storage and algorithm are independent

Storage-by-blocks

Parallelism with data dependencies

High performance even on “exotic” architectures likemultiGPUs

A new distributed memory library for massively parallelclusters and clusters-on-a-chip

http://www.cs.utexas.edu/users/flame/ 15

Page 39: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

General lesson:

This crisis is an opportunity to completely rethink your code.

http://www.cs.utexas.edu/users/flame/ 16

Page 40: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Introduction

To keep you interested ...

0

100

200

300

400

500

600

700

0 5000 10000 15000 20000

GFL

OPS

Matrix size

Performance of the Cholesky factorization on GPU/CPU

MKL 10.0 spotrf on two Intel Xeon QuadCore (2.2 GHz)Algorithm-by-blocks on Tesla S870

Algorithm-by-blocks on Tesla S1070

http://www.cs.utexas.edu/users/flame/ 17

Page 41: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Notation

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 18

Page 42: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Notation

A Motivating Example: The Cholesky Factorization

Given A→ n× n symmetric positive definite, compute

A = L · LT ,

where L is an n× n lower triangular triangular matrix

http://www.cs.utexas.edu/users/flame/ 19

Page 43: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Notation

The Cholesky Factorization: On the Whiteboard

done

done

done

A(partially

updated)

?

α11 ?

a21 A22

?

α11:=√α11

?

a21:=a21/α11

A22:=

A22−a21aT21

?

done

done

done

A(partially

updated)

http://www.cs.utexas.edu/users/flame/ 20

Page 44: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Notation

FLAME Notation

done

done

done

A(partially

updated)

?

α11 aT12

a21 A22

Repartition„ATL ATR

ABL ABR

«

0BB@A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1CCAwhere α11 is a scalar

http://www.cs.utexas.edu/users/flame/ 21

Page 45: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Notation

Algorithm: [A] := Chol unb(A)

Partition A→(

AT L AT R

ABL ABR

)where ATL is 0× 0

while n(ABR) 6= 0 do

Repartition(ATL ATR

ABL ABR

)→

(A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

)where α11 is a scalar

α11 :=√

α11

a21 := a21/α11

A22 := A22 − a21aT21 (syr)

Continue with(ATL ATR

ABL ABR

)←

(A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

)endwhile

http://www.cs.utexas.edu/users/flame/ 22

Page 46: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Notation

General lesson:

Algorithms should be represented in a way that captures how wereason about them.

http://www.cs.utexas.edu/users/flame/ 23

Page 47: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 24

Page 48: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Family Values

http://www.cs.utexas.edu/users/flame/ 25

Page 49: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

“The only effective way to raise the confidence level of a programsignificantly is to give a convincing proof of its correctness. Butone should not first make the program and then prove itscorrectness, because then the requirement of providing the proofwould only increase the poor programmers burden. On thecontrary: the programmer should let correctness proof andprogram grow hand in hand.”

– Dijkstra

http://www.cs.utexas.edu/users/flame/ 26

Page 50: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

The Worksheet: A Weapon of Math Induction

http://www.cs.utexas.edu/users/flame/ 27

Page 51: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 1: Precondition and postcondition

Precondition: A = A

Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.

Postcondition: A = L ∧ A = LLT

Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.

http://www.cs.utexas.edu/users/flame/ 28

Page 52: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 1: Precondition and postcondition

Precondition: A = A

Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.

Postcondition: A = L ∧ A = LLT

Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.

http://www.cs.utexas.edu/users/flame/ 28

Page 53: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 1: Precondition and postcondition

Precondition: A = A

Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.

Postcondition: A = L ∧ A = LLT

Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.

http://www.cs.utexas.edu/users/flame/ 28

Page 54: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step Annotated Algorithm: A := Chol unb var3(A)

1an

A = Ao

4

2

3 while m(ATL) < m(A) do2,3

5a

6

8

5b

7

2

endwhile2,3

1bn

A = L ∧ A = LLTo

http://www.cs.utexas.edu/users/flame/ 29

Page 55: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 2: Finding Loop-Invariants

Partition the operands

A→(

ATL ?

ABL ABR

)and L→

(LTL 0

LBL LBR

)

Plug into postcondition A = L ∧ A = LLT :(ATL ?

ABL ABR

)=

(LTL 0

LBL LBR

)∧

(ATL ?

ABL ABR

)=

(LTL 0

LBL LBR

)(LTL 0

LBL LBR

)T

http://www.cs.utexas.edu/users/flame/ 30

Page 56: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 2: Finding Loop-Invariants

Partition the operands

A→(

ATL ?

ABL ABR

)and L→

(LTL 0

LBL LBR

)

Plug into postcondition A = L ∧ A = LLT :(ATL ?

ABL ABR

)=

(LTL 0

LBL LBR

)∧

(ATL ?

ABL ABR

)=

(LTL 0

LBL LBR

)(LTL 0

LBL LBR

)T

http://www.cs.utexas.edu/users/flame/ 30

Page 57: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Determine the loop-invariants

„ATL ?

ABL ABR

«=

„LTL 0

LBL LBR

«∧

ATL ?

ABL ABR

!=

„LTLLT

TL ?

LBLLTTL LBLLT

BL + LBRLTBR

«

Loop-invariant 1:

„ATL ?

ABL ABR

«=

LTL 0

ABL ABR

!∧ ATL = LTLLT

TL

Loop-invariant 2:„ATL ?

ABL ABR

«=

LTL 0

LBL ABR

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«Loop-invariant 3:„

ATL ?

ABL ABR

«=

LTL 0

LBL ABR − LBLLTBL

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«

http://www.cs.utexas.edu/users/flame/ 31

Page 58: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Determine the loop-invariants

„ATL ?

ABL ABR

«=

„LTL 0

LBL LBR

«∧

ATL ?

ABL ABR

!=

„LTLLT

TL ?

LBLLTTL LBLLT

BL + LBRLTBR

«

Loop-invariant 1:

„ATL ?

ABL ABR

«=

LTL 0

ABL ABR

!∧ ATL = LTLLT

TL

Loop-invariant 2:„ATL ?

ABL ABR

«=

LTL 0

LBL ABR

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«Loop-invariant 3:„

ATL ?

ABL ABR

«=

LTL 0

LBL ABR − LBLLTBL

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«

http://www.cs.utexas.edu/users/flame/ 31

Page 59: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Determine the loop-invariants

„ATL ?

ABL ABR

«=

„LTL 0

LBL LBR

«∧

ATL ?

ABL ABR

!=

„LTLLT

TL ?

LBLLTTL LBLLT

BL + LBRLTBR

«

Loop-invariant 1:

„ATL ?

ABL ABR

«=

LTL 0

ABL ABR

!∧ ATL = LTLLT

TL

Loop-invariant 2:„ATL ?

ABL ABR

«=

LTL 0

LBL ABR

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«Loop-invariant 3:„

ATL ?

ABL ABR

«=

LTL 0

LBL ABR − LBLLTBL

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«

http://www.cs.utexas.edu/users/flame/ 31

Page 60: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Determine the loop-invariants

„ATL ?

ABL ABR

«=

„LTL 0

LBL LBR

«∧

ATL ?

ABL ABR

!=

„LTLLT

TL ?

LBLLTTL LBLLT

BL + LBRLTBR

«

Loop-invariant 1:

„ATL ?

ABL ABR

«=

LTL 0

ABL ABR

!∧ ATL = LTLLT

TL

Loop-invariant 2:„ATL ?

ABL ABR

«=

LTL 0

LBL ABR

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«Loop-invariant 3:„

ATL ?

ABL ABR

«=

LTL 0

LBL ABR − LBLLTBL

!∧

ATL

ABL

!=

„LTLLT

TL

LBLLTTL

«

http://www.cs.utexas.edu/users/flame/ 31

Page 61: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 2: Enter loop-invariant in worksheet

Step Annotated Algorithm: A := Chol unb var3(A)

1an

A = Ao

4

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)3 while do

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)∧ · · ·

5a685b7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)∧ · · ·

1bn

A = L ∧ A = LLTo

http://www.cs.utexas.edu/users/flame/ 32

Page 62: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Why a Weapon of Math Induction?

http://www.cs.utexas.edu/users/flame/ 33

Page 63: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 3: Finding the Loop-Guard

Step Annotated Algorithm: A := Chol unb var3(A)

1an

A = Ao

4

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)3 while m(AT L) < m(A) do

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)∧ m(AT L) < m(A)

5a685b7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)∧ ¬( m(AT L) < m(A) )

1bn

A = L ∧ A = LLTo

http://www.cs.utexas.edu/users/flame/ 34

Page 64: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 4: Finding the Initialization

Step Annotated Algorithm: A := Chol unb var3(A)

1an

A = Ao

4 Partition A→„

AT L ?

ABL ABR

«, L→

„LT L 0

LBL LBR

«where AT L and LT L are 0× 0

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)3 while m(AT L) < m(A) do

2,3

( „AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!!∧ (m(AT L) < m(A))

)5a685b7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

2,3

( „AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!!∧ ¬ (m(AT L) < m(A))

)1b

nA = L ∧ A = LLT

o

http://www.cs.utexas.edu/users/flame/ 35

Page 65: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 5: Marching through the MatrixStep Annotated Algorithm: A := Chol unb var3(A)

1an

A = Ao

4 Partition A→„

AT L ?

ABL ABR

«, L→

„LT L 0

LBL LBR

«where AT L and LT L are 0× 0

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)3 while m(AT L) < m(A) do

2,3

( „AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!!∧ (m(AT L) < m(A))

)5a Repartition„

AT L ?

ABL ABR

«→

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«→

0@L00 0 0

lT10 λ11 0L20 l21 L22

1Awhere α11 and λ11 are scalars

685b Continue with„

AT L ?

ABL ABR

«←

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«←

0@L00 0 0

lT10 λ11 0

L20 l21 L22

1A7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

2,3

( „AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!!∧ ¬ (m(AT L) < m(A))

)1b

nA = L ∧ A = LLT

ohttp://www.cs.utexas.edu/users/flame/ 36

Page 66: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 6: State Before the Update

.

.

.

.

.

.3 while m(AT L) < m(A) do

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)5a Repartition„

AT L ?

ABL ABR

«→

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«→

0@L00 0 0

lT10 λ11 0L20 l21 L22

1Awhere α11 and λ11 are scalars

6

8><>:0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 α11 − lT10l10 ?

L20 a21 − L20l10 A22 − L20LT20

1CA ∧0B@ A00

aT10

A20

1CA =

0B@L00LT00

lT10LT00

L20LT00

1CA9>=>;

85b Continue with„

AT L ?

ABL ABR

«←

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«←

0@L00 0 0

lT10 λ11 0

L20 l21 L22

1A7

2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

.

.

.

.

.

.

http://www.cs.utexas.edu/users/flame/ 37

Page 67: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 7: State After the Update...

.

.

.3 while m(AT L) < m(A) do

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)5a Repartition„

AT L ?

ABL ABR

«→

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«→

0@L00 0 0

lT10 λ11 0L20 l21 L22

1Awhere α11 and λ11 are scalars

6

8><>:0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 α11 − lT10l10 ?

L20 a21 − L20l10 A22 − L20LT20

1CA ∧0B@ A00

aT10

A20

1CA =

0B@L00LT00

lT10LT00

L20LT00

1CA9>=>;

85b Continue with„

AT L ?

ABL ABR

«←

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«←

0@L00 0 0

lT10 λ11 0

L20 l21 L22

1A

7

8>>>>>>><>>>>>>>:

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 λ11 ?

L20 l21 A22 − L20LT20 − l21lT21

1CA∧

0B@ A00 ?

aT10 α11

A20 a21

1CA =

0B@L00LT00 ?

lT10LT00 lT10l10 + λ2

11

L20LT00 L20l10 + l21λ11

1CA

9>>>>>>>=>>>>>>>;2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

.

.

.

.

.

.http://www.cs.utexas.edu/users/flame/ 38

Page 68: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Step 8: The Update...

.

.

.3 while m(AT L) < m(A) do

2,3

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)5a Repartition„

AT L ?

ABL ABR

«→

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«→

0@L00 0 0

lT10 λ11 0L20 l21 L22

1Awhere α11 and λ11 are scalars

6

8><>:0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 α11 − lT10l10 ?

L20 a21 − L20l10 A22 − L20LT20

1CA ∧0B@ A00

aT10

A20

1CA =

0B@L00LT00

lT10LT00

L20LT00

1CA9>=>;

8

α11 :=√

α11a21 := a21/α11

A22 := A22 − a21aT21

5b Continue with„AT L ?

ABL ABR

«←

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A,

„LT L 0

LBL LBR

«←

0@L00 0 0

lT10 λ11 0

L20 l21 L22

1A

7

8>>>>>>><>>>>>>>:

0@A00 ? ?

aT10 α11 ?

A20 a21 A22

1A =

0B@L00 ? ?

lT10 λ11 ?

L20 l21 A22 − L20LT20 − l21lT21

1CA∧

0B@ A00 ?

aT10 α11

A20 a21

1CA =

0B@L00LT00 ?

lT10LT00 lT10l10 + λ2

11

L20LT00 L20l10 + l21λ11

1CA

9>>>>>>>=>>>>>>>;2

(„AT L ?

ABL ABR

«=

LT L ?

LBL ABR − LBLLTBL

!∧

AT L

ABL

!=

LT LLT

T L

LBLLTT L

!)endwhile

.

.

.

.

.

.

http://www.cs.utexas.edu/users/flame/ 39

Page 69: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

The Algorithm

Algorithm: A := Chol unb var3(A)

Partition A→„

ATL ?

ABL ABR

«where ATL is 0× 0

while m(ATL) < m(A) doRepartition„

ATL ?

ABL ABR

«→

0@ A00 ? ?

aT10 α11 ?

A20 a21 A22

1Awhere α11 is a scalars

α11 :=√

α11

a21 := a21/α11

A22 := A22 − a21aT21

Continue with„ATL ?

ABL ABR

«←

0@ A00 ? ?

aT10 α11 ?

A20 a21 A22

1Aendwhile

http://www.cs.utexas.edu/users/flame/ 40

Page 70: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Having families of correct algorithms is good

Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.

Find all (most) algorithms and pick the best for the targetarchitecture.

In our case, we can systematically generate all (loop-based)algorithms.

http://www.cs.utexas.edu/users/flame/ 41

Page 71: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Having families of correct algorithms is good

Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.

Find all (most) algorithms and pick the best for the targetarchitecture.

In our case, we can systematically generate all (loop-based)algorithms.

http://www.cs.utexas.edu/users/flame/ 41

Page 72: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Having families of correct algorithms is good

Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.

Find all (most) algorithms and pick the best for the targetarchitecture.

In our case, we can systematically generate all (loop-based)algorithms.

http://www.cs.utexas.edu/users/flame/ 41

Page 73: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Having families of correct algorithms is good

Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.

Find all (most) algorithms and pick the best for the targetarchitecture.

In our case, we can systematically generate all (loop-based)algorithms.

http://www.cs.utexas.edu/users/flame/ 41

Page 74: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Is the methodology just a theoretical curiosity?

Broad applicability to all operations supported by LAPACK

Not yet: eigensolvers, SVD.

The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct

Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.

Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward

Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.

Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular

Approach to Stability Analysis.” SIMAX. Conditionally accepted.

http://www.cs.utexas.edu/users/flame/ 42

Page 75: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Is the methodology just a theoretical curiosity?

Broad applicability to all operations supported by LAPACK

Not yet: eigensolvers, SVD.

The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct

Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.

Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward

Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.

Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular

Approach to Stability Analysis.” SIMAX. Conditionally accepted.

http://www.cs.utexas.edu/users/flame/ 42

Page 76: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Is the methodology just a theoretical curiosity?

Broad applicability to all operations supported by LAPACK

Not yet: eigensolvers, SVD.

The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct

Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.

Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward

Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.

Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular

Approach to Stability Analysis.” SIMAX. Conditionally accepted.

http://www.cs.utexas.edu/users/flame/ 42

Page 77: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Is the methodology just a theoretical curiosity?

Broad applicability to all operations supported by LAPACK

Not yet: eigensolvers, SVD.

The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct

Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.

Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward

Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.

Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular

Approach to Stability Analysis.” SIMAX. Conditionally accepted.

http://www.cs.utexas.edu/users/flame/ 42

Page 78: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

Is the methodology just a theoretical curiosity?

Broad applicability to all operations supported by LAPACK

Not yet: eigensolvers, SVD.

The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct

Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.

Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward

Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.

Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular

Approach to Stability Analysis.” SIMAX. Conditionally accepted.

http://www.cs.utexas.edu/users/flame/ 42

Page 79: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

How does this apply to parallel programming?

Coding correct parallel code is difficult.

We derive our (for now sequential) algorithms to be correct.

It is important to choose from a family of algorithms.

Choose an algorithm that parallelizes well.

http://www.cs.utexas.edu/users/flame/ 43

Page 80: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

How does this apply to parallel programming?

Coding correct parallel code is difficult.

We derive our (for now sequential) algorithms to be correct.

It is important to choose from a family of algorithms.

Choose an algorithm that parallelizes well.

http://www.cs.utexas.edu/users/flame/ 43

Page 81: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Deriving Algorithms to be Correct

How does this apply to parallel programming?

Coding correct parallel code is difficult.

We derive our (for now sequential) algorithms to be correct.

It is important to choose from a family of algorithms.

Choose an algorithm that parallelizes well.

http://www.cs.utexas.edu/users/flame/ 43

Page 82: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

From Correct Algorithm to Correct Code

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 44

Page 83: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

From Correct Algorithm to Correct Code

FLAME/C Code

Repartition„ATL ATR

ABL ABR

«→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1Awhere α11 is a scalar

FLA_Repart_2x2_to_3x3(

ATL, /**/ ATR, &A00, /**/ &a01, &A02,

/* ************** */ /* *************************** */

&a10t, /**/ &alpha11, &a12t,

ABL, /**/ ABR, &A20, /**/ &a21, &A22,

1, 1, FLA_BR );

http://www.cs.utexas.edu/users/flame/ 45

Page 84: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

From Correct Algorithm to Correct Code

FLAME/C Code

Repartition„ATL ATR

ABL ABR

«→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1Awhere α11 is a scalar

FLA_Repart_2x2_to_3x3(

ATL, /**/ ATR, &A00, /**/ &a01, &A02,

/* ************** */ /* *************************** */

&a10t, /**/ &alpha11, &a12t,

ABL, /**/ ABR, &A20, /**/ &a21, &A22,

1, 1, FLA_BR );

http://www.cs.utexas.edu/users/flame/ 45

Page 85: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

From Correct Algorithm to Correct Code

(Unblocked) FLAME/C Code

int FLA_Cholesky_unb( FLA_Obj A )

{

/* ... FLA_Part_2x2( ); ... */

while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){

FLA_Repart_2x2_to_3x3(

ATL, /**/ ATR, &A00, /**/ &a01, &A02,

/* ************* */ /* ************************** */

&a10t, /**/ &alpha11, &a12t,

ABL, /**/ ABR, &A20, /**/ &a21, &A22,

1, 1, FLA_BR );

/*------------------------------------------------------------*/

FLA_Sqrt ( alpha11 ); /* a11 := sqrt( alpha11 ) */

FLA_Inv_Scal( alpha11, a21 ); /* a21 := a21 / alpha11 */

FLA_Syr ( FLA_LOWER_TRIANGULAR,

FLA_MINUS_ONE,

a21, A22 ); /* A22 := A22 - a21 * a21t */

/*------------------------------------------------------------*/

/* FLA_Cont_with_3x3_to_2x2( ); ... */

}

}

http://www.cs.utexas.edu/users/flame/ 46

Page 86: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Achieving High Performance

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 47

Page 87: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Achieving High Performance

Who is this famous (former) Texan?

http://www.cs.utexas.edu/users/flame/ 48

Page 88: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Achieving High Performance

Who is this famous Texan? Kazushige Goto (TACC)

http://www.cs.utexas.edu/users/flame/ 49

Page 89: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Achieving High Performance

High-Performance Matrix-Matrix Multiplication

Why is matrix-matrix multiplication (gemm) so important?

O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).

Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance

Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):

Article 12, 25 pages, May 2008.

Use method to derive blocked algorithms that cast morecomputation in terms of gemm.

http://www.cs.utexas.edu/users/flame/ 50

Page 90: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Achieving High Performance

High-Performance Matrix-Matrix Multiplication

Why is matrix-matrix multiplication (gemm) so important?

O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).

Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance

Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):

Article 12, 25 pages, May 2008.

Use method to derive blocked algorithms that cast morecomputation in terms of gemm.

http://www.cs.utexas.edu/users/flame/ 50

Page 91: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Achieving High Performance

High-Performance Matrix-Matrix Multiplication

Why is matrix-matrix multiplication (gemm) so important?

O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).

Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance

Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):

Article 12, 25 pages, May 2008.

Use method to derive blocked algorithms that cast morecomputation in terms of gemm.

http://www.cs.utexas.edu/users/flame/ 50

Page 92: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Achieving High Performance

High-Performance Matrix-Matrix Multiplication

Why is matrix-matrix multiplication (gemm) so important?

O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).

Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance

Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):

Article 12, 25 pages, May 2008.

Use method to derive blocked algorithms that cast morecomputation in terms of gemm.

http://www.cs.utexas.edu/users/flame/ 50

Page 93: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Achieving High Performance

(Unblocked) FLAME/C Code (Again)

int FLA_Cholesky_unb( FLA_Obj A )

{

/* ... FLA_Part_2x2( ); ... */

while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){

FLA_Repart_2x2_to_3x3(

ATL, /**/ ATR, &A00, /**/ &a01, &A02,

/* ************* */ /* ************************** */

&a10t, /**/ &alpha11, &a12t,

ABL, /**/ ABR, &A20, /**/ &a21, &A22,

1, 1, FLA_BR );

/*------------------------------------------------------------*/

FLA_Sqrt ( alpha11 ); /* a11 := sqrt( alpha11 ) */

FLA_Inv_Scal( alpha11, a21 ); /* a21 := a21 / alpha11 */

FLA_Syr ( FLA_LOWER_TRIANGULAR,

FLA_MINUS_ONE,

a21, A22 ); /* A22 := A22 - a21 * a21t */

/*------------------------------------------------------------*/

/* FLA_Cont_with_3x3_to_2x2( ); ... */

}

}

http://www.cs.utexas.edu/users/flame/ 51

Page 94: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Achieving High Performance

Blocked FLAME/C Code

int FLA_Cholesky_blk( FLA_Obj A, int nb_alg )

{

/* ... FLA_Part_2x2( ); ... */

while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){

b = min( FLA_Obj_length( ABR ), nb_alg );

FLA_Repart_2x2_to_3x3(

ATL, /**/ ATR, &A00, /**/ &A01, &A02,

/* ************* */ /* ******************** */

&A10, /**/ &A11, &A12,

ABL, /**/ ABR, &A20, /**/ &A21, &A22,

b, b, FLA_BR );

/*------------------------------------------------------------*/

FLA_Cholesky_unb( A11 ); /* A11 := Cholesky( A11 ) */

FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

FLA_TRANSPOSE, FLA_NONUNIT_DIAG,

FLA_ONE, A11,

A21 ); /* A21 := A21 * inv( A11 )’*/

FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, A21, A22 ); /* A22 := A22 - A21 * A21’ */

/*------------------------------------------------------------*/

/* FLA_Cont_with_3x3_to_2x2( ); ... */

}

}

http://www.cs.utexas.edu/users/flame/ 52

Page 95: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 53

Page 96: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error

“When we had no computers, we had no programming problemeither. When we had a few computers, we had a mildprogramming problem. Confronted with machines a million timesas powerful, we are faced with a gigantic programming problem.”

– Dijkstra

http://www.cs.utexas.edu/users/flame/ 54

Page 97: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 55

Page 98: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

LAPACK parallelization: multithreaded BLAS

A11 ?

A21 A22

, A11 is b× b

Pro?:

Evolve legacy code

Con:

Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)

http://www.cs.utexas.edu/users/flame/ 56

Page 99: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

LAPACK parallelization: multithreaded BLAS

A11 ?

A21 A22

, A11 is b× b

Pro?:

Evolve legacy code

Con:

Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)

http://www.cs.utexas.edu/users/flame/ 56

Page 100: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

LAPACK parallelization: multithreaded BLAS

A11 ?

A21 A22

, A11 is b× b

Pro?:

Evolve legacy code

Con:

Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)

http://www.cs.utexas.edu/users/flame/ 56

Page 101: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.

http://www.cs.utexas.edu/users/flame/ 57

Page 102: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.

http://www.cs.utexas.edu/users/flame/ 57

Page 103: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.

http://www.cs.utexas.edu/users/flame/ 57

Page 104: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.

http://www.cs.utexas.edu/users/flame/ 57

Page 105: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.

http://www.cs.utexas.edu/users/flame/ 57

Page 106: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.

http://www.cs.utexas.edu/users/flame/ 57

Page 107: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.

http://www.cs.utexas.edu/users/flame/ 57

Page 108: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.

http://www.cs.utexas.edu/users/flame/ 57

Page 109: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Of algorithms-by-blocks and runtime systems

Improve parallelism and data locality: algorithms-by-blocks

Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation

Execute sequential code to generate DAG of tasks

SuperMatrix

Runtime system for scheduling tasks to threads

Sequential kernels to be executed by the threads

Always be sure to make the machine-specific part someoneelse’s problem

SuperMatrix is part of Ernie Chan’s dissertation work.

http://www.cs.utexas.edu/users/flame/ 57

Page 110: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

A =

A(0,0) ? ? · · · ?

A(1,0) A(1,1) ? · · · ?

A(2,0) A(2,1) A(2,2) · · · ?...

......

. . ....

A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)

http://www.cs.utexas.edu/users/flame/ 58

Page 111: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Algorithm-by-blocks implementation: (almost) no change

int FLA_Cholesky_blk( FLA_Obj A, int nb_alg )

{

/* ... FLA_Part_2x2( ); ... */

while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){

b = min( FLA_Obj_length( ABR ), nb_alg );

FLA_Repart_2x2_to_3x3(

ATL, /**/ ATR, &A00, /**/ &A01, &A02,

/* ************* */ /* ******************** */

&A10, /**/ &A11, &A12,

ABL, /**/ ABR, &A20, /**/ &A21, &A22,

1, 1, FLA_BR );

/*------------------------------------------------------------*/

FLA_Chol( FLA_LOWER_TRIANGULAR,

*FLASH_OBJ_PTR_AT( A11 ) );

FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

FLA_TRANSPOSE, FLA_NONUNIT_DIAG,

FLA_ONE, A11, A21 );

FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, A21, FLA_ONE, A22 );

/*------------------------------------------------------------*/

/* FLA_Cont_with_3x3_to_2x2( ); ... */

}

}

http://www.cs.utexas.edu/users/flame/ 59

Page 112: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

The FLAME runtime system “pre-executes” the code.

Whenever a routine is encountered, a pending task isannotated in a global task queue

http://www.cs.utexas.edu/users/flame/ 60

Page 113: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

The FLAME runtime system “pre-executes” the code.

Whenever a routine is encountered, a pending task isannotated in a global task queue

http://www.cs.utexas.edu/users/flame/ 60

Page 114: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

A(0,0) ? ? · · · ?

A(1,0) A(1,1) ? · · · ?

A(2,0) A(2,1) A(2,2) · · · ?...

......

. . ....

A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)

A(0,0) ? ? · · · ?

A(1,0) A(1,1) ? · · · ?

A(2,0) A(2,1) A(2,2) · · · ?...

......

. . ....

A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)

http://www.cs.utexas.edu/users/flame/ 61

Page 115: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

FLAME Parallelization: SuperMatrix

A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG

FLA Cholesky unb(A(0,0))

A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix

Once all tasks are entered on DAG, the real execution begins!

Tasks with all input operands available are ready, other tasksmust wait in the global queue

Upon termination of a task, the corresponding thread updatesthe list of pending tasks

http://www.cs.utexas.edu/users/flame/ 62

Page 116: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

FLAME Parallelization: SuperMatrix

A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG

FLA Cholesky unb(A(0,0))

A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix

Once all tasks are entered on DAG, the real execution begins!

Tasks with all input operands available are ready, other tasksmust wait in the global queue

Upon termination of a task, the corresponding thread updatesthe list of pending tasks

http://www.cs.utexas.edu/users/flame/ 62

Page 117: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

FLAME Parallelization: SuperMatrix

A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG

FLA Cholesky unb(A(0,0))

A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix

Once all tasks are entered on DAG, the real execution begins!

Tasks with all input operands available are ready, other tasksmust wait in the global queue

Upon termination of a task, the corresponding thread updatesthe list of pending tasks

http://www.cs.utexas.edu/users/flame/ 62

Page 118: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

FLAME Parallelization: SuperMatrix

A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG

FLA Cholesky unb(A(0,0))

A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix

Once all tasks are entered on DAG, the real execution begins!

Tasks with all input operands available are ready, other tasksmust wait in the global queue

Upon termination of a task, the corresponding thread updatesthe list of pending tasks

http://www.cs.utexas.edu/users/flame/ 62

Page 119: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

FLAME Parallelization: SuperMatrix

A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG

FLA Cholesky unb(A(0,0))

A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix

Once all tasks are entered on DAG, the real execution begins!

Tasks with all input operands available are ready, other tasksmust wait in the global queue

Upon termination of a task, the corresponding thread updatesthe list of pending tasks

http://www.cs.utexas.edu/users/flame/ 62

Page 120: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

FLAME Parallelization: SuperMatrix

A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·

......

.... . .

→at

runtimebuildDAG

FLA Cholesky unb(A(0,0))

A(1,0) := A(1,0) tril“A(0,0)−T

”A(2,0) := A(2,0) tril

“A(0,0)−T

”...

A(1,1) := A(1,1) −A(1,0)A(1,0) T

...

SuperMatrix

Once all tasks are entered on DAG, the real execution begins!

Tasks with all input operands available are ready, other tasksmust wait in the global queue

Upon termination of a task, the corresponding thread updatesthe list of pending tasks

http://www.cs.utexas.edu/users/flame/ 62

Page 121: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Separation of concerns simplifies programming

Library code that can target many architectures.

Run-time system that can implement different schedulers fordifferent situations.

http://www.cs.utexas.edu/users/flame/ 63

Page 122: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Separation of concerns simplifies programming

Library code that can target many architectures.

Run-time system that can implement different schedulers fordifferent situations.

http://www.cs.utexas.edu/users/flame/ 63

Page 123: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Separation of concerns simplifies programming

Library code that can target many architectures.

Run-time system that can implement different schedulers fordifferent situations.

http://www.cs.utexas.edu/users/flame/ 63

Page 124: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Who is this famous Texan?

UT-Texas must be the better, faster, more successful!

http://www.cs.utexas.edu/users/flame/ 64

Page 125: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Who is this famous Texan?

UT-Texas must be the better, faster, more successful!

http://www.cs.utexas.edu/users/flame/ 64

Page 126: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Target Architecture 1

4 socket 2.66 GHz Intel Dunnington - 24 cores

16MB shared L3 cache per socket

OpenMP Intel compiler 11.1

Intel MKL 11.1 (Windows), 10.2 (Linux)

http://www.cs.utexas.edu/users/flame/ 65

Page 127: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Cholesky factorization (Linux)

http://www.cs.utexas.edu/users/flame/ 66

Page 128: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Cholesky factorization (Windows)

http://www.cs.utexas.edu/users/flame/ 67

Page 129: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

LU factorization (Linux)

http://www.cs.utexas.edu/users/flame/ 68

Page 130: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

QR factorization (Linux)

http://www.cs.utexas.edu/users/flame/ 69

Page 131: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Target Architecture 2

4 socket 2.3 GHz AMD Opteron Quad-Core

2MB shared L3 cache per socket

OpenMP Intel compiler 10.1

GotoBLAS2 1.00

http://www.cs.utexas.edu/users/flame/ 70

Page 132: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

LU factorization with pivoting

http://www.cs.utexas.edu/users/flame/ 71

Page 133: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Related Approaches

Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)

General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies

High-level language based on OpenMP-like pragmas +compiler + runtime system

Modest results for dense linear algebra

PLASMA Project

Next step in the LAPACK evolutionary path

Traditional style of implementing algorithms

Does not solve the programmability problem

Hierarchically Tiled Arrays

Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72

Page 134: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Related Approaches

Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)

General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies

High-level language based on OpenMP-like pragmas +compiler + runtime system

Modest results for dense linear algebra

PLASMA Project

Next step in the LAPACK evolutionary path

Traditional style of implementing algorithms

Does not solve the programmability problem

Hierarchically Tiled Arrays

Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72

Page 135: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Related Approaches

Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)

General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies

High-level language based on OpenMP-like pragmas +compiler + runtime system

Modest results for dense linear algebra

PLASMA Project

Next step in the LAPACK evolutionary path

Traditional style of implementing algorithms

Does not solve the programmability problem

Hierarchically Tiled Arrays

Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72

Page 136: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Multithreaded Architectures

Related Approaches

Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)

General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies

High-level language based on OpenMP-like pragmas +compiler + runtime system

Modest results for dense linear algebra

PLASMA Project

Next step in the LAPACK evolutionary path

Traditional style of implementing algorithms

Does not solve the programmability problem

Hierarchically Tiled Arrays

Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72

Page 137: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 73

Page 138: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Didn’t we solve the problem in the 1990s?

ScaLAPACK (UTK/Berkeley)

Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)

PLAPACK (UT-Austin)

Object-based libraryInspired the FLAME approach

For very large problems on distributed memory clusters, theseshould suffice.

http://www.cs.utexas.edu/users/flame/ 74

Page 139: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Didn’t we solve the problem in the 1990s?

ScaLAPACK (UTK/Berkeley)

Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)

PLAPACK (UT-Austin)

Object-based libraryInspired the FLAME approach

For very large problems on distributed memory clusters, theseshould suffice.

http://www.cs.utexas.edu/users/flame/ 74

Page 140: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Didn’t we solve the problem in the 1990s?

ScaLAPACK (UTK/Berkeley)

Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)

PLAPACK (UT-Austin)

Object-based libraryInspired the FLAME approach

For very large problems on distributed memory clusters, theseshould suffice.

http://www.cs.utexas.edu/users/flame/ 74

Page 141: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Didn’t we solve the problem in the 1990s?

ScaLAPACK (UTK/Berkeley)

Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)

PLAPACK (UT-Austin)

Object-based libraryInspired the FLAME approach

For very large problems on distributed memory clusters, theseshould suffice.

http://www.cs.utexas.edu/users/flame/ 74

Page 142: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Renewed interest in distributed memory libraries

Intel’s SCC research processor

48 Pentium cores on one chip.

Connected via very fast on-chipcommunication buffers.

No cache-coherency protocol.

Purpose: to study theprogrammability problem formany-core architectures.

http://www.cs.utexas.edu/users/flame/ 75

Page 143: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

A New Framework for Distributed Memory Dense MatrixLibraries

Elemental (Jack Poulson + Bryan Marker)

C++ coded in the style of FLAME/C

2D elemental cyclic matrix distribution.

Does NOT tie algorithmic block size to distribution block size.

ScaLAPACK

Fortran77 coded in the style of LAPACK.

2D block cyclic matrix distribution.

Ties algorithmic block size to distribution block size.

http://www.cs.utexas.edu/users/flame/ 76

Page 144: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

A New Framework for Distributed Memory Dense MatrixLibraries

Elemental (Jack Poulson + Bryan Marker)

C++ coded in the style of FLAME/C

2D elemental cyclic matrix distribution.

Does NOT tie algorithmic block size to distribution block size.

ScaLAPACK

Fortran77 coded in the style of LAPACK.

2D block cyclic matrix distribution.

Ties algorithmic block size to distribution block size.

http://www.cs.utexas.edu/users/flame/ 76

Page 145: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

A New Framework for Distributed Memory Dense MatrixLibraries

Elemental (Jack Poulson + Bryan Marker)

C++ coded in the style of FLAME/C

2D elemental cyclic matrix distribution.

Does NOT tie algorithmic block size to distribution block size.

ScaLAPACK

Fortran77 coded in the style of LAPACK.

2D block cyclic matrix distribution.

Ties algorithmic block size to distribution block size.

http://www.cs.utexas.edu/users/flame/ 76

Page 146: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

A New Framework for Distributed Memory Dense MatrixLibraries

Elemental (Jack Poulson + Bryan Marker)

C++ coded in the style of FLAME/C

2D elemental cyclic matrix distribution.

Does NOT tie algorithmic block size to distribution block size.

ScaLAPACK

Fortran77 coded in the style of LAPACK.

2D block cyclic matrix distribution.

Ties algorithmic block size to distribution block size.

http://www.cs.utexas.edu/users/flame/ 76

Page 147: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

A New Framework for Distributed Memory Dense MatrixLibraries

Elemental (Jack Poulson + Bryan Marker)

C++ coded in the style of FLAME/C

2D elemental cyclic matrix distribution.

Does NOT tie algorithmic block size to distribution block size.

ScaLAPACK

Fortran77 coded in the style of LAPACK.

2D block cyclic matrix distribution.

Ties algorithmic block size to distribution block size.

http://www.cs.utexas.edu/users/flame/ 76

Page 148: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Elemental: FLAME for distributed memory architectures

template<typename T>

void

Elemental::LAPACK::Internal::CholLVar3

( DistMatrix<T,MC,MR>& A )

{

const Grid& grid = A.GetGrid();

// Matrix views

DistMatrix<T,MC,MR>

ATL(grid), ATR(grid), A00(grid), A01(grid), A02(grid),

ABL(grid), ABR(grid), A10(grid), A11(grid), A12(grid),

A20(grid), A21(grid), A22(grid);

// Temporary matrix distributions

DistMatrix<T,Star,Star> A11_Star_Star(grid);

DistMatrix<T,VC, Star> A21_VC_Star(grid);

DistMatrix<T,MC, Star> A21_MC_Star(grid);

DistMatrix<T,MR, Star> A21_MR_Star(grid);

// Start the algorithm

http://www.cs.utexas.edu/users/flame/ 78

Page 149: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

PartitionDownDiagonal( A, ATL, ATR,

ABL, ABR );

while( ABR.Height() > 0 )

{

RepartitionDownDiagonal( ATL, /**/ ATR, A00, /**/ A01, A02,

/*************/ /******************/

/**/ A10, /**/ A11, A12,

ABL, /**/ ABR, A20, /**/ A21, A22 );

A21_MC_Star.AlignWith( A22 );

A21_MR_Star.AlignWith( A22 );

//--------------------------------------------------------------------//

A11_Star_Star = A11;

LAPACK::Chol( Lower, A11_Star_Star.LocalMatrix() );

A11 = A11_Star_Star;

A21_VC_Star = A21;

BLAS::Trsm( Right, Lower, ConjugateTranspose, NonUnit,

(T)1, A11_Star_Star.LockedLocalMatrix(),

A21_VC_Star.LocalMatrix() );

A21_MC_Star = A21_VC_Star;

A21_MR_Star = A21_VC_Star;

BLAS::Internal::HerkLNUpdate( (T)-1, A21_MC_Star, A21_MR_Star,(T)1, A22 );

A21 = A21_MC_Star;

//--------------------------------------------------------------------//

A21_MC_Star.FreeConstraints();

A21_MR_Star.FreeConstraints();

SlidePartitionDownDiagonal( ATL, /**/ ATR, A00, A01, /**/ A02,

/**/ A10, A11, /**/ A12,

/*************/ /******************/

ABL, /**/ ABR, A20, A21, /**/ A22 );

}

}

http://www.cs.utexas.edu/users/flame/ 80

Page 150: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Target Architecture 3

Total of 15× 4× 4 = 240 cores:

15 nodes (out of 3936 nodes)

4 socket 2.3 GHz AMD Opteron Quad-Core

2MB shared L3 cache per socket

fill-CLAS InfiniBand 1Gb/sec

MVAPICH2 Release 1.2

GotoBLAS 1.30

http://www.cs.utexas.edu/users/flame/ 81

Page 151: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Elemental GEMM, 240 cores

http://www.cs.utexas.edu/users/flame/ 82

Page 152: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Elemental Cholesky, 240 cores

http://www.cs.utexas.edu/users/flame/ 83

Page 153: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

Elemental LU with partial pivoting, 240 cores

http://www.cs.utexas.edu/users/flame/ 84

Page 154: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

An Exercise in Portability: Elemental → SCC

Jan 11 - Tim Mattson:We will send the emulator

Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)

implementation in [Elemental] on an actual [SCC] board.

Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).

Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.

(Many weeks of no progress while everyone was busy with other things)

March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.

March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.

http://www.cs.utexas.edu/users/flame/ 85

Page 155: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

An Exercise in Portability: Elemental → SCC

Jan 11 - Tim Mattson:We will send the emulator

Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)

implementation in [Elemental] on an actual [SCC] board.

Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).

Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.

(Many weeks of no progress while everyone was busy with other things)

March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.

March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.

http://www.cs.utexas.edu/users/flame/ 85

Page 156: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

An Exercise in Portability: Elemental → SCC

Jan 11 - Tim Mattson:We will send the emulator

Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)

implementation in [Elemental] on an actual [SCC] board.

Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).

Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.

(Many weeks of no progress while everyone was busy with other things)

March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.

March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.

http://www.cs.utexas.edu/users/flame/ 85

Page 157: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

An Exercise in Portability: Elemental → SCC

Jan 11 - Tim Mattson:We will send the emulator

Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)

implementation in [Elemental] on an actual [SCC] board.

Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).

Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.

(Many weeks of no progress while everyone was busy with other things)

March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.

March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.

http://www.cs.utexas.edu/users/flame/ 85

Page 158: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

An Exercise in Portability: Elemental → SCC

Jan 11 - Tim Mattson:We will send the emulator

Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)

implementation in [Elemental] on an actual [SCC] board.

Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).

Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.

(Many weeks of no progress while everyone was busy with other things)

March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.

March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.

http://www.cs.utexas.edu/users/flame/ 85

Page 159: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

An Exercise in Portability: Elemental → SCC

Jan 11 - Tim Mattson:We will send the emulator

Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)

implementation in [Elemental] on an actual [SCC] board.

Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).

Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.

(Many weeks of no progress while everyone was busy with other things)

March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.

March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.

http://www.cs.utexas.edu/users/flame/ 85

Page 160: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

An Exercise in Portability: Elemental → SCC

Jan 11 - Tim Mattson:We will send the emulator

Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)

implementation in [Elemental] on an actual [SCC] board.

Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).

Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.

(Many weeks of no progress while everyone was busy with other things)

March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.

March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.

http://www.cs.utexas.edu/users/flame/ 85

Page 161: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

An Exercise in Portability: Elemental → SCC

Jan 11 - Tim Mattson:We will send the emulator

Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)

implementation in [Elemental] on an actual [SCC] board.

Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).

Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.

(Many weeks of no progress while everyone was busy with other things)

March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.

March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.

http://www.cs.utexas.edu/users/flame/ 85

Page 162: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Fighting the War on Parallel Programming Error Distributed Memory Parallel

What was required?

Replace MPI layer with Intel’s experimental RCCEcommunication layer (Bryan Marker).

Write a few collective communication routines for RCCE(Ernie Chan).

Important: Great confidence in the implementation.

Note: SuperMatrix port to SCC is almost complete.

We are eagerly waiting for performance results.

http://www.cs.utexas.edu/users/flame/ 86

Page 163: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Other Things I Could Talk About

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 87

Page 164: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Other Things I Could Talk About

FLAME/C + GPU

SuperMatrix + Out-of-Core

SuperMatrix + GPU

SuperMatrix + MultiGPU

SuperMatrix + Out-of-Core + MultiGPU

PLAPACK + GPU

New algorithms for algorithms-by-blocks

Weapons of Math Induction for the War on Numerical ErrorAnalysis

Weapons of Math Induction for iterative methods

Mechanical derivation of algorithms

Mechanical translation of FLAME/C code to lower level code

libflame, the library

http://www.cs.utexas.edu/users/flame/ 88

Page 165: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Other Things I Could Talk About

FLAME/C + GPU

SuperMatrix + Out-of-Core

SuperMatrix + GPU

SuperMatrix + MultiGPU

SuperMatrix + Out-of-Core + MultiGPU

PLAPACK + GPU

New algorithms for algorithms-by-blocks

Weapons of Math Induction for the War on Numerical ErrorAnalysis

Weapons of Math Induction for iterative methods

Mechanical derivation of algorithms

Mechanical translation of FLAME/C code to lower level code

libflame, the library

http://www.cs.utexas.edu/users/flame/ 88

Page 166: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Other Things I Could Talk About

FLAME/C + GPU

SuperMatrix + Out-of-Core

SuperMatrix + GPU

SuperMatrix + MultiGPU

SuperMatrix + Out-of-Core + MultiGPU

PLAPACK + GPU

New algorithms for algorithms-by-blocks

Weapons of Math Induction for the War on Numerical ErrorAnalysis

Weapons of Math Induction for iterative methods

Mechanical derivation of algorithms

Mechanical translation of FLAME/C code to lower level code

libflame, the library

http://www.cs.utexas.edu/users/flame/ 88

Page 167: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Other Things I Could Talk About

FLAME/C + GPU

SuperMatrix + Out-of-Core

SuperMatrix + GPU

SuperMatrix + MultiGPU

SuperMatrix + Out-of-Core + MultiGPU

PLAPACK + GPU

New algorithms for algorithms-by-blocks

Weapons of Math Induction for the War on Numerical ErrorAnalysis

Weapons of Math Induction for iterative methods

Mechanical derivation of algorithms

Mechanical translation of FLAME/C code to lower level code

libflame, the library

http://www.cs.utexas.edu/users/flame/ 88

Page 168: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Other Things I Could Talk About

FLAME/C + GPU

SuperMatrix + Out-of-Core

SuperMatrix + GPU

SuperMatrix + MultiGPU

SuperMatrix + Out-of-Core + MultiGPU

PLAPACK + GPU

New algorithms for algorithms-by-blocks

Weapons of Math Induction for the War on Numerical ErrorAnalysis

Weapons of Math Induction for iterative methods

Mechanical derivation of algorithms

Mechanical translation of FLAME/C code to lower level code

libflame, the library

http://www.cs.utexas.edu/users/flame/ 88

Page 169: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Other Things I Could Talk About

FLAME/C + GPU

SuperMatrix + Out-of-Core

SuperMatrix + GPU

SuperMatrix + MultiGPU

SuperMatrix + Out-of-Core + MultiGPU

PLAPACK + GPU

New algorithms for algorithms-by-blocks

Weapons of Math Induction for the War on Numerical ErrorAnalysis

Weapons of Math Induction for iterative methods

Mechanical derivation of algorithms

Mechanical translation of FLAME/C code to lower level code

libflame, the library

http://www.cs.utexas.edu/users/flame/ 88

Page 170: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Other Things I Could Talk About

FLAME/C + GPU

SuperMatrix + Out-of-Core

SuperMatrix + GPU

SuperMatrix + MultiGPU

SuperMatrix + Out-of-Core + MultiGPU

PLAPACK + GPU

New algorithms for algorithms-by-blocks

Weapons of Math Induction for the War on Numerical ErrorAnalysis

Weapons of Math Induction for iterative methods

Mechanical derivation of algorithms

Mechanical translation of FLAME/C code to lower level code

libflame, the library

http://www.cs.utexas.edu/users/flame/ 88

Page 171: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

How Do I Get to Use All This?

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 89

Page 172: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

How Do I Get to Use All This?

Available as a Professionally Maintained Library

libflame Version 4.0 - Feb. 2010:http://www.cs.utexas.edu/users/flame/

Functionality that is a considerable subset of LAPACK

LAPACK compatibility layer

Linux and Windows OS

Field G. Van Zee. libflame: The Complete Reference.www.lulu.com, 2009

Elemental: http://code.google.com/p/elemental/ (soonto be incorporated in libflame

http://www.cs.utexas.edu/users/flame/ 90

Page 173: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

1 Introduction

2 Notation

3 Deriving Algorithms to be Correct

4 From Correct Algorithm to Correct Code

5 Achieving High Performance

6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel

7 Other Things I Could Talk About

8 How Do I Get to Use All This?

9 Conclusion

http://www.cs.utexas.edu/users/flame/ 91

Page 174: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

“How do we convince people that in programming simplicity andclarity – short: what mathematicians call ”elegance” – not adispensable luxury, but a crucial matter that decides betweensuccess and failure?”

– Dijkstra

http://www.cs.utexas.edu/users/flame/ 92

Page 175: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

A Success Story

Practical application of “goal-oriented programming”

For the domain of dense linear algebra libraries,FLAME+SuperMatrix appears to solve the programmabilityproblem for sequential and multicore

For the domain of distributed memory dense linear algebralibraries, Elemental appears to solve the programmabilityproblem for clusters and many-core

http://www.cs.utexas.edu/users/flame/

http://www.cs.utexas.edu/users/flame/ 93

Page 176: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

What is next?

My favorate definition of science:“Knowledge that has been reduced to a system”

How can one represent knowledge about linear algebraalgorithms?

How can want systematically perform architecture specifictransformations with this knowledge?

Don’t code the library. Encode the expert knowledge.

http://www.cs.utexas.edu/users/flame/ 94

Page 177: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

What is next?

My favorate definition of science:“Knowledge that has been reduced to a system”

How can one represent knowledge about linear algebraalgorithms?

How can want systematically perform architecture specifictransformations with this knowledge?

Don’t code the library. Encode the expert knowledge.

http://www.cs.utexas.edu/users/flame/ 94

Page 178: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

What is next?

My favorate definition of science:“Knowledge that has been reduced to a system”

How can one represent knowledge about linear algebraalgorithms?

How can want systematically perform architecture specifictransformations with this knowledge?

Don’t code the library. Encode the expert knowledge.

http://www.cs.utexas.edu/users/flame/ 94

Page 179: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

What is next?

My favorate definition of science:“Knowledge that has been reduced to a system”

How can one represent knowledge about linear algebraalgorithms?

How can want systematically perform architecture specifictransformations with this knowledge?

Don’t code the library. Encode the expert knowledge.

http://www.cs.utexas.edu/users/flame/ 94

Page 180: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

What is next?

My favorate definition of science:“Knowledge that has been reduced to a system”

How can one represent knowledge about linear algebraalgorithms?

How can want systematically perform architecture specifictransformations with this knowledge?

Don’t code the library. Encode the expert knowledge.

http://www.cs.utexas.edu/users/flame/ 94

Page 181: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

Want to learn more?

http://www.cs.utexas.edu/users/flame/publications

http://www.cs.utexas.edu/users/flame/ 95

Page 182: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

Questions?

http://www.cs.utexas.edu/users/flame/ 96

Page 183: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

Will this be embraced?

Simplicity is a great virtue but it requires hard work to achieve itand education to appreciate it. And to make matters worse:complexity sells better.

– Dijkstra

http://www.cs.utexas.edu/users/flame/ 97

Page 184: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin

Conclusion

Will this be embraced?

Simplicity is a great virtue but it requires hard work to achieve itand education to appreciate it. And to make matters worse:complexity sells better.

– Dijkstra

http://www.cs.utexas.edu/users/flame/ 97