32
Build To Order (BTO) BLAS Compiler Oliver Ney 11th July, 2016 Oliver Ney Build To Order (BTO) BLAS Compiler 1 / 22

Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Build To Order (BTO) BLAS Compiler

Oliver Ney

11th July, 2016

Oliver Ney Build To Order (BTO) BLAS Compiler 1 / 22

Page 2: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Overview

1 BLAS

2 Motivation

3 BTO BLAS Compiler

4 Architecture

5 Analysis

6 Results

7 Conclusions

Oliver Ney Build To Order (BTO) BLAS Compiler 2 / 22

Page 3: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

BLAS

Basic Linear Algebra Subprograms

Defined set of commonly used routines

Highly optimized implementations on various platforms

e.g. Intel Math Kernel Library (IMKL), Armadillo

Building blocks for higher level algorithms

Split into 3 levels

Oliver Ney Build To Order (BTO) BLAS Compiler 3 / 22

Page 4: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

BLAS - Levels

Level 1

Vector routines

e.g. dot product, norms

Level 2

Matrix-Vector routines

e.g. matrix-vector product, solve linear system with triangular

matrix

Level 3

Matrix routines

e.g. matrix product

Oliver Ney Build To Order (BTO) BLAS Compiler 4 / 22

Page 5: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Performance Constraints

Performance of BLAS execution is constrained by:

Processing Power (FLOPS)

well optimized for by compilers

Memory Size, Speed and Bandwidth

requires in-depth analysis

Oliver Ney Build To Order (BTO) BLAS Compiler 5 / 22

Page 6: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Performance Constraints

Performance of BLAS execution is constrained by:

Processing Power (FLOPS)

well optimized for by compilers

Memory Size, Speed and Bandwidth

requires in-depth analysis

Oliver Ney Build To Order (BTO) BLAS Compiler 5 / 22

Page 7: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Performance Constraints

Performance of BLAS execution is constrained by:

Processing Power (FLOPS)

well optimized for by compilers

Memory Size, Speed and Bandwidth

requires in-depth analysis

Oliver Ney Build To Order (BTO) BLAS Compiler 5 / 22

Page 8: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Memory Hierarchy

based on Intel i7 Nehalem architecture

Oliver Ney Build To Order (BTO) BLAS Compiler 6 / 22

Page 9: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Consequences

During math kernel execution operands...

... need to stay in cache.

... should be local.

Common BLAS libraries...

... are well optimized for this on level 3.

... cannot optimize across routines.

Oliver Ney Build To Order (BTO) BLAS Compiler 7 / 22

Page 10: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Build To Order BLAS Compiler

Developed at University of Colorado

Compiles math kernels to custom BLAS routines

Optimizes generated code for memory efficiency on host computer

Based on empirical results

Optional analytical model

Allows for implicit parallelization using pthreads or OpenMP

Oliver Ney Build To Order (BTO) BLAS Compiler 8 / 22

Page 11: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Input and Output Format

Input:

Name of kernel

Input and output arguments

Scalars, Vectors, Matrices

Column-/Row-Major, Symmetric, Triangular, ...

MATLAB-like body

Output:

Single C function named like the kernel

Arguments matching input declaration (plus dimensions)

Oliver Ney Build To Order (BTO) BLAS Compiler 9 / 22

Page 12: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Input and Output FormatInput Example

1 GEMVER

2 in

3 A : matrix(column), u1 : vector(column), u2 : vector

(column),

4 v1 : vector(column), v2 : vector(column),

5 a : scalar , b : scalar ,

6 y : vector(column), z : vector(column)

7 out

8 B : matrix(column), x : vector(column), w : vector(

column)

9 {

10 B = A + u1 * v1’ + u2 * v2 ’

11 x = b * (B’ * y) + z

12 w = a * (B * x)

13 }

Oliver Ney Build To Order (BTO) BLAS Compiler 10 / 22

Page 13: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Architecture Overview

Oliver Ney Build To Order (BTO) BLAS Compiler 11 / 22

Page 14: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Dataflow Graph

Input file parsed into dataflow graph

Arguments as roots/leaves, operators as knots

Every knot is looked up in a linear algebra database

Recursively applied until all knots work on scalars

generates iterating subgraphs

Linear Algebra Database Entry

rr-mult R < τl > ×R < τr > R < R < τl > ×τr >

Oliver Ney Build To Order (BTO) BLAS Compiler 12 / 22

Page 15: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Dataflow Graph

Oliver Ney Build To Order (BTO) BLAS Compiler 13 / 22

Page 16: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Dataflow Graph

Oliver Ney Build To Order (BTO) BLAS Compiler 14 / 22

Page 17: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Loop fusion

If two adjacent loops ...

have the same indices and conditions

and operate on the same values

... or a succeeding loop ...

uses the same indices and conditions

consumes the output of the previous loop

... they can be fusioned into a single loop.

Oliver Ney Build To Order (BTO) BLAS Compiler 15 / 22

Page 18: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Loop fusion

1 mat a;

2

3 for(i = 1..n) {

4 x = a(i,:);

5 ...

6 }

7

8 for(j = 1..n) {

9 y = a(j,:);

10 ...

11 }

Oliver Ney Build To Order (BTO) BLAS Compiler 15 / 22

Page 19: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Loop fusion

1 mat a;

2

3 for(i = 1..n) {

4 x = a(i,:);

5 ...

6 }

7

8 for(j = 1..n) {

9 y = a(j,:);

10 ...

11 }

1 mat a;

2

3 for(i = 1..n) {

4 x = a(i,:);

5 /* y == x */

6 ...

7 }

Oliver Ney Build To Order (BTO) BLAS Compiler 15 / 22

Page 20: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Loop fusion

If two adjacent loops ...

have the same indices and conditions

and operate on the same values

... or a succeeding loop ...

uses the same indices and conditions

consumes the output of the previous loop

... they can be fusioned into a single loop.

Oliver Ney Build To Order (BTO) BLAS Compiler 15 / 22

Page 21: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Loop fusion

1 mat a, b;

2

3 for(i = 1..n) {

4 b(i,:) = a(i,:) * 2;

5 ...

6 }

7

8 for(j = 1..n) {

9 y = b(j,:);

10 ...

11 }

⇒ Improves localization of operands

⇒ Reduces loop overhead

Oliver Ney Build To Order (BTO) BLAS Compiler 15 / 22

Page 22: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Loop fusion

1 mat a, b;

2

3 for(i = 1..n) {

4 b(i,:) = a(i,:) * 2;

5 ...

6 }

7

8 for(j = 1..n) {

9 y = b(j,:);

10 ...

11 }

1 mat a;

2

3 for(i = 1..n) {

4 y = a(i,:) * 2;

5 /* y == b(i,:) */

6 ...

7 }

⇒ Improves localization of operands

⇒ Reduces loop overhead

Oliver Ney Build To Order (BTO) BLAS Compiler 15 / 22

Page 23: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Loop tiling

If a succeeding loop ...

has a high dependency on the previous loop

uses very large shared operands

or if parallelism is wanted ...

... the loops can be tiled to improve performance.

Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22

Page 24: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Loop tiling

If a succeeding loop ...

has a high dependency on the previous loop

uses very large shared operands

or if parallelism is wanted ...

... the loops can be tiled to improve performance.

1 mat a, b;

2

3 for(i = 1..n) {

4 b(i,:) = a(i,:) * 2;

5 ..1..

6 b(i,:) *= 3;

7 ..2..

8 }

9 for(j = 1..n) {

10 y = a(j,:) * 2;

11 ..3..

12 }

Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22

Page 25: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Loop tiling

If a succeeding loop ...

has a high dependency on the previous loop

uses very large shared operands

or if parallelism is wanted ...

... the loops can be tiled to improve performance.

1 mat a, b;

2

3 for(i = 1..n) {

4 b(i,:) = a(i,:) * 2;

5 ..1..

6 b(i,:) *= 3;

7 ..2..

8 }

9 for(j = 1..n) {

10 y = a(j,:) * 2;

11 ..3..

12 }

1 mat a, b;

2

3 for(i = 1..n) {

4 b(i,:) = a(i,:) * 2;

5 ..1..

6 ..3..

7 }

8 for(j = 1..n) {

9 b(i,:) *= 3;

10 ..2..

11 }

Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22

Page 26: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Genetic Search

Enormous search space for best performing solution

Exhaustive search is inapplicable

Explores possible candidates by decision mutation

Every candidate evaluated by empirical analysis

No guarantee to find optimal solution

Hard limit on number of generations

Oliver Ney Build To Order (BTO) BLAS Compiler 17 / 22

Page 27: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Empirical Analysis

Every solution considered by the search is...

compiled

executed on a fixed input set

e.g. 3000x3000 matrices

weighted by execution time

Oliver Ney Build To Order (BTO) BLAS Compiler 18 / 22

Page 28: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Results

Oliver Ney Build To Order (BTO) BLAS Compiler 19 / 22

Page 29: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Results

Oliver Ney Build To Order (BTO) BLAS Compiler 20 / 22

Page 30: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

Conclusions

Pros:

Loop fusion/tiling strongly improves performance for chains of

vector-vector operations

matrix-vector operations

Giant search space is a big hurdle

provides various methods to handle it

average compilation time still acceptable

Automatic parallelization using pthreads

Cons:

Project appears dead

Errors on compilation

Oliver Ney Build To Order (BTO) BLAS Compiler 21 / 22

Page 31: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

References I

[0] Belter, Geoffrey et al. (2009). “Automating the Generation of

Composed Linear Algebra Kernels”. In: Proceedings of the

Conference on High Performance Computing Networking,

Storage and Analysis. SC ’09. Portland, Oregon: ACM,

59:1–59:12. isbn: 978-1-60558-744-8. doi:

10.1145/1654059.1654119. url:

http://doi.acm.org/10.1145/1654059.1654119.

[0] Nelson, Thomas (2015). “DSLs and Search for Linear Algebra

Perfromance Optimization”. PhD thesis. UNIVERSITY OF

COLORADO AT BOULDER.

[0] Siek, Jeremy G. (2009). Build To Order BLAS. CScADS Summer

Workshop - Libraries and Autotuning for Petascale Applications.

Lake Tahoe.

Oliver Ney Build To Order (BTO) BLAS Compiler 22 / 22

Page 32: Build To Order (BTO) BLAS Compilerhpac.cs.umu.se/teaching/sem-accg-16/slides/06.Ney-BTO.pdf · 10..2.. 11} Oliver Ney Build To Order (BTO) BLAS Compiler 16 / 22. Genetic Search Enormous

References II

[0] Siek, Jeremy G, Ian Karlin, and Elizabeth R Jessup (2008). “Build

to order linear algebra kernels”. In: Parallel and Distributed

Processing, 2008. IPDPS 2008. IEEE International Symposium

on. IEEE, pp. 1–8.

Oliver Ney Build To Order (BTO) BLAS Compiler 23 / 22