A framework for practical fast matrix multiplication (BLIS retreat)

0 5000 10000 150000

Dimension (N)

ective

Parallel performance of Strassen on <N,N,N>

MKL, 6 cores

MKL, 24 cores

DFS, 6 cores

BFS, 6 cores

HYBRID, 6 cores

DFS, 24 cores

BFS, 24 cores

HYBRID, 24 cores

A FRAMEWORK FOR

PRACTICAL FAST MATRIX

MULTIPLICATION

Austin Benson (arbenson@stanford.edu), ICME, Stanford

Grey Ballard, Sandia National Laboratories

BLIS Retreat, September 26, 2014

arXiv: 1409.2908

Fast matrix multiplication:

bridging theory and practice

• There are a number of Strassen-like algorithms for matrix

multiplication that have only been “discovered” recently.

[Smirnov13], [Benson&Ballard14]

• We show that they can achieve higher performance with

respect to MKL (sequential and sometimes in parallel).

• We use code generation to do extensive prototyping. There

are several practical issues, and there is plenty of room for

improvement (lots of expertise at UT to help here!)

32 2.81[Strassen79]

2.37[Williams12]

xxx xx xx xx

Strassen’s algorithm

Key ingredients of Strassen’s algorithm

• 1. Block partitioning of matrices (<2, 2, 2>)

• 2. Seven linear combinations of sub-blocks of A

• 3. Seven linear combinations of sub-blocks of B

• 4. Seven matrix multiplies to form Mr (recursive)

• 5. Linear combinations of Mr to form Cij

Key ingredients of fast matmul algorithms

• 1. Block partitioning of matrices (<M, K, N>)

• 2. R linear combinations of sub-blocks of A

• 3. R linear combinations of sub-blocks of B

• 4. R matrix multiplies to form Mr (recursive)

R < MKN faster than classical

• 5. Linear combinations of Mr to form Cij

“Outer product” fast algorithm

• <4, 2, 4> partitioning

• R = 26 multiplies (< 4 * 2 * 4 = 32)

23% speedup per recursive step (if everything else free)

• Linear combinations of Aij to form Sr: 68 terms

• Linear combinations of Bij to form Tr: 52 terms

• Linear combinations of Mr to form Cij: 69 terms

Discovering fast algorithms is a

numerical challenge

• Low-rank tensor decompositions lead to fast algorithms

• Tensors are small, but we need exact decompositions

NP-hard

• Use alternating least squares with regularization and

rounding tricks [Smirnov13], [Benson&Ballard14]

• We have around 10 fast algorithms for <M, K, N>

decompositions. Also have permutations, e.g., <K, M, N>.

[Smirnov13]

[Strassen69]

Code generation lets us prototype

algorithms quickly

• We have compact representation of many fast algorithms:

1. dimensions of block partitioning (<M, K, N>)

2. linear combinations of sub-blocks (Sr, Tr)

3. linear combinations of Mr to form Cij

• We use code generation to rapidly prototype fast algorithms

• Our approach: test all algorithms on a bunch of different

problem sizes and look for patterns

Practical issues

• Best way to do matrix additions? (in paper)

• Can we eliminate redundant linear combinations? (in paper)

• Different problem shapes other than square (this talk)

• When to stop recursion? (this talk)

• How to parallelize? (this talk)

0 1000 2000 300010

Dimension (N)

Sequential dgemm performance

N x 800 x 800

N x 800 x N

N x N x N

0 2000 4000 6000 800010

Dimension (N)G

Parallel dgemm performance (24 cores)

Recursion cutoff: look at gemm curve

Basic idea: take another

recursive step if the sub-

problems will still operate at

high performance

<M, K, N> = <4, 2, 3>

Sequential performance

0 2000 4000 6000 800016

Dimension (N)

Sequential performance on N x N x N

STRASSEN

<3,2,2>

<3,2,4>

<4,2,3>

<3,4,2>

<3,3,3>

<4,2,4>

<2,3,4>

Effective GFLOPS for M x K x N multiplies

= 1e-9 * 2 * MKN / time in seconds

True peak

Sequential performance

0 2000 4000 6000 800016

Dimension (N)

Sequential performance on N x N x N

STRASSEN

<4,4,2>

<4,3,3>

<3,4,3>

<3,3,6>

<3,6,3>

<6,3,3>

• All algorithms beat MKL on large problems

• Strassen’s algorithm is hard to beat

2000 4000 6000 8000 10000 1200022

dimension (N)

ective

Sequential performance on N x 1600 x N

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Sequential performance =

• Almost all algorithms beat MKL

• <4, 2, 4> and <3, 2, 3> tend to perform the best

Sequential performance =

• Almost all algorithms beat MKL

• <4, 3, 3> and <4, 2, 3> tend to perform the best

10000 12000 14000 16000 1800022

dimension (N)

ective

Sequential performance on N x 2400 x 2400

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Parallelization

DFS Parallelization

All threads

Use parallel MKL

+ Easy to implement

+ Load balanced

+ Same memory

footprint as sequential

- Need large base

cases for high

performance

BFS Parallelization

M2 …

omp taskwait

1 thread

+ High performance for smaller base cases

- Sometimes harder to load balance: 24 threads, 49 subproblems

- More memory

1 thread 1 thread

HYBRID parallelization

omp taskwait

1 thread 1 thread all threads

+ Better load balancing

- Explicit synchronization or else we can over-subscribe threads

0 5000 10000 15000 20000 0 5000 10000 150000

Dimension (N)

Parallel performance of <4,2,4> on <N,2800,N>

MKL, 6 cores

MKL, 24 cores

DFS, 6 cores

BFS, 6 cores

HYBRID, 6 cores

DFS, 24 cores

BFS, 24 cores

HYBRID, 24 cores

Bandwidth problems• We rely on the cost of matrix multiplications to be much

more expensive than the cost of matrix additions

• Parallel dgemm on 24 cores: easily get 50-75% of peak

• STREAM benchmark: < 6x speedup in read/write

performance on 24 cores

Parallel performance =

• 6 cores: similar performance to sequential

• 24 cores: can sometimes beat MKL, but barely

9000 10000 11000 12000 1300018

Dimension (N)

ctive G

Performance (6 cores) on N x N x N

STRASSEN

<3,2,2>

<3,2,4>

<4,2,3>

<3,4,2>

<3,3,3>

<4,2,4>

<2,3,4> 9000 10000 11000 12000 1300016

Dimension (N)

ctive G

Performance (24 cores) on N x N x N

STRASSEN

<3,2,2>

<3,2,4>

<4,2,3>

<3,4,2>

<3,3,3>

<4,2,4>

<2,3,4>

10000 15000 20000 10000 1500018

dimension (N)

ective

Performance (6 cores) on N x 2800 x N

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

5000 10000 15000 2000012

dimension (N)

ective

Performance (24 cores) on N x 2800 x N

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

Parallel performance =Bad MKL

performance

• 24 cores: MKL best for large problems

Parallel performance =

• 24 cores: MKL usually the best

10000 15000 20000 10000 1500018

dimension (N)

ective

Performance (6 cores) on N x 3000 x 3000

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

16000 18000 20000 22000 24000 2600012

dimension (N)

ective

Performance (24 cores) on N x 3000 x 3000

<4,2,4>

<4,3,3>

<3,2,3>

<4,2,3>

STRASSEN

High-level conclusions

• For square matrix multiplication, Strassen’s algorithm is

hard to beat

• For rectangular matrix multiplication, use a fast algorithm

that “matches the shape”

• Bandwidth limits the performance of shared memory

parallel fast matrix multiplication

should be less of an issue in distributed memory

Future work:

• Numerical stability

• Using fast matmul as a kernel for other algorithms in

numerical linear algebra

0 5000 10000 150000

Dimension (N)

ective

Parallel performance of Strassen on <N,N,N>

MKL, 6 cores

MKL, 24 cores

DFS, 6 cores

BFS, 6 cores

HYBRID, 6 cores

DFS, 24 cores

BFS, 24 cores

HYBRID, 24 cores

A FRAMEWORK FOR

PRACTICAL FAST MATRIX

MULTIPLICATION

Austin Benson (arbenson@stanford.edu), ICME, Stanford

Grey Ballard, Sandia National Laboratories

BLIS Retreat, September 26, 2014

arXiv: 1409.2908

Matrix additions (linear combinations)

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Pairwise”

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Write once”

custom

“DAXPY”

custom

“DAXPY”

A11 A12 A21 A22

S1 S2S7S6S5S4S3

“Streaming”

Entry-wise

updates

Common subexpression elimination (CSE)

• Example in <4, 2, 4> algorithm (R = 26 multiples):

B24B12 B22 B23

T11 T25

Four additions, six reads, two writes

Common subexpression elimination (CSE)

• Example in <4, 2, 4> algorithm (R = 26 multiples):

B24B12 B22 B23

T11 T25

Three additions, six reads, three writes

Net increase in communication!

CSE does not really help

Effective GFLOPS for M x K x N multiplies

= 1e-9 * 2 * MKN / time in seconds

A framework for practical fast matrix multiplication (BLIS retreat)

Software

PS/PUS(L) cc: PS/PUS(B) - BLIS PS/Mr Bloomfield - BLIS Mr ...PS/Mr Bloomfield - BLIS Mr Brennan Mr A W Stephens - BLIS Mr Chesterton Mr Elliott - BLIS Mr Innes - BLIS Mr Spence - BLIS

BLIS 2015-16.pdf

2017 ANNUAL REPORT BLIS TECHNOLOGIES LIMITED · 2017 ANNUAL REPORT BLIS TECHNOLOGIES LIMITED CONTENTS Company Directory 2 Operations Report 3 Directors’ Report 7 ... Financial Officer,

BLIS-brain Lateralisation Information System for ... · BLIS for student individual assessment and we conclude on findings and future work in chapter 6. 2. Basic Concepts on Information

0 BLIS: A Framework for Rapid Instantiation of BLAS Functionality

Subject Code : Course Code : BLIS -01

BLIS - 2015_11 - DCSI - BlueBays - 4_3

REDESIGNING A BASIC LABORATORY INFORMATION SYSTEM FOR …€¦ · • C4G BLIS is an open-source web-based system to track patients, specimens and laboratory results. C4G BLIS | Overview

Optimizations of BLIS Library for AMD ZEN Coredeveloper.amd.com/.../12/Optimization-of-BLIS-Library-for-AMD-ZEN.pdf · Optimizations of BLIS Library for AMD ZEN Core 1 Introduction

Blind Spot Information System (BLIS) - Ford Media Center America/US/2013... · • Working in conjunction with Blind Spot Information System (BLIS), cross-traffic alert warns the

Block-2 BLIS-01 Unit-5.pdf

BLIS Assignments July 2021 - Jan 2022

POR. SUS DIPiOB!BLIS BUIHAB - FlacsoAndes

BLIS Assignments July 2012-January 2013

Multiplication Division Linking multiplication and division · Multiplication Division Linking multiplication and division ... which can be used as a jotting for both multiplication

Block 1 BLIS 01 Unit 3

Programme- BLIS Course - Library Classification Year- 2020

Winter multiplication strategies - Cartersville Schools · Winter Multiplication Strategies Multiplication Table Addition & Multiplication #1 Addition & Multiplication #2 Zeros and

ToothGuard & ToothGuard Junior with BLIS M18™henryschein.co.nz/documents/PDFs/BLIS 17 eDetailer.pdfTooth decay can occur when the tooth is exposed to plaque and acid. Caries can

llll - Eklablogekladata.com/wz570lhRZzZkad4eWbk3WbSsgFY/Cartes... · llll llll llll llll. Multiplication 318 503,5 Multiplication 310 841,4 Multiplication 5 153,748 Multiplication