Optimized LU-decomposition with Full Pivot for Small...

Optimized LU-decomposition with Full Pivot for Small Batched Matrices

Ian Wainwright

High Performance Consulting Sweden

ian.wainwright@hpcsweden.se

Background Based on work for GTC 2012: 10x speed-up vs multi-threaded Intel MKL on an 4 core (8HT).

http://www.hpcsweden.se/files/CUDA-based_LU_Factorization.pdf

To investigate various techniques for batched matrix computations.

The motivation behind this work is to use our previous work on LU-decomposition as a use case to investigate optimization techniques for Kepler.

Outline

1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions

LU-decomposition

1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions

LU-decomposition The idea is to transform A where Ax=b to an equivalent triangular system such that A=LU

434241

343300

2423220

14131211

44434241

34333231

24232221

14131211

LU-decomposition

3 6 10

Eliminate values by

subtracting pivot line

Without Pivot

A L Pivot

element

Multipliers

LU-decomposition

3 6 10

0 -3 -6

0 -6 -11

Without Pivot

Eliminate values by

A L Pivot

element

Multipliers

LU-decomposition

3 6 10

A 0 0 0

0 -3 -6

0 -6 -11

Without Pivot

LU-decomposition A L

0 -3 -6

0 -6 -11

Without Pivot 0 0 0

Eliminate values by

Pivot element

Multiplier

0 -3 -6

0 -6 -11

Without Pivot 0 0 0

Eliminate values by

0 -3 -6

The multiplier must be shared with all elements of a row Syncronization and sharing in inner loop.

Pivot element

Multiplier

0 -3 -6

0 -6 -11

Without Pivot 0 0 0

0 -3 -6

0 -6 -11

Without Pivot 0 0 0

-2 0 0

-3 0 0

0 -3 -6

LU-decomposition U L

0 -3 -6

Without Pivot 0 0 0

-2 0 0

-3 0 0

LU-decomposition A Full Pivot

3 6 10

Pivot element

LU-decomposition Full Pivot

Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual.

A 10 6 3

Pivot element

LU-decomposition with full pivot is stable

Pivot element

LU-decomposition Full Pivot

Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual.

Necessitates max-reduction in each column iteration.

A 10 6 3

Find largest value Synchronization and data-sharing

Implementations Problem size: • Matrices of at most 32 rows or columns of any shape, i.e. both rectangular and square. • Batches of 10 000 matrices. • Find L and U for matrix A such that PAQ=LU with full pivoting, where P and Q are the pivot vectors.

Implementations Mapping of problem: • Map one matrix to a warp. • Map a column to a thread. • One or more matrices per block.

1 3 4 2

3 7 4 9

7 5 16 4

2 3 4 1

t.x 0 t.x 1 t.x 2 t.x 3

1 3 4 2

3 7 4 9

7 5 16 4

2 3 4 1

t.x 0 t.x 1 t.x 2 t.x 3

Block 0 Block 1

One matrix per block

Warp 0 Warp 0

1 3 4 2

3 7 4 9

7 5 16 4

2 3 4 1

t.x 0 t.x 1 t.x 2 t.x 3

1 3 4 2

3 7 4 9

7 5 16 4

2 3 4 1

t.x 32 t.x 33 t.x 34 t.x 35

Block 0 Block 0

Two matrix per block

Warp 0 Warp 1

1 3 4 2

3 7 4 9

7 5 16 4

2 3 4 1

t.x 0 t.x 1 t.x 2 t.x 3

1 3 4 2

3 7 4 9

7 5 16 4

2 3 4 1

Block 0 Block 0

Two matrix per block

Warp 0 Warp 1

The number of matrices per block affects the occupancy.

t.x 32 t.x 33 t.x 34 t.x 35

Implementations

To investigate various techniques for batched matrix computations.

Multiple matrices per block: 1. 1 matrix per block. 2. 2 matrices per block. 3. 4 matrices per block.

Cache-config: 1. Prefer shared. 2. Prefer L1.

Synchronization Occupancy Cache configuration

and their interaction

Synchronization and data sharing: 1. Shared-mem with

__syncthreads(). 2. Shared-mem using warp-

synchronous programming. 3. Warp shuffle.

Implementations

1 matrix per block

2 matrices per block

Synchronization

Occupancy

Cache configuration

Performance gains

All benchmarks were performed on a standard GTX 680. All numbers are based on execution time for a batch of 10 000

matrices for various sizes. Execution time is in ms. To save time, we will only be looking at performance numbers

for square matrices.

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle

Prefer L1 Prefer shared

0 10 20 30 Matrix size

Execution time, __syncthreads()

0 10 20 30

Matrix size

1 matrix per block 2 matrices per block 4 matrices per block

0 10 20 30

Matrix size

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

Execution time, warp-synchronous

Prefer L1

Prefer shared

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

No surprise: we use shared memory for all data-sharing and reduction. More shared memory gives us higher occupancy Better performance.

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

Performance gains

Warp-synchronous __syncthreads()

1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle

Execution time[ms], warp-synchronous()

Execution time[ms], __syncthreads()

Performance gains

Warp-synchronous __syncthreads()

1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle

Execution time[ms], warp-synchronous()

Execution time[ms], __syncthreads()

0 10 20 30

Matrix size

Warp-sync speed-up over __syncthreads()

Roughly a speed-up of 1.7x when using warp-synchronous.

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

0 8 16 24 32 Matrix size

Execution time, warp shuffle

Prefer shared

Prefer L1

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

0 8 16 24 32

Matrix size

Warp shuffle speed-up over warp-synch

Roughly a speed-up of 1.2x when using warp shuffle over warp-sync

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

0 8 16 24 32

Matrix size

Execution time, warp shuffle & Prefer L1

0 8 16 24 32

Matrix size

Execution time, warp shuffle & Prefer L1 2 matrices per block is most consistent.

0 8 16 24 32

Matrix size

Speed-up vs 1 block per matrix We can gain 2x by mapping 2 or 4 matrices to each block. On average we gain 1.5x.

Performance gains

1 matrix per block

Synchronization

Occupancy

Cache configuration

Conclusions

1. Warp-synchronous is 1.7x faster than __syncthreads(). 2. Warp shuffle is 1.2x faster than warp-synchronous. 3. 2 matrices per block is 1.5x faster than 1 matrix per block. 4. Optimal cache config varies with implementation:

Prefer shared if you use shared mem. Prefer L1 if you use warp shuffle.

Questions?

Ian Wainwright

High Performance Consulting Sweden

ian.wainwright@hpcsweden.se

Optimized LU-decomposition with Full Pivot for Small...

Documents

LU decomposition - Ruđer Bošković Institutebib.irb.hr/datoteka/379945.algorithm_LU-decomposition... · Web viewD. Kozak R. Scitovski ISSN 1333-1124 LU-Decomposition for Solving

CLUDE: An Efficient Algorithm for LU Decomposition Over a Sequence of Evolving Graphs

Lu decomposition (pdf)

LU Decomposition Keon M. Leeelearning.kocw.net/contents4/document/lec/2013/Chungbuk/... · 2014-02-24 · lu 분해 알고리즘 lu 분해 a = lu에서 l이나 u 가 단위 삼각행렬인

LU Decomposition 1. Introduction Another way of solving a system of equations is by using a factorization technique for matrices called LU decomposition

Lecture 10 - Nonlinear gradient techniques and LU Decomposition CVEN 302 June 24, 2002

Pricing American Options Using LU Decomposition...Pricing American options using LU decomposition 2531 solutions [10], [21], [26]. One way to improve the stability is to start the

LU Decomposition Fula

LU Decomposition - Peoplepeople.math.gatech.edu/.../notes/Feb22.pdf · LU Decomposition An LU factorization of A = m x n 's an expression where L x m unit lower triangular matrix

LU Decomposition Fula 2010

4/22/2015 1 LU Decomposition Electrical Engineering Majors Authors: Autar Kaw Transforming

Pricing American Options Using LU Decomposition · 2019-01-02 · Pricing American options using LU decomposition 2531 solutions [10], [21], [26]. One way to improve the stability

Ch10 LU Decomposition 21

LU decomposition 30 - Imperial College Londonnucinkis-lab.cc.ic.ac.uk/.../workbook_30/30_3_lu_decomposition.pdf · LU decomposition 30.3 Introduction In this section we consider another

Gaussian Elimination and LU-Decomposition GAUSSIAN ELIMINATION AND LU-DECOMPOSITION 3 we have such a triangular form, then we have one equation involving just x n, one involving just

LU Factorization - UNC Charlotte FAQ - UNC CharlotteLU Decomposition A matrix A can be decomposed into a lower triangular matrix L and upper triangular matrix U so that A = LU LU decomposition

LU Decomposition

Randomized LU decomposition - TAU

Randomized LU Decomposition - TAUcs.tau.ac.il/~amir1/PS/FastRandomizedLU.pdf · The interest in a randomized LU decomposition can be motivated (compu-tationally wise) by two important

LU Factorisation and the solution of linear systems of equations Factorisation.pdf · LU Factorisation and the solution of linear systems of equations LU factorization or decomposition