56
GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER platform – A case study Martin K¨ ohler March 8, 2017 GAMM Annual Meeting Scientific Computing Section

GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

GPU Accelerated Gauss-Jordan Elimination on theOpenPOWER platform – A case study

Martin Kohler

March 8, 2017

GAMM Annual MeetingScientific Computing Section

Page 2: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

MotivationMatrix-Sign-Function:We consider the Newton iteration to compute Matrix-Sign-FunctionX∞ := sign (A) of a matrix A ∈ Rn×n:

Xk+1 =1

2

(µkXk + µ−1k X−1k

), X0 = A,

where µk is a scaling factor, typically µk := | detXk |−1n .

Applications:

Computation of invariant subspaces of A

Solution of the standard Lyapunov equation

Solution of the standard Riccati equation

Need to compute A−1 or to solve AY = B with many right hand sides.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 2/17

Page 3: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

MotivationMatrix-Sign-Function:We consider the Newton iteration to compute Matrix-Sign-FunctionX∞ := sign (A) of a matrix A ∈ Rn×n:

Xk+1 =1

2

(µkXk + µ−1k X−1k

), X0 = A,

where µk is a scaling factor, typically µk := | detXk |−1n .

Applications:

Computation of invariant subspaces of A

Solution of the standard Lyapunov equation

Solution of the standard Riccati equation

Need to compute A−1 or to solve AY = B with many right hand sides.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 2/17

Page 4: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

MotivationMatrix-Sign-Function:We consider the Newton iteration to compute Matrix-Sign-FunctionX∞ := sign (A,B) of a matrix pencil (A,B) ∈ Rn×n × Rn×n:

Xk+1 =1

2

(µkXk + µ−1k BX−1k B

), X0 = A,

where µk is a scaling factor, typically µk :=(| detXk || detB|

)− 1n

.

Applications:

Computation of deflating subspaces of (A,B)

Solution of the generalized Lyapunov equation

Solution of the generalized Riccati equation

Need to compute A−1 or to solve AY = B with many right hand sides.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 2/17

Page 5: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

MotivationMatrix-Sign-Function:We consider the Newton iteration to compute Matrix-Sign-FunctionX∞ := sign (A,B) of a matrix pencil (A,B) ∈ Rn×n × Rn×n:

Xk+1 =1

2

(µkXk + µ−1k BX−1k B

), X0 = A,

where µk is a scaling factor, typically µk :=(| detXk || detB|

)− 1n

.

Applications:

Computation of deflating subspaces of (A,B)

Solution of the generalized Lyapunov equation

Solution of the generalized Riccati equation

Need to compute A−1 or to solve AY = B with many right hand sides.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 2/17

Page 6: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Motivation

Computation of A−1: (as in LAPACK/MAGMA)

1 Compute LU decomposition of A cost: 23n

3

2 Inversion of U cost: 13n

3

3 Solution of LA−1 = U−1 cost: n3

Solution of AY = B:

1 Compute LU decomposition of A cost: 23n

3

2 Solution of LZ = B cost: n3

3 Solution of UY = Z cost: n3

All three steps are of cubic complexity.

Solving with triangular matrices is not well suited for (massively) parallelarchitectures.

In both cases, we want combine all three stepsinto a single, easily parallelizable, algorithm.

→ Gauss-Jordan Elimination

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 3/17

Page 7: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Motivation

Computation of A−1: (as in LAPACK/MAGMA)

1 Compute LU decomposition of A cost: 23n

3

2 Inversion of U cost: 13n

3

3 Solution of LA−1 = U−1 cost: n3

Solution of AY = B:

1 Compute LU decomposition of A cost: 23n

3

2 Solution of LZ = B cost: n3

3 Solution of UY = Z cost: n3

All three steps are of cubic complexity.

Solving with triangular matrices is not well suited for (massively) parallelarchitectures.

In both cases, we want combine all three stepsinto a single, easily parallelizable, algorithm.

→ Gauss-Jordan Elimination

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 3/17

Page 8: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Motivation

Computation of A−1: (as in LAPACK/MAGMA)

1 Compute LU decomposition of A cost: 23n

3

2 Inversion of U cost: 13n

3

3 Solution of LA−1 = U−1 cost: n3

Solution of AY = B:

1 Compute LU decomposition of A cost: 23n

3

2 Solution of LZ = B cost: n3

3 Solution of UY = Z cost: n3

All three steps are of cubic complexity.

Solving with triangular matrices is not well suited for (massively) parallelarchitectures.

In both cases, we want combine all three stepsinto a single, easily parallelizable, algorithm.

→ Gauss-Jordan Elimination

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 3/17

Page 9: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Motivation

Computation of A−1: (as in LAPACK/MAGMA)

1 Compute LU decomposition of A cost: 23n

3

2 Inversion of U cost: 13n

3

3 Solution of LA−1 = U−1 cost: n3

Solution of AY = B:

1 Compute LU decomposition of A cost: 23n

3

2 Solution of LZ = B cost: n3

3 Solution of UY = Z cost: n3

All three steps are of cubic complexity.

Solving with triangular matrices is not well suited for (massively) parallelarchitectures.

In both cases, we want combine all three stepsinto a single, easily parallelizable, algorithm.

→ Gauss-Jordan Elimination

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 3/17

Page 10: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Motivation – HardwareOpenPOWER

Industry alliance (Google, IBM, Canonical, RedHat, Mellanox, Nvidia,...) todevelop a customizable server platform for data centers, high performancecomputing, . . . on top of Open Source philosophy.

IBM Power System 822LC for HPC

2x IBM POWER8 CPUs(each: 10 Cores, 8-way SMT, 10x 512Kb L2 Cache, 10x 8MB L3 Cache,4.00GHz)

256GB DDR4 memory, bandwidth: 230 GB/s

2x Nvidia Tesla P100 SXM2 accelerators with 16GB HBM2 memory

NVLink CPU-GPU interconnect, bidirectional bandwidth: 80 GB/s

Theoretical peak performance [TFlops/s]: 10.6 (DP), 21.2 (SP), 42.4 (HP)

Staging system for the upcoming POWER 9 + Nvidia Volta architecture

CPU performance comparable with a 2x 8 CoreIntel Haswell v3

GPU performance ≈ 5 times higher as of Keplergeneration GPUs (K20 or a single GPU on K80)

Increase the GPU load on the OpenPOWER platformto obtain high performance.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 4/17

Page 11: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Motivation – HardwareOpenPOWER

Industry alliance (Google, IBM, Canonical, RedHat, Mellanox, Nvidia,...) todevelop a customizable server platform for data centers, high performancecomputing, . . . on top of Open Source philosophy.

IBM Power System 822LC for HPC

2x IBM POWER8 CPUs(each: 10 Cores, 8-way SMT, 10x 512Kb L2 Cache, 10x 8MB L3 Cache,4.00GHz)

256GB DDR4 memory, bandwidth: 230 GB/s

2x Nvidia Tesla P100 SXM2 accelerators with 16GB HBM2 memory

NVLink CPU-GPU interconnect, bidirectional bandwidth: 80 GB/s

Theoretical peak performance [TFlops/s]: 10.6 (DP), 21.2 (SP), 42.4 (HP)

Staging system for the upcoming POWER 9 + Nvidia Volta architecture

CPU performance comparable with a 2x 8 CoreIntel Haswell v3

GPU performance ≈ 5 times higher as of Keplergeneration GPUs (K20 or a single GPU on K80)

Increase the GPU load on the OpenPOWER platformto obtain high performance.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 4/17

Page 12: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Motivation – HardwareOpenPOWER

Industry alliance (Google, IBM, Canonical, RedHat, Mellanox, Nvidia,...) todevelop a customizable server platform for data centers, high performancecomputing, . . . on top of Open Source philosophy.

IBM Power System 822LC for HPC

2x IBM POWER8 CPUs(each: 10 Cores, 8-way SMT, 10x 512Kb L2 Cache, 10x 8MB L3 Cache,4.00GHz)

256GB DDR4 memory, bandwidth: 230 GB/s

2x Nvidia Tesla P100 SXM2 accelerators with 16GB HBM2 memory

NVLink CPU-GPU interconnect, bidirectional bandwidth: 80 GB/s

Theoretical peak performance [TFlops/s]: 10.6 (DP), 21.2 (SP), 42.4 (HP)

Staging system for the upcoming POWER 9 + Nvidia Volta architecture

CPU performance comparable with a 2x 8 CoreIntel Haswell v3

GPU performance ≈ 5 times higher as of Keplergeneration GPUs (K20 or a single GPU on K80)

Increase the GPU load on the OpenPOWER platformto obtain high performance.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 4/17

Page 13: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Motivation – HardwareOpenPOWER

Industry alliance (Google, IBM, Canonical, RedHat, Mellanox, Nvidia,...) todevelop a customizable server platform for data centers, high performancecomputing, . . . on top of Open Source philosophy.

IBM Power System 822LC for HPC

2x IBM POWER8 CPUs(each: 10 Cores, 8-way SMT, 10x 512Kb L2 Cache, 10x 8MB L3 Cache,4.00GHz)

256GB DDR4 memory, bandwidth: 230 GB/s

2x Nvidia Tesla P100 SXM2 accelerators with 16GB HBM2 memory

NVLink CPU-GPU interconnect, bidirectional bandwidth: 80 GB/s

Theoretical peak performance [TFlops/s]: 10.6 (DP), 21.2 (SP), 42.4 (HP)

Staging system for the upcoming POWER 9 + Nvidia Volta architecture

CPU performance comparable with a 2x 8 CoreIntel Haswell v3

GPU performance ≈ 5 times higher as of Keplergeneration GPUs (K20 or a single GPU on K80)

Increase the GPU load on the OpenPOWER platformto obtain high performance.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 4/17

Page 14: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with2n3 flops.

Solving a linear system with n right hand sides costs:

→ 2n3 + 2n3 flops using classical Gauss-Jordan Elimination.→ 2

3n3 + 2n3 flops using LU decomposition.

Modify the Gauss-Jordan Elimination scheme such that:

the multiplication with the inverse is done implicitly,

and the inverse is not accumulated.

Goals:

Reduce the number of necessary flops

Increase memory access locality

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 5/17

Page 15: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with2n3 flops.

Solving a linear system with n right hand sides costs:

→ 2n3 + 2n3 flops using classical Gauss-Jordan Elimination.

→ 23n

3 + 2n3 flops using LU decomposition.

Modify the Gauss-Jordan Elimination scheme such that:

the multiplication with the inverse is done implicitly,

and the inverse is not accumulated.

Goals:

Reduce the number of necessary flops

Increase memory access locality

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 5/17

Page 16: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with2n3 flops.

Solving a linear system with n right hand sides costs:

→ 2n3 + 2n3 flops using classical Gauss-Jordan Elimination.→ 2

3n3 + 2n3 flops using LU decomposition.

Modify the Gauss-Jordan Elimination scheme such that:

the multiplication with the inverse is done implicitly,

and the inverse is not accumulated.

Goals:

Reduce the number of necessary flops

Increase memory access locality

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 5/17

Page 17: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with2n3 flops.

Solving a linear system with n right hand sides costs:

→ 2n3 + 2n3 flops using classical Gauss-Jordan Elimination.→ 2

3n3 + 2n3 flops using LU decomposition.

Modify the Gauss-Jordan Elimination scheme such that:

the multiplication with the inverse is done implicitly,

and the inverse is not accumulated.

Goals:

Reduce the number of necessary flops

Increase memory access locality

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 5/17

Page 18: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with2n3 flops.

Solving a linear system with n right hand sides costs:

→ 2n3 + 2n3 flops using classical Gauss-Jordan Elimination.→ 2

3n3 + 2n3 flops using LU decomposition.

Modify the Gauss-Jordan Elimination scheme such that:

the multiplication with the inverse is done implicitly,

and the inverse is not accumulated.

Goals:

Reduce the number of necessary flops

Increase memory access locality

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 5/17

Page 19: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

We consider the augmented matrix D

D :=[A | B

]=

a11 · · · a1m b11 · · · b1n...

......

......

...am1 · · · amm bm1 · · · bmn

and apply a set of transformations Gi from the left such that we obtain:

Gn · · · G2G1︸ ︷︷ ︸A−1

D =

1 y11 · · · y1n. . .

......

...1 ym1 · · · ymn

.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 6/17

Page 20: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

We define Gi = GiPi as product of a row permutation Pi and a Gausstransformation Gi :

Gi =

1 − d1idii

. . ....

1 − d(i−1)i

dii1dii

− d(i+1)i

dii1

.... . .

− dmidii

1

,

where dkl are the entries in D after applying Gi−1 . . . G1.

Example – Applying Gi :

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 7/17

Page 21: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

We define Gi = GiPi as product of a row permutation Pi and a Gausstransformation Gi :

Gi =

1 − d1idii

. . ....

1 − d(i−1)i

dii1dii

− d(i+1)i

dii1

.... . .

− dmidii

1

,

where dkl are the entries in D after applying Gi−1 . . . G1.

D =

a11 a12 a13 a14 b11 . . . b1na21 a22 a23 a24 b21 . . . b2na31 a32 a33 a34 b31 . . . b3na41 a42 a43 a44 b41 . . . b4n

Example – Applying Gi :

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 7/17

Page 22: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

We define Gi = GiPi as product of a row permutation Pi and a Gausstransformation Gi :

Gi =

1 − d1idii

. . ....

1 − d(i−1)i

dii1dii

− d(i+1)i

dii1

.... . .

− dmidii

1

,

where dkl are the entries in D after applying Gi−1 . . . G1.

G1 D =

1 a

(1)12 a

(1)13 a

(1)14 b

(1)11 . . . b

(1)1n

0 a(1)22 a

(1)23 a

(1)24 b

(1)21 . . . b

(1)2n

0 a(1)32 a

(1)33 a

(1)34 b

(1)31 . . . b

(1)3n

0 a(1)42 a

(1)43 a

(1)44 b

(1)41 . . . b

(1)4n

Example – Applying Gi :

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 7/17

Page 23: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

We define Gi = GiPi as product of a row permutation Pi and a Gausstransformation Gi :

Gi =

1 − d1idii

. . ....

1 − d(i−1)i

dii1dii

− d(i+1)i

dii1

.... . .

− dmidii

1

,

where dkl are the entries in D after applying Gi−1 . . . G1.

G2G1 D =

1 0 a

(2)13 a

(2)14 b

(2)11 . . . b

(2)1n

0 1 a(2)23 a

(2)24 b

(2)21 . . . b

(2)2n

0 0 a(2)33 a

(2)34 b

(2)31 . . . b

(2)3n

0 0 a(2)43 a

(2)44 b

(2)41 . . . b

(2)4n

Example – Applying Gi :

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 7/17

Page 24: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

We define Gi = GiPi as product of a row permutation Pi and a Gausstransformation Gi :

Gi =

1 − d1idii

. . ....

1 − d(i−1)i

dii1dii

− d(i+1)i

dii1

.... . .

− dmidii

1

,

where dkl are the entries in D after applying Gi−1 . . . G1.

G3G2G1 D =

1 0 0 a

(3)14 b

(3)11 . . . b

(3)1n

0 1 0 a(3)24 b

(3)21 . . . b

(3)2n

0 0 1 a(3)34 b

(3)31 . . . b

(3)3n

0 0 0 a(3)44 b

(3)41 . . . b

(3)4n

Example – Applying Gi :

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 7/17

Page 25: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Gauss-Jordan Elimination

We define Gi = GiPi as product of a row permutation Pi and a Gausstransformation Gi :

Gi =

1 − d1idii

. . ....

1 − d(i−1)i

dii1dii

− d(i+1)i

dii1

.... . .

− dmidii

1

,

where dkl are the entries in D after applying Gi−1 . . . G1.

G4G3G2G1︸ ︷︷ ︸A−1

D =

1 0 0 0 y11 . . . y1n0 1 0 0 y21 . . . y2n0 0 1 0 y31 . . . y3n0 0 0 1 y41 . . . y4n

Example – Applying Gi :

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 7/17

Page 26: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Blocked AlgorithmThe application of Gi can be replaced by a rank-1 update and a row scaling inorder to work in-place:

D := D − 1

dii

(d1i , · · · , d(i−1)i , 0, d(i+1)i , . . . , dmi

)TDi,·

Di,· :=1

diiDi,·.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 8/17

Page 27: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Blocked AlgorithmThe application of Gi can be replaced by a rank-1 update and a row scaling inorder to work in-place:

D := D − 1

dii

(d1i , · · · , d(i−1)i , 0, d(i+1)i , . . . , dmi

)TDi,·

Di,· :=1

diiDi,·.

By repartitioning D into blocks:

D :=

A11 A12 A13 b1A21 A22 A23 b2A31 A32 A33 b3

,where A22 ∈ RNB×NB , we transform the rank-1 update into a rank-NB update.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 8/17

Page 28: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Blocked AlgorithmBy repartitioning D into blocks:

D :=

A11 A12 A13 b1A21 A22 A23 b2A31 A32 A33 b3

,where A22 ∈ RNB×NB , we transform the rank-1 update into a rank-NB update:

D :=

A11 0 A13 b10 0 0 0A31 0 A33 b3

+

−A12A−122

A−122

−A32A−122

︸ ︷︷ ︸

H

[A21 INB

A23 b2].

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 8/17

Page 29: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Blocked AlgorithmBy repartitioning D into blocks:

D :=

A11 A12 A13 b1A21 A22 A23 b2A31 A32 A33 b3

,where A22 ∈ RNB×NB , we transform the rank-1 update into a rank-NB update:

D :=

A11 0 A13 b10 0 0 0A31 0 A33 b3

+

−A12A−122

A−122

−A32A−122

︸ ︷︷ ︸

H

[A21 INB

A23 b2].

Algorithm mainly relies on rank-NB updates.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 8/17

Page 30: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Blocked AlgorithmBy repartitioning D into blocks:

D :=

A11 A12 A13 b1A21 A22 A23 b2A31 A32 A33 b3

,where A22 ∈ RNB×NB , we transform the rank-1 update into a rank-NB update:

D :=

A11 0 A13 b10 0 0 0A31 0 A33 b3

+

−A12A−122

A−122

−A32A−122

︸ ︷︷ ︸

H

[A21 INB

A23 b2].

Algorithm mainly relies on rank-NB updates.

Pivoting is integrated by LU decomposition of[AT22 AT

32

]T.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 8/17

Page 31: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Blocked AlgorithmBy repartitioning D into blocks:

D :=

A11 A12 A13 b1A21 A22 A23 b2A31 A32 A33 b3

,where A22 ∈ RNB×NB , we transform the rank-1 update into a rank-NB update:

D :=

A11 0 A13 b10 0 0 0A31 0 A33 b3

+

−A12A−122

A−122

−A32A−122

︸ ︷︷ ︸

H

[A21 INB

A23 b2].

Algorithm mainly relies on rank-NB updates.

Pivoting is integrated by LU decomposition of[AT22 AT

32

]T.

If only the solution AY = B required, the update reduces to: A13 b10 0A33 b3

←A13 b1

0 0A33 b3

+

−A12A−122

A−122

−A32A−122

︸ ︷︷ ︸

H

[A23 b2

].

→ Reduces the flop count to n3 + 2n3.

If only the inverse A−1 is necessary we obtain:A11 0 A13

0 0 0A31 0 A33

←A11 0 A13

0 0 0A31 0 A33

+

−A12A−122

A−122

−A32A−122

︸ ︷︷ ︸

H

[A21 INB

A23

].

→ Same flop count 2n3 as with LU decomposition.

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 8/17

Page 32: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

The algorithm has the following properties:

the computation of the panel matrix H works inside a block column,

the rank-NB update with the matrix H is a GEMM operation.

→ Data layout must cover the computation of the panel matrix H and the GEMM

operation.

Data Layout

Only O(1) GPUs in one server available → Column Block Cyclic (CBC)distribution of the matrix D:

D :=

GP

U1

GP

U2

GP

U1

GP

U2

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 9/17

Page 33: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

The algorithm has the following properties:

the computation of the panel matrix H works inside a block column,

the rank-NB update with the matrix H is a GEMM operation.

→ Data layout must cover the computation of the panel matrix H and the GEMM

operation.

Data Layout

Only O(1) GPUs in one server available → Column Block Cyclic (CBC)distribution of the matrix D:

D :=

GP

U1

GP

U2

GP

U1

GP

U2

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 9/17

Page 34: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

The algorithm has the following properties:

the computation of the panel matrix H works inside a block column,

the rank-NB update with the matrix H is a GEMM operation.

→ Data layout must cover the computation of the panel matrix H and the GEMM

operation.

Data Layout

Only O(1) GPUs in one server available → Column Block Cyclic (CBC)distribution of the matrix D:

D :=

G

PU

1

GP

U2

GP

U1

GP

U2

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 9/17

Page 35: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

Basic GPU Workflow

After separation into CPU-aware and GPU-aware operations we have to:

1 Compute the panel matrix H on the host CPU,

2 Copy the panel matrix H to all GPUs,

3 Perfom the rank-NB update in parallel on the distributed matrix D.

Parallel Rank-NB Update A13 b10 0A33 b3

← A13 b1

0 0A33 b3

+

−A12A−122

A−122

−A32A−122

︸ ︷︷ ︸

H

[A23 b2

].

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 10/17

Page 36: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

Basic GPU Workflow

After separation into CPU-aware and GPU-aware operations we have to:

1 Compute the panel matrix H on the host CPU,

2 Copy the panel matrix H to all GPUs,

3 Perfom the rank-NB update in parallel on the distributed matrix D.

Parallel Rank-NB Update A13 b10 0A33 b3

← A13 b1

0 0A33 b3

+

−A12A−122

A−122

−A32A−122

︸ ︷︷ ︸

H

[A23 b2

].

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 10/17

Page 37: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

Basic GPU Workflow

After separation into CPU-aware and GPU-aware operations we have to:

1 Compute the panel matrix H on the host CPU,

2 Copy the panel matrix H to all GPUs,

3 Perfom the rank-NB update in parallel on the distributed matrix D.

Parallel Rank-NB Update

GP

U1

GP

U2

GP

U1

GP

U2

GP

U1

GP

U2

GP

U1

GP

U2

+

GP

U1

/2

G1

G2

G1

G2

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 10/17

Page 38: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

Look-Ahead and Asynchronous Operation

We split right part of D into A13 b1A23 b2A33 b3

:=

A13 A13 b1A23 A23 b2A33 A33 b3

,

where A23 ∈ RNB×NB and perform the update in two steps as A13

0

A33

← A13

0

A33

+ HA23 (Look-Ahead GEMM)

and A13 b10 0

A33 b3

← A13 b1

0 0

A33 b3

+ H[A23 b2

]. (Remaining GEMM)

CPU GPU

Compute H Upload D

Look-Ahead GEMM

Remaining GEMMCompute H

Look-Ahead GEMM

Remaining GEMMCompute H

Remaining GEMM

Download Y

Co

mp

uta

tio

nT

ime

Execution using one GPU:

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 11/17

Page 39: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

Look-Ahead and Asynchronous Operation

We split right part of D into A13 b1A23 b2A33 b3

:=

A13 A13 b1A23 A23 b2A33 A33 b3

,

where A23 ∈ RNB×NB and perform the update in two steps as

GP

U1

GP

U1

+

GP

U1

G1

and

GP

U2

GP

U1

GP

U2

GP

U2

GP

U1

GP

U2

+G

PU

1/

2

G2

G1

G2

CPU GPU

Compute H Upload D

Look-Ahead GEMM

Remaining GEMMCompute H

Look-Ahead GEMM

Remaining GEMMCompute H

Remaining GEMM

Download Y

Co

mp

uta

tio

nT

ime

Execution using one GPU:

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 11/17

Page 40: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

Look-Ahead and Asynchronous Operation

We split right part of D into A13 b1A23 b2A33 b3

:=

A13 A13 b1A23 A23 b2A33 A33 b3

,

where A23 ∈ RNB×NB and perform the update in two steps as

GP

U1

GP

U1

+

GP

U1

G1

and

GP

U2

GP

U1

GP

U2

GP

U2

GP

U1

GP

U2

+G

PU

1/

2

G2

G1

G2

CPU GPU

Compute H Upload D

Look-Ahead GEMM

Remaining GEMMCompute H

Look-Ahead GEMM

Remaining GEMMCompute H

Remaining GEMM

Download Y

Co

mp

uta

tio

nT

ime

Execution using one GPU:

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 11/17

Page 41: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

Further Caveats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

General

We have to use row major on the GPUs to reduce the number of cachemisses during pivoting.→ The algorithm works implicitly on the transpose of the matrix D.

All memory locations on the host need to be page-aligned andpage-locked.

Performance Shift on OpenPOWER:

Performance gap between CPUs and GPUs: ≈ 20×Higher bandwidths between main memory, CPU, GPU, and GPU memory

→ Panel preparation on the host is slow, even with multi-threaded BLAS.

Input: Current Panel[AT12 A

T22 A

T32

]TOutput: Update matrix H1: Compute the LU decomposition[

L1L2

]U = P

[A22

A32

]2: Enqueue the permutation P on the devices.3: Enqueue the preparation of the rank-NB updates on the devices.4: Compute

H :=

−A12U−1L−11

U−11 L−1

−L2L−11

5: Upload H to the devices.

Fine grained computation of H:

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 12/17

Page 42: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

Further Caveats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

General

We have to use row major on the GPUs to reduce the number of cachemisses during pivoting.→ The algorithm works implicitly on the transpose of the matrix D.

All memory locations on the host need to be page-aligned andpage-locked.

Performance Shift on OpenPOWER:

Performance gap between CPUs and GPUs: ≈ 20×Higher bandwidths between main memory, CPU, GPU, and GPU memory

→ Panel preparation on the host is slow, even with multi-threaded BLAS.

Input: Current Panel[AT12 A

T22 A

T32

]TOutput: Update matrix H1: Compute the LU decomposition[

L1L2

]U = P

[A22

A32

]2: Enqueue the permutation P on the devices.3: Enqueue the preparation of the rank-NB updates on the devices.4: Compute

H :=

−A12U−1L−11

U−11 L−1

−L2L−11

5: Upload H to the devices.

Fine grained computation of H:

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 12/17

Page 43: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Multi-GPU Implementation

Further Caveats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

General

We have to use row major on the GPUs to reduce the number of cachemisses during pivoting.→ The algorithm works implicitly on the transpose of the matrix D.

All memory locations on the host need to be page-aligned andpage-locked.

Performance Shift on OpenPOWER:

Performance gap between CPUs and GPUs: ≈ 20×Higher bandwidths between main memory, CPU, GPU, and GPU memory

→ Panel preparation on the host is slow, even with multi-threaded BLAS.

Input: Current Panel[AT12 A

T22 A

T32

]TOutput: Update matrix H1: Compute the LU decomposition[

L1L2

]U = P

[A22

A32

]2: Enqueue the permutation P on the devices.3: Enqueue the preparation of the rank-NB updates on the devices.4: Compute

H :=

−A12U−1L−11

U−11 L−1

−L2L−11

5: Upload H to the devices.

Fine grained computation of H:

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 12/17

Page 44: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical ResultsHardware and Software Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

OpenPOWER 8 System

Hardware as given in the Motivation with 2x P100 accelerators

CentOS 7.3 for ppc64le with custom 4.8.6 Linux Kernel

IBM XLC 13.1.5 and IBM XLF 15.1.5 compilers

CUDA 8.0

IBM ESSL 5.5 as BLAS and LAPACK library on the host

“Old” Reference System

2x Intel Xeon E5-2640v3 (8 Cores, 8x 256kB L2 Cache, 20MB L3 Cache)

64 GB DDR3 memory

2x Nvidia Tesla K20m accelerators

CentOS 7.3 with Intel Parallel Studio 2017.1 including MKL 2017.1

CUDA 8.0

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 13/17

Page 45: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical ResultsHardware and Software Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

OpenPOWER 8 System

Hardware as given in the Motivation with 2x P100 accelerators

CentOS 7.3 for ppc64le with custom 4.8.6 Linux Kernel

IBM XLC 13.1.5 and IBM XLF 15.1.5 compilers

CUDA 8.0

IBM ESSL 5.5 as BLAS and LAPACK library on the host

“Old” Reference System

2x Intel Xeon E5-2640v3 (8 Cores, 8x 256kB L2 Cache, 20MB L3 Cache)

64 GB DDR3 memory

2x Nvidia Tesla K20m accelerators

CentOS 7.3 with Intel Parallel Studio 2017.1 including MKL 2017.1

CUDA 8.0

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 13/17

Page 46: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical Results

Solution of the linear system:

Random matrix A ∈ Rn×n with n = 1024, 2048, . . .

Right hand side B ∈ Rn×n as B := A · ones(n, n)

Block size varying from 64 to 768(P100) / 1024(K20)

IEEE Double Precision

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 14/17

Page 47: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical ResultsOptimal Blocksize n = 5, 120. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1,024

2,000

4,000

6,000

8,000

Optimal NB = 64

Optimal NB = 192

Blocksize NB

Floprate

in[G

Flops/s]

OpenPOWER 8 (2xP100) Old Reference (2xK20)

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 15/17

Page 48: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical ResultsOptimal Blocksize n = 10, 240. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1,024

2,000

4,000

6,000

8,000

Optimal NB = 256

Optimal NB = 384

Blocksize NB

Floprate

in[G

Flops/s]

OpenPOWER 8 (2xP100) Old Reference (2xK20)

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 15/17

Page 49: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical ResultsOptimal Blocksize n = 15, 360. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1,024

2,000

4,000

6,000

8,000

Optimal NB = 192

Optimal NB = 704

Blocksize NB

Floprate

in[G

Flops/s]

OpenPOWER 8 (2xP100) Old Reference (2xK20)

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 15/17

Page 50: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical ResultsOptimal Blocksize n = 20, 480. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1,024

2,000

4,000

6,000

8,000 Optimal NB = 192

Optimal NB = 704

Blocksize NB

Floprate

in[G

Flops/s]

OpenPOWER 8 (2xP100) Old Reference (2xK20)

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 15/17

Page 51: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical ResultsOptimal Blocksize n = 30, 720. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1,024

2,000

4,000

6,000

8,000Optimal NB = 384

Blocksize NB

Floprate

in[G

Flops/s]

OpenPOWER 8 (2xP100)

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 15/17

Page 52: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical ResultsOptimal Blocksize n = 40, 960. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1,024

2,000

4,000

6,000

8,000Optimal NB = 640

Blocksize NB

Floprate

in[G

Flops/s]

OpenPOWER 8 (2xP100)

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 15/17

Page 53: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Numerical ResultsPerformance with optimal blocksizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6,144 12,288 18,432 24,576 30,720 36,864 43,008

2,000

4,000

6,000

8,000

8.89 TFlops/s= 83.9% of peak

1.8 TFlops/s= 76.9% of peak

×4.15

Matrix size n

Floprate

in[G

Flops/s]

OpenPOWER 8 (2xP100) Old Reference (2xK20)

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 16/17

Page 54: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Conclusions

Conclusions

Speed up up to 5 between the K20 and the P100.

Hybrid CPU-GPU algorithms are getting complicated due to the largeperformance differences.

High bandwidth and smaller latencies yield smaller block sizes for optimalperformance.

“The Power System 822LC is an HPC Cluster in one server.”

Outlook and Future Work

Implementation of an out-of-core solver

Develop a fully integrated GPU accelerated matrix-sign function

Thank you for your attention!

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 17/17

Page 55: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Conclusions

Conclusions

Speed up up to 5 between the K20 and the P100.

Hybrid CPU-GPU algorithms are getting complicated due to the largeperformance differences.

High bandwidth and smaller latencies yield smaller block sizes for optimalperformance.

“The Power System 822LC is an HPC Cluster in one server.”

Outlook and Future Work

Implementation of an out-of-core solver

Develop a fully integrated GPU accelerated matrix-sign function

Thank you for your attention!

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 17/17

Page 56: GPU Accelerated Gauss-Jordan Elimination on the OpenPOWER ...€¦ · The Gauss-Jordan Elimination algorithm is mostly known to invert matrices with 2n3 ops. Solving a linear system

Conclusions

Conclusions

Speed up up to 5 between the K20 and the P100.

Hybrid CPU-GPU algorithms are getting complicated due to the largeperformance differences.

High bandwidth and smaller latencies yield smaller block sizes for optimalperformance.

“The Power System 822LC is an HPC Cluster in one server.”

Outlook and Future Work

Implementation of an out-of-core solver

Develop a fully integrated GPU accelerated matrix-sign function

Thank you for your attention!

Martin Kohler [email protected] Accelerated Gauss-Jordan Elimination 17/17