RWTH Aachen :: High Performance Matrix Computation

1592653589793238462643383279502884197169399375491898018

4159265358979323846264338327950288419716939937549189801

1415926535897932384626433832795028841971693993754918980

926535897932384626433832795028841971693993754918980183

1592653589793238462643383279502884197169399375491898018

4159265358979323846264338327950288419716939937549189801

5926535897932384626433832795028841971693993754918980183

4159265358979323846264338327950288419716939937549189801

1592653589793238462643383279502884197169399375491898018

1592653589793238462643383279502884197169399375491898018

5926535897932384626433832795028841971693993754918980183

1415926535897932384626433832795028841971693993754918980

5926535897932384626433832795028841971693993754918980183

4159265358979323846264338327950288419716939937549189801

1592653589793238462643383279502884197169399375491898018

1415926535897932384626433832795028841971693993754918980

High Performance Linear Algebraon Data Parallel Co-Processors II

2

[email protected] University of Toronto / Technische Universität Berlin

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

SMEM

L1

L2

Global Memory

Host

Input Assembler

Thread Scheduler

Data parallel co-processors

3


Memory latency

!

!

! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! !

, , ,! ! ! ! ! ! ! !! ! ! ! ( ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !!_Y(!&99L'(!8 ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !!

!

! ! ! ! !

! ! ! ! ! ! ! ! ! !

!

! !

! ! ! ! ! ! ! !

! ! ( ! ! ! ! ! !

! ! ! ! ! ! ! !

( ! ! ( ! ! ! ! !

! ! ! ! ! ! ! ! ! !!

! !

! ! ( ! ! ! ! ! !!

! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !!d ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !!

, , , ,! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! !

"#$! 0011"-.! 0311"-4! 2011"-.! "-./01!

&8!MF%9?!"IG9! 03! :/! C1! ;<;!

&=FE%(N!,*MJ! 02b! 0:b! 0Db! 02b!

'F9&=FE%(N! 2b! ;1b! 2b! D;b!

98+FN(U/! 2b! ;1b! 2b! <Db!

98+FN(U;1! ;1b! ;1b! 2b! ;1b!

98+FN(U;111! 1B2b! /B;b! ;B;b! ;B;b!

, , , , !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! Y(! ,*L=N%c8! &,PF(7(!

, ! ! ! ! ! ! ! ! !! ! ! !!

Y ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!

, , , , , !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

!

! ! ! ! !! !

! ! !

! ! !

! ! !!

! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! !

, , , , , ,! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

Vo

lko

v, V

., an

d J

. W

. D

emm

el.

Ben

chm

arki

ng

GP

Us

to T

un

e D

ense

Lin

ear

Alg

ebra

. 20

08.

4


Memory latency

!

!

! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! !

, , ,! ! ! ! ! ! ! !! ! ! ! ( ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !!_Y(!&99L'(!8 ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !!

!

! ! ! ! !

! ! ! ! ! ! ! ! ! !

!

! !

! ! ! ! ! ! ! !

! ! ( ! ! ! ! ! !

! ! ! ! ! ! ! !

( ! ! ( ! ! ! ! !

! ! ! ! ! ! ! ! ! !!

! !

! ! ( ! ! ! ! ! !!

! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !!d ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !!

, , , ,! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! !

"#$! 0011"-.! 0311"-4! 2011"-.! "-./01!

&8!MF%9?!"IG9! 03! :/! C1! ;<;!

&=FE%(N!,*MJ! 02b! 0:b! 0Db! 02b!

'F9&=FE%(N! 2b! ;1b! 2b! D;b!

98+FN(U/! 2b! ;1b! 2b! <Db!

98+FN(U;1! ;1b! ;1b! 2b! ;1b!

98+FN(U;111! 1B2b! /B;b! ;B;b! ;B;b!

, , , , !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! Y(! ,*L=N%c8! &,PF(7(!

, ! ! ! ! ! ! ! ! !! ! ! !!

Y ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!

, , , , , !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

!

! ! ! ! !! !

! ! !

! ! !

! ! !!

! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! !

, , , , , ,! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

Vo

lko

v, V

., an

d J

. W

. D

emm

el.

Ben

chm

arki

ng

GP

Us

to T

un

e D

ense

Lin

ear

Alg

ebra

. 20

08.

5


Linear algebra libraries

• CUBLAS: BLAS implementation for CUDA.

• CUSparse: Sparse matrix operations.

• CUFFT: FFT implementation for CUDA.

• CULA: Lapack implementation for CUDA.

• CURand: (Quasi) random number generation.

• NAG GPU library: NAG for GPUs.

• Thrust: STL style library for commonfunctionality (sort, reduce, ...).

6


BLAS performance

htt

p:/

/g

pg

pu

.org

/w

p/

wp

-co

nte

nt/

up

load

s/20

09/

11/

SC

09_C

UD

A_T

oo

ls_C

oh

en.p

df

7


LU factorization using BLAS!

!

"

#

$"

$#

%"

%#

&"

'( $%) %#' #$% $"%( %"() ("*' )$*% $'&)( &%+')

,-./012

345678!/-!09:4.

;9:4. -9<8/=5>985/:?@;A!1!,;A2!1!4285B9842

) ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! F9!LMN&8(NB!-P(!'&98(+!8P+(&N!9MF%9!

* ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !!- ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

, , , , , , ,! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !!! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! </?111B!-PL9?! &8!0! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!:! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! !

, , , , , , ,! ! ! ! ( ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

( ( ( ( ( ( !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !!

, , ,! ! ! ! ! ! ! ( ( ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !

Vo

lko

v, V

., an

d J

. W

. D

emm

el.

Ben

chm

arki

ng

GP

Us

to T

un

e D

ense

Lin

ear

Alg

ebra

. 20

08.

8


CUFFT performance

htt

p:/

/g

pg

pu

.org

/w

p/

wp

-co

nte

nt/

up

load

s/20

09/

11/

SC

09_C

UD

A_T

oo

ls_C

oh

en.p

df

9


CUSparse performance

htt

p:/

/w

ww

.nv

idia

.co

m/

con

ten

t/G

TC

-201

0/p

dfs

/20

70_G

TC

2010

.pd

f

10


CUSparse performance

htt

p:/

/w

ww

.nv

idia

.co

m/

con

ten

t/G

TC

-201

0/p

dfs

/20

70_G

TC

2010

.pd

f

11


GEMM

• General weighted matrix-matrix multiplication

12


GEMM


C = αAB+ βC

13


GEMM


• Algorithm has to be blocked to avoid beingmemory bandwidth limited.

C = αAB+ βC

14


GEMM


• Algorithm has to be blocked to avoid beingmemory bandwidth limited.

• Many implementation details have to be alignedwith the details of the hardware architecture ...

C = αAB+ βC

15


GEMM

70% l l d dd h d h d ( )

60%

70%

our implementation (60%)

multiply and add with an operand in shared memory (66%)

50%

k

30%

40%

onof

Pea CUBLAS 1.1 (37%)

20%Fracti

0%

10%

0%

64 128 256 512 1024 2048 4096N

htt

p:/

/w

ww

.cs.

ber

kel

ey.e

du

/~

vo

lko

v/

vo

lko

v08

-sc0

8tal

k.p

df

16


GEMM!

!

! ! ! !

( ! ! (! ! ! ! !

! ! ! ! ! ! ! !

! ! ! ! ! ! ! !

! ! ! !! !!

M ! ! ! ! !

! ! ! ! ! ! ! !

! ! ! ! ! ! ! !

! ! ! !

! ! ! !

! ! ! ! ! !

! ! ! !

! !! :/! 1!

` ! ! ! ! ! !

! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! !

"

#"

$""

$#"

%""

%#"

&""

&#"

(""

'( $%) %#' #$% $"%( %"() ("*'

,-./012

C

))"",DE!F$'!G!$H&#,3>I

*)"",DE!F$'!G!$H'+,3>I

,DE%)"!F&"!G!$H&",3>I

)'"",DJ!F(!G!$H(#,3>I@/=4%KL9MN!&H",3>

'!;

&);

'!;

.#;'!;

@/=4%!OL/N!%H'+,3> ).;

J,PQQNB98=5<42CGC

) ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !!

!

! !

!

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!

= ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

! ! ! ! !

, , , , ,! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !

! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !

Vo

lko

v, V

., an

d J

. W

. D

emm

el.

Ben

chm

arki

ng

GP

Us

to T

un

e D

ense

Lin

ear

Alg

ebra

. 20

08.

17


Bisection for eigenvalue spectrum

• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.

18




• Ingredients: Gerschgorin intervaland Sturm counts .

G(T) = [l, u]) [ ]

C(y) = k

19




• Ingredients: Gerschgorin intervaland Sturm counts .

• Recipe: Bracket eigenvalues starting from theGerschgorin interval using Sturm counts.

G(T) = [l, u]) [ ]

C(y) = k

20



l

0 n

u

21



l

0 C(m1) n

um1

22



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1 m1

m1

23



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

C(m21)

m21

C(m21)

m21m1

m1

24



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23

C(m22) C(m23)

m23m22

25



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

26



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

27



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

C(mlk) C(mlk+1)

mlk+1mlk

28



Count = 0;

d = 1;

for i = 1 to n

d = ai - x - (bi-1*bi-1)/d;

if d < 0

Count += 1;

endif

endfor

29



l

0 n

u

30



l

0 C(m1) n

um1

1X

31



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1 m1

m1

1X

32



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

C(m21)

m21

C(m21)

m21m1

m1

1X

2X

33



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21

C(m21) C(m22)

m21 m22

m1

m1

C(m23) n

um23

C(m22) C(m23)

m23m22

1X

2X

34



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

1X

2X

3X

35



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

1X

2X

3X

36



l

0

0 C(m1)

C(m1)

C(m1)

n

n

u

ul m1

0 C(m21)

C(m21)

l m21

m21

C(m21)

m21m1

m1

C(m23) n

um23m31 m32 m32

C(m22) C(m23)

m23m22

C(mlk) C(mlk+1)

mlk+1mlk

1X

2X

3X

≈ nX

37



• Strategy: one thread per tree node (or interval).

– Tree is processed level by level.

38



• Strategy: one thread per tree node (or interval).

– Tree is processed level by level.

• Implementation “details”?

– Memory management? Can and should we useshared memory?

– How can we handle large matrices?

39



1. Compute Gerschgorin interval.

40




2. Move data to co-processor.

– Matrix data (ai, bi).

– Gerschgorin interval.

41




2. Move data to co-processor.

– Matrix data (ai, bi).

– Gerschgorin interval.

3. Launch kernel.

42



// move matrix data to shared mem

__shared__ float a[MAX_MAT_SIZE];

__shared__ float b[MAX_MAT_SIZE];

for( i=0; i<(mat_size/num_threads); ++i) {

a[tid] = a_g[tid];

b[tid] = b_g[tid];

}

__synchthreads();

43


struct SharedMem {

float left[MAX_THREADS_BLOCK];

float right[MAX_THREADS_BLOCK];

short left_count[MAX_THREADS_BLOCK];

short right_count[MAX_THREADS_BLOCK];

short compact[MAX_THREADS_BLOCK+1];

};


44



// move Gerschgorin data to shared mem

__shared__ SharedMem smem;

if( 0 == tid) {

smem.left[0] = lg;

smem.right[0] = ug;

smem.left_count[0] = 0;

smem.right_count[0] = mat_size+1;

}

45



// compute midpoint of interval

// compute Sturm count for midpoint

// write intervals to shared memory

smem.left[tid] = left;

smem.right[tid] = mid;

smem.left[tid+1] = mid;

smem.right[tid+1] = right;

...

}

46



if( tid < num_active threads) {

float left = smem.left[tid];

float right = smem.right[tid];

...

// compute midpoint of interval

float mid = midpoint( left, right);


mid_count = sturm_count( left, right, mid);

47




mid_count = sturm_count( left, right, mid);


__syncthreads();

smem.left[2*tid] = left;

smem.right[2*tid] = mid;

smem.left[2*tid+1] = mid;

smem.right[2*tid+1] = right;

num_intervals *= 2;

}

48


Interlude: prefix sum (scan)

• Swiss army knife of data-parallel programming.

• Sort, max, min, reduce, ...

49


Interlude: prefix sum (scan)

• Swiss army knife of data-parallel programming.

• Sort, max, min, reduce, ...

• Divide-and-Conquer strategy.

– Traverse tree first bottom up and then top down.

– Bottom up: gather information.

– Top down: distribute information.

50




__syncthreads();

smem.left[2*tid] = left;

smem.right[2*tid] = mid;

smem.left[2*tid+1] = mid;

smem.right[2*tid+1] = right;

// compact intervals in shared memory

...

num_intervals = smem.compact[num_intervals];

}

51



1. Read data from global to shared memory.

2. Until all intervals are converged:

– Read interval into registers.

– Bisect and compute Sturm count.

– Write to shared memory.

– Compact intervals.

3. Write eigenvalues to global memory.

52



for( j=0; j<(mat_size/blockDim.x); ++j){

// read data with all threads in the block

// make sure the shared memory is not used

__syncthreads();

// read data for current block

if( addr < mat_size) {

d_data[tid] = g_d[addr];

lld_data[tid] = g_lld[addr];

}

__syncthreads();

53



// read data for current block

if( addr < mat_size) {

a[tid] = a_g[addr];

b[tid] = b_g[addr];

}

__syncthreads();

// for all active intervals

if(tid < num_intervals) {

// Sturm count computation for current block

...

54



0 500 1000 1500 2000 2500 3000 3500 40000

2000

4000

6000

8000

10000

12000

14000

16000

18000Execution Time

Matrix Size

Tim

e (

ms)

sstemr mean

sstemr min

sstemr max

mrrr_dp mean

mrrr_dp min

mrrr_dp max

55


How to get started?

• http://developer.nvidia.com/cuda/

– Cuda 3.0 has device emulation mode.

– SDK has many, many code examples.

– NVIDIA linear algebra libraries.

• http://www.cs.berkeley.edu/~volkov/

– Performance analysis of linear algebra on dataparallel co-processors.

56


Conclusion

• Linear algebra on data parallel co-processors canbe significantly more efficient than on CPUs.

• But requires architecture specific design andoptimizations.

• For many algorithms still unclear how toefficiently implement them on a data parallel co-processor.

Documents

RWTH Aachen :: High Performance Matrix Computation