Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
1592653589793238462643383279502884197169399375491898018
4159265358979323846264338327950288419716939937549189801
1415926535897932384626433832795028841971693993754918980
926535897932384626433832795028841971693993754918980183
1592653589793238462643383279502884197169399375491898018
4159265358979323846264338327950288419716939937549189801
5926535897932384626433832795028841971693993754918980183
4159265358979323846264338327950288419716939937549189801
1592653589793238462643383279502884197169399375491898018
1592653589793238462643383279502884197169399375491898018
5926535897932384626433832795028841971693993754918980183
1415926535897932384626433832795028841971693993754918980
5926535897932384626433832795028841971693993754918980183
4159265358979323846264338327950288419716939937549189801
1592653589793238462643383279502884197169399375491898018
1415926535897932384626433832795028841971693993754918980
High Performance Linear Algebraon Data Parallel Co-Processors II
2
[email protected] University of Toronto / Technische Universität Berlin
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
SMEM
L1
L2
Global Memory
Host
Input Assembler
Thread Scheduler
Data parallel co-processors
3
[email protected] University of Toronto / Technische Universität Berlin
Memory latency
!
!
! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! !
, , ,! ! ! ! ! ! ! !! ! ! ! ( ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !!_Y(!&99L'(!8 ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !!
!
! ! ! ! !
! ! ! ! ! ! ! ! ! !
!
! !
! ! ! ! ! ! ! !
! ! ( ! ! ! ! ! !
! ! ! ! ! ! ! !
( ! ! ( ! ! ! ! !
! ! ! ! ! ! ! ! ! !!
! !
! ! ( ! ! ! ! ! !!
! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !!d ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !!
, , , ,! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! !
"#$! 0011"-.! 0311"-4! 2011"-.! "-./01!
&8!MF%9?!"IG9! 03! :/! C1! ;<;!
&=FE%(N!,*MJ! 02b! 0:b! 0Db! 02b!
'F9&=FE%(N! 2b! ;1b! 2b! D;b!
98+FN(U/! 2b! ;1b! 2b! <Db!
98+FN(U;1! ;1b! ;1b! 2b! ;1b!
98+FN(U;111! 1B2b! /B;b! ;B;b! ;B;b!
, , , , !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! Y(! ,*L=N%c8! &,PF(7(!
, ! ! ! ! ! ! ! ! !! ! ! !!
Y ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!
, , , , , !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
!
! ! ! ! !! !
! ! !
! ! !
! ! !!
! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! !
, , , , , ,! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
Vo
lko
v, V
., an
d J
. W
. D
emm
el.
Ben
chm
arki
ng
GP
Us
to T
un
e D
ense
Lin
ear
Alg
ebra
. 20
08.
4
[email protected] University of Toronto / Technische Universität Berlin
Memory latency
!
!
! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! !
, , ,! ! ! ! ! ! ! !! ! ! ! ( ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !!_Y(!&99L'(!8 ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !!
!
! ! ! ! !
! ! ! ! ! ! ! ! ! !
!
! !
! ! ! ! ! ! ! !
! ! ( ! ! ! ! ! !
! ! ! ! ! ! ! !
( ! ! ( ! ! ! ! !
! ! ! ! ! ! ! ! ! !!
! !
! ! ( ! ! ! ! ! !!
! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !!d ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !!
, , , ,! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! !
"#$! 0011"-.! 0311"-4! 2011"-.! "-./01!
&8!MF%9?!"IG9! 03! :/! C1! ;<;!
&=FE%(N!,*MJ! 02b! 0:b! 0Db! 02b!
'F9&=FE%(N! 2b! ;1b! 2b! D;b!
98+FN(U/! 2b! ;1b! 2b! <Db!
98+FN(U;1! ;1b! ;1b! 2b! ;1b!
98+FN(U;111! 1B2b! /B;b! ;B;b! ;B;b!
, , , , !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! Y(! ,*L=N%c8! &,PF(7(!
, ! ! ! ! ! ! ! ! !! ! ! !!
Y ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!
, , , , , !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
!
! ! ! ! !! !
! ! !
! ! !
! ! !!
! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! !
, , , , , ,! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
Vo
lko
v, V
., an
d J
. W
. D
emm
el.
Ben
chm
arki
ng
GP
Us
to T
un
e D
ense
Lin
ear
Alg
ebra
. 20
08.
5
[email protected] University of Toronto / Technische Universität Berlin
Linear algebra libraries
• CUBLAS: BLAS implementation for CUDA.
• CUSparse: Sparse matrix operations.
• CUFFT: FFT implementation for CUDA.
• CULA: Lapack implementation for CUDA.
• CURand: (Quasi) random number generation.
• NAG GPU library: NAG for GPUs.
• Thrust: STL style library for commonfunctionality (sort, reduce, ...).
6
[email protected] University of Toronto / Technische Universität Berlin
BLAS performance
htt
p:/
/g
pg
pu
.org
/w
p/
wp
-co
nte
nt/
up
load
s/20
09/
11/
SC
09_C
UD
A_T
oo
ls_C
oh
en.p
df
7
[email protected] University of Toronto / Technische Universität Berlin
LU factorization using BLAS!
!
"
#
$"
$#
%"
%#
&"
'( $%) %#' #$% $"%( %"() ("*' )$*% $'&)( &%+')
,-./012
345678!/-!09:4.
;9:4. -9<8/=5>985/:?@;A!1!,;A2!1!4285B9842
) ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! F9!LMN&8(NB!-P(!'&98(+!8P+(&N!9MF%9!
* ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !!- ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
, , , , , , ,! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !!! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! </?111B!-PL9?! &8!0! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!:! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! !
, , , , , , ,! ! ! ! ( ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
( ( ( ( ( ( !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !!
, , ,! ! ! ! ! ! ! ( ( ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !
Vo
lko
v, V
., an
d J
. W
. D
emm
el.
Ben
chm
arki
ng
GP
Us
to T
un
e D
ense
Lin
ear
Alg
ebra
. 20
08.
8
[email protected] University of Toronto / Technische Universität Berlin
CUFFT performance
htt
p:/
/g
pg
pu
.org
/w
p/
wp
-co
nte
nt/
up
load
s/20
09/
11/
SC
09_C
UD
A_T
oo
ls_C
oh
en.p
df
9
[email protected] University of Toronto / Technische Universität Berlin
CUSparse performance
htt
p:/
/w
ww
.nv
idia
.co
m/
con
ten
t/G
TC
-201
0/p
dfs
/20
70_G
TC
2010
.pd
f
10
[email protected] University of Toronto / Technische Universität Berlin
CUSparse performance
htt
p:/
/w
ww
.nv
idia
.co
m/
con
ten
t/G
TC
-201
0/p
dfs
/20
70_G
TC
2010
.pd
f
11
[email protected] University of Toronto / Technische Universität Berlin
GEMM
• General weighted matrix-matrix multiplication
12
[email protected] University of Toronto / Technische Universität Berlin
GEMM
• General weighted matrix-matrix multiplication
C = αAB+ βC
13
[email protected] University of Toronto / Technische Universität Berlin
GEMM
• General weighted matrix-matrix multiplication
• Algorithm has to be blocked to avoid beingmemory bandwidth limited.
C = αAB+ βC
14
[email protected] University of Toronto / Technische Universität Berlin
GEMM
• General weighted matrix-matrix multiplication
• Algorithm has to be blocked to avoid beingmemory bandwidth limited.
• Many implementation details have to be alignedwith the details of the hardware architecture ...
C = αAB+ βC
15
[email protected] University of Toronto / Technische Universität Berlin
GEMM
70% l l d dd h d h d ( )
60%
70%
our implementation (60%)
multiply and add with an operand in shared memory (66%)
50%
k
30%
40%
onof
Pea CUBLAS 1.1 (37%)
20%Fracti
0%
10%
0%
64 128 256 512 1024 2048 4096N
htt
p:/
/w
ww
.cs.
ber
kel
ey.e
du
/~
vo
lko
v/
vo
lko
v08
-sc0
8tal
k.p
df
16
[email protected] University of Toronto / Technische Universität Berlin
GEMM!
!
! ! ! !
( ! ! (! ! ! ! !
! ! ! ! ! ! ! !
! ! ! ! ! ! ! !
! ! ! !! !!
M ! ! ! ! !
! ! ! ! ! ! ! !
! ! ! ! ! ! ! !
! ! ! !
! ! ! !
! ! ! ! ! !
! ! ! !
! !! :/! 1!
` ! ! ! ! ! !
! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! !
"
#"
$""
$#"
%""
%#"
&""
&#"
(""
'( $%) %#' #$% $"%( %"() ("*'
,-./012
C
))"",DE!F$'!G!$H&#,3>I
*)"",DE!F$'!G!$H'+,3>I
,DE%)"!F&"!G!$H&",3>I
)'"",DJ!F(!G!$H(#,3>I@/=4%KL9MN!&H",3>
'!;
&);
'!;
.#;'!;
@/=4%!OL/N!%H'+,3> ).;
J,PQQNB98=5<42CGC
) ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !!
!
! !
!
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !!
= ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
! ! ! ! !
, , , , ,! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !
Vo
lko
v, V
., an
d J
. W
. D
emm
el.
Ben
chm
arki
ng
GP
Us
to T
un
e D
ense
Lin
ear
Alg
ebra
. 20
08.
17
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.
18
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.
• Ingredients: Gerschgorin intervaland Sturm counts .
G(T) = [l, u]) [ ]
C(y) = k
19
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Sought: all eigenvalues of tridiagonal,symmetric input matrix T.
• Ingredients: Gerschgorin intervaland Sturm counts .
• Recipe: Bracket eigenvalues starting from theGerschgorin interval using Sturm counts.
G(T) = [l, u]) [ ]
C(y) = k
20
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0 n
u
21
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0 C(m1) n
um1
22
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1 m1
m1
23
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
C(m21)
m21
C(m21)
m21m1
m1
24
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23
C(m22) C(m23)
m23m22
25
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
26
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
27
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
C(mlk) C(mlk+1)
mlk+1mlk
28
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
Count = 0;
d = 1;
for i = 1 to n
d = ai - x - (bi-1*bi-1)/d;
if d < 0
Count += 1;
endif
endfor
29
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0 n
u
30
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0 C(m1) n
um1
1X
31
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1 m1
m1
1X
32
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
C(m21)
m21
C(m21)
m21m1
m1
1X
2X
33
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21
C(m21) C(m22)
m21 m22
m1
m1
C(m23) n
um23
C(m22) C(m23)
m23m22
1X
2X
34
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
1X
2X
3X
35
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
1X
2X
3X
36
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
l
0
0 C(m1)
C(m1)
C(m1)
n
n
u
ul m1
0 C(m21)
C(m21)
l m21
m21
C(m21)
m21m1
m1
C(m23) n
um23m31 m32 m32
C(m22) C(m23)
m23m22
C(mlk) C(mlk+1)
mlk+1mlk
1X
2X
3X
≈ nX
37
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Strategy: one thread per tree node (or interval).
– Tree is processed level by level.
38
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
• Strategy: one thread per tree node (or interval).
– Tree is processed level by level.
• Implementation “details”?
– Memory management? Can and should we useshared memory?
– How can we handle large matrices?
39
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
1. Compute Gerschgorin interval.
40
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
1. Compute Gerschgorin interval.
2. Move data to co-processor.
– Matrix data (ai, bi).
– Gerschgorin interval.
41
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
1. Compute Gerschgorin interval.
2. Move data to co-processor.
– Matrix data (ai, bi).
– Gerschgorin interval.
3. Launch kernel.
42
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// move matrix data to shared mem
__shared__ float a[MAX_MAT_SIZE];
__shared__ float b[MAX_MAT_SIZE];
for( i=0; i<(mat_size/num_threads); ++i) {
a[tid] = a_g[tid];
b[tid] = b_g[tid];
}
__synchthreads();
43
[email protected] University of Toronto / Technische Universität Berlin
struct SharedMem {
float left[MAX_THREADS_BLOCK];
float right[MAX_THREADS_BLOCK];
short left_count[MAX_THREADS_BLOCK];
short right_count[MAX_THREADS_BLOCK];
short compact[MAX_THREADS_BLOCK+1];
};
Bisection for eigenvalue spectrum
44
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// move Gerschgorin data to shared mem
__shared__ SharedMem smem;
if( 0 == tid) {
smem.left[0] = lg;
smem.right[0] = ug;
smem.left_count[0] = 0;
smem.right_count[0] = mat_size+1;
}
45
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// compute midpoint of interval
// compute Sturm count for midpoint
// write intervals to shared memory
smem.left[tid] = left;
smem.right[tid] = mid;
smem.left[tid+1] = mid;
smem.right[tid+1] = right;
...
}
46
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
if( tid < num_active threads) {
float left = smem.left[tid];
float right = smem.right[tid];
...
// compute midpoint of interval
float mid = midpoint( left, right);
// compute Sturm count for midpoint
mid_count = sturm_count( left, right, mid);
47
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// compute Sturm count for midpoint
mid_count = sturm_count( left, right, mid);
// write intervals to shared memory
__syncthreads();
smem.left[2*tid] = left;
smem.right[2*tid] = mid;
smem.left[2*tid+1] = mid;
smem.right[2*tid+1] = right;
num_intervals *= 2;
}
48
[email protected] University of Toronto / Technische Universität Berlin
Interlude: prefix sum (scan)
• Swiss army knife of data-parallel programming.
• Sort, max, min, reduce, ...
49
[email protected] University of Toronto / Technische Universität Berlin
Interlude: prefix sum (scan)
• Swiss army knife of data-parallel programming.
• Sort, max, min, reduce, ...
• Divide-and-Conquer strategy.
– Traverse tree first bottom up and then top down.
– Bottom up: gather information.
– Top down: distribute information.
50
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// write intervals to shared memory
__syncthreads();
smem.left[2*tid] = left;
smem.right[2*tid] = mid;
smem.left[2*tid+1] = mid;
smem.right[2*tid+1] = right;
// compact intervals in shared memory
...
num_intervals = smem.compact[num_intervals];
}
51
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
1. Read data from global to shared memory.
2. Until all intervals are converged:
– Read interval into registers.
– Bisect and compute Sturm count.
– Write to shared memory.
– Compact intervals.
3. Write eigenvalues to global memory.
52
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
for( j=0; j<(mat_size/blockDim.x); ++j){
// read data with all threads in the block
// make sure the shared memory is not used
__syncthreads();
// read data for current block
if( addr < mat_size) {
d_data[tid] = g_d[addr];
lld_data[tid] = g_lld[addr];
}
__syncthreads();
53
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
// read data for current block
if( addr < mat_size) {
a[tid] = a_g[addr];
b[tid] = b_g[addr];
}
__syncthreads();
// for all active intervals
if(tid < num_intervals) {
// Sturm count computation for current block
...
54
[email protected] University of Toronto / Technische Universität Berlin
Bisection for eigenvalue spectrum
0 500 1000 1500 2000 2500 3000 3500 40000
2000
4000
6000
8000
10000
12000
14000
16000
18000Execution Time
Matrix Size
Tim
e (
ms)
sstemr mean
sstemr min
sstemr max
mrrr_dp mean
mrrr_dp min
mrrr_dp max
55
[email protected] University of Toronto / Technische Universität Berlin
How to get started?
• http://developer.nvidia.com/cuda/
– Cuda 3.0 has device emulation mode.
– SDK has many, many code examples.
– NVIDIA linear algebra libraries.
• http://www.cs.berkeley.edu/~volkov/
– Performance analysis of linear algebra on dataparallel co-processors.
56
[email protected] University of Toronto / Technische Universität Berlin
Conclusion
• Linear algebra on data parallel co-processors canbe significantly more efficient than on CPUs.
• But requires architecture specific design andoptimizations.
• For many algorithms still unclear how toefficiently implement them on a data parallel co-processor.