IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

IAP09 CUDA@MIT / 6.963

Supercomputing on your desktop:Programming the next generation of cheap

and massively parallel hardware using CUDA

Lecture 07

CUDA Advanced #2-

Nicolas Pinto (MIT)

Friday, January 23, 2009

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for 6.963


Todayyey!!


Wanna Play with The Big Guys?


Here are the keys to High-Performance in CUDA


To optimize or not to optimize

Hoare said (and Knuth restated)

“We should forget about small e!ciencies, say about97% of the time:

“Premature optimization is the root of all evil.”

!3% of the time we really should worry about small e!ciencies

(Every 33rd codeline)

Applied Mathematics 23/53slide by Johan Seland

Warning!


To optimize or not to optimize

Hoare said (and Knuth restated)

“We should forget about small e!ciencies, say about97% of the time:Premature optimization is the root of all evil.”

!3% of the time we really should worry about small e!ciencies

(Every 33rd codeline)


Warning!


StrategyMemory Optimizations

Execution Optimizations



CUDAPerformance Strategies



Optimization goals

We should strive to reach GPU performance

We must know the GPU performanceVendor specificationsSyntetic benchmarks

Choose a performance metricMemory bandwidth or GFLOPS?

Use clock() to measure

Experiment and profile!

Applied Mathematics 25/53

Strategy

slide by Johan Seland


© NVIDIA Corporation 2006 3

Programming Model

A kernel is executed as a grid of thread blocks

A thread block is a batch of threads that can cooperate with each other by:

Sharing data through shared memory

Synchronizing their execution

Threads from different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Threading


© NVIDIA Corporation 2008 10

Data Movement in a CUDA Program

Host Memory

Device Memory

[Shared Memory]

COMPUTATION

[Shared Memory]

Device Memory

Host Memory

Memory


39

!"#$%$&'()*+,-$#.%/(0,-(#.'(123

456$%$&'($78'"'78'7#("5-5**'*$/%

456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?

@,%'#$%'/($#A/(='##'-(#,(-'9,%"B#'(#.57(#,(959.'

123(/"'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:

E,(%,-'(9,%"B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(

85#5(#-57/0'-/

GF'7(*,>("5-5**'*$/%(9,%"B#5#$,7/(957(/,%'#$%'/(='(

05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#

Perf


40

!"#$%$&'()'%*+,(-*.'+'/0'

-*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'

=2*>12?@*012(4'5$0'(%'%*+,(

!"#$%$&'(:*+(3"1#$12(2*012$#,($/(010.'4(#'A#<+'(

%'%*+,

B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3

Perf


41

!"#$%&'(")*"+$%,-%./"0$'%1$2,03

45)'0$'6%,-%*72$6%-"6*$0%*/")%+8,9"8%2$2,03

!/0$"'6%:")%:,,;$0"*$%(7"%6/"0$'%2$2,03

<6$%,)$%=%"%-$>%*/0$"'6%*,%8,"'%=%:,2;5*$%'"*"%

6/"0$'%93%"88%*/0$"'6

<6$%7*%*,%"(,7'%),)?:,"8$6:$'%"::$66

.*"+$%8,"'6%")'%6*,0$6%7)%6/"0$'%2$2,03%*,%0$?,0'$0%),)?

:,"8$6:$"98$%"''0$667)+

1"*07@%*0")6;,6$%$@"2;8$%8"*$0

Perf


42

!"#$%&'&((#()"*$+,,)-)#./(0

%&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$

*2(/)3'1-#""1'"$#72&((0$82"0

9&.0$/5'#&:";$*&.0$/5'#&:$8(1-4"

<##3$'#"12'-#$2"&=#$(1>$#.12=5$/1$"2331'/$

*2(/)3(#$&-/)?#$/5'#&:$8(1-4"$3#'$*2(/)3'1-#""1'

@#=)"/#'";$"5&'#:$*#*1'0

Perf



Memory Optimizations



44

!"#$%&'$()*#*+,)*$-.

/()*#*+*-0'#"#$%&')%,-.1"%.

2$,3".4*-0'03$5,3'#"#$%&',44"..".

6.*-0'.7,%"8'#"#$%&'"11"4)*9"3&

Memory


45

!"#"$%&"'()*&(

!*+,-*$.*./&0$#/$1/(#$.*./&0$2"'34,3#1$.5-1$

6/4*&$#1"'$3*+,-*$.*./&0$#/$3*+,-*$2"'34,3#1

789:($;*"<$=>?@A*$BCDE$+(F$GH$89:($;*"<$=I5"3&/$JK$LDHHE

G89:($)/&$>?@A*$MFH

N,',.,O*$#&"'()*&(

@'#*&.*3,"#*$3"#"$(#&5-#5&*($-"'$2*$"66/-"#*3P$/;*&"#*3$

/'P$"'3$3*"66/-"#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$

.*./&0

8&/5;$#&"'()*&(

R'*$6"&Q*$#&"'()*&$.5-1$2*##*&$#1"'$."'0$(."66$/'*(

Memory


46

!"#$%&'()$*+,$-'./+0."123$.2

(4*","55'(6'2789+"55':2+"55'("7;'1+'3+<"#$%5'()$*+='27+-$-'./

>1"?5$2+=;#=$27+(4*",$-(</+<$.3'.-"1($@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9

LM+CDE2+-$"24.$*+'1+1N'.($+KOP;+-'7=$.?'".*2+8'Q$.(5'()$*+!GH%$9

R$$+7=$+S?"1*:;*7=0$27T GUVW+RVX+2"-<5$

U2$+:;7=+("47;'1W55'("7;1#+7''+-4(=+<"#$%5'()$*+-$-'./+("1+.$*4($+'Q$."55+2/27$-+<$.3'.-"1($

0$27+/'4.+2/27$-2+"1*+"<<2+7'+5$".1+7=$;.+5;-;72

Memory


47

!"#$%"&'()#*+&,(%-./0*12(.

3145(.2&"%2(67+&16.2*8721#6.9&:;;<=;;&7"#7>&7+7"(.

?1>("+&2#&$(&@(*A#*)%67(&$#22"(6(7>

B@21)1C%21#6.&7%6&4*(%2"+&167*(%.(&@(*A#*)%67(

D#%"(.71649&8@&2#&E;F&.@((-8@

?%2(67+&51-1649&8@&2#&GHIF&.@((-8@

gmem


Accessing global memory

4 cycles to issue on memory fetch

but 400-600 cycles of latencyThe equivalent of 100 MADs

Likely to be a performance bottleneck

Order of magnitude speedups possibleCoalesce memory access

Use shared memory to re-order non-coalesced addressing


gmem


48

!"#$%&'()*

+,'""-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:

+,'")/(*;";&,-%*("),"3,*$"0#$,<%<"-1=

9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5"-.=,()/?,3$"#/?,@

8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.";0$%45"-.=,()/A?,3$"#/A?,@

AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45"-.=,()/>?,3$"#/>?,@

+..(/(")#$,-%&/-('/(")&,"),FBGHFIG,#-'2(/%'/;-%=

J/#-/()*,#..-%&&,3"-,#,-%*("),<;&/,0%,#,<;$/(6$%,"3,-%*("),

&(K%

L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,

0$"'M,0%()*,-%#.

NO'%6/(")=,)"/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*

P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6

gmem


49

!"#$%&'%()*''%&&+),%#(-./)0$"#1&

12 13 14 135 13617

12 13 14 135 13617

374 378 395 3:4349 352 355 399

374 378 395 3:4349 352 355 399

;"<%)=>?%#(&)@")A"1)B#?1-'-C#1%

*$$)1>?%#(&)C#?1-'-C#1%

gmem


50

!"#$%&'(#')*+##'((,*-'%)."/*0&$%1(

12 13 14 135 13617

374 378349 352 355

:';<=1')*+##'((*>?*@A;'%)(

395 3B4399

C.(%&./"')*D1%;1."/*+));'((*E"$1*%*<=&1.F&'*$0*85G

12 13 14 137 13617

374 378 395 3B4349 352 355 399

135

gmem


51

!"#$%&'()*+,-(.()*,/%&0$1&

234%5(.%)1,"),678+,

9%5)%$+,5%#:,#,;$"#1<,()'5%.%)1<,=5(1%,>#'?

@A,;$"#1&,BCDAEF

-(.%&,#G%5#*%:,"G%5,C89,50)&

CD9,>$"'?&,3,DHI,1J5%#:&+

@HIK&,L '"#$%&'%:

@HMK&,L '"#$%&'%:<,&".%,1J5%#:&,:")N1,4#51('(4#1%

@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&

gmem


58

!"#$%&'()*+

,-./'-/.%&0"10&(2%0! 34054067089-%&:&%0#0,-./'-/.%0"10;..#9&0<,";=0()&-%#>0"10;..#90"10,-./'-/.%&0

<;",=

?10,";0(&0)"-0@(#A$%+

B".'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540".067

:&%0,IJI0-"0#'G(%@%0'"#$%&'()*

zyx Point structure

zyx zyx zyx AoS

xxx yyy zzz SoA

gmem


59

!"#$%&'()*+,-.//#01

!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2

!0(2('#$,2",/%/"0167".)8,9%0)%$&

:%#8()*,&20.'2.0%&,";,&(<%,"25%0,25#),=>,?>,"0,@A712%&,B($$,70%#9,'"#$%&'()*+

C0%;%0,-20.'2.0%&,";,D00#1& "4%0,D"-

E;,-"D,(&,)"2,4(#7$%>,0%#8FB0(2%,250".*5,-GHG

D88(2(")#$,0%&".0'%&+D$(*)%8,I13%&,-JK,-#/3$%

gmem


64

!"#"$$%$&'%()#*&+#,-./%,/0#%

12&"&3"#"$$%$&(",-.2%4&("2*&/-#%"56&",,%66&(%()#*

7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:"2;6

<66%2/."$&/)&",-.%9%&-.=-&:"25>.5/-

<",-&:"2;&,"2&6%#9.,%&)2%&"55#%66&3%#&,*,$%

+&(%()#*&,"2&6%#9.,%&"6&("2*&6.(0$/"2%)06&

",,%66%6&"6&./&-"6&:"2;6

'0$/.3$%&6.(0$/"2%)06&",,%66%6&/)&"&:"2;

#%60$/&.2&"&:"2;&,)28$.,/&

?)28$.,/.2=&",,%66%6&"#%&6%#."$.@%5

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

smem


65

!"#$%&''()**+#,%-."/01)*

23%!"#$%43#51+67*

8+#)"(%"''()**+#,%

*7(+')%99%:

23%!"#$%43#51+67*

;"#'3/%:<:%=)(/>7"7+3#

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

smem


66

!"#$%&''()**+#,%-."/01)*

234"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%2

=34"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%=

Thread 11

Thread 10

Thread 9

Thread 8

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 9

Bank 8

Bank 15

Bank 7

Bank 2

Bank 1

Bank 0x8

x8

smem


67

!"#$%&&'())()$*%+$,"$-%./)$".$012

3%.&#4&,5$"6$(%75$-%./$4)$89$-4,)$+('$9$7:"7/$7;7:()

<=77())4>($89?-4,$#"'&)$%'($%))4@.(&$,"$)=77())4>($

-%./)

012$5%)$AB$-%./)

<"$-%./$C$%&&'())$D$AB

<%*($%)$,5($)4E($"6$%$5%:6?#%'+F"$-%./$7".6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$".:;$#4,54.$%$)4.@:($5%:6?#%'+

smem


68

!"#$%&'(%()$*'+#,-'.),/01.23

!"#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2"%$%'#$%'

,)'+#,-'.),/01.23

5"%'/#32'.#3%6

7/'#00'2"$%#&3')/'#'"#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2"%$%'13'

,)'+#,-'.),/01.2

7/'#00'2"$%#&3')/'#'"#0/89#$:'$%#&'2"%'1&%,21.#0'#&&$%33;'

2"%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=

5"%'30)9'.#3%6

>#,-'?),/01.26'(@021:0%'2"$%#&3'1,'2"%'3#(%'"#0/89#$:'

#..%33'2"%'3#(%'+#,-

A@32'3%$1#01B%'2"%'#..%33%3

?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-

smem


Use the right kind of memory

Constant memory:Quite small, ! 20KAs fast as register access if all threads in a warp access thesame location

Texture memory:Spatially cachedOptimized for 2D localityNeighboring threads should read neighboring addressesNo need to think about coalescing

Constraint:These memories can only be updated from the CPU


Strategy


Memory optimizations roundup

CUDA memory handling is complexAnd I have not covered all topics...

Using memory correctly can lead to huge speedupsAt least CUDA expose the memory hierarchy, unlike CPUs

Get your algorithm up an running first, then optimize

Use shared memory to let threads cooperate

Be wary of “data ownership”A thread does not have to read/write the data it calculate


Strategy



Conflicts,Coalescing, Warps...I hate growing up.


!"#$%$&'#$()*+,'%"-./*0'#1$,*21')3"(3.

Example


70

!"#$%&'($")*+,*-

./0'."1+2-'34#$")*+,*-56

7228*#$"#-*9

:,"2-*;%)<

=>,%?%)<'.!@!'A")B';,)C2%;#*

.+--?8+*'C,$'->-)'*1"22'1"#$%;-*

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

Example


71

!"#$%&'(#')*+,%"(-$('

__global__ void transpose_naive(float *odata, float *idata, int width, int height)

{

unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

if (xIndex < width && yIndex < height)

{

unsigned int index_in = xIndex + width * yIndex;

unsigned int index_out = yIndex + height * xIndex;

$)%.%/0")'12$3.4 = 0)%.%/0")'120"4;

}

}

1.

2.

3.

4.

5.

6.

Example


72

!"#$%&'(#')*+,%"(-$('

.'%)(*/"-01*2,$3*4565

787978:78778;

;879;8:;87;8;

79879798:7987798;

<,/1'*$01-01*1$*4565

7987:87787;87

798;:8;78;;8;

79879:8797879;879

Stride = 16, uncoalesced

45654565

Stride = 1, coalesced

Example


73

!"#$%&'%()*+#,&-"&%

.&&/0-12",3)0#1+24)2&)-#+1212",%()2,1")&5/#+%)12$%&

*6+%#(7$"'8)974:)7;<3

=%#()16%)974:7;< 2,-/1)12$%:)&1"+%)2,1")>?@?

A+21%)16%)>?@?)(#1#)1")97;:74< "/1-/1)12$%*+#,&-"&%)16%)2,(%42,B)2,1")>?@?

*6+%#()914:1;<3

=%#(&)%$%0%,1)914:1;< C+"0)2,-/1)12$%

A+21%&)%$%0%,1)914:1;< 2,1")"/1-/1)12$%

!"#$%&'2,B)2&)#'62%D%()2C3

E$"'8F12$%)(20%,&2",&)#+%)0/$12-$%&)"C)GH

Example


74

!"#$%&'%()*+#,&-"&%

.+/0%&)0")12324%#(&)5+"6)7232

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

.+/0%&)0")72324%#(&)5+"6)1232

8:98;98898<98

8:9<;9<89<<9<

8:98:;98:898:<98:

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

Example


75

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Example


75

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Example


76

!"#$%&'%()*+#,&-"&%

__global__ void transpose(float *odata, float *idata, int width, int height)

{

__shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;

unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;

unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;


{

unsigned int index_in = width * yIndex + xIndex;

unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;

block[index_block] = idata[index_in];

index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;

index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}

__syncthreads();


odata[index_out] = block[index_transpose];

}

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

Example


Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;


if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}


Example












}__synchthreads();


}}

Allocate shared memory.


Example











}__synchthreads();


}}


Set up indexing


Example











}__synchthreads();


}}


Set up indexing

Check that we are withindomain, calculate moreindices


Example











}__synchthreads();


}}


Set up indexing


Write to shared memory.


Example











}__synchthreads();


}}


Set up indexing



Calculate output indices.


Example











}__synchthreads();


}}


Set up indexing




Synchronize.NB:outside if-clause


Example











}__synchthreads();


}}


Set up indexing




Synchronize.NB:outside if-clause

Write to global mem.Di!erent index


Example


Transpose timings

Was it worth the trouble?

Grid Size Coalesced Non-coalesced Speedup128! 128 0.011 ms 0.022 ms 2.0!512! 512 0.07 ms 0.33 ms 4.5!

1024! 1024 0.30 ms 1.92 ms 6.4!1024! 2048 0.79 ms 6.6 ms 8.4!

For me, this is a clear yes.


Example



Execution Optimizations



Know the arithmetic cost of operations

4 clock cycles:Floating point: add, multiply, fused multiply-addInteger add, bitwise operations, compare, min, max

16 clock cycles:reciprocal, reciprocal square root, log(x), 32-bit integermultiplication

32 clock cycles:sin(x), cos(x) and exp(x)

36 clock cycles:Floating point division (24-bit version in 20 cycles)

Particularly costly:Integer division, moduloRemedy: Replace with shifting whenever possible

Double precision (when available) will perform at half thespeed


Exec


79

!""#$%&"'

()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-

+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-

4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-

"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-

<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-

"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@

A+6./0+*/

B)%*+,-<+<1*'

Exec


80

!"#$%&'()*+,#-.+/.0"#12#)1

3+(4+5'()*1+6+3+(4+70'2#8"().11("1

,(+9''+70'2#8"().11("1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.

3+(4+5'()*1+%+3+(4+70'2#8"().11("1+6+>

?0'2#8'.+5'()*1+)9<+"0<+)(<)0"".<2'@+#<+9+70'2#8"().11("

&'()*1+2:92+9".<A2+B9#2#<C+92+9+DD1@<)2:".9$1EF+*..8+2:.+

:9"$B9".+501@

,05G.)2+2(+".1(0").+9;9#'95#'#2@+H ".C#12."1I+1:9".$+7.7("@

3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020".+$.;#).1

&'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<

JKKK+5'()*1+8."+C"#$+B#''+1)9'.+9)"(11+70'2#8'.+C.<."92#(<1

Exec


81

!"#$%&"'()"*"+,"+-.

!"/,0/1&"'02'$&"('"#$%&"'(,"*"+,"+-.3+%&'4-&$5+6%('"%47&(-/+(8"('"/,(9::(-.-7"%(7/&"'

;-"+/'$5%<=>)?< @AB<

A5(-5C*7"&"7.(D$,"(&D"(7/&"+-.<(!4+(/&(7"/%&(EF: &D'"/,%(GH(2/'*%I(*"'(C47&$*'5-"%%5'

?&(7"/%&(:JK 5--4*/+-.

AD'"/,%(,5(+5&(D/L"(&5(8"75+#(&5(&D"(%/C"(&D'"/,(875-M

/,,N1O:(((P1OQ(P1EQ(P1:

/,,N1O:(((P1JQ(P1OQ(P1R

S T(.(U(JV

W(T(S U(OV

7,N%D/'",N1O:((P1OQ(XP'OEUYZ(

/,,N1O:(((((((((((P1OQ(P1OQ(P1R

%[,/&/XYZ(UT(OV

Exec


82

!"#$%&"'()'"%%*'"

+$,"(-.&"/01(21(*%$/#(34'"(&5'".,%(6"'(78

9$3$&$/#(:.0&4'%;

<*32"'(4=('"#$%&"'%(6"'(>"'/"-

?@AB 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,%

D34*/&(4=(%5.'",(3"34'1

@EFG 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,2-40>%

H5"0>(I0*2$/(=$-"(=4'(J('"#$%&"'%(K(>"'/"-

L%"(M3.N''"#04*/&O< =-.#(&4(<PHH

< O(,"%$'",(3.N$3*3('"#$%&"'%(K(>"'/"-

D&(%43"(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'

!",*0"%(6"'=4'3./0"(M 98S8($%(%-4T

H5"0>(I0*2$/(=$-"(=4'(98S8(*%.#"

Exec


83

!"#"$%&'&'()$"*+,$-"),*.("

/*")012#3+2#&+'*4567 +2#&+')#+)'6--

8$9)-+%2&:")#;")<"$'":)-+=")>&#;)#;")5-,?&')@:.()#+)

="#"$%&'")$"(&*#"$),*.("A

82"')#;")A-,?&')@&:")>&#;).)#"3#)"=&#+$).'=):++<)@+$)

#;")0-+="7 *"-#&+'Aarchitecture {sm_10}

abiversion {0}

modname {cubin}

code {

name = BlackScholesGPU

lmem = 0

smem = 68

reg = 20

bar = 0

bincode {

0xa0004205 0x04200780 0x40024c09 0x00200780

…

per thread local memory

per thread block shared memory

per thread registers

Exec


84

!"#$%&''()*+',%!*-'(-*./0Exec


85

!"#$%$&$'()#*+,-./)",+)01234

5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,

9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/

<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)

*$.$'(

?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)

#*+,-.

A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.

B,6+$/#$3/

<$'$%6%C)DE)#*+,-./)",+)01234

!'1>)$7)%61#$"1,)32'36++,'#)01234/)

FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,

J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>

K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M

Exec


86

!""#$%&"'()*(+,-./-0%&",

1&"-,%23&4(/""#$%&"'(5/,2(&/6(&,",22%-37'(

3&"-,%2,($,-./-0%&",

BUT…

8/9:/""#$%&"'(0#763$-/",22/-2("%&&/6(%5,;#%6,7'(

<35,(7%6,&"'(/&(0,0/-':=/#&5(>,-&,72

?16(%77("/0,2(5/9&(6/(%-36<0,63"(3&6,&236'(%&5(%@%37%=7,(

$%-%77,7320A

Exec


87

!"#"$%&%#'(%)*+,#)-../'0"&'+1

!"#"$%&%#'("&'+1)2%/.3)"4".&"&'+1)&+)4'55%#%1&)6!73

6!73)8"#9)'1)$"19):"93;)+5)$,/&'.#+0%33+#3

<%$+#9)="14:'4&2

>2"#%4)$%$+#9)3'(%

?%@'3&%#)5'/%)3'(%

A2#%"43).%#)=/+0B

*+,)0"1)%8%1)$"B%)"..3)3%/5C&,1'1@)D/'B%)EEAF)"14)-AG->H

IJK.%#'$%1&L $+4%)4'30+8%#3)"14)3"8%3)+.&'$"/)0+15'@,#"&'+1

Exec


Loop unrolling

Sometimes we know some kernel parameters at compile time:# of loop iterationsDegrees of polynomialsNumber of data elements

If we could “tell” this to the compiler, it can unroll loops andoptimize register usage

We need to be genericAvoid code duplication, sizes unknown at compile time

Templates to rescueThe same trick can be used for regular C++ sources


Exec


Example: de Casteljau algorithm

A standard algorithm for evaluating polynomials in Bernstein form

Recursively defined:

f (x) = bd00

bki ,j = xbk!1

i+1,j + (1! x)bk!1i ,j+1

b0i ,jare coe!cients

f (x) = bd00

bd!110 bd!1

01

bd!220 bd!2

11 bd!202

1! x

1! x x 1! x2

x

x


Exec


Implementation

The de Casteljau algorithm is usually implemented as nestedfor-loops

Coe!cients are overwritten for each iteration

f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x , i n t d ){

f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )

c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}

r e t u r n c [ 0 ] ;}

f (x) = cd00

cd"110 cd"1

01

cd"220 cd"2

11 cd"202

1! x

1! x x 1! x2

x

x


Exec


Template loop unrolling

We make d a template parametertemplate<int d>f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x, int d ) {

f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )

c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}r e t u r n c [ 0 ] ;

}

Kernel is called assw i t c h ( d ) {case 1 :

d eCa s t e l j a u <1><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;ca se 2 :

d eCa s t e l j a u <2><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;..ca se MAXD:

deCa s t e l j a u <MAXD><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;}


Exec


Results

For the de Castelaju algorithm we see a relatively smallspeedup

! 1.2" (20%...)

Very easy to implement

Can lead to long compile times

Conclusion:

Probably worth it near end of development cycle


Exec


88

!"#$%&'("#

)#*+,'-.#*/!)01/2+,3",4.#$+/$5.,.$-+,('-($'

6+4",7/$".%+'$(#8

0(9+,8+#-/:,.#$5(#8

;.#</$"#3%($-'

=.-+#$7/5(*(#8

)'+/2+.</2+,3",4.#$+/4+-,($'/-"/8&(*+/"2-(4(>.-("#/

)#*+,'-.#*/2.,.%%+%/.%8",(-54/$"42%+?(-7/-5+",7

@#"A/5"A/-"/(*+#-(37/-72+/"3/:"--%+#+$<

+B8B/4+4",7C/$",+/$"42&-.-("#C/",/(#'-,&$-("#/"9+,5+.*

D2-(4(>+/7"&,/.%8",(-54C/then &#,"%%/%""2'

)'+/-+42%.-+/2.,.4+-+,'/-"/8+#+,.-+/"2-(4.%/$"*+

Exec


61

!"#$%&'($)*+,-.$/012*.#0

3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$

401:.#5

;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$

5#594?+

!*5#$+8-54+

(99#++$81$"-07@-0#$4#02105-69#$91,68#0+$

Profiling


62

!"#$%&'

()*$+',%-*,+-%./*0,1"+2,2%-01%-*,.34$+*-',3$,'"#$%&',"$,+2*,.2"56

+"7*'+%75

#&08"$.32*-*$+

#&08.32*-*$+

#'+8"$.32*-*$+

#'+8.32*-*$+

&3.%&8&3%0

&3.%&8'+3-*

9-%$.2

0")*-#*$+89-%$.2

"$'+-4.+"3$' : "$'+-4.+"3$,.34$+

1%-58'*-"%&";* : +2-*%0,1%-5',+2%+,'*-"%&";*,3$,%00-*'',.3$<&".+',+3,'2%-*0,3-,.3$'+%$+,7*73-=

.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'

Global memory loads/stores are coalesced

(coherent) or non-coalesced (incoherent)

Total branches and divergent branches

taken by threads

Local loads/stores

Profiling


63

!"#$%&%$#'"()&%*+',$%)-*."#$%/

01,.$/)%$&%$/$"#)$2$"#/)3'#4'")1)#4%$15)31%&

6",7)#1%($#/)*"$)8.,#'&%*-$//*%01,.$/)3',,)"*#)-*%%$/&*"5)#*)#4$)#*#1,)".89$%)*+)31%&/),1."-4$5)+*%)1)&1%#'-.,1%):$%"$,;

<1."-4)$"*.(4)#4%$15)9,*-:/)#*)$"/.%$)#41#)#4$)#1%($#)8.,#'&%*-$//*%)'/)('2$")1)-*"/'/#$"#)&$%-$"#1($)*+)#4$)#*#1,)3*%:;

01,.$/)1%$)9$/#)./$5)#*)'5$"#'+7)%$,1#'2$)&$%+*%81"-$)5'++$%$"-$/)9$#3$$")."*&#'8'=$5)1"5)*&#'8'=$5)-*5$

!")*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81("'#.5$/)*+)(,5?(/#@'"-*4$%$"#>)5'2$%($"#@9%1"-4>)1"5)31%&@/$%'1,'=$

Profiling


84

!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(

23+45

23+25

2365

23765

43825

43995

:43:72*;<=>

+93??:*;<=>

92346?*;<=>

273977*;<=>

?37+2*;<=>

+36@+*;<=>

43869*;<=>

9838+5

4232:5

2@3825

639+5

+3:65

43995

834:6*&>A"#("-*7B&0-.1C-"*"-"&"(.>*C"#*.D#"'/

83962*&>A"#("-*:B)%&C-"."-E*0(#%--"/

83@9:*&>A"#("-*@B0(#%--*-'>.*F'#C

83?:@*&>A"#("-*+B$1#>.*'//*/0#1(G*G-%H'-*-%'/

23744*&>A"#("-*9B>"I0"(.1'-*'//#">>1(G

93+@:*&>A"#("-*4B1(."#-"'J"/*'//#">>1(G

F1.D*H'(K*)%($-1).>

638@+*&>A"#("-*2B*1(."#-"'J"/*'//#">>1(G

F1.D*/1J"#G"(.*H#'()D1(G

A"#("-*7*%(*94,*"-"&"(.>B*74*;<=>L

M."C

MC""/0C<'(/F1/.DN1&"*O444*1(.>PQ0&0-'.1J"

MC""/0C

Example


Build your own!



© 2008 NVIDIA Corporation.slide by David Kirk

Thank you!


Back Pocket Slides

slide by David Cox



Misc



19M02: High Performance Computing with CUDA

Tesla C1060 Computing ProcessorTesla C1060 Computing Processor

1.33 GHzCore GHz

Processor 1x Tesla T10P

Form factor

Full ATX:

4.736” (H) x 10.5” (L)

Dual slot wide

On-boardmemory

4 GB

System I/O PCIe x16 gen2

Memory I/O512-bit, 800MHz DDR

102 GB/s peak bandwidth

Display outputs None

Typical power 160 W



Tesla S1070 1U SystemTesla S1070 1U System

1.5 GHzCore GHz

Processors 4 x Tesla T10P

Form factor1U for an EIA 19”

4-post rack

Total 1U systemmemory

16 GB (4.0GB per GPU)

System I/O 2 PCIe x16

Memory I/O perprocessor

512-bit, 800MHz GDDR

102 GB/s peakbandwidth

Display outputs None

Typical power 700 W

Chassisdimensions

1.73” H ! 17.5” W !28.5” D



Double Precision Floating PointDouble Precision Floating Point

NVIDIA GPU SSE2 Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADDand FMUL

All 4 IEEE, round tonearest, zero, inf, -inf

All 4 IEEE, round tonearest, zero, inf, -inf

Round tozero/truncate only

Denormal handling Full speedSupported, costs 1000’sof cycles

Flush to zero

NaN support Yes Yes No

Overflow and Infinitysupport

Yes YesNo infinity,clamps to max norm

Flags No Yes Some

FMA Yes No Yes

Square rootSoftware with low-latencyFMA-based convergence

Hardware Software only

DivisionSoftware with low-latencyFMA-based convergence

Hardware Software only

Reciprocal estimateaccuracy

24 bit 12 bit 12 bit

Reciprocal sqrt estimateaccuracy

23 bit 12 bit 12 bit

log2(x) and 2^x estimatesaccuracy

23 bit No No


Education

IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)