86
IAP09 CUDA@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 07 CUDA Advanced #2 - Nicolas Pinto (MIT) Friday, January 23, 2009

IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

  • Upload
    npinto

  • View
    2.347

  • Download
    1

Embed Size (px)

DESCRIPTION

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009 Note that some slides were borrowed from NVIDIA.

Citation preview

Page 1: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

IAP09 CUDA@MIT / 6.963

Supercomputing on your desktop:Programming the next generation of cheap

and massively parallel hardware using CUDA

Lecture 07

CUDA Advanced #2-

Nicolas Pinto (MIT)

Friday, January 23, 2009

Page 2: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for 6.963

Friday, January 23, 2009

Page 3: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Todayyey!!

Friday, January 23, 2009

Page 4: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Wanna Play with The Big Guys?

Friday, January 23, 2009

Page 5: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Here are the keys to High-Performance in CUDA

Friday, January 23, 2009

Page 6: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

To optimize or not to optimize

Hoare said (and Knuth restated)

“We should forget about small e!ciencies, say about97% of the time:

“Premature optimization is the root of all evil.”

!3% of the time we really should worry about small e!ciencies

(Every 33rd codeline)

Applied Mathematics 23/53slide by Johan Seland

Warning!

Friday, January 23, 2009

Page 7: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

To optimize or not to optimize

Hoare said (and Knuth restated)

“We should forget about small e!ciencies, say about97% of the time:Premature optimization is the root of all evil.”

!3% of the time we really should worry about small e!ciencies

(Every 33rd codeline)

Applied Mathematics 23/53slide by Johan Seland

Warning!

Friday, January 23, 2009

Page 8: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

StrategyMemory Optimizations

Execution Optimizations

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 9: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

CUDAPerformance Strategies

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 10: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Optimization goals

We should strive to reach GPU performance

We must know the GPU performanceVendor specificationsSyntetic benchmarks

Choose a performance metricMemory bandwidth or GFLOPS?

Use clock() to measure

Experiment and profile!

Applied Mathematics 25/53

Strategy

slide by Johan Seland

Friday, January 23, 2009

Page 11: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2006 3

Programming Model

A kernel is executed as a grid of thread blocks

A thread block is a batch of threads that can cooperate with each other by:

Sharing data through shared memory

Synchronizing their execution

Threads from different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Threading

Friday, January 23, 2009

Page 12: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 10

Data Movement in a CUDA Program

Host Memory

Device Memory

[Shared Memory]

COMPUTATION

[Shared Memory]

Device Memory

Host Memory

Memory

Friday, January 23, 2009

Page 13: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

39

!"#$%$&'()*+,-$#.%/(0,-(#.'(123

456$%$&'($78'"'78'7#("5-5**'*$/%

456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?

@,%'#$%'/($#A/(='##'-(#,(-'9,%"B#'(#.57(#,(959.'

123(/"'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:

E,(%,-'(9,%"B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(

85#5(#-57/0'-/

GF'7(*,>("5-5**'*$/%(9,%"B#5#$,7/(957(/,%'#$%'/(='(

05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#

Perf

Friday, January 23, 2009

Page 14: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

40

!"#$%$&'()'%*+,(-*.'+'/0'

-*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'

=2*>12?@*012(4'5$0'(%'%*+,(

!"#$%$&'(:*+(3"1#$12(2*012$#,($/(010.'4(#'A#<+'(

%'%*+,

B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3

Perf

Friday, January 23, 2009

Page 15: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

41

!"#$%&'(")*"+$%,-%./"0$'%1$2,03

45)'0$'6%,-%*72$6%-"6*$0%*/")%+8,9"8%2$2,03

!/0$"'6%:")%:,,;$0"*$%(7"%6/"0$'%2$2,03

<6$%,)$%=%"%-$>%*/0$"'6%*,%8,"'%=%:,2;5*$%'"*"%

6/"0$'%93%"88%*/0$"'6

<6$%7*%*,%"(,7'%),)?:,"8$6:$'%"::$66

.*"+$%8,"'6%")'%6*,0$6%7)%6/"0$'%2$2,03%*,%0$?,0'$0%),)?

:,"8$6:$"98$%"''0$667)+

1"*07@%*0")6;,6$%$@"2;8$%8"*$0

Perf

Friday, January 23, 2009

Page 16: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

42

!"#$%&'&((#()"*$+,,)-)#./(0

%&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$

*2(/)3'1-#""1'"$#72&((0$82"0

9&.0$/5'#&:";$*&.0$/5'#&:$8(1-4"

<##3$'#"12'-#$2"&=#$(1>$#.12=5$/1$"2331'/$

*2(/)3(#$&-/)?#$/5'#&:$8(1-4"$3#'$*2(/)3'1-#""1'

@#=)"/#'";$"5&'#:$*#*1'0

Perf

Friday, January 23, 2009

Page 17: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Friday, January 23, 2009

Page 18: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Memory Optimizations

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 19: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

44

!"#$%&'$()*#*+,)*$-.

/()*#*+*-0'#"#$%&')%,-.1"%.

2$,3".4*-0'03$5,3'#"#$%&',44"..".

6.*-0'.7,%"8'#"#$%&'"11"4)*9"3&

Memory

Friday, January 23, 2009

Page 20: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

45

!"#"$%&"'()*&(

!*+,-*$.*./&0$#/$1/(#$.*./&0$2"'34,3#1$.5-1$

6/4*&$#1"'$3*+,-*$.*./&0$#/$3*+,-*$2"'34,3#1

789:($;*"<$=>?@A*$BCDE$+(F$GH$89:($;*"<$=I5"3&/$JK$LDHHE

G89:($)/&$>?@A*$MFH

N,',.,O*$#&"'()*&(

@'#*&.*3,"#*$3"#"$(#&5-#5&*($-"'$2*$"66/-"#*3P$/;*&"#*3$

/'P$"'3$3*"66/-"#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$

.*./&0

8&/5;$#&"'()*&(

R'*$6"&Q*$#&"'()*&$.5-1$2*##*&$#1"'$."'0$(."66$/'*(

Memory

Friday, January 23, 2009

Page 21: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

46

!"#$%&'()$*+,$-'./+0."123$.2

(4*","55'(6'2789+"55':2+"55'("7;'1+'3+<"#$%5'()$*+='27+-$-'./

>1"?5$2+=;#=$27+(4*",$-(</+<$.3'.-"1($@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9

LM+CDE2+-$"24.$*+'1+1N'.($+KOP;+-'7=$.?'".*2+8'Q$.(5'()$*+!GH%$9

R$$+7=$+S?"1*:;*7=0$27T GUVW+RVX+2"-<5$

U2$+:;7=+("47;'1W55'("7;1#+7''+-4(=+<"#$%5'()$*+-$-'./+("1+.$*4($+'Q$."55+2/27$-+<$.3'.-"1($

0$27+/'4.+2/27$-2+"1*+"<<2+7'+5$".1+7=$;.+5;-;72

Memory

Friday, January 23, 2009

Page 22: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

47

!"#$%"&'()#*+&,(%-./0*12(.

3145(.2&"%2(67+&16.2*8721#6.9&:;;<=;;&7"#7>&7+7"(.

?1>("+&2#&$(&@(*A#*)%67(&$#22"(6(7>

B@21)1C%21#6.&7%6&4*(%2"+&167*(%.(&@(*A#*)%67(

D#%"(.71649&8@&2#&E;F&.@((-8@

?%2(67+&51-1649&8@&2#&GHIF&.@((-8@

gmem

Friday, January 23, 2009

Page 23: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Accessing global memory

4 cycles to issue on memory fetch

but 400-600 cycles of latencyThe equivalent of 100 MADs

Likely to be a performance bottleneck

Order of magnitude speedups possibleCoalesce memory access

Use shared memory to re-order non-coalesced addressing

Applied Mathematics 32/53slide by Johan Seland

gmem

Friday, January 23, 2009

Page 24: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

48

!"#$%&'()*

+,'""-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:

+,'")/(*;";&,-%*("),"3,*$"0#$,<%<"-1=

9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5"-.=,()/?,3$"#/?,@

8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.";0$%45"-.=,()/A?,3$"#/A?,@

AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45"-.=,()/>?,3$"#/>?,@

+..(/(")#$,-%&/-('/(")&,"),FBGHFIG,#-'2(/%'/;-%=

J/#-/()*,#..-%&&,3"-,#,-%*("),<;&/,0%,#,<;$/(6$%,"3,-%*("),

&(K%

L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,

0$"'M,0%()*,-%#.

NO'%6/(")=,)"/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*

P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6

gmem

Friday, January 23, 2009

Page 25: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

49

!"#$%&'%()*''%&&+),%#(-./)0$"#1&

12 13 14 135 13617

12 13 14 135 13617

374 378 395 3:4349 352 355 399

374 378 395 3:4349 352 355 399

;"<%)=>?%#(&)@")A"1)B#?1-'-C#1%

*$$)1>?%#(&)C#?1-'-C#1%

gmem

Friday, January 23, 2009

Page 26: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

50

!"#$%&'(#')*+##'((,*-'%)."/*0&$%1(

12 13 14 135 13617

374 378349 352 355

:';<=1')*+##'((*>?*@A;'%)(

395 3B4399

C.(%&./"')*D1%;1."/*+));'((*E"$1*%*<=&1.F&'*$0*85G

12 13 14 137 13617

374 378 395 3B4349 352 355 399

135

gmem

Friday, January 23, 2009

Page 27: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

51

!"#$%&'()*+,-(.()*,/%&0$1&

234%5(.%)1,"),678+,

9%5)%$+,5%#:,#,;$"#1<,()'5%.%)1<,=5(1%,>#'?

@A,;$"#1&,BCDAEF

-(.%&,#G%5#*%:,"G%5,C89,50)&

CD9,>$"'?&,3,DHI,1J5%#:&+

@HIK&,L '"#$%&'%:

@HMK&,L '"#$%&'%:<,&".%,1J5%#:&,:")N1,4#51('(4#1%

@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&

gmem

Friday, January 23, 2009

Page 28: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

58

!"#$%&'()*+

,-./'-/.%&0"10&(2%0! 34054067089-%&:&%0#0,-./'-/.%0"10;..#9&0<,";=0()&-%#>0"10;..#90"10,-./'-/.%&0

<;",=

?10,";0(&0)"-0@(#A$%+

B".'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540".067

:&%0,IJI0-"0#'G(%@%0'"#$%&'()*

zyx Point structure

zyx zyx zyx AoS

xxx yyy zzz SoA

gmem

Friday, January 23, 2009

Page 29: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

59

!"#$%&'()*+,-.//#01

!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2

!0(2('#$,2",/%/"0167".)8,9%0)%$&

:%#8()*,&20.'2.0%&,";,&(<%,"25%0,25#),=>,?>,"0,@A712%&,B($$,70%#9,'"#$%&'()*+

C0%;%0,-20.'2.0%&,";,D00#1& "4%0,D"-

E;,-"D,(&,)"2,4(#7$%>,0%#8FB0(2%,250".*5,-GHG

D88(2(")#$,0%&".0'%&+D$(*)%8,I13%&,-JK,-#/3$%

gmem

Friday, January 23, 2009

Page 30: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

64

!"#"$$%$&'%()#*&+#,-./%,/0#%

12&"&3"#"$$%$&(",-.2%4&("2*&/-#%"56&",,%66&(%()#*

7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:"2;6

<66%2/."$&/)&",-.%9%&-.=-&:"25>.5/-

<",-&:"2;&,"2&6%#9.,%&)2%&"55#%66&3%#&,*,$%

+&(%()#*&,"2&6%#9.,%&"6&("2*&6.(0$/"2%)06&

",,%66%6&"6&./&-"6&:"2;6

'0$/.3$%&6.(0$/"2%)06&",,%66%6&/)&"&:"2;

#%60$/&.2&"&:"2;&,)28$.,/&

?)28$.,/.2=&",,%66%6&"#%&6%#."$.@%5

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

smem

Friday, January 23, 2009

Page 31: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

65

!"#$%&''()**+#,%-."/01)*

23%!"#$%43#51+67*

8+#)"(%"''()**+#,%

*7(+')%99%:

23%!"#$%43#51+67*

;"#'3/%:<:%=)(/>7"7+3#

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

smem

Friday, January 23, 2009

Page 32: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

66

!"#$%&''()**+#,%-."/01)*

234"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%2

=34"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%=

Thread 11

Thread 10

Thread 9

Thread 8

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 9

Bank 8

Bank 15

Bank 7

Bank 2

Bank 1

Bank 0x8

x8

smem

Friday, January 23, 2009

Page 33: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

67

!"#$%&&'())()$*%+$,"$-%./)$".$012

3%.&#4&,5$"6$(%75$-%./$4)$89$-4,)$+('$9$7:"7/$7;7:()

<=77())4>($89?-4,$#"'&)$%'($%))4@.(&$,"$)=77())4>($

-%./)

012$5%)$AB$-%./)

<"$-%./$C$%&&'())$D$AB

<%*($%)$,5($)4E($"6$%$5%:6?#%'+F"$-%./$7".6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$".:;$#4,54.$%$)4.@:($5%:6?#%'+

smem

Friday, January 23, 2009

Page 34: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

68

!"#$%&'(%()$*'+#,-'.),/01.23

!"#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2"%$%'#$%'

,)'+#,-'.),/01.23

5"%'/#32'.#3%6

7/'#00'2"$%#&3')/'#'"#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2"%$%'13'

,)'+#,-'.),/01.2

7/'#00'2"$%#&3')/'#'"#0/89#$:'$%#&'2"%'1&%,21.#0'#&&$%33;'

2"%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=

5"%'30)9'.#3%6

>#,-'?),/01.26'(@021:0%'2"$%#&3'1,'2"%'3#(%'"#0/89#$:'

#..%33'2"%'3#(%'+#,-

A@32'3%$1#01B%'2"%'#..%33%3

?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-

smem

Friday, January 23, 2009

Page 35: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Use the right kind of memory

Constant memory:Quite small, ! 20KAs fast as register access if all threads in a warp access thesame location

Texture memory:Spatially cachedOptimized for 2D localityNeighboring threads should read neighboring addressesNo need to think about coalescing

Constraint:These memories can only be updated from the CPU

Applied Mathematics 31/53slide by Johan Seland

Strategy

Friday, January 23, 2009

Page 36: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Memory optimizations roundup

CUDA memory handling is complexAnd I have not covered all topics...

Using memory correctly can lead to huge speedupsAt least CUDA expose the memory hierarchy, unlike CPUs

Get your algorithm up an running first, then optimize

Use shared memory to let threads cooperate

Be wary of “data ownership”A thread does not have to read/write the data it calculate

Applied Mathematics 41/53

Strategy

slide by Johan Seland

Friday, January 23, 2009

Page 37: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Conflicts,Coalescing, Warps...I hate growing up.

Friday, January 23, 2009

Page 38: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

!"#$%$&'#$()*+,'%"-./*0'#1$,*21')3"(3.

Example

Friday, January 23, 2009

Page 39: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

70

!"#$%&'($")*+,*-

./0'."1+2-'34#$")*+,*-56

7228*#$"#-*9

:,"2-*;%)<

=>,%?%)<'.!@!'A")B';,)C2%;#*

.+--?8+*'C,$'->-)'*1"22'1"#$%;-*

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

Example

Friday, January 23, 2009

Page 40: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

71

!"#$%&'(#')*+,%"(-$('

__global__ void transpose_naive(float *odata, float *idata, int width, int height)

{

unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

if (xIndex < width && yIndex < height)

{

unsigned int index_in = xIndex + width * yIndex;

unsigned int index_out = yIndex + height * xIndex;

$)%.%/0")'12$3.4 = 0)%.%/0")'120"4;

}

}

1.

2.

3.

4.

5.

6.

Example

Friday, January 23, 2009

Page 41: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

72

!"#$%&'(#')*+,%"(-$('

.'%)(*/"-01*2,$3*4565

787978:78778;

;879;8:;87;8;

79879798:7987798;

<,/1'*$01-01*1$*4565

7987:87787;87

798;:8;78;;8;

79879:8797879;879

Stride = 16, uncoalesced

45654565

Stride = 1, coalesced

Example

Friday, January 23, 2009

Page 42: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

73

!"#$%&'%()*+#,&-"&%

.&&/0-12",3)0#1+24)2&)-#+1212",%()2,1")&5/#+%)12$%&

*6+%#(7$"'8)974:)7;<3

=%#()16%)974:7;< 2,-/1)12$%:)&1"+%)2,1")>?@?

A+21%)16%)>?@?)(#1#)1")97;:74< "/1-/1)12$%*+#,&-"&%)16%)2,(%42,B)2,1")>?@?

*6+%#()914:1;<3

=%#(&)%$%0%,1)914:1;< C+"0)2,-/1)12$%

A+21%&)%$%0%,1)914:1;< 2,1")"/1-/1)12$%

!"#$%&'2,B)2&)#'62%D%()2C3

E$"'8F12$%)(20%,&2",&)#+%)0/$12-$%&)"C)GH

Example

Friday, January 23, 2009

Page 43: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

74

!"#$%&'%()*+#,&-"&%

.+/0%&)0")12324%#(&)5+"6)7232

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

.+/0%&)0")72324%#(&)5+"6)1232

8:98;98898<98

8:9<;9<89<<9<

8:98:;98:898:<98:

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

Example

Friday, January 23, 2009

Page 44: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

75

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Example

Friday, January 23, 2009

Page 45: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

75

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Example

Friday, January 23, 2009

Page 46: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

76

!"#$%&'%()*+#,&-"&%

__global__ void transpose(float *odata, float *idata, int width, int height)

{

__shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;

unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;

unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if (xIndex < width && yIndex < height)

{

unsigned int index_in = width * yIndex + xIndex;

unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;

block[index_block] = idata[index_in];

index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;

index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}

__syncthreads();

if (xIndex < width && yIndex < height)

odata[index_out] = block[index_transpose];

}

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

Example

Friday, January 23, 2009

Page 47: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Applied Mathematics 39/53

Example

slide by Johan Seland

Friday, January 23, 2009

Page 48: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 49: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 50: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 51: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Write to shared memory.

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 52: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Write to shared memory.

Calculate output indices.

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 53: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Write to shared memory.

Calculate output indices.

Synchronize.NB:outside if-clause

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 54: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Write to shared memory.

Calculate output indices.

Synchronize.NB:outside if-clause

Write to global mem.Di!erent index

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 55: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Transpose timings

Was it worth the trouble?

Grid Size Coalesced Non-coalesced Speedup128! 128 0.011 ms 0.022 ms 2.0!512! 512 0.07 ms 0.33 ms 4.5!

1024! 1024 0.30 ms 1.92 ms 6.4!1024! 2048 0.79 ms 6.6 ms 8.4!

For me, this is a clear yes.

Applied Mathematics 40/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 56: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Friday, January 23, 2009

Page 57: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Execution Optimizations

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 58: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Know the arithmetic cost of operations

4 clock cycles:Floating point: add, multiply, fused multiply-addInteger add, bitwise operations, compare, min, max

16 clock cycles:reciprocal, reciprocal square root, log(x), 32-bit integermultiplication

32 clock cycles:sin(x), cos(x) and exp(x)

36 clock cycles:Floating point division (24-bit version in 20 cycles)

Particularly costly:Integer division, moduloRemedy: Replace with shifting whenever possible

Double precision (when available) will perform at half thespeed

Applied Mathematics 28/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 59: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

79

!""#$%&"'

()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-

+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-

4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-

"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-

<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-

"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@

A+6./0+*/

B)%*+,-<+<1*'

Exec

Friday, January 23, 2009

Page 60: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

80

!"#$%&'()*+,#-.+/.0"#12#)1

3+(4+5'()*1+6+3+(4+70'2#8"().11("1

,(+9''+70'2#8"().11("1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.

3+(4+5'()*1+%+3+(4+70'2#8"().11("1+6+>

?0'2#8'.+5'()*1+)9<+"0<+)(<)0"".<2'@+#<+9+70'2#8"().11("

&'()*1+2:92+9".<A2+B9#2#<C+92+9+DD1@<)2:".9$1EF+*..8+2:.+

:9"$B9".+501@

,05G.)2+2(+".1(0").+9;9#'95#'#2@+H ".C#12."1I+1:9".$+7.7("@

3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020".+$.;#).1

&'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<

JKKK+5'()*1+8."+C"#$+B#''+1)9'.+9)"(11+70'2#8'.+C.<."92#(<1

Exec

Friday, January 23, 2009

Page 61: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

81

!"#$%&"'()"*"+,"+-.

!"/,0/1&"'02'$&"('"#$%&"'(,"*"+,"+-.3+%&'4-&$5+6%('"%47&(-/+(8"('"/,(9::(-.-7"%(7/&"'

;-"+/'$5%<=>)?< @AB<

A5(-5C*7"&"7.(D$,"(&D"(7/&"+-.<(!4+(/&(7"/%&(EF: &D'"/,%(GH(2/'*%I(*"'(C47&$*'5-"%%5'

?&(7"/%&(:JK 5--4*/+-.

AD'"/,%(,5(+5&(D/L"(&5(8"75+#(&5(&D"(%/C"(&D'"/,(875-M

/,,N1O:(((P1OQ(P1EQ(P1:

/,,N1O:(((P1JQ(P1OQ(P1R

S T(.(U(JV

W(T(S U(OV

7,N%D/'",N1O:((P1OQ(XP'OEUYZ(

/,,N1O:(((((((((((P1OQ(P1OQ(P1R

%[,/&/XYZ(UT(OV

Exec

Friday, January 23, 2009

Page 62: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

82

!"#$%&"'()'"%%*'"

+$,"(-.&"/01(21(*%$/#(34'"(&5'".,%(6"'(78

9$3$&$/#(:.0&4'%;

<*32"'(4=('"#$%&"'%(6"'(>"'/"-

?@AB 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,%

D34*/&(4=(%5.'",(3"34'1

@EFG 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,2-40>%

H5"0>(I0*2$/(=$-"(=4'(J('"#$%&"'%(K(>"'/"-

L%"(M3.N''"#04*/&O< =-.#(&4(<PHH

< O(,"%$'",(3.N$3*3('"#$%&"'%(K(>"'/"-

D&(%43"(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'

!",*0"%(6"'=4'3./0"(M 98S8($%(%-4T

H5"0>(I0*2$/(=$-"(=4'(98S8(*%.#"

Exec

Friday, January 23, 2009

Page 63: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

83

!"#"$%&'&'()$"*+,$-"),*.("

/*")012#3+2#&+'*4567 +2#&+')#+)'6--

8$9)-+%2&:")#;")<"$'":)-+=")>&#;)#;")5-,?&')@:.()#+)

="#"$%&'")$"(&*#"$),*.("A

82"')#;")A-,?&')@&:")>&#;).)#"3#)"=&#+$).'=):++<)@+$)

#;")0-+="7 *"-#&+'Aarchitecture {sm_10}

abiversion {0}

modname {cubin}

code {

name = BlackScholesGPU

lmem = 0

smem = 68

reg = 20

bar = 0

bincode {

0xa0004205 0x04200780 0x40024c09 0x00200780

per thread local memory

per thread block shared memory

per thread registers

Exec

Friday, January 23, 2009

Page 64: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

84

!"#$%&''()*+',%!*-'(-*./0Exec

Friday, January 23, 2009

Page 65: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

85

!"#$%$&$'()#*+,-./)",+)01234

5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,

9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/

<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)

*$.$'(

?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)

#*+,-.

A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.

B,6+$/#$3/

<$'$%6%C)DE)#*+,-./)",+)01234

!'1>)$7)%61#$"1,)32'36++,'#)01234/)

FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,

J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>

K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M

Exec

Friday, January 23, 2009

Page 66: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

86

!""#$%&"'()*(+,-./-0%&",

1&"-,%23&4(/""#$%&"'(5/,2(&/6(&,",22%-37'(

3&"-,%2,($,-./-0%&",

BUT…

8/9:/""#$%&"'(0#763$-/",22/-2("%&&/6(%5,;#%6,7'(

<35,(7%6,&"'(/&(0,0/-':=/#&5(>,-&,72

?16(%77("/0,2(5/9&(6/(%-36<0,63"(3&6,&236'(%&5(%@%37%=7,(

$%-%77,7320A

Exec

Friday, January 23, 2009

Page 67: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

87

!"#"$%&%#'(%)*+,#)-../'0"&'+1

!"#"$%&%#'("&'+1)2%/.3)"4".&"&'+1)&+)4'55%#%1&)6!73

6!73)8"#9)'1)$"19):"93;)+5)$,/&'.#+0%33+#3

<%$+#9)="14:'4&2

>2"#%4)$%$+#9)3'(%

?%@'3&%#)5'/%)3'(%

A2#%"43).%#)=/+0B

*+,)0"1)%8%1)$"B%)"..3)3%/5C&,1'1@)D/'B%)EEAF)"14)-AG->H

IJK.%#'$%1&L $+4%)4'30+8%#3)"14)3"8%3)+.&'$"/)0+15'@,#"&'+1

Exec

Friday, January 23, 2009

Page 68: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Loop unrolling

Sometimes we know some kernel parameters at compile time:# of loop iterationsDegrees of polynomialsNumber of data elements

If we could “tell” this to the compiler, it can unroll loops andoptimize register usage

We need to be genericAvoid code duplication, sizes unknown at compile time

Templates to rescueThe same trick can be used for regular C++ sources

Applied Mathematics 43/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 69: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Example: de Casteljau algorithm

A standard algorithm for evaluating polynomials in Bernstein form

Recursively defined:

f (x) = bd00

bki ,j = xbk!1

i+1,j + (1! x)bk!1i ,j+1

b0i ,jare coe!cients

f (x) = bd00

bd!110 bd!1

01

bd!220 bd!2

11 bd!202

1! x

1! x x 1! x2

x

x

Applied Mathematics 44/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 70: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Implementation

The de Casteljau algorithm is usually implemented as nestedfor-loops

Coe!cients are overwritten for each iteration

f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x , i n t d ){

f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )

c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}

r e t u r n c [ 0 ] ;}

f (x) = cd00

cd"110 cd"1

01

cd"220 cd"2

11 cd"202

1! x

1! x x 1! x2

x

x

Applied Mathematics 45/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 71: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Template loop unrolling

We make d a template parametertemplate<int d>f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x, int d ) {

f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )

c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}r e t u r n c [ 0 ] ;

}

Kernel is called assw i t c h ( d ) {case 1 :

d eCa s t e l j a u <1><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;ca se 2 :

d eCa s t e l j a u <2><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;..ca se MAXD:

deCa s t e l j a u <MAXD><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;}

Applied Mathematics 46/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 72: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Results

For the de Castelaju algorithm we see a relatively smallspeedup

! 1.2" (20%...)

Very easy to implement

Can lead to long compile times

Conclusion:

Probably worth it near end of development cycle

Applied Mathematics 47/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 73: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

88

!"#$%&'("#

)#*+,'-.#*/!)01/2+,3",4.#$+/$5.,.$-+,('-($'

6+4",7/$".%+'$(#8

0(9+,8+#-/:,.#$5(#8

;.#</$"#3%($-'

=.-+#$7/5(*(#8

)'+/2+.</2+,3",4.#$+/4+-,($'/-"/8&(*+/"2-(4(>.-("#/

)#*+,'-.#*/2.,.%%+%/.%8",(-54/$"42%+?(-7/-5+",7

@#"A/5"A/-"/(*+#-(37/-72+/"3/:"--%+#+$<

+B8B/4+4",7C/$",+/$"42&-.-("#C/",/(#'-,&$-("#/"9+,5+.*

D2-(4(>+/7"&,/.%8",(-54C/then &#,"%%/%""2'

)'+/-+42%.-+/2.,.4+-+,'/-"/8+#+,.-+/"2-(4.%/$"*+

Exec

Friday, January 23, 2009

Page 74: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

61

!"#$%&'($)*+,-.$/012*.#0

3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$

401:.#5

;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$

5#594?+

!*5#$+8-54+

(99#++$81$"-07@-0#$4#02105-69#$91,68#0+$

Profiling

Friday, January 23, 2009

Page 75: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

62

!"#$%&'

()*$+',%-*,+-%./*0,1"+2,2%-01%-*,.34$+*-',3$,'"#$%&',"$,+2*,.2"56

+"7*'+%75

#&08"$.32*-*$+

#&08.32*-*$+

#'+8"$.32*-*$+

#'+8.32*-*$+

&3.%&8&3%0

&3.%&8'+3-*

9-%$.2

0")*-#*$+89-%$.2

"$'+-4.+"3$' : "$'+-4.+"3$,.34$+

1%-58'*-"%&";* : +2-*%0,1%-5',+2%+,'*-"%&";*,3$,%00-*'',.3$<&".+',+3,'2%-*0,3-,.3$'+%$+,7*73-=

.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'

Global memory loads/stores are coalesced

(coherent) or non-coalesced (incoherent)

Total branches and divergent branches

taken by threads

Local loads/stores

Profiling

Friday, January 23, 2009

Page 76: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

63

!"#$%&%$#'"()&%*+',$%)-*."#$%/

01,.$/)%$&%$/$"#)$2$"#/)3'#4'")1)#4%$15)31%&

6",7)#1%($#/)*"$)8.,#'&%*-$//*%01,.$/)3',,)"*#)-*%%$/&*"5)#*)#4$)#*#1,)".89$%)*+)31%&/),1."-4$5)+*%)1)&1%#'-.,1%):$%"$,;

<1."-4)$"*.(4)#4%$15)9,*-:/)#*)$"/.%$)#41#)#4$)#1%($#)8.,#'&%*-$//*%)'/)('2$")1)-*"/'/#$"#)&$%-$"#1($)*+)#4$)#*#1,)3*%:;

01,.$/)1%$)9$/#)./$5)#*)'5$"#'+7)%$,1#'2$)&$%+*%81"-$)5'++$%$"-$/)9$#3$$")."*&#'8'=$5)1"5)*&#'8'=$5)-*5$

!")*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81("'#.5$/)*+)(,5?(/#@'"-*4$%$"#>)5'2$%($"#@9%1"-4>)1"5)31%&@/$%'1,'=$

Profiling

Friday, January 23, 2009

Page 77: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

84

!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(

23+45

23+25

2365

23765

43825

43995

:43:72*;<=>

+93??:*;<=>

92346?*;<=>

273977*;<=>

?37+2*;<=>

+36@+*;<=>

43869*;<=>

9838+5

4232:5

2@3825

639+5

+3:65

43995

834:6*&>A"#("-*7B&0-.1C-"*"-"&"(.>*C"#*.D#"'/

83962*&>A"#("-*:B)%&C-"."-E*0(#%--"/

83@9:*&>A"#("-*@B0(#%--*-'>.*F'#C

83?:@*&>A"#("-*+B$1#>.*'//*/0#1(G*G-%H'-*-%'/

23744*&>A"#("-*9B>"I0"(.1'-*'//#">>1(G

93+@:*&>A"#("-*4B1(."#-"'J"/*'//#">>1(G

F1.D*H'(K*)%($-1).>

638@+*&>A"#("-*2B*1(."#-"'J"/*'//#">>1(G

F1.D*/1J"#G"(.*H#'()D1(G

A"#("-*7*%(*94,*"-"&"(.>B*74*;<=>L

M."C

MC""/0C<'(/F1/.DN1&"*O444*1(.>PQ0&0-'.1J"

MC""/0C

Example

Friday, January 23, 2009

Page 78: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Build your own!

Friday, January 23, 2009

Page 79: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Friday, January 23, 2009

Page 80: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

© 2008 NVIDIA Corporation.slide by David Kirk

Thank you!

Friday, January 23, 2009

Page 81: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Back Pocket Slides

slide by David Cox

Friday, January 23, 2009

Page 82: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Friday, January 23, 2009

Page 83: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

Misc

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 84: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

19M02: High Performance Computing with CUDA

Tesla C1060 Computing ProcessorTesla C1060 Computing Processor

1.33 GHzCore GHz

Processor 1x Tesla T10P

Form factor

Full ATX:

4.736” (H) x 10.5” (L)

Dual slot wide

On-boardmemory

4 GB

System I/O PCIe x16 gen2

Memory I/O512-bit, 800MHz DDR

102 GB/s peak bandwidth

Display outputs None

Typical power 160 W

Friday, January 23, 2009

Page 85: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

20M02: High Performance Computing with CUDA

Tesla S1070 1U SystemTesla S1070 1U System

1.5 GHzCore GHz

Processors 4 x Tesla T10P

Form factor1U for an EIA 19”

4-post rack

Total 1U systemmemory

16 GB (4.0GB per GPU)

System I/O 2 PCIe x16

Memory I/O perprocessor

512-bit, 800MHz GDDR

102 GB/s peakbandwidth

Display outputs None

Typical power 700 W

Chassisdimensions

1.73” H ! 17.5” W !28.5” D

Friday, January 23, 2009

Page 86: IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

18M02: High Performance Computing with CUDA

Double Precision Floating PointDouble Precision Floating Point

NVIDIA GPU SSE2 Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADDand FMUL

All 4 IEEE, round tonearest, zero, inf, -inf

All 4 IEEE, round tonearest, zero, inf, -inf

Round tozero/truncate only

Denormal handling Full speedSupported, costs 1000’sof cycles

Flush to zero

NaN support Yes Yes No

Overflow and Infinitysupport

Yes YesNo infinity,clamps to max norm

Flags No Yes Some

FMA Yes No Yes

Square rootSoftware with low-latencyFMA-based convergence

Hardware Software only

DivisionSoftware with low-latencyFMA-based convergence

Hardware Software only

Reciprocal estimateaccuracy

24 bit 12 bit 12 bit

Reciprocal sqrt estimateaccuracy

23 bit 12 bit 12 bit

log2(x) and 2^x estimatesaccuracy

23 bit No No

Friday, January 23, 2009