58
1

GPU Compute

Embed Size (px)

DESCRIPTION

Introduction to GPU Compute (think Cuda, OpenCL, DirectCompute...) for "non-GPU" programmers. Slight focus on the videogame bussiness.

Citation preview

Page 1: GPU Compute

1  

Page 2: GPU Compute

So  I  found  this  batman  image  on  the  net.  I  had  to  put  it  somewhere  in  the  presenta7on.  Here  it  is.    Nvidia  has  a  page  on  its  «cuda  zone»  showing  applica7ons  and  speedups,  it’s  a  bit  silly  and  it’s  not  so  adver7zed  anymore  nowadays,  but  s7ll  useful  as  an  introduc7on  hCp://www.nvidia.com/object/cuda-­‐apps-­‐flash-­‐new.html#    Intel’s  Knights  Corner  news  hCp://www.intel.com/pressroom/archive/releases/2010/20100531comp.htm  hCp://en.wikipedia.org/wiki/Intel_MIC#Knights_Corner  

2  

Page 3: GPU Compute

A  «wild»  and  not  even  a  par7cularly  crea7ve  example...  More  computa7on  could  enable  en7rely  novel  scenarios...  The  past  couple  of  Unreal  engine  demos  were  architected  to  show  some  of  what  it  could  be  possible…  

3  

Page 4: GPU Compute

I  like  the  idea  of  mo7ongraphs  hCp://cg.cis.upenn.edu/hms/research/wcMG/  Also,  imperfect  shadowmaps  are  cool  hCp://cg.ibds.kit.edu/english/publica7ons.php  The  two  images  top  right  are  for  megameshes  (Lionhead)  and  megatextures  (ID)  Fight  night  round  4  and  Champion  are  the  first  games  using  EA’s  Real  AI,  which  is  trained  from  examples  

4  

Page 5: GPU Compute

Naughy  dog  famously  hand  op7mizes  SPU  kernels.  How  many  games  do  pack  so  much  computa7on,  even  today,  on  SPUs?  And  s7ll,  preCy  «uniform»  workloads,  on  a  processor  that  can  handle  non-­‐uniform  loads  beCer  than  compute  (lower  SIMD  width).    Heart  bokeh  image  taken  from  hCp://www.augus7nefou.com/2009/08/bokeh-­‐filter-­‐turns-­‐blurry-­‐lights-­‐into.html,  I  like  how  the  woman  is  pushed  out  of  the  frame  to  show  the  bokeh,  as  a  metaphor  of  how  we  did  some  effects  “early  on”  360  and  ps3  

5  

Page 6: GPU Compute

*  Data  parallel  is  “harder”  for  more  high  level  tasks,  but  it’s  far  from  impossible.  In  fact,  we  s7ll  rely  on  lots  of  data,  just  we  let  this  data  be  combed  through  with  mostly  handmade  serial  logic…  Use  learning,  classifica7on,  probabilis7c,  op7miza7on  methods  for  games...      I.E.  in  anima7on,  we  have  data,  go  from  blend  trees  to  anima7on  clips  all  down  to  work  with  individual  poses,  use  anima7on  data  itself  and  its  metrics  to  drive  anima7on,  not  logic!  hCp://graphics.stanford.edu/projects/ccclde/ccclde.pdf  hCp://www.stanford.edu/~svlevine/    Another  avenue:  simula7on,  modeling,  physics…    The  last  image  is  from  hCp://www.crea7veapplica7ons.net/theory/meibach-­‐and-­‐posavec-­‐data-­‐visualiza7on-­‐poetry-­‐and-­‐sculpture/  think  about  what  can  you  do  with  all  the  data  we  already  have…  

6  

Page 7: GPU Compute

7  

Page 8: GPU Compute

Usually  also  superscalar,  mul7ple  pipes,  SIMD…  but  for  now  we  don’t  care  

8  

Page 9: GPU Compute

Ok,  ok,  the  last  batman…  

9  

Page 10: GPU Compute

...but  ooen  is  not,  it’s  «surprising»  (but  consider  that  ooen  the  compiler  does  not  know  if  a  given  loop  is  executed  ooen  or  not)  how  much  manual  unrolling  (and  prefetching  on  plaporms  without  strong  cache  predic7ors)  in  7ght  loops  s7ll  pays,  allowing  the  compiler  to  then  compute  a  good  schedule  without  dependencies  of  the  generated  assembly.  

10  

Page 11: GPU Compute

11  

Page 12: GPU Compute

12  

Page 13: GPU Compute

First  «insight»  we  learned  

13  

Page 14: GPU Compute

Again,  same  reasoning  

14  

Page 15: GPU Compute

15  

Page 16: GPU Compute

Fibers  and  Corou7nes  are  i.e.  used  to  hide  I/O  and  network  access  in  web  servers  (a  popular  example  is  node.js)  

16  

Page 17: GPU Compute

For  a  sequen7al  CPU  loop,  all  this  work  done  *in  the  specified  order*.  The  compiler  may  reschedule  instruc7ons  without  breaking  the  logical  order  of  opera7ons,  and  the  OOO  CPU  (i.e.  PC,  but  not  Ps3  or  360)  will  re-­‐order  instruc7ons  (well,  microinstruc7ons)  to  get  rid  of  dependencies,  but  the  processing  order  is  essen7ally  preserved.  By  contrast,  on  the  GPU,  all  you're  saying  is  you  want  that  whole  array  processed,  you  don't  care  about  the  order.    

17  

Page 18: GPU Compute

Of  course  there  is  other  stuff  too,  and  never  mistake  a  logical  model  for  a  physical  one,  but  this  is  enough  for  everything  we’ll  need  

18  

Page 19: GPU Compute

Vectors  are  usually  even  larger  than  the  number  of  ALUs.  Repeat  the  same  instruc7on  a  number  of  7mes  to  process  an  en7re  vector  (allows  ALUs  to  be  pipelined,  i.e.  a  vector  of  16  float4  elements  might  be  processed  by  four  float4  ALUs  issuing  4  at  a  7me,  each  ALU  pipeline  stage  can  compute  a  single  float  per  cycle  and  with  four  stages  we  achieve  the  float4  processing  with  a  throughput  of  one  per  cycle  and  a  latency  of  four).    We  say  “threads”  here  but  they  are  more  similar  to  lightweight  fibers.  Also,  confusingly,  different  GPU  languages  call  the  same  concepts  in  different  ways…  The  more  computa7onal  threads  we  can  pack  in  the  registers,  the  more  independent  work  we’ll  have  to  hide  latencies!    This  lightweight  threading  mechanism  is  fundamental  for  performance  with  manycores,  see  i.e.  hCp://www.linleygroup.com/newsleCers/newsleCer_detail.php?num=4747&year=2011&tag=3  which  seems  a  deriva7ve  of  the  Tilera  Tile64  hCp://www.7lera.com/products/processors/TILE64  

19  

Page 20: GPU Compute

Illustra7on  stole  from  Running  Code  at  a  Teraflop:  Overview  of  GPU  Architecture,  Kayvon  Fatahalian,  Siggraph  2009    Some  GPUs  do  not  support  variable  length  loops,  everything  gets  unrolled  and  inlined...    

20  

Page 21: GPU Compute

I.E.  Pixar’s  Reyes  rendering  algorithm  had  interpreted,  very  parallel  shaders.  The  cost  of  the  interpreter  is  zero  when  you  repeat  the  same  instruc7on  over  thousands  of  elements.  Reyes  shaders  are  essen7ally  the  same  as  the  hardware  shaders  we  have  on  modern  GPUs    Note  that  GPU  shaders  have  some  explicit  SIMD  instruc7ons  in  HLSL  and  similar  languages,  suppor7ng  vectors  of  elements  up  to  4-­‐wide.  In  prac7ce,  many  modern  GPUs  process  only  scalar  code,  turning  vector  instruc7on  into  scalars.  This  is  because  the  SIMD  parallelism  that  GPUs  implement  is  much  wider  than  the  4-­‐wide  instruc7ons  of  the  language,  coupling  many  (i.e.  16  or  more)  execu7on  elements  together  in  a  big  execu7on  vector.  This  regardless  of  if  each  element  is  scalar  or  4-­‐way,  some  GPUs  even  have  a  4+1  configura7on  (one  4-­‐way  and  one  scalar  instruc7on  in  parallel  per  each  execu7on  element).  In  general  when  we  talk  about  “GPU  vectors”  we  don’t  mean  the  actual  vector  instruc7ons  in  the  shader  code  but  the  aggrega7on  of  many  ALUs  following  the  same  code.    

21  

Page 22: GPU Compute

22  

Page 23: GPU Compute

23  

Page 24: GPU Compute

Skinning  images  from  hCp://blog.wolfire.com/2009/11/volumetric-­‐heat-­‐diffusion-­‐skinning/    If  you’ve  done  rendering  on  the  GPU,  you  might  recognize  that  the  pseudo-­‐code  looks  nothing  like  a  vertex  shader,  where  you  don’t  fetch  data  but  it’s  passed  as  the  parameters  of  the  «main»  func7on  of  a  kernel.  The  data  is  laid  out  in  GPU  memory  and  bound  via  «streams»,  «constants»  and  other  means.  I  didn’t  include  a  real  vertex  shader  code  example,  even  if  it  would  not  have  been  harder  to  read,  because  it  might  even  be  misleading.  The  way  DirectX  or  OpenGL  expose  programmable  func7onality  to  the  user  might  be  close  to  the  hardware,  but  most  ooen  nowadays  is  not.  I.E.  shaders  might  indeed  compile  to  include  explicit  fetches...    Pixel  shaders  can  «kill»  their  output,  so  to  be  pedan7c  they  can  output  one  or  zero  elements.  Also,  pixel  processing  might  be  culled  «early»,  this  is  very  important.    The  «independence»  that  I  assert  is  not  en7rely  true.  Pixel  shaders  need  deriva7ve  informa7on  for  all  the  variables,  these  are  computed  using  finite  differences  from  the  neighboring  pixels  which  not  only  constrains  them  to  always  be  processed  in  2x2  pixel  units  (quads)  but  also  establishes  through  the  deriva7ves  (ddx,  ddy),  a  mean  of  communica7on  (i.e.  see  Eric  Penner’s  Shader  Amor7za7on  using  Pixel  Quad  Message  Passing  –  GPU  Pro  2)  

24  

Page 25: GPU Compute

We’ll see later some more accurate pseudo-code…  for  now  let’s  s7ll  pretend  to  work  with  a  generic  “in-­‐parallel”

25  

Page 26: GPU Compute

26  

Page 27: GPU Compute

*  At  least  in  our  representa7on  of  the  GPU...  Really  compilers  can  decide  to  spill  some  of  the  local  state  onto  the  shared  memory  (we’ll  see  later)  and  allow  for  a  stack  and  so  on,  but  that  is  quite  a  different  thing...    I.E.  Nvidia  Fermi  has  a  limit  of  64  temporary  registers  in  kernel  code  (number  of  registers  per  core  is  higher,  but  it  will  be  used  for  mul7ple  contexts)  

27  

Page 28: GPU Compute

28  

Page 29: GPU Compute

ALL  PSEUDO-­‐CODE  DOES  NOT  DO  BOUNDS  CHECKS  J  

29  

Page 30: GPU Compute

hCp://en.wikipedia.org/wiki/Sor7ng_network  hCp://www.cs.brandeis.edu/~hugues/sor7ng_networks.html    hCp://www.sciencedirect.com/science/ar7cle/pii/0020019089901968  hCp://www.springerlink.com/content/g002019156467227/    Tangen7ally  related,  but  interes7ng  hCp://www.pvk.ca/Blog/2012/08/27/tabasco-­‐sort-­‐super-­‐op7mal-­‐merge-­‐sort/  

30  

Page 31: GPU Compute

31  

Page 32: GPU Compute

Rendering  is  a  non-­‐uniform  workload,  fully  automated  scheduling  has  its  benefits…  early-­‐z  early-­‐stencil  rejec7on,  which  performs  an  hardware,  automated  «stream  compac7on»!  Fundamental  for  rendering...      

32  

Page 33: GPU Compute

At  first  it’s  «easy»  to  complain  about  bank  rules  and  cohalescing  issues  when  trying  to  access  shared  and  global  memory...  But  we  have  to  consider  that  the  GPU  offers  the  ability  of  accessing  different  memory  loca7ons  per  SIMD  element,  on  many  of  them  on  many  cores...  This  is  a  rather  impressive  feat,  on  standard  CPUs  to  load  data  into  SIMD  vectors  you  have  to  do  it  serially,  one  element  at  a  7me.  Memory  has  addressing  limita7ons,  it’s  a  limit  of  the  technology.    SIMD  communica7on  via  vo7ng:  all,  any,  ballot  Atomics:  Add,  Sub,  Inc,  Dec,  And,  Or,  Xor,  Exch,  Min,  Max,  CompareAndSwap.  Append  (or)  consume  structured  buffer  views  (Dx11)    

33  

Page 34: GPU Compute

If  we  don’t  have  control  over  scheduling,  we  could  not  be  able  to  access  core  local  memory,  obviously,  as  we  would  not  know  which  elements  are  scheduled  where.    

34  

Page 35: GPU Compute

For  convenience,  in  compute,  indices  can  be  generated  on  1d,  2d  or  3d  grids  ResetCoreMemory…  memory  is  not  –really-­‐  reset,  but  we  can’t  rely  on  its  contents,  the  ini7aliza7on  is  to  be  done  in  the  Kernel  and  any  output  has  to  go  into  GLOBAL_MEM,  so  I  “created”  this  pseudo-­‐func7on  to  illustrate  this  in  a  C-­‐like  way.  

35  

Page 36: GPU Compute

36  

Page 37: GPU Compute

37  

Page 38: GPU Compute

In  the  picture,  the  first  Connec7on  Machine,  a  65536-­‐processor  supercomputer  of  the  80ies  

38  

Page 39: GPU Compute

Shared  memory  is  not  exposed  to  pixel  and  vertex  shaders  as  it  requires  manual  scheduling  of  elements  on  cores.  It’s  s7ll  used  internally  by  the  graphic  card  to  store  results  of  rasteriza7on  and  shared  informa7on…    

39  

Page 40: GPU Compute

40  

Page 41: GPU Compute

Note  how  widening  the  number  of  outputs  per  element  looks  like  «unrolling»  the  computa7on...    Such  scheduling  decisions  have  impacts  on  latency  hiding  (switching  contextes),  the  ability  of  using  registers  as  cache  (i.e.  in  this  computa7on,  as  we  average  columns,  going  wide  on  columns  will  enable  reusing  fetches  from  registers),  the  ability  of  syncronizing  different  tasks  in  a  single  kernel  (persistent  threads)  and  so  on...    

41  

Page 42: GPU Compute

Also,  many  libraries  worth  checking  out.  Thrust,  Nvidia  performance  primi7ves,    CuDPP,  libCL  etc…  Even  more  for  tasks  that  are  inherently  parallel,  linear  algebra,  imaging  and  so  on  

42  

Page 43: GPU Compute

43  

Page 44: GPU Compute

44  

Page 45: GPU Compute

45  

Page 46: GPU Compute

All  the  arrows  here  represent  sums  to  the  exis7ng  values,  the  “self”  part  of  the  addi7on  is  not  shown  to  not  confuse  with  too  many  arrows    hCp://en.wikipedia.org/wiki/Prefix_sum  

46  

Page 47: GPU Compute

All  the  black  arrows  here  represent  sums  to  the  exis7ng  values,  the  “self”  part  of  the  addi7on  is  not  shown,  as  for  the  slide  before  The  red  arrows  mark  element  “moves”    hCp://hCp.developer.nvidia.com/GPUGems3/gpugems3_ch39.html  www.umiacs.umd.edu/~ramani/cmsc828e_gpusci/ScanTalk.pdf    

47  

Page 48: GPU Compute

The  moderngpu  website  has  lots  of  code,  and  it  take  a  „scan  centric”  approach  to  parallel  computa7on  hCp://www.moderngpu.com/intro/scan.html  hCp://www.moderngpu.com/scan/segscan.html    

48  

Page 49: GPU Compute

Blelloch  paper  references  many  applica7on  domains:  www.cs.cmu.edu/~guyb/papers/Ble93.pdf    hCp://back40compu7ng.googlecode.com/svn-­‐history/r225/wiki/documents/RadixSortTR.pdf  

49  

Page 50: GPU Compute

50  

Page 51: GPU Compute

hCp://www.cse.chalmers.se/~olaolss/papers/Efficient%20Stream%20Compac7on%20on%20Wide%20SIMD%20Many-­‐Core%20Architectures.pdf  

51  

Page 52: GPU Compute

52  

Page 53: GPU Compute

Images  from  the  original  paper:  On-­‐the-­‐fly  Point  Clouds  through  Histogram  Pyramids.  Gernot  Ziegler,  Art  Tevs,  Chris7an  Theobalt,  Hans-­‐Peter  Seidel    hCp://www.mpi-­‐inf.mpg.de/~gziegler/gpu_pointlist/paper17_gpu_pointclouds.pdf    hCp://www.mpi-­‐inf.mpg.de/%7Egziegler/gpu_pointlist/slides_vmv2006.pdf  hCp://www.astro.lu.se/compugpu2010/resources/bonus_histopyramid.pdf  Marching  cubes  using  HP  hCp://diglib.eg.org/EG/DL/CGF/volume27/issue8/v27i8pp2028-­‐2039.pdf.abstract.pdf  

53  

Page 54: GPU Compute

Good  rendering    hCps://graphics.stanford.edu/wikis/cs448s-­‐10/FrontPage?ac7on=ACachFile&do=get&target=08-­‐GPUArchII.pdf  

54  

Page 55: GPU Compute

Some  of  the  compromises,  many  differences  in  the  implementa7on…  See  Debunking  the  100x  GPU  vs  CPU  Myth  for  more  details:  h@p://www.hwsw.hu/kepek/hirek/2010/06/p451-­‐lee.pdf    

55  

Page 56: GPU Compute

There  are  many  languages  that  are  trying  to  be  both  GPU  and  CPU  targets,  OpenCL  has  CPU  backends,  without  considering  CPU  and  GPUs  on  the  same  chip  like  AMD  Fusion  (see  i.e.  synergy.cs.vt.edu/pubs/papers/daga-­‐saahpc11-­‐apu-­‐efficacy.pdf)  and  languages  that  target  the  CPU  for  wide  SIMD  compu7ng  like  Intel’s  opensource  SPMD  ISPC  hCp://ispc.github.com/  …  hCp://tbex.twbbs.org/~tbex/pad/CA_papers/Twin%20Peaks.pdf    Throughput  compu7ng  is  everywhere,  not  only  for  high  performance  calcula7ons  but  also  servers,  low  power  and  so  on.    Amdhal  for  mul7cores:  research.cs.wisc.edu/mul7facet/papers/ieeecomputer08_amdahl_mul7core.pdf    

56  

Page 57: GPU Compute

57  

Page 58: GPU Compute

Some  “bonus”  reads…  

BASICS  •  Nvidia  CUDA  programming  guide  hCp://developer.download.nvidia.com/compute/

DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf  •  N-­‐Body  Simula7on  with  CUDA  from  GPU  Gems  3  hCp://hCp.developer.nvidia.com/

GPUGems3/gpugems3_ch31.html  

ADVANCED  •  Efficient  Primi7ves  and  Algorithms  for  Many-­‐core  architectures  hCp://

www.cs.ucdavis.edu/research/tech-­‐reports/2011/CSE-­‐2011-­‐4.pdf  •  Persistent  threads  programming  hCp://developer.download.nvidia.com/GTC/PDF/

GTC2012/Presenta7onPDF/S0157-­‐GTC2012-­‐Persistent-­‐Threads-­‐Compu7ng.pdf  •  More  on  PT  and  Task  Parallelism…  hCp://developer.download.nvidia.com/GTC/PDF/

GTC2012/Presenta7onPDF/S0138-­‐GTC2012-­‐Parallelism-­‐Primi7ves-­‐Applica7ons.pdf  •  Registers  or  shared  memory?  hCp://mc.stanford.edu/cgi-­‐bin/images/6/65/

SC08_Volkov_GPU.pdf  •  Op7miza7on  7ps  hCp://www.cs.virginia.edu/~skadron/Papers/

cuda_tuning_bof_sc09_final.pdf  

DATA  STRUCTURES…  TONS  of  spa7al  ones,  very  few  general  purpose…  NOTE:  aCributes  of  parallel  data  structures:    

58