33
The Parallel Revolution in Computational The Parallel Revolution in Computational Science and Engineering Science and Engineering applications, education, tools, and impact applications, education, tools, and impact Wen Wen - - mei Hwu mei Hwu University of Illinois, Urbana University of Illinois, Urbana - - Champaign Champaign

The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

The Parallel Revolution in Computational The Parallel Revolution in Computational Science and Engineering Science and Engineering

applications, education, tools, and impactapplications, education, tools, and impact

WenWen--mei Hwumei HwuUniversity of Illinois, UrbanaUniversity of Illinois, Urbana--ChampaignChampaign

Page 2: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 22

The Energy Behind Parallel RevolutionThe Energy Behind Parallel Revolution•• Calculation: 1 TFLOPS vs. 100 GFLOPSCalculation: 1 TFLOPS vs. 100 GFLOPS•• Memory Bandwidth: 100Memory Bandwidth: 100--150 GB/s vs. 32150 GB/s vs. 32--64 GB/s64 GB/s

•• MultiMulti--core and GPU in every PCcore and GPU in every PC–– massive volume and potential impactmassive volume and potential impactCourtesy: John Owens

Courtesy:John Owens

3 year shift

Page 3: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 33

Applications Entry TimeframesApplications Entry Timeframes

2-core 4-core 8-core 16-core

16-cores500 GF

32-cores1TF

64-cores2 TF

50 GF 100 GF 200 GF

Apps entry point (2008)

Many-core

Multi-core

Time

128-cores4 TF

Apps entrypoint (2011)

400 GF

App developers want at least 3X-5X for end-user perceived value-add

24-month generations

G80 G280G380

Larrabee

Page 4: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 44

What are these applications?What are these applications?

146X 36X 19X 17X 100X

Interactive Interactive visualization of visualization of

volumetric white volumetric white matter connectivitymatter connectivity

Ionic placement for Ionic placement for molecular molecular dynamics dynamics

simulation on GPUsimulation on GPU

TranscodingTranscoding HD HD video stream to video stream to

H.264H.264

Simulation in Simulation in MatlabMatlab using .using .mexmexfile CUDA functionfile CUDA function

Astrophysics NAstrophysics N--body simulationbody simulation

149X 47X 20X 24X 30X

Financial Financial simulation of simulation of

LIBOR model with LIBOR model with swaptionsswaptions

GLAME@labGLAME@lab: An : An MM--script API for script API for linear Algebra linear Algebra

operations on GPUoperations on GPU

Ultrasound Ultrasound medical imaging medical imaging

for cancer for cancer diagnosticsdiagnostics

Highly optimized Highly optimized object oriented object oriented

molecular molecular dynamicsdynamics

CmatchCmatch exact exact string matching to string matching to

find similar find similar proteins and gene proteins and gene

sequencessequences

Courtesy NVIDIA

Page 5: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 55

L2

FB

SP SP

L1

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

•• The future of GPUs is programmable processingThe future of GPUs is programmable processing•• So So –– build the architecture around the processorbuild the architecture around the processor

GPU GPU –– Graphics ModeGraphics Mode

Page 6: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 66

GPU CUDA ModeGPU CUDA Mode•• Processors execute computing threadsProcessors execute computing threads•• New operating mode/HW interface for computingNew operating mode/HW interface for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Page 7: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 77

UIUC/NCSA AC ClusterUIUC/NCSA AC Clusteruu 32 nodes32 nodes

ss 44--GPU (GTX280, Tesla), 1GPU (GTX280, Tesla), 1--FPGA, FPGA, quadquad--core core OpteronOpteron node at NCSAnode at NCSA

ss GPUsGPUs donated by NVIDIAdonated by NVIDIAss FPGA donated by FPGA donated by XilinxXilinxss 128 TFLOPS single precision, 10 128 TFLOPS single precision, 10

TFLOPS double precisionTFLOPS double precision

uu Coulomb Summation:Coulomb Summation:ss 1.78 TFLOPS/node1.78 TFLOPS/nodess 271x speedup vs. Intel QX6700 271x speedup vs. Intel QX6700

CPU core w/ SSECPU core w/ SSE

UIUC/NCSA AC Clusterhttp://www.ncsa.uiuc.edu/Projects/GPUcluster/

A partnership between NCSA and academic departments.

Page 8: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 88

MultiMulti--GPU MPI ImplementationGPU MPI Implementation

GPU clusterGPU cluster Performance scalingPerformance scaling

Imagination Unbound

10

100

1000

0 10 20 30 40 50exeu

tion

 time (sec)

number of MPI threads

HP xw9400 workstation with NVIDIA Quadro PlexModel IV modules

Lust

refil

esy

stem

se

rver

s

Clu

ster

m

anag

er

and

Logi

n se

rver

Netgear Prosafe 24-port Gigabit Ethernet switch

24-port Topspin 120 Server InfiniBandswitch

16 c

lust

er c

ompu

te n

odes

Two Point Angular Correlation

Page 9: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 99

CUDA CUDA -- No more No more shadershader functions.functions.

•• CUDA integrated CPU+GPU application C programCUDA integrated CPU+GPU application C program•• Serial or modestly parallel C code executes on CPUSerial or modestly parallel C code executes on CPU•• Highly parallel SPMD kernel C code executes on GPUHighly parallel SPMD kernel C code executes on GPU

CPU Serial CodeGrid 0

. . .

. . .

GPU Parallel KernelKernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args);

Page 10: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 1010

DRAM Bandwidth Trends Sets Programming AgendaDRAM Bandwidth Trends Sets Programming Agenda

•• Random access BW Random access BW 1.2%1.2% of peak for DDR3of peak for DDR3--1600, 1600, 0.8% 0.8% for GDDR4for GDDR4--1600 (and falling)1600 (and falling)

•• 3D stacking and optical interconnects will unlikely help.3D stacking and optical interconnects will unlikely help.

Page 11: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

It is all about applications!It is all about applications!

Illinois CUDA Center of ExcellenceIllinois CUDA Center of Excellenceand the IACAT Communityand the IACAT Community

parallel.illinois.eduparallel.illinois.edu

Page 12: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 1212

847 objects 100,000

NAMD Overlapping ExecutionNAMD Overlapping Execution

Example Configuration

Objects are assigned to processors and queued as data arrives.

108

Phillips et al., SC2002.

Offload to GPU

Klaus Schulten and team, over 29,000 registered users

Page 13: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 1313

Actual Timelines from NAMDActual Timelines from NAMDGenerated using Charm++ tool Generated using Charm++ tool ““ProjectionsProjections””

Remote Force Local Force

xf f

x

GPU

CPU

f

f

x

x

Page 14: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 1414

NAMD on QP Cluster NAMD on QP Cluster –– total application timetotal application time

00.20.40.60.8

11.21.41.61.8

2

4 8 16 32 60

seco

nds p

er st

epCPU onlywith GPUGPU subset

2.4 GHz Opteron + Quadro FX 5600fa

ster

6.76 3.33

Amount of time the GPU is actually doing work

Page 15: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 1515

VMDVMD•• ““Visual Molecular DynamicsVisual Molecular Dynamics”” (120,000 registered users)(120,000 registered users)•• Visualization of molecular dynamics simulations, sequence data, Visualization of molecular dynamics simulations, sequence data,

volumetric data, quantum chemistry data, particle systemsvolumetric data, quantum chemistry data, particle systems•• http://http://www.ks.uiuc.edu/Research/vmdwww.ks.uiuc.edu/Research/vmd//

Page 16: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 1616

Molecular Orbital Computation/Display Molecular Orbital Computation/Display --MolekelMolekel, , MacMolPltMacMolPlt, and VMD, and VMD

Units: 103 grid points/secLarger numbers indicate higher performance.

High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. J. Stone, J. Saam, D. Hardy, K. Vandivort, W. Hwu, and K. Schulten, 2nd Workshop on General-Purpose Computation on Graphics Pricessing Units (GPGPU-2) (in press)

Page 17: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 1717

WholeWhole--cell Diffusion Modelingcell Diffusion Modeling•• Zan LutheyZan Luthey--Schulten group (Chemistry, UIUC) is using CUDA to simulate Schulten group (Chemistry, UIUC) is using CUDA to simulate

reaction diffusion in three dimensions on a lattice.reaction diffusion in three dimensions on a lattice.•• GPUs and CUDA enable simulations at cellular length and time scaGPUs and CUDA enable simulations at cellular length and time scales, les,

enabling study of stochastic biochemical networks enabling study of stochastic biochemical networks in vivoin vivo..•• Simulations sample the cell stateSimulations sample the cell state’’s probability distribution according to the s probability distribution according to the

reactionreaction--diffusion master equation.diffusion master equation.

Using  size  and  population distributions  from  proteomic data,  an  approximation  of  an in vivo cellular environment  is constructed on a lattice.

[left]  A  cell  model  in  which 30%  of  the  total  volume  is occupied by obstacles.

Particle  diffuse  around  the stationary  obstacles,  reacting according to kinetic rates.

Page 18: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 1818

Scalability on Scalability on GPUsGPUs

Roberts, Stone, Sepulveda, Hwu, Luthey‐Schulten (2009) The Eighth IEEE International Workshop on High‐Performance Computational Biology, in press.

Page 19: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 1919

Large Eddy Simulations (LES) on Large Eddy Simulations (LES) on GPUsGPUsusing CUDAusing CUDA

• 3D incompressible Navier-Stokes equations

• Solved using fractional-step method.

• Used geometric multigrid for pressure-Poisson solver.

• Used Smagorinsky sub-grid scale model.

• Used Red-Black Gauss-Seidel for linear solver.

Aaron Shinn, Pratap Vanka, Wen-mei Hwu

Page 20: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 2020

Turbulent flow in 3D square ductTurbulent flow in 3D square duct

Instantaneous flow in cross-section Mean flow in cross-section

Page 21: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 2121

GPU (Tesla C1060) versus 3.0 GHz Intel Xeon CoreGPU (Tesla C1060) versus 3.0 GHz Intel Xeon Core

laminar lid-driven cube simulation

turbulent flow in a square duct

First 100 time steps

Page 22: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 2222

observed data random dataset 1 random dataset nR

Two-point angularcorrelation function

Cosmological Data Analysis: Cosmological Data Analysis: Two Point Angular Correlation FunctionTwo Point Angular Correlation Function

Imagination Unbound

Page 23: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 2323

Performance on a Single GPUPerformance on a Single GPU

Execution timeExecution time SpeedupSpeedup

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1000 10000 100000

kernel execution

 tim

e (sec)

dataset size

AMD Opteron

Quadro FX 5600

GeForce GTX 280 (SP)

GeForce GTX 280 (DP)

0

50

100

150

200

250

1000 10000 100000spee

dup  vs CP

U

dataset size

Quadro FX 5600

GeForce GTX 280 (SP)

GeForce GTX 280 (DP)

Dylan Roeh, Volodymyr V. Kindratenko, Robert J. Brunner

Page 24: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

Tools are coming.Tools are coming.

““There is always hope.There is always hope.””–– In the eve of the Battle of the Helms DeepIn the eve of the Battle of the Helms Deep

Page 25: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 2525

AutoAuto--tuning Tools do heavy lifting!tuning Tools do heavy lifting!

uu Pareto optimal curve based on analytical performance model of thPareto optimal curve based on analytical performance model of the source code allows e source code allows automatic identification of optimal code arrangementautomatic identification of optimal code arrangement

• Programmers are doing too much heavy lifting

• Too many memory organizational details are exposed to the programmers

Sum of Absolute Differences

S. Ryoo, et al, “Program Optimization Space Pruning for a Multithreaded GPU, CGO, April 2008.

Page 26: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 2626

IA multi-core& Larrabe NVIDIA GPU

NVIDIA SDK 1.1

MCUDA/OpenMP

CUDA-lite

CUDA-tune

CUDA-auto

1st generation CUDA programming with explicit, hardwired thread organizations and explicit management of memory types and data transfers

Parameterized CUDA programming using auto-tuning and optimization space pruning

Locality annotation programming to eliminate need for explicit management of memory types and data transfers

Implicitly parallel programming with data structure and function property annotations to enable auto parallelization

Page 27: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 2727

Towards Systematic Code Reuse Towards Systematic Code Reuse –– MRI ExampleMRI Example•• Two main approaches to MRI Reconstruction:Two main approaches to MRI Reconstruction:

•• Riemann approximation of the continuous inverse FT:Riemann approximation of the continuous inverse FT:

•• Solve a regularized inverse problem, e.g.,Solve a regularized inverse problem, e.g.,

Solutions often derived by solving one or more matrix inversionsSolutions often derived by solving one or more matrix inversions, e.g.,, e.g.,

( ) [ ] 2

1

ˆ m

MiH

mm

w d m e ⋅

=

= =∑ k xρ F w de π

( )2

2ˆ arg min R= − +

ρρ Fρ d ρ

( ) 1ˆ H H−= +ρ F F H F d

Page 28: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 2828

Towards a Reusable Code BaseTowards a Reusable Code Base•• Certain computations appear frequently in inverse Certain computations appear frequently in inverse

problems, e.g.,problems, e.g.,•• Matrix inversionMatrix inversion•• Matrix multiplicationMatrix multiplication

•• It is rare that two similar applications have exactly the It is rare that two similar applications have exactly the same structuresame structure

•• Better library interfaces and tools are needed to allow Better library interfaces and tools are needed to allow for easy code reuse across different applications.for easy code reuse across different applications.

( ) 1H −+S QS H d ( ) 1H −

+I W DW d

Toeplitz Sparse Diagonal Contourlet Tx

Page 29: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

Education is key.Education is key.

Page 30: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 3030

To Learn MoreTo Learn More

uu UIUC ECE498AL UIUC ECE498AL –– Programming Programming Massively Parallel ProcessorsMassively Parallel Processors((http://courses.ece.uiuc.edu/ece498/al/http://courses.ece.uiuc.edu/ece498/al/))

ss David Kirk (NVIDIA) and WenDavid Kirk (NVIDIA) and Wen--mei Hwu mei Hwu (UIUC) co(UIUC) co--instructorsinstructors

ss CUDA programming, GPU computing, lab CUDA programming, GPU computing, lab exercises, and projectsexercises, and projects

ss Lecture slides and voice recordingsLecture slides and voice recordings

uu More than 500 students worldwide follow More than 500 students worldwide follow the course each semester.the course each semester.

Page 31: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 3131

Graduate Summer SchoolGraduate Summer School•• Aimed at CSE grad studentsAimed at CSE grad students

•• Full registrationFull registration•• 180 applicants from 3 180 applicants from 3

continentscontinents•• 44 accepted from 25 44 accepted from 25

universities, 3 continentsuniversities, 3 continents•• 50+ remote participants50+ remote participants•• 9 lectures, 3 keynotes, 1 panel, 9 lectures, 3 keynotes, 1 panel,

handshands--on labon lab

•• Sponsored by UIUC, NCSA, Sponsored by UIUC, NCSA, Microsoft, NVIDIA, VSCSEMicrosoft, NVIDIA, VSCSE

•• Will be offered again, August 10Will be offered again, August 10--14 14 20092009

www.greatlakesconsortium.org/events/GPUMulticorewww.greatlakesconsortium.org/events/GPUMulticoreWenWen--meimei HwuHwu and David Kirk Coand David Kirk Co--instructorinstructor

Page 32: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 3232

Conclusion Conclusion -- A Great Opportunity for ManyA Great Opportunity for Many

•• ManyMany--core accelerates science discoverycore accelerates science discovery•• Current challenges are in parallelizing sparse and graphCurrent challenges are in parallelizing sparse and graph-- based based

computation for very large models computation for very large models –– memory memory bandwidth!bandwidth!•• Incorporating manyIncorporating many--core technology into large scale systems such as core technology into large scale systems such as

Blue Waters presents major challenges in software tool chains Blue Waters presents major challenges in software tool chains

•• Software engineering issues must be better addressedSoftware engineering issues must be better addressed•• It is still hard to extract and integrate parallel kernels from It is still hard to extract and integrate parallel kernels from real real

applications applications •• Effective code reuse requires a very large library code base andEffective code reuse requires a very large library code base and new new

tools and interfaces.tools and interfaces.

•• CrossCross--application fertilization key to future success application fertilization key to future success –– vision for vision for a global developer communitya global developer community

Page 33: The Parallel Revolution in Computational Science and ... · Netgear Prosafe 24-port Gigabit Ethernet switch 24-port Topspin 120 Server InfiniBand ... SDK 1.1 MCUDA/ OpenMP CUDA-lite

SIAM ANC 2009SIAM ANC 2009WenWen--mei W. Hwumei W. Hwu——University of Illinois at UrbanaUniversity of Illinois at Urbana--ChampaignChampaign 3333

Thank you! Any questions?Thank you! Any questions?