3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications…

3/12/2013 Computer Engg, IIT(BHU) 1

CUDA-3

GPGPU

●General Purpose computation using GPUin applications other than 3D graphics–GPU accelerates critical path of application●Data parallel algorithms leverage GPU attributes–Large data arrays, streaming throughput–Fine-grain SIMD parallelism–Low-latency floating point (FP) computation

GPGPU Constraints

●Dealing with graphics API–Working with the corner cases of the graphics API

●Addressing modes–Limited texture size/dimension

●Shader capabilities–Limited outputs

●Instruction sets–Lack of Integer & bit ops

●Communication limited–Between pixels

–Scatter a[i] = p

CUDA

●General purpose programming model–User kicks off batches of threads on the GPU–GPU = dedicated super-threaded, massively data parallel co-processor●Targeted software stack–Compute oriented drivers, language, and tools

CUDA

●Driver for loading computation programs into GPU–Standalone Driver - Optimized for computation–Interface designed for compute - graphics free API–Data sharing with OpenGL buffer objects–Guaranteed maximum download & readback speeds–Explicit GPU memory management

Parallel Computing on a GPU

NVIDIA GPU Computing Architecture–Via a separate HW interface–In laptops, desktops, workstations, servers

8-series GPUs deliver 50 to 200 GFLOPS on compiled parallel C applications

Parallel Computing on a GPU

GPU parallelism is doubling every year Programming model scales transparently

Programmable in C with CUDA tools Multithreaded SPMD model uses application-data

parallelism and thread parallelism

CPU vs GPU

0

0.5

1

1.5

2

2.5

3

20000 70000 120000 170000 220000 270000 320000 370000 420000 470000

Particle #

Spee

dup

Baseline 1CPUOpenMP 1CPUOpenMP 2CPU

CPU vs GPU

0

10

20

30

40

50

60

70

80

90

20000 70000 120000 170000 220000 270000 320000 370000 420000 470000

Particle #

Spee

dup Baseline 1CPU

OpenMP 1CPUOpenMP 2CPUGPU

CPU vs GPU

1

1.05

1.1

1.15

1.2

1.25

1.3

20000 70000 120000 170000 220000 270000 320000 370000 420000 470000

Particle #

Spee

dup

GPU 128 v1GPU 256 BaselineGPU 128 v2 - Global memoryGPU 256 v2 - Global memoryGPU 128 v3 - Shared MemoryGPU 256 v3 - Shared MemoryGPU 128 v4 - Loop UnrollingGPU 256 v4 - Loop Unrolling

CPU vs GPU

●GPU Baseline speedup is approximately 60x

●For 500,000 particles that is a reduction in calculation time from 33 minutes to 33 seconds!

Conclusion

●Without optimization we already got an amazing speedup on CUDA●N2 algorithm is “made” for CUDA●Optimizations are hard to predict in advance tradeoffs

Conclusion

●There are ways to dynamically distribute workloads across a fixed number of blocks●Biggest problem: how to handle dynamic results in global memory

Uses

–CUDA provided benefit for many applications. Here list of some:●Seismic Database - 66x to 100x speedup http://www.headwave.com.●Molecular Dynamics - 21x to 100x speedup http://www.ks.uiuc.edu/Research/vmd●MRI processing - 245x to 415x speedup●http://bic-test.beckman.uiuc.edu●Atmospheric Cloud Simulation - 50x speedup http://www.cs.clemson.edu/~jesteel/clouds.html.

References–CUDA, Supercomputing for the Masses by Rob Farber.●http://www.ddj.com/architect/207200659. – CUDA, Wikipedia.●http://en.wikipedia.org/wiki/CUDA. – Cuda for developers, Nvidia.●http://www.nvidia.com/object/cuda_home.html#. –Download CUDA manual and binaries.●http://www.nvidia.com/object/cuda_get.html

Documents

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications…