Upload
claude-higgins
View
225
Download
0
Embed Size (px)
DESCRIPTION
GPGPU Constraints ● Dealing with graphics API – Working with the corner cases of the graphics API ● Addressing modes – Limited texture size/dimension ● Shader capabilities – Limited outputs ● Instruction sets – Lack of Integer & bit ops ● Communication limited – Between pixels – Scatter a[i] = p
Citation preview
3/12/2013 Computer Engg, IIT(BHU) 1
CUDA-3
GPGPU
●General Purpose computation using GPUin applications other than 3D graphics–GPU accelerates critical path of application●Data parallel algorithms leverage GPU attributes–Large data arrays, streaming throughput–Fine-grain SIMD parallelism–Low-latency floating point (FP) computation
GPGPU Constraints
●Dealing with graphics API–Working with the corner cases of the graphics API
●Addressing modes–Limited texture size/dimension
●Shader capabilities–Limited outputs
●Instruction sets–Lack of Integer & bit ops
●Communication limited–Between pixels
–Scatter a[i] = p
CUDA
●General purpose programming model–User kicks off batches of threads on the GPU–GPU = dedicated super-threaded, massively data parallel co-processor●Targeted software stack–Compute oriented drivers, language, and tools
CUDA
●Driver for loading computation programs into GPU–Standalone Driver - Optimized for computation–Interface designed for compute - graphics free API–Data sharing with OpenGL buffer objects–Guaranteed maximum download & readback speeds–Explicit GPU memory management
Parallel Computing on a GPU
NVIDIA GPU Computing Architecture–Via a separate HW interface–In laptops, desktops, workstations, servers
8-series GPUs deliver 50 to 200 GFLOPS on compiled parallel C applications
Parallel Computing on a GPU
GPU parallelism is doubling every year Programming model scales transparently
Programmable in C with CUDA tools Multithreaded SPMD model uses application-data
parallelism and thread parallelism
CPU vs GPU
0
0.5
1
1.5
2
2.5
3
20000 70000 120000 170000 220000 270000 320000 370000 420000 470000
Particle #
Spee
dup
Baseline 1CPUOpenMP 1CPUOpenMP 2CPU
CPU vs GPU
0
10
20
30
40
50
60
70
80
90
20000 70000 120000 170000 220000 270000 320000 370000 420000 470000
Particle #
Spee
dup Baseline 1CPU
OpenMP 1CPUOpenMP 2CPUGPU
CPU vs GPU
1
1.05
1.1
1.15
1.2
1.25
1.3
20000 70000 120000 170000 220000 270000 320000 370000 420000 470000
Particle #
Spee
dup
GPU 128 v1GPU 256 BaselineGPU 128 v2 - Global memoryGPU 256 v2 - Global memoryGPU 128 v3 - Shared MemoryGPU 256 v3 - Shared MemoryGPU 128 v4 - Loop UnrollingGPU 256 v4 - Loop Unrolling
CPU vs GPU
●GPU Baseline speedup is approximately 60x
●For 500,000 particles that is a reduction in calculation time from 33 minutes to 33 seconds!
Conclusion
●Without optimization we already got an amazing speedup on CUDA●N2 algorithm is “made” for CUDA●Optimizations are hard to predict in advance tradeoffs
Conclusion
●There are ways to dynamically distribute workloads across a fixed number of blocks●Biggest problem: how to handle dynamic results in global memory
Uses
–CUDA provided benefit for many applications. Here list of some:●Seismic Database - 66x to 100x speedup http://www.headwave.com.●Molecular Dynamics - 21x to 100x speedup http://www.ks.uiuc.edu/Research/vmd●MRI processing - 245x to 415x speedup●http://bic-test.beckman.uiuc.edu●Atmospheric Cloud Simulation - 50x speedup http://www.cs.clemson.edu/~jesteel/clouds.html.
References–CUDA, Supercomputing for the Masses by Rob Farber.●http://www.ddj.com/architect/207200659. – CUDA, Wikipedia.●http://en.wikipedia.org/wiki/CUDA. – Cuda for developers, Nvidia.●http://www.nvidia.com/object/cuda_home.html#. –Download CUDA manual and binaries.●http://www.nvidia.com/object/cuda_get.html