Upload
andrea-carroll
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
GPU History Term coined in 1999 by NVIDIA with the release of the
GeForce 256
Real “first” GPU was the 1985 Commodore Amiga: graphics coprocessor with a primitive instruction set
Modern day: GPUs are commercially available at low cost and prolific in computing systems
Primarily used (obviously) for displaying and rapidly altering video data which is inherently data-parallel
Architecture has become increasingly parallel over the last decade
GPUs are structured as SIMD machines to exploit high levels of DLP
GPU Capabilities
Large matrix/vector operations
Protein folding (molecular dynamics) modeling
FFT (signal processing)
Physics simulations
Sequence matching
Speech recognition
Database manipulation
Sort/search algorthims
Medical imaging
GPGPU Origins C. Thompson, S. Hahn, M. Oskin: “Using Modern Graphics Architectures
for General-Purpose Computing: A Framework and Analysis.” International Symposium on Microarchitecture (MICRO), Turkey, Nov. 2002
Use the parallel architecture of GPUs to exploit data-level parallelism in common processes
Wrote framework for testing various data-heavy operations in C++, implemented through OpenGL programming interface
Run tests on Arithmetic, exponential, factorial, and multiplicative operation on large (10k-10 million member) vectors
Uses NVIDIA's GeForce 4 with 128 MB of VRAM (18 specialized cores), compare results to 1.5 GHz Pentium IV with 1 GB of RAM
No modifications to GPU or CPU hardware, test programs compiled with Microsoft Visual C++ 6
GeForce4 Architecture
Provides ISA for vertex programming – registers hold and process quad-valued FP numbers
Input and output attribute registers hold various graphical data
No access to main memory – 96 constant registers used instead (video memory can be filled pre-runtime by CPU)
21 instructions available – mostly operate on all 4 input components
Vertex programs have an instruction limit of 128
[Thompson, MICRO 2002]
Programming Framework
C++ framework for general-purpose programs using vector operations implemented through OpenGL (GPU API)
Abstract the GPU functionality using C++ data types operated on by GPU assembly programs
DVector: vector class, allocated a buffer of video memory
DProgram: contains GPU assembly program written via array of strings
DFunction: contains a DProgram, input/output DVectors, bindings for constant registers ; executes the Dprogram and converts vectors to quad-value format, reduce CPU usage in scalar computation using quad-floats
Dsemaphore: object to stall CPU when waiting for GPU results
[Thompson, MICRO 2002]
2002 results
[Thompson, MICRO 2002]
2002 Results
Arithmetic vector operation – GPU ~6.4 times faster for large vectors
CPU run time doubles with doubled program complexity, GPU only triples with 12x program size
Matrix multiplication: GPU ~3.2 times faster
Boolean SAT: GPU ~2 times faster at large input sizes
Proved that GPUs can be used for general-purpose computation and will result in significant speedup for applications with DLP
[Thompson, MICRO 2002]
Moving Forward – Stream Processing I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P.
Hanrahan: “Brook for GPUs: Stream Computing on Graphics Hardware.” Special Interest Group on Graphics and Interactive Techniques (SIGGRAPH), 2004.
In 2004, GPGPUs were becoming a legitimate tool, but there was no universal tool for programming a GPU to be used this way
Brook was an attempt by Stanford to create and share a stream programming model for GPGPU computing
Extension of C language to include data-parallel constructs: streams and kernels, allowing for SIMD-type operation
Stream: collection of data that can be opened in parallel
Kernel: special function built to operate upon streams, called with input and output stream(s)
Brook compiler maps this language to existing GPU APIs (specifically DirectX and OpenGL)
Brook Goals & Methods Purpose: extend C to include data-parallel constructs for using a
GPU as a stream processor
Uses streams (collection of data) and kernels (functions that operate on streams) to express DLP native to various applications
Improves arithmetic intensity by containing program computation within kernels
Implementation is void of explicit graphics constructs and thus capable of being used on any architecture and API (NVIDIA/ATI and DirectX/OpenGL)
Abstracts the GPU computing a higher level to remove need for knowledge of DirectX or OpenGL
[Buck, SIGGRAPH, 2004]
Brook Implementation Map kernels to Cg shaders, streams represented as floating point
textures
Brook Runtime (BRT) library allows for input/output streams to be rendered to a display
Streams can be mapped to multiple textures to allow larger sizes than available on GPU architecture (2048x2048 or 4096x4096)
Use fragment processor to execute kernels over the streams present in textures: non-stream arguments passed via constant registers, apply shader compiler to create GPU assembly, map process to fragment shaders
[Buck, SIGGRAPH, 2004]
Brook Performance Results (2004)
Compares optimized reference implementation to Brook DirectX and OpenGL variants
Normalized by CPU performance (black)[Buck, SIGGRAPH, 2004]
Brook Results (cont) SAXPY: vector scaling and addition (y = ax + y)
SGEMV: matrix-vector product, scaled vector addition (y = nAx + my)
Segment: nonlinear diffusion-based region-growing algorithm, primarily used in medical image processing
FFT: fast Fourier transform, used in graphical post-processing
GPGPU performance increases with limited data reuse (SAXPY vs FFT) and increased arithmetic intensity (Segment vs SGEMV)
Brook implementations within 80% of hand-coded (optimized) versions
Important factor: read/write bandwidth – resulted in NVIDIA performing worse than ATI due to this difference (1.2 Gfloats/s vs 4.5 Gfloats/s)
Made available as open source code for GPGPU software developers
[Buck, SIGGRAPH, 2004]
Moving forward – CUDA (2007)CUDA: Compute Unified Device Architecture
Parallel computing architecture from NVIDIA, released in 2007 and compatible with GeForce 8 series and beyond (2006+)
As GPGPUs became popular, there was a need for a universal tool to access the virtual instruction set and parallel architecture of commercial GPUs
CUDA provides an API for software developers to usewith a public SDK
Modern GPU: GTX 690 has 3072 CUDA cores, 4096 MB of device memory, 6.9 billion transistors
Modern GPGPU uses Arithmetic: matrix and vector operations
Modeling molecular dynamics (protein folding, etc)
FFT (signal processing, graphical post-processing)
Physics simulations and engines (ex: modern games)
Speech recognition
Medical imaging
Instruction Set Simulator
Parity-Check Decoding
ISA Simulator S. Raghav, M. Ruggiero, D. Atienza, C. Pinto, A. Marongiu and L.
Benini: “Scalable instruction set simulator for thousand-core architectures running on GPGPUs”, Proceedings of High Performance Computing and Simulation (HPCS), pp.459-466, June/July 2010.
Improve current standards of processor simulation by exploiting parallelism available in GPGPUs
Accurate sequential simulators already exist (Cotson, m5, mpi-sim), much harder to efficiently simulate more complex environments
Two fields: high-performance (x86) and embedded (ARM)
CUDA threads simulate one or more cores, global memory provides a context structure and control logic for each simulated CPU
Written in C++ and CUDA, simulates both instruction sets on NVIDIA GTX 295 with Intel i7 running Linux
GTX 295: 2 GTX200 GPUs with 30 Streaming Multiprocessors (SM) including 240 stream processors, 938 MB VRAM
Instruction Set Simulator - ARM Supports all non-Thumb ARM
instructions
Functional blocks for Fetch, Decode, and Execute placed on CUDA model
Texture memory used to hold LUTs of instructions (working like a cache) – 16KB available
SPMD simulation allows CUDA threads to run concurrently, MIMD task-based applications sometimes become serialized if branches are data-dependent
16 GP registers, status & auxiliary registers
Large matrix holds execution context for each processor [Raghav, HPCS, 2010]
Instruction Set Simulator - x86 Simulator must support Intel IA-32 ISA
Context is held in 8 32-bit GP registers, 6 segment regs and various control registers
CISC architecture with complex decoding logic leads to some serialization: threads may branch to different functions/kernels depending on the parsed operation
Task-based parallel applications incur performance hit when branches are data dependent
CUDA concurrency is compromised by variable length instructions
[Raghav, HPCS, 2010]
ISS Testing & Results Best Case (BC): application has SIMD DLP, same kernel is running
on different data subsets, all cores fetch same instructions
Worst Case (WC): application has task-level parallelism (MIMD), cores may operate on different data sets, cores diverge in instruction retrieval due to data dependent branches
Single Kernel (SK): entire ISS run in one CUDA kernel, components simulated in successive steps of one function
Multiple Kernels (MK): system components modeled in separate CUDA kernels, requires many memory tranfers of device state when kernels swap & launch
ARM performance dependent upon kernel swaps (SK vs MK)
X86 performance dependent upon application type (BC vs WC)
[Raghav, HPCS, 2010]
Simulation ResultsMK SK
WC
BC
[Ra
gha
v, HP
CS
, 201
0]
Simulation ResultsMK SK
WC
BC
[Ra
gha
v, HP
CS
, 201
0]
ISS Testing – Real Workloads Test the ISS using real workloads to see if theoretical speedup is
possible with real applications
Matrix Multiplication, IDCT, FFT
Use parallelization scheme like OpenMP to distribute workload
Static loop parallelization: identical # of consecutive iterations are assigned to parallel threads
Processor ID determines which dataset to use (HW2/3 scheme B)
Stack-allocated variables determine lower and upper bounds of functional loops
Simulation speedup – speedup relative to serial simulation of varying number of cores
Application speedup – speedup of parallel simulation over a single simulated core
[Raghav, HPCS, 2010]
Speedup Results
Takeaway: architecture is scalable to and beyond 1000 cores
~500-1000x speedup for best case scenarios (near ideal 1024)
[Raghav, HPCS, 2010]
Parallel Nonbinary LDPC Decoding G. Wang, H. Shen, B. Yin, M. Wu, Y. Sun, J. Cavallaro: “Parallel
Nonbinary LDPC Decoding on GPU”, 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012.
Low-Density Parity-Check Codes (LDPC) are error-correcting codes over a Galois (or finite) field
Finite Field: commutative ring in abstract algebra containing multiplicative inverse for every non-zero element
Current implementations of LDPC decoding algorithms have poor flexibility & scalability
Complexity of LDPC decoding algorithms increases greatly going from binary to nonbinary codes (with q>2 for GF(q))
Goal: create a highly parallel and flexible decoder supporting different code types, variable code lengths, and the ability to run of various devices
Use OpenCL to employ a SIMT model to exploit LDPC decoding's inherent DLP
LDPC Decoding – Nonbinary LDPC Parity check matrix H (spare q-ary MxN matrix) with elements defined
in a Galois field GF(q)
Can be represented by a Tanner graph: each row of H → check node, each column of H → variable node
M(n) is the set of check nodes for variable node nN(m) is the set of variable nodes for check node m
Row weight of a check node = dc
Belief Propagation (BP) algorithm is one of the best decoding algorithms, this implementation uses the Min-Max approximation algorithm to exploit DLP
[Wang, ASILOMAR, 2012]
Implementation – Complexity Analysis
Computation kernels of nonbinary (q>2) LDPC becomes more complex for check node processing (O(d
c*q2) vs O(d
c))
CNP and VNP take up 91.64% and 6.43% of serial runtime respectively
[Wang, ASILOMAR, 2012]
Implementation – Algorithm Mapping
Develop work flow of decoding process
Computation is all done on GPU to keep intermediate messages in device memory
Use 5 OpenCL kernels to exploit DLP, distribute effectively
Work items (q) become CUDA threads, work groups (M) become CUDA thread blocks: all have the same computation path and memory access patterns
[Wang, ASILOMAR, 2012]
Implementation – Nonbinary Arithmetic & Efficient Data Structures
Addition and subtraction of nonbinary elements achieved through XOR operations
Use LUT with expq & logq for multiplication & division:a*b = expq[(logq[a]+logq[b]+q-1)%(q-1)]a/b = expq[(logq[a]+logq[b]+q-1)%(q-1)]
Expq and logq are used frequently → keep in local memory
Compress H horizontally & vertically to create more efficient structure
[Wang, ASILOMAR, 2012]
Implementation – Accelerating Forward-Backward Algorithm in CNP
Original algorithm shown has O(qdc), revised has O(dc*q2)
Forwarded messages vector Fi(a) stored in local memory, updated by q work items in parallel for each stage (i)
Use a barrier function after each stage for synchronization
Requires 2*sizeof(cl_float)*q*dc of local memory
[Wang, ASILOMAR, 2012]
1.5KB for (3,6)-regular GF(32)(used in this implementation)
Implementation – Coalescing GlobalMemory Access
[Wang, ASILOMAR, 2012]
Rm,n
(a) and Qm,n
(a) are complex 3D structures located in global memory
Arrange in [N,q,M] format rather than [M,N,q] so that q work items always access data stored contiguously
Enables coalesced memory access → ~4-5x speedup
Nonbinary LDPC Decoding - Results Run on 2 CPUs and 1 GPU:
Intel i7-640LM (dual core, 2.93 GHz)AMD Phenom II X9-940 (quad core, 2.9 GHz)NVIDIA GTX470 (448 stream processors, 1.215 GHz, 1280MB device memory)
2.47 speedup for OpenCL over serial C on Intel i76.67 speedup for OpenCL over serial C on AMD Phenom II
GPU has 69.92 speedup over Intel i7 and 33.46 over AMD Phenom IIWorse case speedups: 38.48 and 18.41
[Wang, ASILOMAR, 2012]
Nonbinary LDPC Decoding - Results
[Wang, ASILOMAR, 2012]
GPU algorithm had 693.5 Kbps throughput and 1260 Kbps throughput with early termination
Nonbinary decoders have complexity of 2q2 ~ 3q2 higher thanbinary decoders
With q=32 in the samples run, this results in a 2000~3000x increase in complexity
Due to massive parallelization in the decoding algorithm and the GPU, the gap between binary and nonbinary implementation is reduced to 50x
This type of LDPC decoding (with short codewords and high GF(q) values) is most common in LDPC research & application, although better speedups & throughput values are found in this implementation with longer codewords & lower GF(q) values
Citations Chris J. Thompson, Sahngyun Hahn, Mark Oskin: “Using Modern Graphics
Architectures for General-Purpose Computing: A Framework and Analysis.” International Symposium on Microarchitecture (MICRO), Turkey, Nov. 2002.
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan: “Brook for GPUs: Stream Computing on Graphics Hardware.” Special Interest Group for Graphics and Interactive Techniques (SIGGRAPH), Los Angeles, Aug. 2004.
Shivani Raghav, Martino Ruggiero, David Atienza, Christian Pinto, Andrea Marongiu, Luca Benini: “Scalable Instruction Set Simulator for Thousand-core Architectures Running on GPGPUs.” Proceedings of High Performance Computing and Simulation (HPCS), pp. 459-466, France, June/July 2010.
Guohui Wang, Hao Shen, Bei Yin, Michael Wu, Yang Sun, Joseph R. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU.” 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012.