GP GPU Applications and Simulations Mike Metzger [email protected] MS - ECE

GP GPU Applicationsand Simulations

Mike [email protected]

MS - ECE

GPU History Term coined in 1999 by NVIDIA with the release of the

GeForce 256

Real “first” GPU was the 1985 Commodore Amiga: graphics coprocessor with a primitive instruction set

Modern day: GPUs are commercially available at low cost and prolific in computing systems

Primarily used (obviously) for displaying and rapidly altering video data which is inherently data-parallel

Architecture has become increasingly parallel over the last decade

GPUs are structured as SIMD machines to exploit high levels of DLP

GPU Capabilities

Large matrix/vector operations

Protein folding (molecular dynamics) modeling

FFT (signal processing)

Physics simulations

Sequence matching

Speech recognition

Database manipulation

Sort/search algorthims

Medical imaging

GPGPU Origins C. Thompson, S. Hahn, M. Oskin: “Using Modern Graphics Architectures

for General-Purpose Computing: A Framework and Analysis.” International Symposium on Microarchitecture (MICRO), Turkey, Nov. 2002

Use the parallel architecture of GPUs to exploit data-level parallelism in common processes

Wrote framework for testing various data-heavy operations in C++, implemented through OpenGL programming interface

Run tests on Arithmetic, exponential, factorial, and multiplicative operation on large (10k-10 million member) vectors

Uses NVIDIA's GeForce 4 with 128 MB of VRAM (18 specialized cores), compare results to 1.5 GHz Pentium IV with 1 GB of RAM

No modifications to GPU or CPU hardware, test programs compiled with Microsoft Visual C++ 6

GeForce4 Architecture

Provides ISA for vertex programming – registers hold and process quad-valued FP numbers

Input and output attribute registers hold various graphical data

No access to main memory – 96 constant registers used instead (video memory can be filled pre-runtime by CPU)

21 instructions available – mostly operate on all 4 input components

Vertex programs have an instruction limit of 128

[Thompson, MICRO 2002]

Programming Framework

C++ framework for general-purpose programs using vector operations implemented through OpenGL (GPU API)

Abstract the GPU functionality using C++ data types operated on by GPU assembly programs

DVector: vector class, allocated a buffer of video memory

DProgram: contains GPU assembly program written via array of strings

DFunction: contains a DProgram, input/output DVectors, bindings for constant registers ; executes the Dprogram and converts vectors to quad-value format, reduce CPU usage in scalar computation using quad-floats

Dsemaphore: object to stall CPU when waiting for GPU results


2002 results


2002 Results

Arithmetic vector operation – GPU ~6.4 times faster for large vectors

CPU run time doubles with doubled program complexity, GPU only triples with 12x program size

Matrix multiplication: GPU ~3.2 times faster

Boolean SAT: GPU ~2 times faster at large input sizes

Proved that GPUs can be used for general-purpose computation and will result in significant speedup for applications with DLP


Moving Forward – Stream Processing I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P.

Hanrahan: “Brook for GPUs: Stream Computing on Graphics Hardware.” Special Interest Group on Graphics and Interactive Techniques (SIGGRAPH), 2004.

In 2004, GPGPUs were becoming a legitimate tool, but there was no universal tool for programming a GPU to be used this way

Brook was an attempt by Stanford to create and share a stream programming model for GPGPU computing

Extension of C language to include data-parallel constructs: streams and kernels, allowing for SIMD-type operation

Stream: collection of data that can be opened in parallel

Kernel: special function built to operate upon streams, called with input and output stream(s)

Brook compiler maps this language to existing GPU APIs (specifically DirectX and OpenGL)

Brook Goals & Methods Purpose: extend C to include data-parallel constructs for using a

GPU as a stream processor

Uses streams (collection of data) and kernels (functions that operate on streams) to express DLP native to various applications

Improves arithmetic intensity by containing program computation within kernels

Implementation is void of explicit graphics constructs and thus capable of being used on any architecture and API (NVIDIA/ATI and DirectX/OpenGL)

Abstracts the GPU computing a higher level to remove need for knowledge of DirectX or OpenGL

[Buck, SIGGRAPH, 2004]

Brook Implementation Map kernels to Cg shaders, streams represented as floating point

textures

Brook Runtime (BRT) library allows for input/output streams to be rendered to a display

Streams can be mapped to multiple textures to allow larger sizes than available on GPU architecture (2048x2048 or 4096x4096)

Use fragment processor to execute kernels over the streams present in textures: non-stream arguments passed via constant registers, apply shader compiler to create GPU assembly, map process to fragment shaders


Brook Performance Results (2004)

Compares optimized reference implementation to Brook DirectX and OpenGL variants

Normalized by CPU performance (black)[Buck, SIGGRAPH, 2004]

Brook Results (cont) SAXPY: vector scaling and addition (y = ax + y)

SGEMV: matrix-vector product, scaled vector addition (y = nAx + my)

Segment: nonlinear diffusion-based region-growing algorithm, primarily used in medical image processing

FFT: fast Fourier transform, used in graphical post-processing

GPGPU performance increases with limited data reuse (SAXPY vs FFT) and increased arithmetic intensity (Segment vs SGEMV)

Brook implementations within 80% of hand-coded (optimized) versions

Important factor: read/write bandwidth – resulted in NVIDIA performing worse than ATI due to this difference (1.2 Gfloats/s vs 4.5 Gfloats/s)

Made available as open source code for GPGPU software developers


Moving forward – CUDA (2007)CUDA: Compute Unified Device Architecture

Parallel computing architecture from NVIDIA, released in 2007 and compatible with GeForce 8 series and beyond (2006+)

As GPGPUs became popular, there was a need for a universal tool to access the virtual instruction set and parallel architecture of commercial GPUs

CUDA provides an API for software developers to usewith a public SDK

Modern GPU: GTX 690 has 3072 CUDA cores, 4096 MB of device memory, 6.9 billion transistors

Modern GPGPU uses Arithmetic: matrix and vector operations

Modeling molecular dynamics (protein folding, etc)

FFT (signal processing, graphical post-processing)

Physics simulations and engines (ex: modern games)

Speech recognition

Medical imaging

Instruction Set Simulator

Parity-Check Decoding

ISA Simulator S. Raghav, M. Ruggiero, D. Atienza, C. Pinto, A. Marongiu and L.

Benini: “Scalable instruction set simulator for thousand-core architectures running on GPGPUs”, Proceedings of High Performance Computing and Simulation (HPCS), pp.459-466, June/July 2010.

Improve current standards of processor simulation by exploiting parallelism available in GPGPUs

Accurate sequential simulators already exist (Cotson, m5, mpi-sim), much harder to efficiently simulate more complex environments

Two fields: high-performance (x86) and embedded (ARM)

CUDA threads simulate one or more cores, global memory provides a context structure and control logic for each simulated CPU

Written in C++ and CUDA, simulates both instruction sets on NVIDIA GTX 295 with Intel i7 running Linux

GTX 295: 2 GTX200 GPUs with 30 Streaming Multiprocessors (SM) including 240 stream processors, 938 MB VRAM

Instruction Set Simulator - ARM Supports all non-Thumb ARM

instructions

Functional blocks for Fetch, Decode, and Execute placed on CUDA model

Texture memory used to hold LUTs of instructions (working like a cache) – 16KB available

SPMD simulation allows CUDA threads to run concurrently, MIMD task-based applications sometimes become serialized if branches are data-dependent

16 GP registers, status & auxiliary registers

Large matrix holds execution context for each processor [Raghav, HPCS, 2010]

Instruction Set Simulator - x86 Simulator must support Intel IA-32 ISA

Context is held in 8 32-bit GP registers, 6 segment regs and various control registers

CISC architecture with complex decoding logic leads to some serialization: threads may branch to different functions/kernels depending on the parsed operation

Task-based parallel applications incur performance hit when branches are data dependent

CUDA concurrency is compromised by variable length instructions

[Raghav, HPCS, 2010]

ISS Testing & Results Best Case (BC): application has SIMD DLP, same kernel is running

on different data subsets, all cores fetch same instructions

Worst Case (WC): application has task-level parallelism (MIMD), cores may operate on different data sets, cores diverge in instruction retrieval due to data dependent branches

Single Kernel (SK): entire ISS run in one CUDA kernel, components simulated in successive steps of one function

Multiple Kernels (MK): system components modeled in separate CUDA kernels, requires many memory tranfers of device state when kernels swap & launch

ARM performance dependent upon kernel swaps (SK vs MK)

X86 performance dependent upon application type (BC vs WC)


Simulation ResultsMK SK

WC

BC

[Ra

gha

v, HP

CS

, 201

0]

Simulation ResultsMK SK

WC

BC

[Ra

gha

v, HP

CS

, 201

0]

ISS Testing – Real Workloads Test the ISS using real workloads to see if theoretical speedup is

possible with real applications

Matrix Multiplication, IDCT, FFT

Use parallelization scheme like OpenMP to distribute workload

Static loop parallelization: identical # of consecutive iterations are assigned to parallel threads

Processor ID determines which dataset to use (HW2/3 scheme B)

Stack-allocated variables determine lower and upper bounds of functional loops

Simulation speedup – speedup relative to serial simulation of varying number of cores

Application speedup – speedup of parallel simulation over a single simulated core


Speedup Results

Takeaway: architecture is scalable to and beyond 1000 cores

~500-1000x speedup for best case scenarios (near ideal 1024)


Parallel Nonbinary LDPC Decoding G. Wang, H. Shen, B. Yin, M. Wu, Y. Sun, J. Cavallaro: “Parallel

Nonbinary LDPC Decoding on GPU”, 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012.

Low-Density Parity-Check Codes (LDPC) are error-correcting codes over a Galois (or finite) field

Finite Field: commutative ring in abstract algebra containing multiplicative inverse for every non-zero element

Current implementations of LDPC decoding algorithms have poor flexibility & scalability

Complexity of LDPC decoding algorithms increases greatly going from binary to nonbinary codes (with q>2 for GF(q))

Goal: create a highly parallel and flexible decoder supporting different code types, variable code lengths, and the ability to run of various devices

Use OpenCL to employ a SIMT model to exploit LDPC decoding's inherent DLP

LDPC Decoding – Nonbinary LDPC Parity check matrix H (spare q-ary MxN matrix) with elements defined

in a Galois field GF(q)

Can be represented by a Tanner graph: each row of H → check node, each column of H → variable node

M(n) is the set of check nodes for variable node nN(m) is the set of variable nodes for check node m

Row weight of a check node = dc

Belief Propagation (BP) algorithm is one of the best decoding algorithms, this implementation uses the Min-Max approximation algorithm to exploit DLP

[Wang, ASILOMAR, 2012]

Implementation – Complexity Analysis

Computation kernels of nonbinary (q>2) LDPC becomes more complex for check node processing (O(d

c*q2) vs O(d

c))

CNP and VNP take up 91.64% and 6.43% of serial runtime respectively


Implementation – Algorithm Mapping

Develop work flow of decoding process

Computation is all done on GPU to keep intermediate messages in device memory

Use 5 OpenCL kernels to exploit DLP, distribute effectively

Work items (q) become CUDA threads, work groups (M) become CUDA thread blocks: all have the same computation path and memory access patterns


Implementation – Nonbinary Arithmetic & Efficient Data Structures

Addition and subtraction of nonbinary elements achieved through XOR operations

Use LUT with expq & logq for multiplication & division:a*b = expq[(logq[a]+logq[b]+q-1)%(q-1)]a/b = expq[(logq[a]+logq[b]+q-1)%(q-1)]

Expq and logq are used frequently → keep in local memory

Compress H horizontally & vertically to create more efficient structure


Implementation – Accelerating Forward-Backward Algorithm in CNP

Original algorithm shown has O(qdc), revised has O(dc*q2)

Forwarded messages vector Fi(a) stored in local memory, updated by q work items in parallel for each stage (i)

Use a barrier function after each stage for synchronization

Requires 2*sizeof(cl_float)*q*dc of local memory


1.5KB for (3,6)-regular GF(32)(used in this implementation)

Implementation – Coalescing GlobalMemory Access


Rm,n

(a) and Qm,n

(a) are complex 3D structures located in global memory

Arrange in [N,q,M] format rather than [M,N,q] so that q work items always access data stored contiguously

Enables coalesced memory access → ~4-5x speedup

Nonbinary LDPC Decoding - Results Run on 2 CPUs and 1 GPU:

Intel i7-640LM (dual core, 2.93 GHz)AMD Phenom II X9-940 (quad core, 2.9 GHz)NVIDIA GTX470 (448 stream processors, 1.215 GHz, 1280MB device memory)

2.47 speedup for OpenCL over serial C on Intel i76.67 speedup for OpenCL over serial C on AMD Phenom II

GPU has 69.92 speedup over Intel i7 and 33.46 over AMD Phenom IIWorse case speedups: 38.48 and 18.41


Nonbinary LDPC Decoding - Results


GPU algorithm had 693.5 Kbps throughput and 1260 Kbps throughput with early termination

Nonbinary decoders have complexity of 2q2 ~ 3q2 higher thanbinary decoders

With q=32 in the samples run, this results in a 2000~3000x increase in complexity

Due to massive parallelization in the decoding algorithm and the GPU, the gap between binary and nonbinary implementation is reduced to 50x

This type of LDPC decoding (with short codewords and high GF(q) values) is most common in LDPC research & application, although better speedups & throughput values are found in this implementation with longer codewords & lower GF(q) values

Citations Chris J. Thompson, Sahngyun Hahn, Mark Oskin: “Using Modern Graphics

Architectures for General-Purpose Computing: A Framework and Analysis.” International Symposium on Microarchitecture (MICRO), Turkey, Nov. 2002.

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan: “Brook for GPUs: Stream Computing on Graphics Hardware.” Special Interest Group for Graphics and Interactive Techniques (SIGGRAPH), Los Angeles, Aug. 2004.

Shivani Raghav, Martino Ruggiero, David Atienza, Christian Pinto, Andrea Marongiu, Luca Benini: “Scalable Instruction Set Simulator for Thousand-core Architectures Running on GPGPUs.” Proceedings of High Performance Computing and Simulation (HPCS), pp. 459-466, France, June/July 2010.

Guohui Wang, Hao Shen, Bei Yin, Michael Wu, Yang Sun, Joseph R. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU.” 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012.

Documents

GP GPU Applications and Simulations Mike Metzger [email protected] MS - ECE