Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos [email protected] ECE, Univ

Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos [email protected] ECE, Univ. of Toronto Summer 2010 Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green and others as noted on slides

How to Get High Performance Computation Calculations Data communication/Storage Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency

Calculation capabilities How many calculation units can be built? Todays silicon chips About 1B transistors 30K transistors for a 52b multiplier ~30K multipliers 260mm^2 area (mid-range) 112microns^2 for FP unit (overestimated) ~2K FP units Frequency ~ 3Ghz common today TFLOPs possible Disclaimer: back-on-the-envelop calculations take with a grain of salt Can build lots of calculation units (ALUs) Tons of Compute Engines ?

How about Communication/Storage Need data feed and storage The larger the slower Takes time to get there and back Multiple cycles even on the same die Tons of Compute Engines Tons of Slow Storage Unlimited Bandwidth Zero/Low Latency

Is there enough parallelism? Keep this busy? Needs lots of independent calculations Parallelism/Concurrency Much of what we do is sequential First do 1, then do 2, then if X do 3 else do 4 Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency

Todays High-End General Purpose Processors Localize Communication and Computation Try to automatically extract parallelism time Tons of Slow Storage Faster cache Slower Cache Automatically extract instruction level parallelism Large on-die caches to tolerate off-chip memory latency

Some things are naturally parallel

Sequential Execution Model int a[N]; // N is large for (i =0; i < N; i++) a[i] = a[i] * fade; time Flow of control / Thread One instruction at the time Optimizations possible at the machine level

Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel a[index] = a[index] * fade; time This has been tried before: ILLIAC III, UIUC, 1966

Single Program Multiple Data / SPMD int a[N]; // N is large for all elements do in parallel if (a[i] > threshold) a[i]*= fade; time The model used in todays Graphics Processors

CPU vs. GPU overview CPU: Handles sequential code well Cant take advantage of massively parallel code Off-chip bandwidth lower Peak Computation capability lower GPU: Requires massively parallel computation Handles some control flow Higher off-chip bandwidth Higher peak computation capability

Programmers view GPU as a co-processor (2008) CPU Memory GPU GPU Memory 1GB on our systems 3GB/s 8GB.s 6.4GB/sec 31.92GB/sec 8B per transfer 141GB/sec

Target Applications int a[N]; // N is large for all elements of a compute a[i] = a[i] * fade Lots of independent computations CUDA threads need not be independent

Programmers View of the GPU GPU: a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Why are threads useful? Parallelism Concurrency: Do multiple things in parallel Uses more hardware Gets higher performance Needs more functional units

Why are threads useful #2 Tolerating stalls Often a thread stalls, e.g., memory access Multiplex the same functional unit Get more performance at a fraction of the cost

GPU vs. CPU Threads GPU threads are extremely lightweight Very little creation overhead In the order of microseconds All done in hardware GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

Execution Timeline time 1. Copy to GPU mem 2. Launch GPU Kernel GPU / Device 2. Synchronize with GPU 3. Copy from GPU mem CPU / Host

Programmers view First create data on CPU memory CPU Memory GPU GPU Memory

Programmers view Then Copy to GPU CPU Memory GPU GPU Memory

Programmers view GPU starts computation runs a kernel CPU can also continue CPU Memory GPU GPU Memory

Programmers view CPU and GPU Synchronize CPU Memory GPU GPU Memory

Programmers view Copy results back to CPU CPU Memory GPU GPU Memory

Computation partitioning: At the highest level: Think of computation as a series of loops: for (i = 0; i < big_number; i++) a[i] = some function for (i = 0; i < big_number; i++) a[i] = some other function for (i = 0; i < big_number; i++) a[i] = some other function Kernels

Computation Partitioning -- Kernel CUDA exposes the hardware to the programmer Programmer must manually partition work appropriately Programmers view is hierarchical: Think of data as an array

Per Kernel Computation Partitioning Computation Grid: 2D Case Threads within a block can communicate/synchronize Run on the same multiprocessor Threads across blocks cant communicate Shouldnt touch each others data Behavior undefined Block thread

Thread Coordination Overview Race-free access to data

GBT: Grids of Blocks of Threads Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds Programmers view of data and computation partitioning

Block and Thread IDs Threads and blocks have IDs So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Convenience not necessity Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) IDs and dimensions are accessible through predefined variables, e.g., blockDim.x and threadIdx.x

Execution Model: Ordering Execution order is undefined Do not assume and use: block 0 executes before block 1 Thread 10 executes before thread 20 And any other ordering even if you can observe it Future implementations may break this ordering Its not part of the CUDA definition Why? More flexible hardware options

Programmers view: Memory Model Different memories with different uses and performance Some managed by the compiler Some must be managed by the programmer Arrows show whether read and/or write is possible

Execution Model Summary (for your reference) Grid of blocks of threads 1D/2D grid of blocks 1D/2D/3D blocks of threads All blocks are identical: same structure and # of threads Block execution order is undefined Same block threads: can synchronize and share data fast (shared memory) Threads from different blocks: Cannot cooperate Communication through global memory Threads and Blocks have IDs Simplifies data indexing Can be 1D, 2D, or 3D (threads) Blocks do not migrate: execute on the same processor Several blocks may run over the same processor

CUDA Software Architecture cuda() cu() e.g., fft()

Reasoning about CUDA call ordering GPU communication via cuda() calls and kernel invocations cudaMalloc, cudaMemCpy Asynchronous from the CPUs perspective CPU places a request in a CUDA queue requests are handled in-order Streams allow for multiple queues Order within each queue honored No order across queues More on this much later on

My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd >> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU

CUDA API: Example int a[N]; for (i =0; i < N; i++) a[i] = a[i] + x; 1.Allocate CPU Data Structure 2.Initialize Data on CPU 3.Allocate GPU Data Structure 4.Copy Data from CPU to GPU 5.Define Execution Configuration 6.Run Kernel 7.CPU synchronizes with GPU 8.Copy Data from GPU to CPU 9.De-allocate GPU and CPU memory

1. Allocate CPU Data float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N);... } No memory allocated on the GPU side Pinned memory allocation results in faster CPU to/from GPU copies But pinned memory cannot be paged-out More on this later cudaMallocHost ()

2. Initialize CPU Data (dummy) float *ha; int i; for (i = 0; i < N; i++) ha[i] = i;

3. Allocate GPU Data float *da; cudaMalloc ((void **) &da, sizeof (float) * N); Notice: no assignment side NOT: da = cudaMalloc () Assignment is done internally: Thats why we pass &da Space is allocated in Global Memory on the GPU

GPU Memory Allocation The host manages GPU memory allocation: cudaMalloc (void **ptr, size_t nbytes) Must explicitly cast to ( void **) cudaMalloc ((void **) &da, sizeof (float) * N); cudaFree (void *ptr); cudaFree (da); cudaMemset (void *ptr, int value, size_t nbytes); cudaMemset (da, 0, N * sizeof (int)); Check the CUDA Reference Manual

4. Copy Initialized CPU data to GPU float *da; float *ha; cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION

Host/Device Data Transfers The host initiates all transfers: cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction) Asynchronous from the CPUs perspective CPU thread continues In-order processing with other CUDA requests enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

5. Define Execution Configuration How many blocks and threads/block int threads_block = 64; int blocks = N / threads_block; if (blocks % N != 0) blocks += 1; Alternatively: blocks = (N + threads_block 1) / threads_block;

6. Launch Kernel & 7. CPU/GPU Synchronization Instructs the GPU to launch blocks x threads_block threads: darradd > (da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait darradd: kernel name >> execution configuration More on this soon (da, x, N): arguments 256 8 byte limit / No variable arguments

CPU/GPU Synchronization CPU does not block on cuda() calls Kernel/requests are queued and processed in-order Control returns to CPU immediately Good if there is other work to be done e.g., preparing for the next kernel invocation Eventually, CPU must know when GPU is done Then it can safely copy the GPU results cudaThreadSynchronize () Block CPU until all preceding cuda() and kernel requests have completed

8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);

The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = da[i] + x; } BlockIdx: Unique Block ID. Numerically asceding: 0, 1, BlockDim: Dimensions of Block = how many threads it has BlockDim.x, BlockDim.y, BlockDim.z Unused dimensions default to 0 ThreadIdx: Unique per Block Index 0, 1, Per Block

Array Index Calculation Example int i = blockIdx.x * blockDim.x + threadIdx.x; a[0]a[63]a[64]a[127]a[128]a[191]a[192] blockIdx.x = 0blockIdx.x = 1blockIdx.x = 2 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 i = 0i = 63i = 64i = 127i = 128i = 191 i = 192 Assuming blockDim.x = 64

CUDA Function Declarations __global__ defines a kernel function Must return void Can only call __device__ functions __device__ and __host__ can be used together Two difference versions generated Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() devicehost __host__ float HostFunc() host

__device__ Example Add x to a[i] multiple times __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }

Kernel and Device Function Restrictions __device__ functions cannot have their address taken e.g., f = &addmany; *f(); For functions executed on the device: No recursion darradd () { darradd () } No static variable declarations inside the function darradd () { static int canthavethis; } No variable number of arguments e.g., something like printf ()

My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd >> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU

How to get high-performance #1 Programmer managed Scratchpad memory Bring data in from global memory Reuse 16KB/banked Accessed in parallel by 16 threads Programmer needs to: Decide what to bring and when Decide which thread accesses what and when Coordination paramount

How to get high-performance #2 Global memory accesses 32 threads access memory together Can coalesce into a single reference E.g., a[threadID] works well Control flow 32 threads run together If they diverge there is a performance penalty Texture cache When you think there is locality

Are GPUs really that much faster than CPUs 50x 200x speedups typically reported Recent work found Not enough effort goes into optimizing code for CPUs But: The learning curve and expertise needed for CPUs is much larger

ECE Overview -ECE research Profile -Personnel and budget -Partnerships with industry Our areas of expertise -Biomedical Engineering -Communications -Computer Engineering -Electromagnetics -Electronics -Energy Systems -Photonics -Systems Control -Slides from F. Najm (Chair) and T. Sargent (Research Vice Chair)

About our group Computer Architecture How to build the best possible system Best: performance, power, cost, etc. Expertise in high-end systems Micro-architecture Multi-processor and Multi-core systems Current Research Support: AMD, IBM, NSERC, Qualcomm (planned) Claims to fame Memory Dependence Prediction Commercially implemented and licensed Snoop Filtering: IBM Blue Gene

UofT-DRDC Partnership

Examples of industry research contracts with ECE in the past 8 years AMD Agile Systems Inc Altera ARISE Technologies Asahi Kasei Microsystems Bell Canada Bell Mobility Cellular Bioscrypt Inc Broadcom Corporation Ciclon Semiconductor Cybermation Inc Digital Predictive Systems Inc. DPL Science Eastman Kodak Electro Scientific Industries EMS Technologies Exar Corp FOX-TEK Firan Technology Group Fuji Electric 62 Fujitsu Gennum H2Green Energy Corporation Honeywell ASCa, Inc. Hydro One Networks Inc. IBM Canada Ltd. IBM IMAX Corporation Intel Corporation Jazz Semiconductor KT Micro LG Electronics Maxim MPB Technologies Microsoft Motorola Northrop Grumman NXP Semiconductors ON Semiconductor Ontario Lottery and Gaming Corp Ontario Power Generation Inc. Panasonic Semiconductor Singapore Peraso Technologies Inc. Philips Electronics North America Redline Communications Inc. Research in Motion Ltd. Right Track CAD Robert Bosch Corporation Samsung Thales Co., Ltd Semiconductor Research Corporation Siemens Aktiengesellschaft Sipex Corporation STMicroelectronics Inc. Sun Microsystems of Canada Inc. Telus Mobility Texas Instruments Toronto Hydro-Electric System Toshiba Corporation Xilinx Inc.

63 Eight Research Groups 1.Biomedical Engineering 2.Communications 3.Computer Engineering 4.Electromagnetics 5.Electronics 6.Energy Systems 7.Photonics 8.Systems Control ECE

Computer Engineering Group Human-Computer Interaction Willy Wong, Steve Mann Multi-sensor information systems Parham Aarabi Computer Hardware Jonathan Rose, Steve Brown, Paul Chow, Jason Anderson Computer Architecture Greg Steffan, Andreas Moshovos, Tarek Abdelrahman, Natalie Enright Jerger Computer Security Davie Lie, Ashvin Goel

65 Biomedical Engineering Neurosystems Berj L. Bardakjian, Roman Genov. Willy Wong, Hans Kunov Moshe Eizenman Rehabilitation Milos Popovic, Tom Chau. Medical Imaging Michael Joy, Adrian Nachman. Richard Cobbold Ofer Levi Proteomics Brendan Frey. Kevin Truong. Ca 2+

Communications Group Study of the principles, mathematics and algorithms that underpin how information is encoded, exchanged and processed Three Sub-Groups: 1.Networks 2.Signal Processing 3.Information Theory

Sequence Analysis

Image Analysis and Computer Vision Computer vision and graphics Embedded computer vision Pattern recognition and detection

Networks

Quantum Cryptography and Computing

Computer Engineering System Software Michael Stumm, H-A. Jacobsen, Cristiana Amza, Baochun Li Computer-Aided Design of Circuits Farid Najm, Andreas Veneris, Jianwen Zhu, Jonathan Rose

Electronics Group UofT-IBM Partnership 72 n 14 active professors; largest electronics group in Canada. n Breadth of research topics: l Electronic device modelling l Semiconductor technology l VLSI CAD and Systems l FPGAs l DSP and Mixed-mode ICs l Biomedical microsystems l High-speed and mm-wave ICs and SoCs n Lab for (on-wafer) SoC and IC testing through 220 GHz

73 Intelligent Sensory Microsystems n Mixed-signal VLSI circuits l Low-power, low-noise signal processing, computing and ADCs n On-chip micro-sensors l Electrical, chemical, optical n Project examples l Brain-chip interfaces l On-chip biochemical sensors l CMOS imagers

74 mm-Wave and 100+GHz systems on chip n Modelling mm-wave and noise performance of active and passive devices past 300 GHz. n 60-120GHz multi-gigabit data rate phased-array radios n Single-chip 76-79 GHz automotive radar n 170 GHz transceiver with on-die antennas

Electromagnetics Group Metamaterials: From microwaves to optics Super-resolving lenses for imaging and sensing Small antennas Multiband RF components CMOS phase shifters Electromagnetics of High-Speed Circuits Signal integrity in high-speed digital systems Microwave integrated circuit design, modeling and characterization Computational Electromagnetics Interaction of Electromagnetic Fields with Living Tissue Antennas Telecom and Wireless Systems Reflectarrays Wave electronics Integrated antennas Controlled-beam antennas Adaptive and diversity antennas

Super-lens capable of resolving details down to Small and broadband antennas Scanning antennas with CMOS MTM chips METAMATERIALS (MTMs)

Computational Electromagnetics Fast CAD for RF/ optical structures Modeling of Metamaterials Plasmonic Left-Handed Media Leaky-Wave Antennas Microstrip spiral inductor Optical power splitter

78 Energy Systems Group Power Electronics High power (> 1.2 MW) converters modeling, control, and digital control realization Micro-Power Grids converters for distributed resources, dc distribution systems, and HVdc systems Low-Power Electronics Integrated power supplies and power management systems-on-chip for low-power electronics computers, cell phones, PDA-s, MP3 players, body implants Harvesting Energy from humans

79 IC for cell phone power supplies U of T Matrix Converter for Micro-Turbine Generator Voltage Control System for Wind Power Generators Energy Systems Research

Photonics Group

Photonics Group: Bio-Photonics

Basic & applied research in control engineering World-leading group in Control theory _______________________________________ ________ Optical Signal-to-Noise Ratio opt. with game theory Erbium-doped fibre amplifier design Analysis and design of digital watermarks for authentication Nonlinear control theory application to magnetic levitation, micro positioning system distributed control of mobile autonomous robots Formations, collision avoidance Systems Control Group

Documents

Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos [email protected] ECE, Univ