Introduction to CUDA Programming Introduction to Programming
Massively Parallel Graphics processors Andreas Moshovos
[email protected] ECE, Univ. of Toronto Summer 2010 Some
slides/material from: UIUC course by Wen-Mei Hwu and David Kirk
UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem
NVIDIA by Simon Green and others as noted on slides
Slide 2
How to Get High Performance Computation Calculations Data
communication/Storage Tons of Compute Engines Tons of Storage
Unlimited Bandwidth Zero/Low Latency
Slide 3
Calculation capabilities How many calculation units can be
built? Todays silicon chips About 1B transistors 30K transistors
for a 52b multiplier ~30K multipliers 260mm^2 area (mid-range)
112microns^2 for FP unit (overestimated) ~2K FP units Frequency ~
3Ghz common today TFLOPs possible Disclaimer: back-on-the-envelop
calculations take with a grain of salt Can build lots of
calculation units (ALUs) Tons of Compute Engines ?
Slide 4
How about Communication/Storage Need data feed and storage The
larger the slower Takes time to get there and back Multiple cycles
even on the same die Tons of Compute Engines Tons of Slow Storage
Unlimited Bandwidth Zero/Low Latency
Slide 5
Is there enough parallelism? Keep this busy? Needs lots of
independent calculations Parallelism/Concurrency Much of what we do
is sequential First do 1, then do 2, then if X do 3 else do 4 Tons
of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low
Latency
Slide 6
Todays High-End General Purpose Processors Localize
Communication and Computation Try to automatically extract
parallelism time Tons of Slow Storage Faster cache Slower Cache
Automatically extract instruction level parallelism Large on-die
caches to tolerate off-chip memory latency
Slide 7
Some things are naturally parallel
Slide 8
Sequential Execution Model int a[N]; // N is large for (i =0; i
< N; i++) a[i] = a[i] * fade; time Flow of control / Thread One
instruction at the time Optimizations possible at the machine
level
Slide 9
Data Parallel Execution Model / SIMD int a[N]; // N is large
for all elements do in parallel a[index] = a[index] * fade; time
This has been tried before: ILLIAC III, UIUC, 1966
Slide 10
Single Program Multiple Data / SPMD int a[N]; // N is large for
all elements do in parallel if (a[i] > threshold) a[i]*= fade;
time The model used in todays Graphics Processors
Slide 11
CPU vs. GPU overview CPU: Handles sequential code well Cant
take advantage of massively parallel code Off-chip bandwidth lower
Peak Computation capability lower GPU: Requires massively parallel
computation Handles some control flow Higher off-chip bandwidth
Higher peak computation capability
Slide 12
Programmers view GPU as a co-processor (2008) CPU Memory GPU
GPU Memory 1GB on our systems 3GB/s 8GB.s 6.4GB/sec 31.92GB/sec 8B
per transfer 141GB/sec
Slide 13
Target Applications int a[N]; // N is large for all elements of
a compute a[i] = a[i] * fade Lots of independent computations CUDA
threads need not be independent
Slide 14
Programmers View of the GPU GPU: a compute device that: Is a
coprocessor to the CPU or host Has its own DRAM (device memory)
Runs many threads in parallel Data-parallel portions of an
application are executed on the device as kernels which run in
parallel on many threads
Slide 15
Why are threads useful? Parallelism Concurrency: Do multiple
things in parallel Uses more hardware Gets higher performance Needs
more functional units
Slide 16
Why are threads useful #2 Tolerating stalls Often a thread
stalls, e.g., memory access Multiplex the same functional unit Get
more performance at a fraction of the cost
Slide 17
GPU vs. CPU Threads GPU threads are extremely lightweight Very
little creation overhead In the order of microseconds All done in
hardware GPU needs 1000s of threads for full efficiency Multi-core
CPU needs only a few
Slide 18
Execution Timeline time 1. Copy to GPU mem 2. Launch GPU Kernel
GPU / Device 2. Synchronize with GPU 3. Copy from GPU mem CPU /
Host
Slide 19
Programmers view First create data on CPU memory CPU Memory GPU
GPU Memory
Slide 20
Programmers view Then Copy to GPU CPU Memory GPU GPU
Memory
Slide 21
Programmers view GPU starts computation runs a kernel CPU can
also continue CPU Memory GPU GPU Memory
Slide 22
Programmers view CPU and GPU Synchronize CPU Memory GPU GPU
Memory
Slide 23
Programmers view Copy results back to CPU CPU Memory GPU GPU
Memory
Slide 24
Computation partitioning: At the highest level: Think of
computation as a series of loops: for (i = 0; i < big_number;
i++) a[i] = some function for (i = 0; i < big_number; i++) a[i]
= some other function for (i = 0; i < big_number; i++) a[i] =
some other function Kernels
Slide 25
Computation Partitioning -- Kernel CUDA exposes the hardware to
the programmer Programmer must manually partition work
appropriately Programmers view is hierarchical: Think of data as an
array
Slide 26
Per Kernel Computation Partitioning Computation Grid: 2D Case
Threads within a block can communicate/synchronize Run on the same
multiprocessor Threads across blocks cant communicate Shouldnt
touch each others data Behavior undefined Block thread
Slide 27
Thread Coordination Overview Race-free access to data
Slide 28
GBT: Grids of Blocks of Threads Why? Realities of integrated
circuits: need to cluster computation and storage to achieve high
speeds Programmers view of data and computation partitioning
Slide 29
Block and Thread IDs Threads and blocks have IDs So each thread
can decide what data to work on Block ID: 1D or 2D Thread ID: 1D,
2D, or 3D Simplifies memory addressing when processing
multidimensional data Convenience not necessity Device Grid 1 Block
(0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block
(2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1)
Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2,
2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread
(2, 0) Thread (3, 0) Thread (4, 0) IDs and dimensions are
accessible through predefined variables, e.g., blockDim.x and
threadIdx.x
Slide 30
Execution Model: Ordering Execution order is undefined Do not
assume and use: block 0 executes before block 1 Thread 10 executes
before thread 20 And any other ordering even if you can observe it
Future implementations may break this ordering Its not part of the
CUDA definition Why? More flexible hardware options
Slide 31
Programmers view: Memory Model Different memories with
different uses and performance Some managed by the compiler Some
must be managed by the programmer Arrows show whether read and/or
write is possible
Slide 32
Execution Model Summary (for your reference) Grid of blocks of
threads 1D/2D grid of blocks 1D/2D/3D blocks of threads All blocks
are identical: same structure and # of threads Block execution
order is undefined Same block threads: can synchronize and share
data fast (shared memory) Threads from different blocks: Cannot
cooperate Communication through global memory Threads and Blocks
have IDs Simplifies data indexing Can be 1D, 2D, or 3D (threads)
Blocks do not migrate: execute on the same processor Several blocks
may run over the same processor
Slide 33
CUDA Software Architecture cuda() cu() e.g., fft()
Slide 34
Reasoning about CUDA call ordering GPU communication via cuda()
calls and kernel invocations cudaMalloc, cudaMemCpy Asynchronous
from the CPUs perspective CPU places a request in a CUDA queue
requests are handled in-order Streams allow for multiple queues
Order within each queue honored No order across queues More on this
much later on
Slide 35
My first CUDA Program __global__ void arradd (float *a, float
f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i
< N) a[i] = a[i] + float; } int main() { float h_a[N]; float
*d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize
(); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd
>> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a,
d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree
(a_d)); } GPU CPU
Slide 36
CUDA API: Example int a[N]; for (i =0; i < N; i++) a[i] =
a[i] + x; 1.Allocate CPU Data Structure 2.Initialize Data on CPU
3.Allocate GPU Data Structure 4.Copy Data from CPU to GPU 5.Define
Execution Configuration 6.Run Kernel 7.CPU synchronizes with GPU
8.Copy Data from GPU to CPU 9.De-allocate GPU and CPU memory
Slide 37
1. Allocate CPU Data float *ha; main (int argc, char *argv[]) {
int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) *
N);... } No memory allocated on the GPU side Pinned memory
allocation results in faster CPU to/from GPU copies But pinned
memory cannot be paged-out More on this later cudaMallocHost
()
Slide 38
2. Initialize CPU Data (dummy) float *ha; int i; for (i = 0; i
< N; i++) ha[i] = i;
Slide 39
3. Allocate GPU Data float *da; cudaMalloc ((void **) &da,
sizeof (float) * N); Notice: no assignment side NOT: da =
cudaMalloc () Assignment is done internally: Thats why we pass
&da Space is allocated in Global Memory on the GPU
Slide 40
GPU Memory Allocation The host manages GPU memory allocation:
cudaMalloc (void **ptr, size_t nbytes) Must explicitly cast to (
void **) cudaMalloc ((void **) &da, sizeof (float) * N);
cudaFree (void *ptr); cudaFree (da); cudaMemset (void *ptr, int
value, size_t nbytes); cudaMemset (da, 0, N * sizeof (int)); Check
the CUDA Reference Manual
Slide 41
4. Copy Initialized CPU data to GPU float *da; float *ha;
cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE
sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); //
DIRECTION
Slide 42
Host/Device Data Transfers The host initiates all transfers:
cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind
direction) Asynchronous from the CPUs perspective CPU thread
continues In-order processing with other CUDA requests enum
cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
Slide 43
5. Define Execution Configuration How many blocks and
threads/block int threads_block = 64; int blocks = N /
threads_block; if (blocks % N != 0) blocks += 1; Alternatively:
blocks = (N + threads_block 1) / threads_block;
Slide 44
6. Launch Kernel & 7. CPU/GPU Synchronization Instructs the
GPU to launch blocks x threads_block threads: darradd > (da,
10f, N); cudaThreadSynchronize (); // forces CPU to wait darradd:
kernel name >> execution configuration More on this soon (da,
x, N): arguments 256 8 byte limit / No variable arguments
Slide 45
CPU/GPU Synchronization CPU does not block on cuda() calls
Kernel/requests are queued and processed in-order Control returns
to CPU immediately Good if there is other work to be done e.g.,
preparing for the next kernel invocation Eventually, CPU must know
when GPU is done Then it can safely copy the GPU results
cudaThreadSynchronize () Block CPU until all preceding cuda() and
kernel requests have completed
Slide 46
8. Copy data from GPU to CPU & 9. DeAllocate Memory float
*da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *)
da, // SOURCE sizeof (float) * N, // #bytes
cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or
process results here free (ha);
Slide 47
The GPU Kernel __global__ darradd (float *da, float x, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i]
= da[i] + x; } BlockIdx: Unique Block ID. Numerically asceding: 0,
1, BlockDim: Dimensions of Block = how many threads it has
BlockDim.x, BlockDim.y, BlockDim.z Unused dimensions default to 0
ThreadIdx: Unique per Block Index 0, 1, Per Block
Slide 48
Array Index Calculation Example int i = blockIdx.x * blockDim.x
+ threadIdx.x; a[0]a[63]a[64]a[127]a[128]a[191]a[192] blockIdx.x =
0blockIdx.x = 1blockIdx.x = 2 threadIdx.x 0 threadIdx.x 63
threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63
threadIdx.x 0 i = 0i = 63i = 64i = 127i = 128i = 191 i = 192
Assuming blockDim.x = 64
Slide 49
CUDA Function Declarations __global__ defines a kernel function
Must return void Can only call __device__ functions __device__ and
__host__ can be used together Two difference versions generated
Executed on the: Only callable from the: __device__ float
DeviceFunc() device __global__ void KernelFunc() devicehost
__host__ float HostFunc() host
Slide 50
__device__ Example Add x to a[i] multiple times __device__
float addmany (float a, float b, int count) { while (count--) a +=
b; return a; } __global__ darradd (float *da, float x, int N) { int
i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] =
addmany (da[i], x, 10); }
Slide 51
Kernel and Device Function Restrictions __device__ functions
cannot have their address taken e.g., f = &addmany; *f(); For
functions executed on the device: No recursion darradd () { darradd
() } No static variable declarations inside the function darradd ()
{ static int canthavethis; } No variable number of arguments e.g.,
something like printf ()
Slide 52
My first CUDA Program __global__ void arradd (float *a, float
f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i
< N) a[i] = a[i] + float; } int main() { float h_a[N]; float
*d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize
(); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd
>> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a,
d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree
(a_d)); } GPU CPU
Slide 53
How to get high-performance #1 Programmer managed Scratchpad
memory Bring data in from global memory Reuse 16KB/banked Accessed
in parallel by 16 threads Programmer needs to: Decide what to bring
and when Decide which thread accesses what and when Coordination
paramount
Slide 54
How to get high-performance #2 Global memory accesses 32
threads access memory together Can coalesce into a single reference
E.g., a[threadID] works well Control flow 32 threads run together
If they diverge there is a performance penalty Texture cache When
you think there is locality
Slide 55
Are GPUs really that much faster than CPUs 50x 200x speedups
typically reported Recent work found Not enough effort goes into
optimizing code for CPUs But: The learning curve and expertise
needed for CPUs is much larger
Slide 56
ECE Overview -ECE research Profile -Personnel and budget
-Partnerships with industry Our areas of expertise -Biomedical
Engineering -Communications -Computer Engineering -Electromagnetics
-Electronics -Energy Systems -Photonics -Systems Control -Slides
from F. Najm (Chair) and T. Sargent (Research Vice Chair)
Slide 57
About our group Computer Architecture How to build the best
possible system Best: performance, power, cost, etc. Expertise in
high-end systems Micro-architecture Multi-processor and Multi-core
systems Current Research Support: AMD, IBM, NSERC, Qualcomm
(planned) Claims to fame Memory Dependence Prediction Commercially
implemented and licensed Snoop Filtering: IBM Blue Gene
Slide 58
Slide 59
UofT-DRDC Partnership
Slide 60
Slide 61
Slide 62
Examples of industry research contracts with ECE in the past 8
years AMD Agile Systems Inc Altera ARISE Technologies Asahi Kasei
Microsystems Bell Canada Bell Mobility Cellular Bioscrypt Inc
Broadcom Corporation Ciclon Semiconductor Cybermation Inc Digital
Predictive Systems Inc. DPL Science Eastman Kodak Electro
Scientific Industries EMS Technologies Exar Corp FOX-TEK Firan
Technology Group Fuji Electric 62 Fujitsu Gennum H2Green Energy
Corporation Honeywell ASCa, Inc. Hydro One Networks Inc. IBM Canada
Ltd. IBM IMAX Corporation Intel Corporation Jazz Semiconductor KT
Micro LG Electronics Maxim MPB Technologies Microsoft Motorola
Northrop Grumman NXP Semiconductors ON Semiconductor Ontario
Lottery and Gaming Corp Ontario Power Generation Inc. Panasonic
Semiconductor Singapore Peraso Technologies Inc. Philips
Electronics North America Redline Communications Inc. Research in
Motion Ltd. Right Track CAD Robert Bosch Corporation Samsung Thales
Co., Ltd Semiconductor Research Corporation Siemens
Aktiengesellschaft Sipex Corporation STMicroelectronics Inc. Sun
Microsystems of Canada Inc. Telus Mobility Texas Instruments
Toronto Hydro-Electric System Toshiba Corporation Xilinx Inc.
Slide 63
63 Eight Research Groups 1.Biomedical Engineering
2.Communications 3.Computer Engineering 4.Electromagnetics
5.Electronics 6.Energy Systems 7.Photonics 8.Systems Control
ECE
Slide 64
Computer Engineering Group Human-Computer Interaction Willy
Wong, Steve Mann Multi-sensor information systems Parham Aarabi
Computer Hardware Jonathan Rose, Steve Brown, Paul Chow, Jason
Anderson Computer Architecture Greg Steffan, Andreas Moshovos,
Tarek Abdelrahman, Natalie Enright Jerger Computer Security Davie
Lie, Ashvin Goel
Slide 65
65 Biomedical Engineering Neurosystems Berj L. Bardakjian,
Roman Genov. Willy Wong, Hans Kunov Moshe Eizenman Rehabilitation
Milos Popovic, Tom Chau. Medical Imaging Michael Joy, Adrian
Nachman. Richard Cobbold Ofer Levi Proteomics Brendan Frey. Kevin
Truong. Ca 2+
Slide 66
Communications Group Study of the principles, mathematics and
algorithms that underpin how information is encoded, exchanged and
processed Three Sub-Groups: 1.Networks 2.Signal Processing
3.Information Theory
Slide 67
Sequence Analysis
Slide 68
Image Analysis and Computer Vision Computer vision and graphics
Embedded computer vision Pattern recognition and detection
Slide 69
Networks
Slide 70
Quantum Cryptography and Computing
Slide 71
Computer Engineering System Software Michael Stumm, H-A.
Jacobsen, Cristiana Amza, Baochun Li Computer-Aided Design of
Circuits Farid Najm, Andreas Veneris, Jianwen Zhu, Jonathan
Rose
Slide 72
Electronics Group UofT-IBM Partnership 72 n 14 active
professors; largest electronics group in Canada. n Breadth of
research topics: l Electronic device modelling l Semiconductor
technology l VLSI CAD and Systems l FPGAs l DSP and Mixed-mode ICs
l Biomedical microsystems l High-speed and mm-wave ICs and SoCs n
Lab for (on-wafer) SoC and IC testing through 220 GHz
Slide 73
73 Intelligent Sensory Microsystems n Mixed-signal VLSI
circuits l Low-power, low-noise signal processing, computing and
ADCs n On-chip micro-sensors l Electrical, chemical, optical n
Project examples l Brain-chip interfaces l On-chip biochemical
sensors l CMOS imagers
Slide 74
74 mm-Wave and 100+GHz systems on chip n Modelling mm-wave and
noise performance of active and passive devices past 300 GHz. n
60-120GHz multi-gigabit data rate phased-array radios n Single-chip
76-79 GHz automotive radar n 170 GHz transceiver with on-die
antennas
Slide 75
Electromagnetics Group Metamaterials: From microwaves to optics
Super-resolving lenses for imaging and sensing Small antennas
Multiband RF components CMOS phase shifters Electromagnetics of
High-Speed Circuits Signal integrity in high-speed digital systems
Microwave integrated circuit design, modeling and characterization
Computational Electromagnetics Interaction of Electromagnetic
Fields with Living Tissue Antennas Telecom and Wireless Systems
Reflectarrays Wave electronics Integrated antennas Controlled-beam
antennas Adaptive and diversity antennas
Slide 76
Super-lens capable of resolving details down to Small and
broadband antennas Scanning antennas with CMOS MTM chips
METAMATERIALS (MTMs)
Slide 77
Computational Electromagnetics Fast CAD for RF/ optical
structures Modeling of Metamaterials Plasmonic Left-Handed Media
Leaky-Wave Antennas Microstrip spiral inductor Optical power
splitter
Slide 78
78 Energy Systems Group Power Electronics High power (> 1.2
MW) converters modeling, control, and digital control realization
Micro-Power Grids converters for distributed resources, dc
distribution systems, and HVdc systems Low-Power Electronics
Integrated power supplies and power management systems-on-chip for
low-power electronics computers, cell phones, PDA-s, MP3 players,
body implants Harvesting Energy from humans
Slide 79
79 IC for cell phone power supplies U of T Matrix Converter for
Micro-Turbine Generator Voltage Control System for Wind Power
Generators Energy Systems Research
Slide 80
Photonics Group
Slide 81
Slide 82
Slide 83
Photonics Group: Bio-Photonics
Slide 84
Basic & applied research in control engineering
World-leading group in Control theory
_______________________________________ ________ Optical
Signal-to-Noise Ratio opt. with game theory Erbium-doped fibre
amplifier design Analysis and design of digital watermarks for
authentication Nonlinear control theory application to magnetic
levitation, micro positioning system distributed control of mobile
autonomous robots Formations, collision avoidance Systems Control
Group