Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos [email protected] ECE, Univ

  • View
    232

  • Download
    3

Embed Size (px)

Citation preview

  • Slide 1
  • Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos [email protected] ECE, Univ. of Toronto Summer 2010 Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green and others as noted on slides
  • Slide 2
  • How to Get High Performance Computation Calculations Data communication/Storage Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency
  • Slide 3
  • Calculation capabilities How many calculation units can be built? Todays silicon chips About 1B transistors 30K transistors for a 52b multiplier ~30K multipliers 260mm^2 area (mid-range) 112microns^2 for FP unit (overestimated) ~2K FP units Frequency ~ 3Ghz common today TFLOPs possible Disclaimer: back-on-the-envelop calculations take with a grain of salt Can build lots of calculation units (ALUs) Tons of Compute Engines ?
  • Slide 4
  • How about Communication/Storage Need data feed and storage The larger the slower Takes time to get there and back Multiple cycles even on the same die Tons of Compute Engines Tons of Slow Storage Unlimited Bandwidth Zero/Low Latency
  • Slide 5
  • Is there enough parallelism? Keep this busy? Needs lots of independent calculations Parallelism/Concurrency Much of what we do is sequential First do 1, then do 2, then if X do 3 else do 4 Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency
  • Slide 6
  • Todays High-End General Purpose Processors Localize Communication and Computation Try to automatically extract parallelism time Tons of Slow Storage Faster cache Slower Cache Automatically extract instruction level parallelism Large on-die caches to tolerate off-chip memory latency
  • Slide 7
  • Some things are naturally parallel
  • Slide 8
  • Sequential Execution Model int a[N]; // N is large for (i =0; i < N; i++) a[i] = a[i] * fade; time Flow of control / Thread One instruction at the time Optimizations possible at the machine level
  • Slide 9
  • Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel a[index] = a[index] * fade; time This has been tried before: ILLIAC III, UIUC, 1966
  • Slide 10
  • Single Program Multiple Data / SPMD int a[N]; // N is large for all elements do in parallel if (a[i] > threshold) a[i]*= fade; time The model used in todays Graphics Processors
  • Slide 11
  • CPU vs. GPU overview CPU: Handles sequential code well Cant take advantage of massively parallel code Off-chip bandwidth lower Peak Computation capability lower GPU: Requires massively parallel computation Handles some control flow Higher off-chip bandwidth Higher peak computation capability
  • Slide 12
  • Programmers view GPU as a co-processor (2008) CPU Memory GPU GPU Memory 1GB on our systems 3GB/s 8GB.s 6.4GB/sec 31.92GB/sec 8B per transfer 141GB/sec
  • Slide 13
  • Target Applications int a[N]; // N is large for all elements of a compute a[i] = a[i] * fade Lots of independent computations CUDA threads need not be independent
  • Slide 14
  • Programmers View of the GPU GPU: a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
  • Slide 15
  • Why are threads useful? Parallelism Concurrency: Do multiple things in parallel Uses more hardware Gets higher performance Needs more functional units
  • Slide 16
  • Why are threads useful #2 Tolerating stalls Often a thread stalls, e.g., memory access Multiplex the same functional unit Get more performance at a fraction of the cost
  • Slide 17
  • GPU vs. CPU Threads GPU threads are extremely lightweight Very little creation overhead In the order of microseconds All done in hardware GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few
  • Slide 18
  • Execution Timeline time 1. Copy to GPU mem 2. Launch GPU Kernel GPU / Device 2. Synchronize with GPU 3. Copy from GPU mem CPU / Host
  • Slide 19
  • Programmers view First create data on CPU memory CPU Memory GPU GPU Memory
  • Slide 20
  • Programmers view Then Copy to GPU CPU Memory GPU GPU Memory
  • Slide 21
  • Programmers view GPU starts computation runs a kernel CPU can also continue CPU Memory GPU GPU Memory
  • Slide 22
  • Programmers view CPU and GPU Synchronize CPU Memory GPU GPU Memory
  • Slide 23
  • Programmers view Copy results back to CPU CPU Memory GPU GPU Memory
  • Slide 24
  • Computation partitioning: At the highest level: Think of computation as a series of loops: for (i = 0; i < big_number; i++) a[i] = some function for (i = 0; i < big_number; i++) a[i] = some other function for (i = 0; i < big_number; i++) a[i] = some other function Kernels
  • Slide 25
  • Computation Partitioning -- Kernel CUDA exposes the hardware to the programmer Programmer must manually partition work appropriately Programmers view is hierarchical: Think of data as an array
  • Slide 26
  • Per Kernel Computation Partitioning Computation Grid: 2D Case Threads within a block can communicate/synchronize Run on the same multiprocessor Threads across blocks cant communicate Shouldnt touch each others data Behavior undefined Block thread
  • Slide 27
  • Thread Coordination Overview Race-free access to data
  • Slide 28
  • GBT: Grids of Blocks of Threads Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds Programmers view of data and computation partitioning
  • Slide 29
  • Block and Thread IDs Threads and blocks have IDs So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Convenience not necessity Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) IDs and dimensions are accessible through predefined variables, e.g., blockDim.x and threadIdx.x
  • Slide 30
  • Execution Model: Ordering Execution order is undefined Do not assume and use: block 0 executes before block 1 Thread 10 executes before thread 20 And any other ordering even if you can observe it Future implementations may break this ordering Its not part of the CUDA definition Why? More flexible hardware options
  • Slide 31
  • Programmers view: Memory Model Different memories with different uses and performance Some managed by the compiler Some must be managed by the programmer Arrows show whether read and/or write is possible
  • Slide 32
  • Execution Model Summary (for your reference) Grid of blocks of threads 1D/2D grid of blocks 1D/2D/3D blocks of threads All blocks are identical: same structure and # of threads Block execution order is undefined Same block threads: can synchronize and share data fast (shared memory) Threads from different blocks: Cannot cooperate Communication through global memory Threads and Blocks have IDs Simplifies data indexing Can be 1D, 2D, or 3D (threads) Blocks do not migrate: execute on the same processor Several blocks may run over the same processor
  • Slide 33
  • CUDA Software Architecture cuda() cu() e.g., fft()
  • Slide 34
  • Reasoning about CUDA call ordering GPU communication via cuda() calls and kernel invocations cudaMalloc, cudaMemCpy Asynchronous from the CPUs perspective CPU places a request in a CUDA queue requests are handled in-order Streams allow for multiple queues Order within each queue honored No order across queues More on this much later on
  • Slide 35
  • My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd >> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU
  • Slide 36
  • CUDA API: Example int a[N]; for (i =0; i < N; i++) a[i] = a[i] + x; 1.Allocate CPU Data Structure 2.Initialize Data on CPU 3.Allocate GPU Data Structure 4.Copy Data from CPU to GPU 5.Define Execution Configuration 6.Run Kernel 7.CPU synchronizes with GPU 8.Copy Data from GPU to CPU 9.De-allocate GPU and CPU memory
  • Slide 37
  • 1. Allocate CPU Data float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N);... } No memory allocated on the GPU side Pinned memory allocation results in faster CPU to/from GPU copies But pinned memory cannot be paged-out More on this later cudaMallocHost ()
  • Slide 38
  • 2. Initialize CPU Data (dummy) float *ha; int i; for (i = 0; i < N; i++) ha[i] = i;
  • Slide 39
  • 3. Allocate GPU Data float *da; cudaMalloc ((void **) &da, sizeof (float) * N); Notice: no assignment side NOT: da = cudaMalloc () Assignment is done internally: Thats why we pass &da Space is allocated in Global Memory on the GPU
  • Slide 40
  • GPU Memory Allocation The host manages GPU memory allocation: cudaMalloc (void **ptr, size_t nbytes) Must explicitly cast to ( void **) cudaMalloc ((void **) &da, sizeof (float) * N); cudaFree (void *ptr); cudaFree (da); cudaMemset (void *ptr, int value, size_t nbytes); cudaMemset (da, 0, N * sizeof (int)); Check the CUDA Reference Manual
  • Slide 41
  • 4. Copy Initialized CPU data to GPU float *da; float *ha; cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION
  • Slide 42
  • Host/Device Data Transfers The host initiates all transfers: cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction) Asynchronous from the CPUs perspective CPU thread continues In-order processing with other CUDA requests enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice
  • Slide 43
  • 5. Define Execution Configuration How many blocks and threads/block int threads_block = 64; int blocks = N / threads_block; if (blocks % N != 0) blocks += 1; Alternatively: blocks = (N + threads_block 1) / threads_block;
  • Slide 44
  • 6. Launch Kernel & 7. CPU/GPU Synchronization Instructs the GPU to launch blocks x threads_block threads: darradd > (da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait darradd: kernel name >> execution configuration More on this soon (da, x, N): arguments 256 8 byte limit / No variable arguments
  • Slide 45
  • CPU/GPU Synchronization CPU does not block on cuda() calls Kernel/requests are queued and processed in-order Control returns to CPU immediately Good if there is other work to be done e.g., preparing for the next kernel invocation Eventually, CPU must know when GPU is done Then it can safely copy the GPU results cudaThreadSynchronize () Block CPU until all preceding cuda() and kernel requests have completed
  • Slide 46
  • 8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);
  • Slide 47
  • The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = da[i] + x; } BlockIdx: Unique Block ID. Numerically asceding: 0, 1, BlockDim: Dimensions of Block = how many threads it has BlockDim.x, BlockDim.y, BlockDim.z Unused dimensions default to 0 ThreadIdx: Unique per Block Index 0, 1, Per Block
  • Slide 48
  • Array Index Calculation Example int i = blockIdx.x * blockDim.x + threadIdx.x; a[0]a[63]a[64]a[127]a[128]a[191]a[192] blockIdx.x = 0blockIdx.x = 1blockIdx.x = 2 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 i = 0i = 63i = 64i = 127i = 128i = 191 i = 192 Assuming blockDim.x = 64
  • Slide 49
  • CUDA Function Declarations __global__ defines a kernel function Must return void Can only call __device__ functions __device__ and __host__ can be used together Two difference versions generated Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() devicehost __host__ float HostFunc() host
  • Slide 50
  • __device__ Example Add x to a[i] multiple times __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }
  • Slide 51
  • Kernel and Device Function Restrictions __device__ functions cannot have their address taken e.g., f = &addmany; *f(); For functions executed on the device: No recursion darradd () { darradd () } No static variable declarations inside the function darradd () { static int canthavethis; } No variable number of arguments e.g., something like printf ()
  • Slide 52
  • My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd >> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU
  • Slide 53
  • How to get high-performance #1 Programmer managed Scratchpad memory Bring data in from global memory Reuse 16KB/banked Accessed in parallel by 16 threads Programmer needs to: Decide what to bring and when Decide which thread accesses what and when Coordination paramount
  • Slide 54
  • How to get high-performance #2 Global memory accesses 32 threads access memory together Can coalesce into a single reference E.g., a[threadID] works well Control flow 32 threads run together If they diverge there is a performance penalty Texture cache When you think there is locality
  • Slide 55
  • Are GPUs really that much faster than CPUs 50x 200x speedups typically reported Recent work found Not enough effort goes into optimizing code for CPUs But: The learning curve and expertise needed for CPUs is much larger
  • Slide 56
  • ECE Overview -ECE research Profile -Personnel and budget -Partnerships with industry Our areas of expertise -Biomedical Engineering -Communications -Computer Engineering -Electromagnetics -Electronics -Energy Systems -Photonics -Systems Control -Slides from F. Najm (Chair) and T. Sargent (Research Vice Chair)
  • Slide 57
  • About our group Computer Architecture How to build the best possible system Best: performance, power, cost, etc. Expertise in high-end systems Micro-architecture Multi-processor and Multi-core systems Current Research Support: AMD, IBM, NSERC, Qualcomm (planned) Claims to fame Memory Dependence Prediction Commercially implemented and licensed Snoop Filtering: IBM Blue Gene
  • Slide 58
  • Slide 59
  • UofT-DRDC Partnership
  • Slide 60
  • Slide 61
  • Slide 62
  • Examples of industry research contracts with ECE in the past 8 years AMD Agile Systems Inc Altera ARISE Technologies Asahi Kasei Microsystems Bell Canada Bell Mobility Cellular Bioscrypt Inc Broadcom Corporation Ciclon Semiconductor Cybermation Inc Digital Predictive Systems Inc. DPL Science Eastman Kodak Electro Scientific Industries EMS Technologies Exar Corp FOX-TEK Firan Technology Group Fuji Electric 62 Fujitsu Gennum H2Green Energy Corporation Honeywell ASCa, Inc. Hydro One Networks Inc. IBM Canada Ltd. IBM IMAX Corporation Intel Corporation Jazz Semiconductor KT Micro LG Electronics Maxim MPB Technologies Microsoft Motorola Northrop Grumman NXP Semiconductors ON Semiconductor Ontario Lottery and Gaming Corp Ontario Power Generation Inc. Panasonic Semiconductor Singapore Peraso Technologies Inc. Philips Electronics North America Redline Communications Inc. Research in Motion Ltd. Right Track CAD Robert Bosch Corporation Samsung Thales Co., Ltd Semiconductor Research Corporation Siemens Aktiengesellschaft Sipex Corporation STMicroelectronics Inc. Sun Microsystems of Canada Inc. Telus Mobility Texas Instruments Toronto Hydro-Electric System Toshiba Corporation Xilinx Inc.
  • Slide 63
  • 63 Eight Research Groups 1.Biomedical Engineering 2.Communications 3.Computer Engineering 4.Electromagnetics 5.Electronics 6.Energy Systems 7.Photonics 8.Systems Control ECE
  • Slide 64
  • Computer Engineering Group Human-Computer Interaction Willy Wong, Steve Mann Multi-sensor information systems Parham Aarabi Computer Hardware Jonathan Rose, Steve Brown, Paul Chow, Jason Anderson Computer Architecture Greg Steffan, Andreas Moshovos, Tarek Abdelrahman, Natalie Enright Jerger Computer Security Davie Lie, Ashvin Goel
  • Slide 65
  • 65 Biomedical Engineering Neurosystems Berj L. Bardakjian, Roman Genov. Willy Wong, Hans Kunov Moshe Eizenman Rehabilitation Milos Popovic, Tom Chau. Medical Imaging Michael Joy, Adrian Nachman. Richard Cobbold Ofer Levi Proteomics Brendan Frey. Kevin Truong. Ca 2+
  • Slide 66
  • Communications Group Study of the principles, mathematics and algorithms that underpin how information is encoded, exchanged and processed Three Sub-Groups: 1.Networks 2.Signal Processing 3.Information Theory
  • Slide 67
  • Sequence Analysis
  • Slide 68
  • Image Analysis and Computer Vision Computer vision and graphics Embedded computer vision Pattern recognition and detection
  • Slide 69
  • Networks
  • Slide 70
  • Quantum Cryptography and Computing
  • Slide 71
  • Computer Engineering System Software Michael Stumm, H-A. Jacobsen, Cristiana Amza, Baochun Li Computer-Aided Design of Circuits Farid Najm, Andreas Veneris, Jianwen Zhu, Jonathan Rose
  • Slide 72
  • Electronics Group UofT-IBM Partnership 72 n 14 active professors; largest electronics group in Canada. n Breadth of research topics: l Electronic device modelling l Semiconductor technology l VLSI CAD and Systems l FPGAs l DSP and Mixed-mode ICs l Biomedical microsystems l High-speed and mm-wave ICs and SoCs n Lab for (on-wafer) SoC and IC testing through 220 GHz
  • Slide 73
  • 73 Intelligent Sensory Microsystems n Mixed-signal VLSI circuits l Low-power, low-noise signal processing, computing and ADCs n On-chip micro-sensors l Electrical, chemical, optical n Project examples l Brain-chip interfaces l On-chip biochemical sensors l CMOS imagers
  • Slide 74
  • 74 mm-Wave and 100+GHz systems on chip n Modelling mm-wave and noise performance of active and passive devices past 300 GHz. n 60-120GHz multi-gigabit data rate phased-array radios n Single-chip 76-79 GHz automotive radar n 170 GHz transceiver with on-die antennas
  • Slide 75
  • Electromagnetics Group Metamaterials: From microwaves to optics Super-resolving lenses for imaging and sensing Small antennas Multiband RF components CMOS phase shifters Electromagnetics of High-Speed Circuits Signal integrity in high-speed digital systems Microwave integrated circuit design, modeling and characterization Computational Electromagnetics Interaction of Electromagnetic Fields with Living Tissue Antennas Telecom and Wireless Systems Reflectarrays Wave electronics Integrated antennas Controlled-beam antennas Adaptive and diversity antennas
  • Slide 76
  • Super-lens capable of resolving details down to Small and broadband antennas Scanning antennas with CMOS MTM chips METAMATERIALS (MTMs)
  • Slide 77
  • Computational Electromagnetics Fast CAD for RF/ optical structures Modeling of Metamaterials Plasmonic Left-Handed Media Leaky-Wave Antennas Microstrip spiral inductor Optical power splitter
  • Slide 78
  • 78 Energy Systems Group Power Electronics High power (> 1.2 MW) converters modeling, control, and digital control realization Micro-Power Grids converters for distributed resources, dc distribution systems, and HVdc systems Low-Power Electronics Integrated power supplies and power management systems-on-chip for low-power electronics computers, cell phones, PDA-s, MP3 players, body implants Harvesting Energy from humans
  • Slide 79
  • 79 IC for cell phone power supplies U of T Matrix Converter for Micro-Turbine Generator Voltage Control System for Wind Power Generators Energy Systems Research
  • Slide 80
  • Photonics Group
  • Slide 81
  • Slide 82
  • Slide 83
  • Photonics Group: Bio-Photonics
  • Slide 84
  • Basic & applied research in control engineering World-leading group in Control theory _______________________________________ ________ Optical Signal-to-Noise Ratio opt. with game theory Erbium-doped fibre amplifier design Analysis and design of digital watermarks for authentication Nonlinear control theory application to magnetic levitation, micro positioning system distributed control of mobile autonomous robots Formations, collision avoidance Systems Control Group