Upload
regina-ellis
View
216
Download
0
Embed Size (px)
Citation preview
© 2010 Michael Boyer 1
Harnessing the Power of GPUs for Non-Graphics Applications
Michael BoyerDepartment of Computer Science
University of VirginiaAdvisor: Kevin Skadron
© 2010 Michael Boyer 2
Outline• GPU architecture• Programming GPUs using CUDA• Case study: Leukocyte Tracking• Current work: CPU-GPU Task Sharing
© 2010 Michael Boyer 3
Graphics Processors• Graphics Processing Units (GPUs) are designed
specifically for graphics rendering applications
Courtesy of G
ameS
pot
© 2010 Michael Boyer 4
Graphics Applications• Graphics applications involve applying the same
operation to many pieces of data• Application characteristics:
– Massively parallel– Only aggregate performance matters
© 2010 Michael Boyer 5
CPU vs. GPU: Architectural Difference 1
Fetch/DecodeFetch/
Decode
ExecuteExecute
Register FileRegister File OOO LogicOOO Logic
Branch PredictorBranch
Predictor
Data CacheData Cache
MemoryPre-Fetcher
MemoryPre-Fetcher
Fetch/DecodeFetch/
Decode
ExecuteExecute
Register FileRegister File
CPU GPU
Avoid structures that onlyimprove single-thread performance
© 2010 Michael Boyer 6
CPU vs. GPU: Architectural Difference 2
Fetch/DecodeFetch/
Decode
ExecuteExecute
Register FileRegister File
Amortize the overhead of control logic across multiple execution units (SIMD processing)
EXEEXE
RFRF
EXEEXE
RFRF
EXEEXE
RFRF
EXEEXE
RFRFThread Group
Fetch/DecodeFetch/
Decode
ExecuteExecute
Register FileRegister File OOO LogicOOO Logic
Branch PredictorBranch
Predictor
Data CacheData Cache
MemoryPre-Fetcher
MemoryPre-Fetcher
Fetch/DecodeFetch/
Decode
GPUCPU
© 2010 Michael Boyer 7
EXEEXE
RFRF
EXEEXE
RFRF
EXEEXE
RFRF
EXEEXE
RFRF
RFRF RFRF RFRF RFRF
RFRF RFRF RFRF RFRF
RFRF RFRF RFRF RFRF
CPU vs. GPU: Architectural Difference 3
Fetch/DecodeFetch/
Decode
EXEEXE
RFRF
EXEEXE
RFRF
EXEEXE
RFRF
EXEEXE
RFRFThread GroupThread Group 1
Thread Group 2
Thread Group 3
Thread Group 4
Use multiple groups of threads to keepexecution units busy and hide memory latency
Fetch/DecodeFetch/
Decode
ExecuteExecute
Register FileRegister File OOO LogicOOO Logic
Branch PredictorBranch
Predictor
Data CacheData Cache
MemoryPre-Fetcher
MemoryPre-Fetcher
Fetch/DecodeFetch/
Decode
GPUCPU
© 2010 Michael Boyer 8
CPU vs. GPU: Architectural Difference 4
EXEEXE
RFRF
EXEEXE
RFRF
EXEEXE
RFRF
EXEEXE
RFRF
RFRF RFRF RFRF RFRF
RFRF RFRF RFRF RFRF
RFRF RFRF RFRF RFRF
Fetch/DecodeFetch/
Decode
Replicate cores to leverage more parallelism
Fetch/DecodeFetch/
Decode
ExecuteExecute
Register FileRegister File OOO LogicOOO Logic
Branch PredictorBranch
Predictor
Data CacheData Cache
MemoryPre-Fetcher
MemoryPre-Fetcher
Fetch/DecodeFetch/
Decode
GPUCPU
CPU CoreCPU CoreGPU CoreGPU Core
Core 1Core 1 Core 2Core 2
Core 3Core 3 Core 4Core 4
Core 7Core 7 Core 8Core 8 Core 9Core 9 Core 10
Core 10
Core 12
Core 12
Core 13
Core 13
Core 14
Core 14
Core 15
Core 15
Core 17
Core 17
Core 18
Core 18
Core 19
Core 19
Core 20
Core 20
Core 22
Core 22
Core 23
Core 23
Core 24
Core 24
Core 25
Core 25
Core 27
Core 27
Core 28
Core 28
Core 29
Core 29
Core 30
Core 30
Core 2Core 2 Core 3Core 3 Core 4Core 4 Core 5Core 5
Core 6Core 6
Core 11
Core 11
Core 16
Core 16
Core 21
Core 21
Core 26
Core 26
Core 1Core 1
© 2010 Michael Boyer 9
CPU vs. GPU: Architectural Differences• Summary: take advantage of abundant parallelism
– Lots of threads, so focus on aggregate performance– Parallelism in space:
• SIMD processing in each core• Many independent SIMD cores across the chip
– Parallelism in time:• Multiple SIMD groups in each core
© 2010 Michael Boyer 10
CPU vs. GPU: Peak PerformanceProcessor Type CPU GPU
Product Intel Xeon W5590 (Nehalem)
AMD Radeon HD 5870
Throughput(GFLOPs)
107 2,720
Memory Bandwidth (GB/s) 32 154
Cost $1,700 $450
• Note that these are peak numbers• What we really care about is performance on real-world applications
© 2010 Michael Boyer 11
General-Purpose Computing on GPUs• Lots of recent interest in using GPUs to run non-
graphics applications (GPGPU)• Why GPUs? Why now?
– Recent increases in performance via parallelism – Recent increases in programmability– Ubiquity in multiple market segments
• Old approach: graphics languages• New approach: GPGPU languages
– OpenCL, CUDA
© 2010 Michael Boyer 12
CUDA• Programming model for running general-purpose
applications on NVIDIA GPUs• Extension to the C programming language• GPU is a co-processor:
– Main program runs on the CPU– Large computations (kernels) are offloaded to the GPU– CPU and GPU have separate memory, so data must be
transferred back and forth
© 2010 Michael Boyer 13
CUDA: Typical Program Structurevoid function(…) {
Allocate memory on the GPUTransfer input data to the GPULaunch kernel on the GPUTransfer output data to CPU
}
__global__ void kernel(…) {Code executed on
the GPU goes here…}
CPUCPU CPU MemoryCPU Memory
GPUGPU
GPU MemoryGPU Memory
© 2010 Michael Boyer 14
CUDA: Typical Program Transformationfor (i = 0; i < N; i++) {
Process array element i}
Body of loop becomes body of kernel
__global__ void kernel(…) {Determine this thread’s value of iProcess array element i
}
© 2010 Michael Boyer 15
CUDA Kernel• Scalar program invoked across many threads
– Typically one thread per data element
• Overall computation decomposed into a grid of thread blocks– Thread blocks are independent and cannot communicate
(with some exceptions)– Threads within the same block can communicate
Thread Block 1
Thread Block 2
Thread Block 3
Thread Block 4
Thread Block 5
© 2010 Michael Boyer 16
1 2 3 4 5 6 7 8A
9 10 11 12 13 14 15 16B
10 12 14 16 18 20 22 24C
+
=
Simple Example: Vector Addition
C = A + B
© 2010 Michael Boyer 17
C Code
float *CPU_add_vectors(float *A, float *B, int N) {
// Allocate memory for the resultfloat *C = (float *) malloc(N * sizeof(float));
// Compute the sum;for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];}
// Return the resultreturn C;
}
© 2010 Michael Boyer 18
CUDA Kernel
// GPU kernel that computes the vector sum C = A + B// (each thread computes a single value of the result)__global__ void kernel(float *A, float *B, float *C, int N) {
// Determine which element this thread is computingint i = blockDim.x * blockIdx.x + threadIdx.x;
// Compute a single element of the result vectorif (i < N) {
C[i] = A[i] + B[i];}
}
© 2010 Michael Boyer 19
CUDA Host Codefloat *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) {
// Allocate GPU memory for the inputs and the resultint vector_size = N * sizeof(float);float *A_GPU, *B_GPU, *C_GPU;cudaMalloc((void **) &A_GPU, vector_size);cudaMalloc((void **) &B_GPU, vector_size);cudaMalloc((void **) &C_GPU, vector_size);
// Transfer the input vectors to GPU memorycudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice);cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice);
// Execute the kernel to compute the vector sum on the GPUint num_blocks = ceil((double) N / (double) THREADS_PER_BLOCK);kernel <<< num_blocks, THREADS_PER_BLOCK >>> (A_GPU, B_GPU, C_GPU, N);
// Transfer the result vector from the GPU to the CPUfloat *C_CPU = (float *) malloc(vector_size);cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost);return C_CPU;
}
© 2010 Michael Boyer 20
Example Program Output./vector_add 50,000,000
GPU: Transfer to GPU: 0.236 sec Kernel execution: 0.005 sec Transfer from GPU: 0.152 sec Total: 0.404 sec
CPU: 0.136 sec
Execution: GPU outperformed CPU by 27.2x Overall: CPU outperformed GPU by 2.97x
Vector addition does not do enough work per memory operation to justify offload!
© 2010 Michael Boyer 21
Case Study:Leukocyte Tracking
© 2010 Michael Boyer 22
Leukocyte Tracking
• Important for evaluating inflammatory drugs• Velocity measured by tracking leukocytes through
multiple frames• Current approaches:
– Manual analysis: 1 minute video in tens of hours– MATLAB: 1 minute video in 5 hours
© 2010 Michael Boyer 23
Goal: Leverage CUDA and a GPU toaccelerate leukocyte tracking
to near real-time speeds
© 2010 Michael Boyer 24
Acceleration1. Translation: convert MATLAB code to C2. Parallelization:
– OpenMP for multi-core CPU– CUDA for GPU
• Experimental setup:– CPU: 3.2 GHz quad-core Intel Core 2 Extreme X9770– GPU: NVIDIA GeForce GTX 280 (PCIe 2.0)
© 2010 Michael Boyer 25
Tracking AlgorithmInputs: Video frame
Location of cells in previous frame
Output: Location of cells in current frame
For each cell:– Extract sub-image near cell’s old location– Compute MGVF matrix over sub-image– Evolve active contour using MGVF matrix
→ 99.8%
© 2010 Michael Boyer 26
Computing the MGVF Matrix• Motion Gradient Vector Flow• MGVF matrix is approximated via an iterative solution
procedure
Sub-image near cell MGVF
© 2010 Michael Boyer 27
MGVF = normalized sub-image gradientdo {
Compute eight matrices based on current MGVFCompute Heaviside function across each matrixUpdate MGVF matrixCompute convergence criterion
} while (! converged)
MGVF Pseudo-code
© 2010 Michael Boyer 28
Naïve CUDA Implementation
2.0x 7.7x 0.8x0x
50x
100x
150x
200x
250x
C C + OpenMP Naïve CUDA
CUDA
Sp
ee
du
p o
ve
r M
AT
LA
B
• Kernel is called ~50,000 times per frame• Amount of work per call is small• Runtime dominated by CUDA overheads:
– Memory allocation, memory copying, kernel call overhead
© 2010 Michael Boyer 29
Kernel Overhead• Kernel calls are not cheap!
– Overhead of one kernel call: 9 microseconds– Overhead of one CPU function: 3 nanoseconds– Kernel call is 3,000 times more expensive
• Heaviside kernel:– 27% of kernel runtime due to computation– 73% of kernel runtime due to kernel overhead
© 2010 Michael Boyer 30
Lesson 1: Reduce Kernel Overhead• Increase amount of work per kernel call
– Decrease total number of kernel calls– Amortize overhead of each kernel call across more
computation
© 2010 Michael Boyer 31
Larger Kernel Implementation
MGVF = normalized sub-image gradientdo {
Compute eight matrices based on current MGVFCompute Heaviside function across each matrixUpdate MGVF matrixCompute convergence criterion
} while (! converged)
© 2010 Michael Boyer 32
9%
15%
71%
0% 20% 40% 60% 80% 100%
Kernel Execution
Memory Copying
Memory Allocation
Percentage of Runtime
Larger Kernel Implementation
2.0x 7.7x 0.8x 6.3x0x
50x
100x
150x
200x
250x
C C + OpenMP Naïve CUDA Larger Kernel
CUDA
Sp
ee
du
p o
ve
r M
AT
LA
B
© 2010 Michael Boyer 33
Memory Allocation Overhead
0.01
0.1
1
10
100
1000
10000
1E-07 1E-06 1E-05 0.0001 0.001 0.01 0.1 1 10 100 1000
Megabytes Allocated Per Call
Tim
e P
er
Ca
ll (
mic
ros
ec
on
ds
)
malloc (CPU memory) cudaMalloc (GPU memory)
© 2010 Michael Boyer 34
Lesson 2: Reduce Memory Management Overhead
• Reduce the number of memory allocations– Allocate memory once and reuse it throughout the
application– If memory size is not known a priori, estimate and only re-
allocate if estimate is too small
© 2010 Michael Boyer 35
31%
56%
3%
0% 20% 40% 60% 80% 100%
Kernel Execution
Memory Copying
Memory Allocation
Percentage of Runtime
Reduced Allocation Implementation
2.0x 7.7x 0.8x 6.3x25.4x
0x
50x
100x
150x
200x
250x
C C + OpenMP Naïve CUDA Larger Kernel ReducedAllocation
CUDA
Sp
ee
du
p o
ve
r M
AT
LA
B
© 2010 Michael Boyer 36
Memory Transfer Overhead
0.001
0.01
0.1
1
10
100
1000
1E-06 1E-05 0.0001 0.001 0.01 0.1 1 10 100 1000
Megabytes per Transfer
Tra
ns
fer
Tim
e (
mil
lis
ec
on
ds
)
CPU to GPU GPU to CPU
© 2010 Michael Boyer 37
Lesson 3: Reduce Memory Transfer Overhead
• If the CPU operates on values produced by the GPU:– Move the operation to the GPU– May improve performance even if the operation itself is
slower on the GPU
Operation(GPU)
Time
valuesproducedby GPU
valuesconsumed
by GPU
Memory Transfer
Operation(CPU)
Memory Transfer
© 2010 Michael Boyer 38
MGVF = normalized sub-image gradientdo {
Compute eight matrices based on current MGVFCompute Heaviside function across each matrixUpdate MGVF matrixCompute convergence criterion
} while (! converged)
GPU Reduction Implementation
© 2010 Michael Boyer 39
GPU Reduction Implementation
2.0x 7.7x 0.8x 6.3x25.4x
60.7x
0x
50x
100x
150x
200x
250x
C C + OpenMP Naïve CUDA Larger Kernel ReducedAllocation
GPUReduction
CUDA
Sp
ee
du
p o
ve
r M
AT
LA
B
80%
1%
7%
0% 20% 40% 60% 80% 100%
Kernel Execution
Memory Copying
Memory Allocation
Percentage of Runtime
© 2010 Michael Boyer 40
MGVF = normalized sub-image gradientdo {
Compute eight matrices based on current MGVFCompute Heaviside function across each matrixUpdate MGVF matrixCompute convergence criterion
} while (! converged)
Persistent Thread Block
© 2010 Michael Boyer 41
Persistent Thread Block• Problem: need a global memory fence
– Multiple thread blocks compute the MGVF matrix– Thread blocks cannot communicate with each other– So each iteration requires a separate kernel call
• Solution: compute entire matrix in one thread block– Arbitrary number of iterations can be computed in a single kernel call
© 2010 Michael Boyer 42
Persistent Thread Block: Example
1 32
4 65
7 98
1 11
1 11
1 11
Canonical CUDA Approach
(1-to-1 mapping between threads and data elements)
Persistent Thread Block
MGVF Matrix MGVF Matrix
© 2010 Michael Boyer 43
GPUGPU
SM SM SM
SM SM SM
SM SM SM
Cell2
Cell3
Cell4
Cell5
Cell6
Cell7
Cell8
Cell9
SM SM SM
SM SM SM
SM SM SM
Cell1
Cell1
Cell1
Cell1
Cell1
Cell1
Cell1
Cell1
Cell1
Persistent Thread Block: Example
Cell1
Canonical CUDA Approach
(1-to-1 mapping between threads and data elements)
Persistent Thread Block
SM = Streaming Multiprocessor (GPU core)
© 2010 Michael Boyer 44
Lesson 4: Avoid Global Memory Fences• Confine dependent computations to a single thread
block– Execute an iterative algorithm until convergence in a single
kernel call– Only efficient if there are multiple independent
computations
© 2010 Michael Boyer 45
Persistent Thread Block Implementation
2.0x 7.7x 0.8x 6.3x25.4x
211.3x
60.7x
0x
50x
100x
150x
200x
250x
C C + OpenMP Naïve CUDA Larger Kernel ReducedAllocation
GPUReduction
PersistentThread Block
CUDA
Sp
ee
du
p o
ve
r M
AT
LA
B
27x
© 2010 Michael Boyer 46
Absolute Performance
0.11 0.22 0.83
21.6
0
5
10
15
20
25
MATLAB C C + OpenMP CUDA
Fra
mes p
er
Seco
nd
(F
PS
)
© 2010 Michael Boyer 47
Video Example
© 2010 Michael Boyer 48
Conclusions• CUDA overheads can be significant bottlenecks• CUDA provides enormous performance improvements
for leukocyte tracking– 200x over MATLAB– 27x over OpenMP
• Processing time reduced from > 4.5 hours to < 1.5 minutes
• Real-time analysis feasible in near future
M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors.“ In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), May 2009.
© 2010 Michael Boyer 49
Current work:CPU-GPU Task Sharing
© 2010 Michael Boyer 50
CPU-GPU Task Sharing• Offloading decision is generally considered to be
binary
GPU?
CPU?
© 2010 Michael Boyer 51
CPU-GPU Task Sharing• Offload decision does not need to be
binary!– Dividing a task between the CPU and
GPU can provide improved performance over either device alone
GPU
CPU
GPU? CPU?
© 2010 Michael Boyer 52
Theoretical Performance
0
0.5
1
1.5
2
0.01 0.1 1 10 100
Ratio of GPU to CPU performance
Pe
rfo
rma
nc
e n
orm
ali
zed
to b
es
t w
ith
ou
t s
ha
rin
g
GPUCPUCPU+GPU (equal sharing)CPU+GPU (optimal sharing)
© 2010 Michael Boyer 53
Research Goal1. Given an input program written in CUDA or OpenCL,
automatically generate a program that can execute on the CPU and GPU concurrently
2. Automatically determine best division of work:– When beneficial, share work between CPU and GPU– Otherwise, execute on CPU or GPU exclusively– Optimal decision can change at runtime:
• With different inputs• With contention
© 2010 Michael Boyer 54
Proposed System
OpenCL code
Source-to-source Translation Framework
OpenCL Compiler
Modified OpenCL code
CPU/GPU binary
Transform all GPU memory allocations, memory transfers, and
kernel launches into a form supporting concurrent CPU-GPU
execution
© 2010 Michael Boyer 55
Potential Problems• One version of the kernel for multiple devices
– Optimizations for GPU may hurt performance on CPU and vice versa
• Possible (but rare) for thread blocks to communicate with each other– Do we try to support this?
• Statically predicting data access patterns can be hard (or even impossible for some applications)
© 2010 Michael Boyer 56
Data Sharing
CPUGPU
• If we cannot predict data access patterns statically, then the CPU and the GPU must have a consistent view of memory
1) Computation
2) Data Transfer
© 2010 Michael Boyer 57
Data Sharing (2)
CPUGPU
• If we can predict data access patterns statically, then we can minimize the data transfer overhead
1) Computation
2) Data Transfer
© 2010 Michael Boyer 58
Preliminary Results (HotSpot)
0
2
4
6
8
10
12
0 20 40 60 80 100
Percent of Computation on GPU
Ex
ec
uti
on
Tim
e (
se
co
nd
s)
Static AnalysisDynamic AnalysisNo Sharing
© 2010 Michael Boyer 59
Conclusions• GPUs are designed to provide good performance on
graphics workloads– But they have evolved to support any workload with
abundant parallelism
• GPUs can provide large performance improvements– But we need to take into account the overheads involved to
fully take advantage
• Allowing the CPU and GPU to work together can provide an even larger performance improvement
© 2010 Michael Boyer 60
Acknowledgements• Funding provided by:
– NSF grant IIS-0612049– SRC grant 1607.001– NVIDIA research grant– GRC AMD/Mahboob Kahn Ph.D. fellowship
• Equipment donated by NVIDIA
© 2010 Michael Boyer 61
BACKUP
© 2010 Michael Boyer 62
3D Rendering APIs
Graphics Application
Vertex Program
Rasterization
Fragment Program
Display
• High-level abstractions for rendering geometry
Courtesy of D. Luebke, NVIDIA
© 2010 Michael Boyer 63
CUDA: Abstractions1. Kernel function
– Mapped onto a grid of thread blocks
2. Scratchpad memory– For sharing data within a thread block
3. Barrier synchronization– For synchronizing within a thread block
© 2010 Michael Boyer 64
Kernel Function
__global__ void kernel(int *in, int *out) {
// Determine this thread’s index
int i = threadIdx.x;
// Add one to the input value
out[i] = in[i] + 1;
}
© 2010 Michael Boyer 65
Grid of Thread Blocks
Grid:2-dimensional≤ 4.3 billion blocks
Thread block:3-dimensional≤ 512 threads
© 2010 Michael Boyer 66
Launching a Kernel
int num_threads = ...;
int threads_per_block = 256;
// Determine how many thread blocks are needed
// (using either of the two methods shown below)
int num_blocks = ceil(num_threads / threads_per_block);
int num_blocks = (num_threads + threads_per_block – 1) /
threads_per_block;
// Make structures for grid and thread block dimensions
dim3 grid(num_blocks, 1);
dim3 thread_block(threads_per_block, 1, 1);
// Launch the kernel
kernel <<< grid, thread_block >>> (in, out);
© 2010 Michael Boyer 67
Scratchpad Memory• Each multiprocessor has 16 KB of software-controlled
shared memory• Variables declared “__shared__” get mapped into this
memory• Values can only be shared among threads within the
same thread block
© 2010 Michael Boyer 68
Scratchpad Memory: Example
__global__ void kernel() {
int i = threadIdx.x;
// Compute some function
int v = foo(i);
// Write the value into shared memory
__shared__ int values[THREADS_PER_BLOCK];
values[i] = v;
// Use the shared values
...
}
© 2010 Michael Boyer 69
Barrier Synchronization• __syncthreads() function• Each thread waits for all other threads in the thread
block• All values written by every thread are now visible to all
other threads
© 2010 Michael Boyer 70
Barrier Synchronization: Example__global__ void kernel(float *out, int N) {
int i = threadIdx.x;
__shared__ int values[THREADS_PER_BLOCK];
values[i] = foo(i);
// Wait to ensure all values have been written
__syncthreads();
// Compute average of two values
out[i] = (values[i] + values[(i + 1) % N]);
}
© 2010 Michael Boyer 71
CUDA Overheads• Driver initialization: 0.14 seconds• Kernel launch: 13 μs• GPU memory allocation and deallocation: orders of
magnitude slower than on CPU• Memory transfer: 15 μs + 1 ms/MB
© 2010 Michael Boyer 72
Program
Allocate GPU memory
Transfer input data
Launch kernel
Transfer results
Free GPU memory
Acceleration using CUDACPU GPU
Step 1: Determine which code to offload to the GPU as a CUDA kernel
Step 2: Write the CPU-side CUDA code
Step 3: Write and optimize the GPU kernel
CUDA kernel
© 2010 Michael Boyer 73
Performance Issues• Branch divergence• Memory coalescing
• Key concept: Warp– Group of threads that execute concurrently– In current hardware, warp size is 32 threads
© 2010 Michael Boyer 74
Branch Divergence• Remember: hardware is SIMD• What if threads in the same warp follow two different
paths?
• Solution: entire warp executes both paths– Unneeded values are simply ignored– Performance can suffer with many divergent branches
© 2010 Michael Boyer 75
Memory Coalescing• Threads in the same half-warp access memory together• If all threads access successive memory locations:
– All of the accesses are combined (coalesced)– Result: significantly improved memory performance
• Otherwise:– Each thread accesses memory separately– Result: significantly reduced memory performance
© 2010 Michael Boyer 76
Memory Coalescing: Examples
Coalesced Accesses
Non-CoalescedAccess
© 2010 Michael Boyer 77
Parallelization Granularity
CPUCPU
CPU MemoryCPU Memory
GPUGPU
GPU MemoryGPU Memory
© 2010 Michael Boyer 78
Memory Transfer
Memory Transfer
Kernel Overhead Revisited• Overhead depends on calling pattern:
– One at a time (synchronous): 9 microseconds– Back-to-back (asynchronous): 3 microseconds
Kernel Call
Kernel Call
Kernel Call
Kernel Call
Kernel Call
Kernel Call
Synchronous:
Asynchronous:
Implicit Synchronization
Kernel Call
Kernel Call
© 2010 Michael Boyer 79
Lesson 1 Revisited: Reduce Kernel Overhead
• Increase amount of work per kernel call– Decrease total number of kernel calls– Amortize overhead of each kernel call across more
computation
• Launch kernels back-to-back– Kernel calls are asynchronous: avoid explicit or implicit
synchronization between kernel calls– Overlap kernel execution on the GPU with driver access on
the CPU