Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
SIGGRAPH 2013 Shaping the Future of Visual Computing
Multi-GPU Programming for Visual Computing Wil Braithwaite - NVIDIA Applied Engineering
Talk Outline
Compute and Graphics API Interoperability.
— The basics.
— Interoperability Methodologies & driver optimizations.
Interoperability at a system level.
— fine-grained control for managing scaling
Demo.
Peer-to-peer & UVA, NUMA considerations
Scaling beyond one-to-one = single-compute : single-render
— Enumerating Graphics & Compute Resources
— many-to-many = Multiple-compute : multiple-render
2
NVIDIA Maximus Initiative
Mixing Tesla and Quadro GPUs
— Tight integration with OEMs and System Integrators
— Optimized driver paths for GPU-GPU communication
NVIDIA® MAXIMUS™ Workstation
Visualize Simulate (Cluster)
Simulate0 Simulate1
+ Visualize0
Simulate2
+ Visualize1
Simulate3
+ Visualize2
Traditional Workstation
Multi-GPU Compute+Graphics Use Cases
Image processing
— Multiple compute GPUs and a low-
end display GPU for blitting
Mixing rasterization with
compute
— Polygonal rendering done in
OpenGL and input to Compute for
further processing
Visualization for HPC Simulation
— Numerical simulation distributed
across multiple compute GPUs,
possibly on remote supercomputer
Seismic Interpretation
(NVIDIA Index)
Real-time CFD Viz
(Morpheus Medical)
Interactive Fluid-dynamics
Compute/Graphics interoperability
Setup the objects in the graphics context.
Register objects with the compute context.
Map / Unmap the objects from the compute context.
CUDA OpenGL/DX
Application
CUDA Array
CUDA Buffer Buffer Object
Texture Object
6
Code Sample – Simple buffer interop
Setup and Registration of Buffer Objects:
GLuint vboId;
cudaGraphicsResource_t vboRes;
// OpenGL buffer creation...
glGenBuffers(1, &vboId);
glBindBuffer(GL_ARRAY_BUFFER, vboId);
glBufferData(GL_ARRAY_BUFFER, vboSize, 0, GL_STREAM_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
// Registration with CUDA.
cudaGraphicsGLRegisterBuffer(&vboRes, vboId, cudaGraphicsRegisterFlagsNone);
9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Code Sample – Simple buffer interop
Mapping between contexts:
float* vboPtr;
while (!done)
{
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
runCUDA(vboPtr, 0);
cudaGraphicsUnmapResources(1, &vboRes, 0);
runGL(vboId);
}
10
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Resource Behavior: Single-GPU
The resource is shared.
Context switch is fast and independent on data size.
GL contextCUDA context
Data
interop API
GPU
11
float* vboPtr;
while (!done)
{
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
runCUDA(vboPtr, 0);
cudaGraphicsUnmapResources(1, &vboRes, 0);
runGL(vboId);
}
Code Sample – Simple buffer interop
Mapping between contexts:
12
Context-switching happens
when these commands are
processed.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Timeline: Single-GPU
Driver-interop
1 2 3
map
kernel
GL
kernel
GL
runCUDA
unmap
runGL
kernel
GL
4
13 C
OM
PU
TE
R
EN
DE
R
CUDA
GL
float* vboPtr;
while (!done)
{
cudaGraphicsMapResources(1, &vboRes, 0);
cudaStreamSynchronize(0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
runCUDA(vboPtr, 0);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaStreamSynchronize(0);
runGL(vboId);
}
Code Sample – Simple buffer interop
Adding synchronization for analysis:
14
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Timeline: Single-GPU
Driver-interop, synchronous*
— (we synchronize after map and unmap calls)
1 2 3
map *
kernel
GL
kernel
GL
runCUDA
unmap *
runGL
kernel
Synchronize after “map”
waits for GL to finish
before context-switch.
GL
4
Synchronize after “unmap”
waits for CUDA (& GL) to finish.
15 C
OM
PU
TE
R
EN
DE
R
CUDA
GL
Resource Behavior: Multi-GPU
Each GPU has a copy of the resource.
Context-switch is dependent on data size, because driver must
copy data.
GL contextCuda context
Data
interop API
Data
compute-GPU render-GPU
18
Resource Behavior: Multi-GPU
Each GPU has a copy of the resource.
Context-switch is dependent on data size, because driver must
copy data.
GL contextCuda context
Data
interop API
Data
compute-GPU render-GPU
MAP
19
Resource Behavior: Multi-GPU
Each GPU has a copy of the resource.
Context-switch is dependent on data size, because driver must
copy data.
GL contextCuda context
Data
interop API
Data
compute-GPU render-GPU
UNMAP
20
Timeline: Multi-GPU
Driver-interop, synchronous*
— SLOWER! (Tasks are serialized).
1 2 3
map *
kernel CUtoGL
GL
kernel
GL
runCUDA
unmap *
runGL
GLtoCU kernel GLtoCU GLtoCU CUtoGL
Resources are mirrored and
synchronized across the GPUs
“map” has to wait for GL to complete
before it synchronizes the resource.
22 C
OM
PU
TE
R
EN
DE
R
CUDA
GL
Interoperability Methodologies
READ-ONLY
— GL produces... and CUDA consumes.
e.g. Post-process the GL render in CUDA.
WRITE-DISCARD
— CUDA produces... and GL consumes.
e.g. CUDA simulates fluid, and GL renders result.
READ & WRITE
— Useful if you want to use the rasterization pipeline.
e.g. Feedback loop:
— runGL(texture) framebuffer
— runCUDA(framebuffer) texture
23
float* vboPtr;
cudaGraphicsResourceSetMapFlags(vboRes, cudaGraphicsMapFlagsWriteDiscard);
while (!done)
{
cudaGraphicsMapResources(1, &vboRes, 0);
cudaStreamSynchronize(0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
runCUDA(vboPtr, 0);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaStreamSynchronize(0);
runGL(vboId);
}
CUDA produces... and OpenGL consumes:
Code Sample – WRITE-DISCARD
Hint that we do not care about
the previous contents of buffer.
24
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Timeline: Single-GPU
Driver-interop, synchronous*, WRITE-DISCARD
1 2 3
map *
kernel
GL
kernel
GL
runCUDA
unmap *
runGL
kernel
Synchronize after “map”
waits for GL to finish
before context-switch.
GL
4
Synchronize after “unmap”
waits for CUDA (& GL) to finish.
Context switch forces serialization.
25 C
OM
PU
TE
R
EN
DE
R
CUDA
GL
Timeline: Multi-GPU
Driver-interop, synchronous*, WRITE-DISCARD
1 2 3
map *
kernel CUtoGL kernel
runCUDA
unmap *
runGL
CUtoGL kernel CUtoGL
4
When multi-GPU,
“map” does nothing.
Compute & render can overlap
as they are on different GPUs.
GL GL GL
Synchronize after “unmap”
waits for CUDA & GL to finish.
26 C
OM
PU
TE
R
EN
DE
R
CUDA
GL
Timeline: Multi-GPU
Driver-interop, synchronous*, WRITE-DISCARD
— if render is long...
1 2 3
map *
kernel CUtoGL kernel
runCUDA
unmap *
runGL
CUtoGL kernel CUtoGL
GL
4
“unmap” will wait for GL.
GL GL
27 C
OM
PU
TE
R
EN
DE
R
CUDA
GL
Single-GPU
Driver-Interop: System View
Multi-GPU
28
Multi-GPU
Manual-Interop: System View
compute-GPU render-GPU
29
Manual-Interop
Driver-Interop hides all the complexity
but sometimes...
— We don’t want to transfer all data
Kernel may only compute subregions.
— We may have a complex system with multiple compute-GPUs and/or
render-GPUs.
— We have application specific pipelining and multi-buffering
— May have some CPU code in your algorithm between compute and
graphics.
cudaMalloc((void**)&d_data, vboSize);
cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);
while (!done) {
// Compute data in temp buffer, and copy to host...
runCUDA(d_data, 0);
cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);
cudaStreamSynchronize(0);
// Map the render-GPU’s resource and upload the host buffer...
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaStreamSynchronize(0);
cudaSetDevice(computeGPU);
runGL(vboId);
}
Code Sample – Manual-Interop
Create a temporary buffer
in pinned host-memory.
31
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Timeline: Multi-GPU
Manual-interop, synchronous*, WRITE-DISCARD
1 2 3
map
kernel CUtoH kernel
runCUDA
unmap
runGL
CUtoH kernel CUtoH
CUtoH *
HtoGL *
HtoGL HtoGL HtoGL
GL GL
32
4
CO
MP
UT
E
RE
ND
ER
CUDA
CUDA
GL
cudaMalloc((void**)&d_data, vboSize);
cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);
while (!done) {
// Compute data in temp buffer, and copy to host...
runCUDA(d_data, 0);
cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);
cudaStreamSynchronize(0);
// Map the render-GPU’s resource and upload the host buffer...
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaSetDevice(computeGPU);
// cudaStreamSynchronize(0);
runGL(vboId);
}
Code Sample - Manual Interop (Async)
33
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Timeline: Multi-GPU
Manual-interop, asynchronous, WRITE-DISCARD
1 2 3
map
kernel CUtoH kernel
runCUDA
unmap
runGL
CUtoH kernel CUtoH
GL
4
CUtoH *
HtoGL
HtoGL
GL GL
HtoGL HtoGL
34 C
OM
PU
TE
R
EN
DE
R
CUDA
CUDA
GL
Timeline: Multi-GPU
Manual-interop, asynchronous, WRITE-DISCARD
— if render is long...
35
1 2 3
map
kernel CUtoH kernel
runCUDA
unmap
runGL
CUtoH kernel CUtoH
4
CUtoH *
HtoGL
HtoGL
GL
HtoGL HtoGL
GL
We are downloading while uploading! Drifting out of sync!
CO
MP
UT
E
RE
ND
ER
CUDA
CUDA
GL
Timeline: Multi-GPU (safe Async)
Manual-interop, asynchronous, WRITE-DISCARD
— if render is long...
36
1 2 3
map
kernel CUtoH kernel
CO
MP
UT
E
RE
ND
ER
runCUDA
unmap
runGL
CUtoH kernel CUtoH
4
CUtoH *
HtoGL
HtoGL
GL
HtoGL HtoGL
GL
Synchronization must also wait for HtoGL to finish
CUDA
CUDA
GL
while (!done) {
// Compute the data in a temp buffer, and copy to a host buffer...
runCUDA(d_data, 0);
cudaStreamWaitEvent(0, uploadFinished, 0);
cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);
cudaStreamSynchronize(0);
// Map the render-GPU’s resource and upload the host buffer...
// (all commands must be asynchronous.)
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaEventRecord(uploadFinished, 0);
cudaSetDevice(computeGPU);
runGL(vboId);
}
Code Sample - Manual Interop (safe Async)
38
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Timeline: Multi-GPU
Manual-interop, WRITE-DISCARD, flipping CUDA
39
1 2 3
map
kernel[A] CUtoH[A]
runCUDA
unmap *
runGL
CUtoH[B]
kernel[A] CUtoH[A]
4
CUtoH
HtoGL
H[A]toGL
GL
H[A]toGL H[B]toGL
GL GL
kernel[B] CUtoH[B] kernel[B]
H[B]toGL
GL
5
CO
MP
UT
E
RE
ND
ER
CUDA
CUDA
CUDA
GL
int read = 1, write = 0;
while (!done) {
// Compute the data in a temp buffer, and copy to a host buffer...
cudaStreamWaitEvent(custream[write], kernelFinished[read]);
runCUDA(d_data[write], custream[write]);
cudaEventRecord(kernelFinished[write], custream[write]);
cudaStreamWaitEvent(custream[write], uploadFinished[read]);
cudaMemcpyAsync(h_data[write], d_data[write], vboSize, cudaMemcpyDeviceToHost, custream[write]);
cudaEventRecord(downloadFinished[write], custream[write]);
// Map the renderGPU’s resource and upload the host buffer...
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, glstream);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaStreamWaitEvent(glstream, downloadFinished[read]);
cudaMemcpyAsync(vboPtr, h_data[read], size, cudaMemcpyHostToDevice, glstream);
cudaGraphicsUnmapResources(1, &vboRes, glstream);
cudaEventRecord(uploadFinished[read], glstream);
cudaStreamSynchronize(glstream); // Sync for easier analysis!
cudaSetDevice(computeGPU);
runGL(vboId);
swap(&read, &write);
}
Code Sample - Manual Interop (flipping)
40
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Timeline: Multi-GPU
Manual-interop, WRITE-DISCARD, flipping CUDA
— if render is long...
41
1 2 3
map
kernel[A] CUtoH[A]
runCUDA
unmap *
runGL
CUtoH[B]
kernel[A] CUtoH[A]
4
CUtoH
HtoGL
H[A]toGL
GL
H[A]toGL H[B]toGL
GL GL
kernel[B] CUtoH[B] kernel[B]
H[B]toGL
CO
MP
UT
E
RE
ND
ER
CUDA
CUDA
CUDA
GL
Timeline: Multi-GPU
Manual-interop, WRITE-DISCARD, flipping CUDA & GL
— if render is long... we have a problem.
42
1 2 3
map
kernel[A] CUtoH[A]
runCUDA
unmap *
runGL
CUtoH[B]
kernel[A] CUtoH[A]
4
CUtoH
HtoGL
H[A]toGL[A]
GL[A]
H[A]toGL[A] H[B]toGL[B]
GL[B] GL[A]
kernel[B] CUtoH[B] kernel[B]
H[B]toGL[B]
GL[B]
CO
MP
UT
E
RE
ND
ER
CUDA
CUDA
CUDA
GL
We can’t overlap contexts!
Timeline: Multi-GPU
Manual-interop, WRITE-DISCARD, flipping CUDA & GL
— Use the GL context for upload.
43
1 2 3
cuSync *
kernel[A] CUtoH[A]
runCUDA
runGL
CUtoH[B]
kernel[A] CUtoH[A]
4
CUtoH
memcpy *
GL[A] GL[B] GL[A]
kernel[B] CUtoH[B] kernel[B]
GL[B]
CO
MP
UT
E
RE
ND
ER
CUDA
CUDA
GL
GL
5
gl-map
gl-unmap
Visual Studio Nsight
44
Demo
45
Demo - results
— runCUDA (20ms)
— runGL (10ms)
— copy (10ms)
Single-GPU
— Driver-interop = 30ms
Multi-GPU
— Driver-interop = 36ms
— Async Manual-interop = 32ms
— Flipped Manual-interop = 22ms
Too large data size
makes multi-GPU
interop worse.
Overlapping the
download helps
us break even.
But using streams
and flipping is a
significant win!
46
Some final thoughts on Interoperability.
Similar considerations are applicable when OpenGL is the producer and CUDA is the consumer.
Use cudaGraphicsMapFlagsReadOnly
Avoid synchronized GPUs for CUDA.
— Watch out for Windows’s WDDM implicit synchronization on unmap!
CUDA-OpenGL interoperability can perform slower if OpenGL context spans multiple GPUs.
Context switch performance varies with system config and OS.
47
Scaling Beyond Single Compute+Graphics
Scaling compute
— Divide tasks across multiple devices
— When data does not fit into single GPU memory - Distribute data
Scaling graphics
— Multi-displays, Stereo
— More complex rendering e.g. raytracing
Higher compute density
— Amortize host or server costs eg CPU, memory, RAID shared with
multiple GPUs
Multi-GPU Image Processing - SagivTech
Renderer
Multiple Compute : Single Render GPU
CUDA
Context
GPU 1
CUDA
Context
GPU 1 GPU 0
Compute Devices Render Device
OpenGL
Context
Aux
CUDA
Context
GPU 2
Map
Unmap
-Explicit copies via host
-CudaMemcpyPeer
-P2P Memory Access
Unified Virtual Addressing (UVA)
Easier programming with a single address space:
System
Memory
CPU GPU0
GPU0
Memory
GPU1
GPU1
Memory
PCI-e
0x0000
0xFFFF
No UVA – Multiple Memory Space UVA – Single Address Space
Peer-to-Peer (P2P) Communication
NUMA/Topology Considerations
Memory access is non-uniform.
— Local GPU access is faster than remote
(extra QPI hop)
— This affects PCIe transfer throughput.
NUMA APIs
— Thread affinity considerations
Pitfalls of set process affinity
— Does not work with graphics APIs
Memory
CPU0
IOH0
CPU1
IOH1
QPI
QPI
QPI
Memory
QPI
GPU0 GPU1
6 GB/s
~4 GB/s
SandyBridge
Integrated IOH
Dale Southard. Designing and Managing GPU Clusters. Supercomputing 2011
http://nvidia.fullviewmedia.com/fb/nv-sc11/tabscontent/archive/401-thu-southard.html
P2P disabled over QPI – Copying staged via host (P2H2P)
Peer-to-Peer Configurations
CPU0
QPI PCI-e
IOH0
Best P2P Performance
Between GPUs on the
Same PCIe Switch
Eg K10 dual-GPU card
(~11GB)
P2P Communication Supported
Between GPUs on the Same IOH
(~5GB)
x16 x16 x16 x16
x16 x16
Tesla
GPU0
Tesla
GPU1 Tesla
GPU2
Tesla
GPU3
Switch
0 Switch
1
PCI-e x16
x16 CPU1 IOH1
Quadro
GPU4
Quadro
GPU5
QPI
P2P communication
- Linux or TCC only
- No WDDM
Peer-to-Peer Initialization
cudaDeviceCanAccessPeer(&isAccessible, srcGPU, dstGPU)
— Returns in 1st arg if srcGPU can access memory of dstGPU
— Need to do this bidirectionally
cudaDeviceEnablePeerAccess(peerDevice, 0)
— Enables current GPU to access peerDevice
— Note that this is asymmetric!
cudaDeviceDisablePeerAccess
— P2P can be limited to a specific phase in order to reduce overhead and
free resources
Peer-to-Peer Copy
cudaMemcpyPeerAsync
— Will fall back to staging copies via host
for unsupported configurations or if
peer-access is disabled.
cudaMemcpyAsync
— Using UVA, regular memory copies will
also work.
NVIDIA CUDA Webinars – Multi-GPU Programming
http://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf
cudaMalloc((void**)&d_data, vboSize);
while (!done) {
// Compute data in temp buffer, and copy to host...
runCUDA(d_data, 0);
cudaStreamSynchronize(0);
// Map the render-GPU’s resource and upload the host buffer...
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, glstream);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaMemcpyPeerAsync(vboPtr, renderGPU, d_data, computeGPU, size, glstream);
cudaGraphicsUnmapResources(1, &vboRes, glstream);
cudaStreamSynchronize(glstream);
cudaSetDevice(computeGPU);
runGL(vboId);
}
Code Sample - Manual Interop (P2P)
60
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Timeline: Multi-GPU
Manual-interop, synchronous*, P2P, WRITE-DISCARD
1 2 3
map
kernel
P2P
kernel
runCUDA
unmap
runGL
P2P
kernel
P2P
GL
P2P *
GL GL
61
4
CO
MP
UT
E
RE
ND
ER
CUDA
CUDA
GL
Copy is faster, but still
serialized with render.
int read = 1, write = 0;
while (!done) {
// Compute the data in a temp buffer, and copy to a host buffer...
cudaStreamWaitEvent(custream[write], peerFinished[read]);
runCUDA(d_data[write], custream[write]);
cudaEventRecord(kernelFinished[write], custream[write]);
// Map the renderGPU’s resource and p2p copy the compute buffer...
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, glstream);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaStreamWaitEvent(glstream, kernelFinished[read]);
cudaMemcpyPeerAsync(vboPtr, renderGPU, d_data[read], computeGPU, size, glstream);
cudaGraphicsUnmapResources(1, &vboRes, glstream);
cudaEventRecord(peerFinished[read], glstream);
cudaStreamSynchronize(glstream); // Sync for easier analysis!
cudaSetDevice(computeGPU);
runGL(vboId);
swap(&read, &write);
}
Code Sample - Manual Interop (P2P flipping)
62
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Timeline: Multi-GPU
Manual-interop, synchronous*, P2P flipping, WRITE-DISCARD
63
1 2 3
map
kernel[A]
runCUDA
unmap *
runGL
kernel[A]
4
CUtoH
P2P
P2P[A]
GL
P2P[A] P2P[B]
GL GL
kernel[B] kernel[B]
CO
MP
UT
E
RE
ND
ER
CUDA
CUDA
CUDA
GL
Demo
64
Peer-to-Peer Memory Access
Sometimes we don’t want to explicitly copy but want access to
entire space on all GPUs and CPU
Already possible with linear memory with UVA, recent texture
addition useful for graphics.
Example – large dataset that can’t fit into 1 GPU memory
— Distribute the domain/data across GPUs
— Each GPU now need to access the other GPU’s data for halos/boundary
exchange
Mapping Algorithms to Hardware Topology
Won-Ki Jeong. S3308 - Fast Compressive
Sensing MRI Reconstruction on a Multi-
GPU System, GTC 2013.
Paulius M. S0515 – Implementing
3D Finite Difference Code on
GPUs, GTC 2009.
GPU0 GPU1 Data reduction, Multli-phase problems
Sharing texture across GPUs
Kernel runs on device 0 accesses texture from device 1
Peer access is unidirectional. cudaSetDevice(0);
cudaMalloc3DArray(d0_volume, ...);
cudaSetDevice(1);
cudaMalloc3DArray(d1_volume, ...);
while (!done) {
// Set CUDA device to COMPUTE DEVICE 0
cudaSetDevice(0);
cudaBindTextureToArray(tex0, d0_volume);
cudaBindTextureToArray(tex1, d1_volume);
myKernel <<<...>>>
}
__global__ void myKernel(...)
{
float voxel0 = tex3D(tex0, u, v, w); //accesses gpu0
float voxel1 = tex3D(tex1, u, v, w); //accesses gpu1 through p2p
}
Scaling Graphics - Multiple Quadro GPUs
Virtualized SLI case
— Multi-GPU mosaic drive a large wall
Access each GPU separately
— Parallel rendering
— Multi view-frustum eg Stereo, CAVEs
Hurricane Sandy Simulation showing multiple
computed tracks for different input params –
Image Credits : NOAA
Sort +
Alpha Composite
Data Distribution +
Render
GPU-0
GPU-2
GPU-1
GPU-3
Visible Human 14GB Texture Data
CUDA With SLI Mosaic
Use driver interop with the current cuda device
— Driver handles the optimized data transfer (Work in progress)
The OpenGL context spans 2 cuda devices
Cuda
Enumeration
[0] = K20
[1] = K5000
[2] = K5000
OpenGL
Enumeration
with SLI
[0] = K5000
cudaGLgetGLDevices
GL context
int glDeviceIndices[MAX_GPU];
int glDeviceCount;
cudaGLGetDevices(&glDeviceCount, glDeviceIndices, glDeviceCount,
cudaGLDeviceListAll));
glDeviceCount = 2
Specifying Render GPU Explicitly
Linux
— Specify separate X screens using XOpenDisplay
Windows
— Using NV_GPU_AFFINITY extension
Display* dpy = XOpenDisplay(“:0.”+gpu)
GLXContext = glxCreateContextAttribs(dpy,…);
BOOL wglEnumGpusNV(UINT iGpuIndex, HGPUNV *phGPU)
For #GPUs enumerated {
GpuMask[0]=hGPU[0];
GpuMask[1]=NULL;
//Get affinity DC based on GPU
HDC affinityDC = wglCreateAffinityDCNV(GpuMask);
setPixelFormat(affinityDC);
HGLRC affinityGLRC = wglCreateContext(affinityDC);
}
Mapping Cuda Affinity GL Devices
How to map OpenGL device to CUDA
cudaWGLGetDevice(int *cudadevice,HGPUNV oglGPUHandle)
Don’t expect the same order
— GL order is windows specific while CUDA is device specific
Cuda
Enumeration
[0] = K20
[1] = K5000
[2] = K5000
GL Affinity
Enumeration
[0] = K5000
[1] = K5000
GL Affinity
Enumeration
(with SLI)
[0] = K5000
cudaGLgetGLDevices
cudaWGLgetDevice
GL context GL context
Compute & Multiple Render
GL_NV_COPY_IMAGE - Copy between GL devices wglCopyImageSubData and glXCopyImageSubDataNV
— Texture objects & Renderbuffers only.
— Can copy Texture subregions.
OpenGL
Context
GPU 2
CUDA
Context
GPU 1
OpenGL
Context
CudaMemcpyPeer
Aux
CUDA
Context
GPU 1 GPU 2 GPU 0
Compute Render Devices
wglCopyImageSubData
Map
Unmap
Multithreading with OpenGL Contexts
// Wait for signal to start consuming
CPUWait(producedFenceValid);
glWaitSync(producedFence[0]);
// Bind texture object
glBindTexture(destTex[0]);
// Use as needed
// Signal we are done wit this texture
consumedFence[0] = glFenceSync(…);
CPUSignal(consumedFenceValid);
// Wait for
CPUWait(consumedFenceValid);
glWaitSync(consumedFence[1]);
// Bind render target
glFramebufferTexture2D(srcTex[1]);
// Draw here…
// Unbind
glFramebufferTexture2D(0);
// Copy over to consumer GPU
wglCopyImageSubDataNV(srcCtx,srcTex[1],
..destCtx,destTex[1]);
// Signal that producer has completed
producedFence[1] = glFenceSync(…);
CPUSignal(producedFenceValid);
Thread - Dest GPU1 Thread - Source GPU0
[0]
[1]
GLsync consumedFence[MAX_BUFFERS];
GLsync producedFence[MAX_BUFFERS];
HANDLE consumedFenceValid, producedFenceValid;
destTex
Multi-level CPU and GPU sync primitives
[0]
[1]
srcTex
GPU1 GPU0
glCopyImage
Shalini V. S0353 - Programming Multi-GPUs for Scalable Rendering, GTC 2012.
Some final thoughts...
Hardware architecture influences transfer and system
throughput. e.g. NUMA
Peer-To-Peer copies/access simplify multi-gpu programming.
— but context-switching between CUDA and GL will serialize.
When scaling graphics
— remember the device mapping between CUDA and OpenGL.
Conclusions & Resources
The driver can do all the heavy-lifting but...
Scalability and final performance is up to the developer
— For fine-grained control and optimization, you might want to move the data manually.
CUDA samples/documentation: — http://developer.nvidia.com/cuda-downloads
OpenGL Insights, Patrick Cozzi, Christophe Riccio, 2012. ISBN 1439893764.
www.openglinsights.com
75
Demo code
win7 & linux Code for the benchmarking demo is available at:
— http://bitbucket.org/wbraithwaite_nvidia/interopbenchmark
76