66
SIGGRAPH 2013 Shaping the Future of Visual Computing Multi-GPU Programming for Visual Computing Wil Braithwaite - NVIDIA Applied Engineering

Siggraph2013 - Multi-GPU Programming for Visual Computing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Siggraph2013 - Multi-GPU Programming for Visual Computing

SIGGRAPH 2013 Shaping the Future of Visual Computing

Multi-GPU Programming for Visual Computing Wil Braithwaite - NVIDIA Applied Engineering

Page 2: Siggraph2013 - Multi-GPU Programming for Visual Computing

Talk Outline

Compute and Graphics API Interoperability.

— The basics.

— Interoperability Methodologies & driver optimizations.

Interoperability at a system level.

— fine-grained control for managing scaling

Demo.

Peer-to-peer & UVA, NUMA considerations

Scaling beyond one-to-one = single-compute : single-render

— Enumerating Graphics & Compute Resources

— many-to-many = Multiple-compute : multiple-render

2

Page 3: Siggraph2013 - Multi-GPU Programming for Visual Computing

NVIDIA Maximus Initiative

Mixing Tesla and Quadro GPUs

— Tight integration with OEMs and System Integrators

— Optimized driver paths for GPU-GPU communication

NVIDIA® MAXIMUS™ Workstation

Visualize Simulate (Cluster)

Simulate0 Simulate1

+ Visualize0

Simulate2

+ Visualize1

Simulate3

+ Visualize2

Traditional Workstation

Page 4: Siggraph2013 - Multi-GPU Programming for Visual Computing

Multi-GPU Compute+Graphics Use Cases

Image processing

— Multiple compute GPUs and a low-

end display GPU for blitting

Mixing rasterization with

compute

— Polygonal rendering done in

OpenGL and input to Compute for

further processing

Visualization for HPC Simulation

— Numerical simulation distributed

across multiple compute GPUs,

possibly on remote supercomputer

Seismic Interpretation

(NVIDIA Index)

Real-time CFD Viz

(Morpheus Medical)

Interactive Fluid-dynamics

Page 5: Siggraph2013 - Multi-GPU Programming for Visual Computing

Compute/Graphics interoperability

Setup the objects in the graphics context.

Register objects with the compute context.

Map / Unmap the objects from the compute context.

CUDA OpenGL/DX

Application

CUDA Array

CUDA Buffer Buffer Object

Texture Object

6

Page 6: Siggraph2013 - Multi-GPU Programming for Visual Computing

Code Sample – Simple buffer interop

Setup and Registration of Buffer Objects:

GLuint vboId;

cudaGraphicsResource_t vboRes;

// OpenGL buffer creation...

glGenBuffers(1, &vboId);

glBindBuffer(GL_ARRAY_BUFFER, vboId);

glBufferData(GL_ARRAY_BUFFER, vboSize, 0, GL_STREAM_DRAW);

glBindBuffer(GL_ARRAY_BUFFER, 0);

// Registration with CUDA.

cudaGraphicsGLRegisterBuffer(&vboRes, vboId, cudaGraphicsRegisterFlagsNone);

9

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Page 7: Siggraph2013 - Multi-GPU Programming for Visual Computing

Code Sample – Simple buffer interop

Mapping between contexts:

float* vboPtr;

while (!done)

{

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

runCUDA(vboPtr, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

runGL(vboId);

}

10

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Page 8: Siggraph2013 - Multi-GPU Programming for Visual Computing

Resource Behavior: Single-GPU

The resource is shared.

Context switch is fast and independent on data size.

GL contextCUDA context

Data

interop API

GPU

11

Page 9: Siggraph2013 - Multi-GPU Programming for Visual Computing

float* vboPtr;

while (!done)

{

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

runCUDA(vboPtr, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

runGL(vboId);

}

Code Sample – Simple buffer interop

Mapping between contexts:

12

Context-switching happens

when these commands are

processed.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Page 10: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Single-GPU

Driver-interop

1 2 3

map

kernel

GL

kernel

GL

runCUDA

unmap

runGL

kernel

GL

4

13 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Page 11: Siggraph2013 - Multi-GPU Programming for Visual Computing

float* vboPtr;

while (!done)

{

cudaGraphicsMapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

runCUDA(vboPtr, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

runGL(vboId);

}

Code Sample – Simple buffer interop

Adding synchronization for analysis:

14

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Page 12: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Single-GPU

Driver-interop, synchronous*

— (we synchronize after map and unmap calls)

1 2 3

map *

kernel

GL

kernel

GL

runCUDA

unmap *

runGL

kernel

Synchronize after “map”

waits for GL to finish

before context-switch.

GL

4

Synchronize after “unmap”

waits for CUDA (& GL) to finish.

15 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Page 13: Siggraph2013 - Multi-GPU Programming for Visual Computing

Resource Behavior: Multi-GPU

Each GPU has a copy of the resource.

Context-switch is dependent on data size, because driver must

copy data.

GL contextCuda context

Data

interop API

Data

compute-GPU render-GPU

18

Page 14: Siggraph2013 - Multi-GPU Programming for Visual Computing

Resource Behavior: Multi-GPU

Each GPU has a copy of the resource.

Context-switch is dependent on data size, because driver must

copy data.

GL contextCuda context

Data

interop API

Data

compute-GPU render-GPU

MAP

19

Page 15: Siggraph2013 - Multi-GPU Programming for Visual Computing

Resource Behavior: Multi-GPU

Each GPU has a copy of the resource.

Context-switch is dependent on data size, because driver must

copy data.

GL contextCuda context

Data

interop API

Data

compute-GPU render-GPU

UNMAP

20

Page 16: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Driver-interop, synchronous*

— SLOWER! (Tasks are serialized).

1 2 3

map *

kernel CUtoGL

GL

kernel

GL

runCUDA

unmap *

runGL

GLtoCU kernel GLtoCU GLtoCU CUtoGL

Resources are mirrored and

synchronized across the GPUs

“map” has to wait for GL to complete

before it synchronizes the resource.

22 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Page 17: Siggraph2013 - Multi-GPU Programming for Visual Computing

Interoperability Methodologies

READ-ONLY

— GL produces... and CUDA consumes.

e.g. Post-process the GL render in CUDA.

WRITE-DISCARD

— CUDA produces... and GL consumes.

e.g. CUDA simulates fluid, and GL renders result.

READ & WRITE

— Useful if you want to use the rasterization pipeline.

e.g. Feedback loop:

— runGL(texture) framebuffer

— runCUDA(framebuffer) texture

23

Page 18: Siggraph2013 - Multi-GPU Programming for Visual Computing

float* vboPtr;

cudaGraphicsResourceSetMapFlags(vboRes, cudaGraphicsMapFlagsWriteDiscard);

while (!done)

{

cudaGraphicsMapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

runCUDA(vboPtr, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

runGL(vboId);

}

CUDA produces... and OpenGL consumes:

Code Sample – WRITE-DISCARD

Hint that we do not care about

the previous contents of buffer.

24

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Page 19: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Single-GPU

Driver-interop, synchronous*, WRITE-DISCARD

1 2 3

map *

kernel

GL

kernel

GL

runCUDA

unmap *

runGL

kernel

Synchronize after “map”

waits for GL to finish

before context-switch.

GL

4

Synchronize after “unmap”

waits for CUDA (& GL) to finish.

Context switch forces serialization.

25 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Page 20: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Driver-interop, synchronous*, WRITE-DISCARD

1 2 3

map *

kernel CUtoGL kernel

runCUDA

unmap *

runGL

CUtoGL kernel CUtoGL

4

When multi-GPU,

“map” does nothing.

Compute & render can overlap

as they are on different GPUs.

GL GL GL

Synchronize after “unmap”

waits for CUDA & GL to finish.

26 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Page 21: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Driver-interop, synchronous*, WRITE-DISCARD

— if render is long...

1 2 3

map *

kernel CUtoGL kernel

runCUDA

unmap *

runGL

CUtoGL kernel CUtoGL

GL

4

“unmap” will wait for GL.

GL GL

27 C

OM

PU

TE

R

EN

DE

R

CUDA

GL

Page 22: Siggraph2013 - Multi-GPU Programming for Visual Computing

Single-GPU

Driver-Interop: System View

Multi-GPU

28

Page 23: Siggraph2013 - Multi-GPU Programming for Visual Computing

Multi-GPU

Manual-Interop: System View

compute-GPU render-GPU

29

Page 24: Siggraph2013 - Multi-GPU Programming for Visual Computing

Manual-Interop

Driver-Interop hides all the complexity

but sometimes...

— We don’t want to transfer all data

Kernel may only compute subregions.

— We may have a complex system with multiple compute-GPUs and/or

render-GPUs.

— We have application specific pipelining and multi-buffering

— May have some CPU code in your algorithm between compute and

graphics.

Page 25: Siggraph2013 - Multi-GPU Programming for Visual Computing

cudaMalloc((void**)&d_data, vboSize);

cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);

while (!done) {

// Compute data in temp buffer, and copy to host...

runCUDA(d_data, 0);

cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

cudaStreamSynchronize(0);

// Map the render-GPU’s resource and upload the host buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaStreamSynchronize(0);

cudaSetDevice(computeGPU);

runGL(vboId);

}

Code Sample – Manual-Interop

Create a temporary buffer

in pinned host-memory.

31

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Page 26: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Manual-interop, synchronous*, WRITE-DISCARD

1 2 3

map

kernel CUtoH kernel

runCUDA

unmap

runGL

CUtoH kernel CUtoH

CUtoH *

HtoGL *

HtoGL HtoGL HtoGL

GL GL

32

4

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

GL

Page 27: Siggraph2013 - Multi-GPU Programming for Visual Computing

cudaMalloc((void**)&d_data, vboSize);

cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);

while (!done) {

// Compute data in temp buffer, and copy to host...

runCUDA(d_data, 0);

cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

cudaStreamSynchronize(0);

// Map the render-GPU’s resource and upload the host buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaSetDevice(computeGPU);

// cudaStreamSynchronize(0);

runGL(vboId);

}

Code Sample - Manual Interop (Async)

33

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Page 28: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Manual-interop, asynchronous, WRITE-DISCARD

1 2 3

map

kernel CUtoH kernel

runCUDA

unmap

runGL

CUtoH kernel CUtoH

GL

4

CUtoH *

HtoGL

HtoGL

GL GL

HtoGL HtoGL

34 C

OM

PU

TE

R

EN

DE

R

CUDA

CUDA

GL

Page 29: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Manual-interop, asynchronous, WRITE-DISCARD

— if render is long...

35

1 2 3

map

kernel CUtoH kernel

runCUDA

unmap

runGL

CUtoH kernel CUtoH

4

CUtoH *

HtoGL

HtoGL

GL

HtoGL HtoGL

GL

We are downloading while uploading! Drifting out of sync!

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

GL

Page 30: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU (safe Async)

Manual-interop, asynchronous, WRITE-DISCARD

— if render is long...

36

1 2 3

map

kernel CUtoH kernel

CO

MP

UT

E

RE

ND

ER

runCUDA

unmap

runGL

CUtoH kernel CUtoH

4

CUtoH *

HtoGL

HtoGL

GL

HtoGL HtoGL

GL

Synchronization must also wait for HtoGL to finish

CUDA

CUDA

GL

Page 31: Siggraph2013 - Multi-GPU Programming for Visual Computing

while (!done) {

// Compute the data in a temp buffer, and copy to a host buffer...

runCUDA(d_data, 0);

cudaStreamWaitEvent(0, uploadFinished, 0);

cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

cudaStreamSynchronize(0);

// Map the render-GPU’s resource and upload the host buffer...

// (all commands must be asynchronous.)

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);

cudaGraphicsUnmapResources(1, &vboRes, 0);

cudaEventRecord(uploadFinished, 0);

cudaSetDevice(computeGPU);

runGL(vboId);

}

Code Sample - Manual Interop (safe Async)

38

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Page 32: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Manual-interop, WRITE-DISCARD, flipping CUDA

39

1 2 3

map

kernel[A] CUtoH[A]

runCUDA

unmap *

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

HtoGL

H[A]toGL

GL

H[A]toGL H[B]toGL

GL GL

kernel[B] CUtoH[B] kernel[B]

H[B]toGL

GL

5

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

CUDA

GL

Page 33: Siggraph2013 - Multi-GPU Programming for Visual Computing

int read = 1, write = 0;

while (!done) {

// Compute the data in a temp buffer, and copy to a host buffer...

cudaStreamWaitEvent(custream[write], kernelFinished[read]);

runCUDA(d_data[write], custream[write]);

cudaEventRecord(kernelFinished[write], custream[write]);

cudaStreamWaitEvent(custream[write], uploadFinished[read]);

cudaMemcpyAsync(h_data[write], d_data[write], vboSize, cudaMemcpyDeviceToHost, custream[write]);

cudaEventRecord(downloadFinished[write], custream[write]);

// Map the renderGPU’s resource and upload the host buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, glstream);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaStreamWaitEvent(glstream, downloadFinished[read]);

cudaMemcpyAsync(vboPtr, h_data[read], size, cudaMemcpyHostToDevice, glstream);

cudaGraphicsUnmapResources(1, &vboRes, glstream);

cudaEventRecord(uploadFinished[read], glstream);

cudaStreamSynchronize(glstream); // Sync for easier analysis!

cudaSetDevice(computeGPU);

runGL(vboId);

swap(&read, &write);

}

Code Sample - Manual Interop (flipping)

40

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

Page 34: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Manual-interop, WRITE-DISCARD, flipping CUDA

— if render is long...

41

1 2 3

map

kernel[A] CUtoH[A]

runCUDA

unmap *

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

HtoGL

H[A]toGL

GL

H[A]toGL H[B]toGL

GL GL

kernel[B] CUtoH[B] kernel[B]

H[B]toGL

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

CUDA

GL

Page 35: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Manual-interop, WRITE-DISCARD, flipping CUDA & GL

— if render is long... we have a problem.

42

1 2 3

map

kernel[A] CUtoH[A]

runCUDA

unmap *

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

HtoGL

H[A]toGL[A]

GL[A]

H[A]toGL[A] H[B]toGL[B]

GL[B] GL[A]

kernel[B] CUtoH[B] kernel[B]

H[B]toGL[B]

GL[B]

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

CUDA

GL

We can’t overlap contexts!

Page 36: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Manual-interop, WRITE-DISCARD, flipping CUDA & GL

— Use the GL context for upload.

43

1 2 3

cuSync *

kernel[A] CUtoH[A]

runCUDA

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

memcpy *

GL[A] GL[B] GL[A]

kernel[B] CUtoH[B] kernel[B]

GL[B]

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

GL

GL

5

gl-map

gl-unmap

Page 37: Siggraph2013 - Multi-GPU Programming for Visual Computing

Visual Studio Nsight

44

Page 38: Siggraph2013 - Multi-GPU Programming for Visual Computing

Demo

45

Page 39: Siggraph2013 - Multi-GPU Programming for Visual Computing

Demo - results

— runCUDA (20ms)

— runGL (10ms)

— copy (10ms)

Single-GPU

— Driver-interop = 30ms

Multi-GPU

— Driver-interop = 36ms

— Async Manual-interop = 32ms

— Flipped Manual-interop = 22ms

Too large data size

makes multi-GPU

interop worse.

Overlapping the

download helps

us break even.

But using streams

and flipping is a

significant win!

46

Page 40: Siggraph2013 - Multi-GPU Programming for Visual Computing

Some final thoughts on Interoperability.

Similar considerations are applicable when OpenGL is the producer and CUDA is the consumer.

Use cudaGraphicsMapFlagsReadOnly

Avoid synchronized GPUs for CUDA.

— Watch out for Windows’s WDDM implicit synchronization on unmap!

CUDA-OpenGL interoperability can perform slower if OpenGL context spans multiple GPUs.

Context switch performance varies with system config and OS.

47

Page 41: Siggraph2013 - Multi-GPU Programming for Visual Computing

Scaling Beyond Single Compute+Graphics

Scaling compute

— Divide tasks across multiple devices

— When data does not fit into single GPU memory - Distribute data

Scaling graphics

— Multi-displays, Stereo

— More complex rendering e.g. raytracing

Higher compute density

— Amortize host or server costs eg CPU, memory, RAID shared with

multiple GPUs

Page 42: Siggraph2013 - Multi-GPU Programming for Visual Computing

Multi-GPU Image Processing - SagivTech

Renderer

Page 43: Siggraph2013 - Multi-GPU Programming for Visual Computing

Multiple Compute : Single Render GPU

CUDA

Context

GPU 1

CUDA

Context

GPU 1 GPU 0

Compute Devices Render Device

OpenGL

Context

Aux

CUDA

Context

GPU 2

Map

Unmap

-Explicit copies via host

-CudaMemcpyPeer

-P2P Memory Access

Page 44: Siggraph2013 - Multi-GPU Programming for Visual Computing

Unified Virtual Addressing (UVA)

Easier programming with a single address space:

System

Memory

CPU GPU0

GPU0

Memory

GPU1

GPU1

Memory

PCI-e

0x0000

0xFFFF

No UVA – Multiple Memory Space UVA – Single Address Space

Page 45: Siggraph2013 - Multi-GPU Programming for Visual Computing

Peer-to-Peer (P2P) Communication

Page 46: Siggraph2013 - Multi-GPU Programming for Visual Computing

NUMA/Topology Considerations

Memory access is non-uniform.

— Local GPU access is faster than remote

(extra QPI hop)

— This affects PCIe transfer throughput.

NUMA APIs

— Thread affinity considerations

Pitfalls of set process affinity

— Does not work with graphics APIs

Memory

CPU0

IOH0

CPU1

IOH1

QPI

QPI

QPI

Memory

QPI

GPU0 GPU1

6 GB/s

~4 GB/s

SandyBridge

Integrated IOH

Dale Southard. Designing and Managing GPU Clusters. Supercomputing 2011

http://nvidia.fullviewmedia.com/fb/nv-sc11/tabscontent/archive/401-thu-southard.html

Page 47: Siggraph2013 - Multi-GPU Programming for Visual Computing

P2P disabled over QPI – Copying staged via host (P2H2P)

Peer-to-Peer Configurations

CPU0

QPI PCI-e

IOH0

Best P2P Performance

Between GPUs on the

Same PCIe Switch

Eg K10 dual-GPU card

(~11GB)

P2P Communication Supported

Between GPUs on the Same IOH

(~5GB)

x16 x16 x16 x16

x16 x16

Tesla

GPU0

Tesla

GPU1 Tesla

GPU2

Tesla

GPU3

Switch

0 Switch

1

PCI-e x16

x16 CPU1 IOH1

Quadro

GPU4

Quadro

GPU5

QPI

P2P communication

- Linux or TCC only

- No WDDM

Page 48: Siggraph2013 - Multi-GPU Programming for Visual Computing

Peer-to-Peer Initialization

cudaDeviceCanAccessPeer(&isAccessible, srcGPU, dstGPU)

— Returns in 1st arg if srcGPU can access memory of dstGPU

— Need to do this bidirectionally

cudaDeviceEnablePeerAccess(peerDevice, 0)

— Enables current GPU to access peerDevice

— Note that this is asymmetric!

cudaDeviceDisablePeerAccess

— P2P can be limited to a specific phase in order to reduce overhead and

free resources

Page 49: Siggraph2013 - Multi-GPU Programming for Visual Computing

Peer-to-Peer Copy

cudaMemcpyPeerAsync

— Will fall back to staging copies via host

for unsupported configurations or if

peer-access is disabled.

cudaMemcpyAsync

— Using UVA, regular memory copies will

also work.

NVIDIA CUDA Webinars – Multi-GPU Programming

http://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf

Page 50: Siggraph2013 - Multi-GPU Programming for Visual Computing

cudaMalloc((void**)&d_data, vboSize);

while (!done) {

// Compute data in temp buffer, and copy to host...

runCUDA(d_data, 0);

cudaStreamSynchronize(0);

// Map the render-GPU’s resource and upload the host buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, glstream);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaMemcpyPeerAsync(vboPtr, renderGPU, d_data, computeGPU, size, glstream);

cudaGraphicsUnmapResources(1, &vboRes, glstream);

cudaStreamSynchronize(glstream);

cudaSetDevice(computeGPU);

runGL(vboId);

}

Code Sample - Manual Interop (P2P)

60

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Page 51: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Manual-interop, synchronous*, P2P, WRITE-DISCARD

1 2 3

map

kernel

P2P

kernel

runCUDA

unmap

runGL

P2P

kernel

P2P

GL

P2P *

GL GL

61

4

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

GL

Copy is faster, but still

serialized with render.

Page 52: Siggraph2013 - Multi-GPU Programming for Visual Computing

int read = 1, write = 0;

while (!done) {

// Compute the data in a temp buffer, and copy to a host buffer...

cudaStreamWaitEvent(custream[write], peerFinished[read]);

runCUDA(d_data[write], custream[write]);

cudaEventRecord(kernelFinished[write], custream[write]);

// Map the renderGPU’s resource and p2p copy the compute buffer...

cudaSetDevice(renderGPU);

cudaGraphicsMapResources(1, &vboRes, glstream);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

cudaStreamWaitEvent(glstream, kernelFinished[read]);

cudaMemcpyPeerAsync(vboPtr, renderGPU, d_data[read], computeGPU, size, glstream);

cudaGraphicsUnmapResources(1, &vboRes, glstream);

cudaEventRecord(peerFinished[read], glstream);

cudaStreamSynchronize(glstream); // Sync for easier analysis!

cudaSetDevice(computeGPU);

runGL(vboId);

swap(&read, &write);

}

Code Sample - Manual Interop (P2P flipping)

62

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Page 53: Siggraph2013 - Multi-GPU Programming for Visual Computing

Timeline: Multi-GPU

Manual-interop, synchronous*, P2P flipping, WRITE-DISCARD

63

1 2 3

map

kernel[A]

runCUDA

unmap *

runGL

kernel[A]

4

CUtoH

P2P

P2P[A]

GL

P2P[A] P2P[B]

GL GL

kernel[B] kernel[B]

CO

MP

UT

E

RE

ND

ER

CUDA

CUDA

CUDA

GL

Page 54: Siggraph2013 - Multi-GPU Programming for Visual Computing

Demo

64

Page 55: Siggraph2013 - Multi-GPU Programming for Visual Computing

Peer-to-Peer Memory Access

Sometimes we don’t want to explicitly copy but want access to

entire space on all GPUs and CPU

Already possible with linear memory with UVA, recent texture

addition useful for graphics.

Example – large dataset that can’t fit into 1 GPU memory

— Distribute the domain/data across GPUs

— Each GPU now need to access the other GPU’s data for halos/boundary

exchange

Page 56: Siggraph2013 - Multi-GPU Programming for Visual Computing

Mapping Algorithms to Hardware Topology

Won-Ki Jeong. S3308 - Fast Compressive

Sensing MRI Reconstruction on a Multi-

GPU System, GTC 2013.

Paulius M. S0515 – Implementing

3D Finite Difference Code on

GPUs, GTC 2009.

GPU0 GPU1 Data reduction, Multli-phase problems

Page 57: Siggraph2013 - Multi-GPU Programming for Visual Computing

Sharing texture across GPUs

Kernel runs on device 0 accesses texture from device 1

Peer access is unidirectional. cudaSetDevice(0);

cudaMalloc3DArray(d0_volume, ...);

cudaSetDevice(1);

cudaMalloc3DArray(d1_volume, ...);

while (!done) {

// Set CUDA device to COMPUTE DEVICE 0

cudaSetDevice(0);

cudaBindTextureToArray(tex0, d0_volume);

cudaBindTextureToArray(tex1, d1_volume);

myKernel <<<...>>>

}

__global__ void myKernel(...)

{

float voxel0 = tex3D(tex0, u, v, w); //accesses gpu0

float voxel1 = tex3D(tex1, u, v, w); //accesses gpu1 through p2p

}

Page 58: Siggraph2013 - Multi-GPU Programming for Visual Computing

Scaling Graphics - Multiple Quadro GPUs

Virtualized SLI case

— Multi-GPU mosaic drive a large wall

Access each GPU separately

— Parallel rendering

— Multi view-frustum eg Stereo, CAVEs

Hurricane Sandy Simulation showing multiple

computed tracks for different input params –

Image Credits : NOAA

Sort +

Alpha Composite

Data Distribution +

Render

GPU-0

GPU-2

GPU-1

GPU-3

Visible Human 14GB Texture Data

Page 59: Siggraph2013 - Multi-GPU Programming for Visual Computing

CUDA With SLI Mosaic

Use driver interop with the current cuda device

— Driver handles the optimized data transfer (Work in progress)

The OpenGL context spans 2 cuda devices

Cuda

Enumeration

[0] = K20

[1] = K5000

[2] = K5000

OpenGL

Enumeration

with SLI

[0] = K5000

cudaGLgetGLDevices

GL context

int glDeviceIndices[MAX_GPU];

int glDeviceCount;

cudaGLGetDevices(&glDeviceCount, glDeviceIndices, glDeviceCount,

cudaGLDeviceListAll));

glDeviceCount = 2

Page 60: Siggraph2013 - Multi-GPU Programming for Visual Computing

Specifying Render GPU Explicitly

Linux

— Specify separate X screens using XOpenDisplay

Windows

— Using NV_GPU_AFFINITY extension

Display* dpy = XOpenDisplay(“:0.”+gpu)

GLXContext = glxCreateContextAttribs(dpy,…);

BOOL wglEnumGpusNV(UINT iGpuIndex, HGPUNV *phGPU)

For #GPUs enumerated {

GpuMask[0]=hGPU[0];

GpuMask[1]=NULL;

//Get affinity DC based on GPU

HDC affinityDC = wglCreateAffinityDCNV(GpuMask);

setPixelFormat(affinityDC);

HGLRC affinityGLRC = wglCreateContext(affinityDC);

}

Page 61: Siggraph2013 - Multi-GPU Programming for Visual Computing

Mapping Cuda Affinity GL Devices

How to map OpenGL device to CUDA

cudaWGLGetDevice(int *cudadevice,HGPUNV oglGPUHandle)

Don’t expect the same order

— GL order is windows specific while CUDA is device specific

Cuda

Enumeration

[0] = K20

[1] = K5000

[2] = K5000

GL Affinity

Enumeration

[0] = K5000

[1] = K5000

GL Affinity

Enumeration

(with SLI)

[0] = K5000

cudaGLgetGLDevices

cudaWGLgetDevice

GL context GL context

Page 62: Siggraph2013 - Multi-GPU Programming for Visual Computing

Compute & Multiple Render

GL_NV_COPY_IMAGE - Copy between GL devices wglCopyImageSubData and glXCopyImageSubDataNV

— Texture objects & Renderbuffers only.

— Can copy Texture subregions.

OpenGL

Context

GPU 2

CUDA

Context

GPU 1

OpenGL

Context

CudaMemcpyPeer

Aux

CUDA

Context

GPU 1 GPU 2 GPU 0

Compute Render Devices

wglCopyImageSubData

Map

Unmap

Page 63: Siggraph2013 - Multi-GPU Programming for Visual Computing

Multithreading with OpenGL Contexts

// Wait for signal to start consuming

CPUWait(producedFenceValid);

glWaitSync(producedFence[0]);

// Bind texture object

glBindTexture(destTex[0]);

// Use as needed

// Signal we are done wit this texture

consumedFence[0] = glFenceSync(…);

CPUSignal(consumedFenceValid);

// Wait for

CPUWait(consumedFenceValid);

glWaitSync(consumedFence[1]);

// Bind render target

glFramebufferTexture2D(srcTex[1]);

// Draw here…

// Unbind

glFramebufferTexture2D(0);

// Copy over to consumer GPU

wglCopyImageSubDataNV(srcCtx,srcTex[1],

..destCtx,destTex[1]);

// Signal that producer has completed

producedFence[1] = glFenceSync(…);

CPUSignal(producedFenceValid);

Thread - Dest GPU1 Thread - Source GPU0

[0]

[1]

GLsync consumedFence[MAX_BUFFERS];

GLsync producedFence[MAX_BUFFERS];

HANDLE consumedFenceValid, producedFenceValid;

destTex

Multi-level CPU and GPU sync primitives

[0]

[1]

srcTex

GPU1 GPU0

glCopyImage

Shalini V. S0353 - Programming Multi-GPUs for Scalable Rendering, GTC 2012.

Page 64: Siggraph2013 - Multi-GPU Programming for Visual Computing

Some final thoughts...

Hardware architecture influences transfer and system

throughput. e.g. NUMA

Peer-To-Peer copies/access simplify multi-gpu programming.

— but context-switching between CUDA and GL will serialize.

When scaling graphics

— remember the device mapping between CUDA and OpenGL.

Page 65: Siggraph2013 - Multi-GPU Programming for Visual Computing

Conclusions & Resources

The driver can do all the heavy-lifting but...

Scalability and final performance is up to the developer

— For fine-grained control and optimization, you might want to move the data manually.

CUDA samples/documentation: — http://developer.nvidia.com/cuda-downloads

OpenGL Insights, Patrick Cozzi, Christophe Riccio, 2012. ISBN 1439893764.

www.openglinsights.com

75

Page 66: Siggraph2013 - Multi-GPU Programming for Visual Computing

Demo code

win7 & linux Code for the benchmarking demo is available at:

— http://bitbucket.org/wbraithwaite_nvidia/interopbenchmark

76