GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’...

GPU Direct for Video

Agenda

Video Processing Workflow

Video I/O with GPU

GPUDirect for Video

Why Not Peer to Peer

SDK Availability

Conclusions

Video In GPU

Video Processing Workflow

Video In Video Out

Video In

Video Out

Video I/O with GPU: transfer path

CPU Memory

SYSTEM

Video 3rd Party

Input/Output

Card GPU

Video I/O with GPU: inefficiencies

I/O <-> GPU through GPU

unshareable system

memory buffer

GPU synchronous transfers

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

F2 CPU Memcopy

CPU Memcopy F2

vsync vsync

Scan Out

Scan In F3

2 frames of latency

Video I/O with GPU: improvements

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

F2 CPU Memcopy

CPU Memcopy F4 Increasing latency

Scan Out

Scan In F5

vsync vsync

4 frames of latency!

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

F2 CPU Memcopy

CPU Memcopy F2 I/O <-> GPU through GPU

shareable system memory

buffer

Scan Out

Scan In F3

vsync vsync

2 frames of latency

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

Overlapping I/O and GPU

DMAs by using:

— sub-field chunks

vsync vsync

Scan Out

Scan In F3

2 frames of latency

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

Overlapping GPU DMAs and

GPU processing by using:

— sub-field chunks

— GPU asynchronous transfers

vsync vsync

Scan Out

Scan In F3

2 frames of latency

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

— Increasing latency

vsync vsync

Scan Out

Scan In F4

3 frames of latency

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

— Increasing latency even

vsync vsync

Scan Out

Scan In F5

4 frames of latency = processing can occupy the entire frame!

GPUDirect for Video

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA F2

Unified data transfer API

for all Graphics and

Compute APIs’ objects

— Video oriented

— Efficient synchronization

— GPU shareable system

memory

— Sub-field transfers

— GPU Asynchronous transfers

vsync vsync

Scan Out

Scan In F3

GPUDirect for Video: Application usage

Not this

Application

3rd Party Video I/O SDK

OpenGL CUDA DX GPU

Direct

3rd Party Video I/O

Device driver NVIDIA Driver

Application

OpenGL CUDA DX GPU

Direct

3rd Party Video I/O

Device driver NVIDIA Driver

But this:

GPUDirect for Video:Application Usage

Use the SDK Provided by Your Preferred Video I/O Vendor

GPUDirect for Video: Application Usage Video Capture to OpenGL Texture main()

GLuint glTex;

glGenTextures(1, &glTex); \\ Create OpenGL texture obect

glBindTexture(GL_TEXTURE_2D, glTex);

glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, bufferWidth, bufferHeight, 0, 0, 0, 0);

glBindTexture(GL_TEXTURE_2D, 0);

EXTRegisterGPUTextureGL(glTexIn); \\ Register texture with 3rd party Video I/O SDK

while(!quit)

EXTBegin(glTexIn); \\ Release texture from Video I/O SDK

Render(glTexIn); \\ Use the texture

EXTEnd(glTexIn); \\ Release texture back to Video I/O SDK

EXTUnregisterGPUTextureGL(glTexIn); \\ Unregister texture with 3rd party Video I/O SDK

Results

Capture Latency Playout Latency

ExtDev to SysMem SysMem to GPU GPU to SysMem SysMem to ExtDev

1.4ms 1.5ms 1.5ms 1.5ms

2.9 ms 3.0ms

Optimal transfer time for 4-component 8-bit 1080p video:

* Although Gen 2 PCI Express bandwidth is specified at 8.0GB / sec, the maximum achievable is ~6.0 GB/sec

Direct to GPU video transfer time for 4-component 8-bit 1080p video:

transfer time ≅𝑓𝑟𝑎𝑚𝑒 𝑠𝑖𝑧𝑒

𝑃𝐶𝐼𝐸 𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ ∗ 2 ≅

497664000 𝑏𝑦𝑡𝑒𝑠 𝑝𝑒𝑟 𝑠𝑒𝑐𝑜𝑛𝑑

6000000000∗ 𝑏𝑦𝑡𝑒𝑠 𝑝𝑒𝑟 𝑠𝑒𝑐𝑜𝑛𝑑

60 ∗ 2

𝑡𝑟𝑎𝑛𝑠𝑓𝑒𝑟 𝑡𝑖𝑚𝑒 ≅ 2.397 msec

Results

Without Direct to GPU Transfers

With Direct to GPU Transfers

Direct to GPU Transfers with Chunking

16.6 ms

vsync vsync

Capture

GPU Processing

Readback

3.8 ms 8.9 ms 3.9 ms

2.9 ms 10.7 ms 3.0 ms

1.5 ms

13.6 ms 1.6 ms 1.6 ms

1.7 ms

Video I/O with GPU: transfer path

CPU Memory

SYSTEM

Video 3rd Party

Input/Output

Card GPU

Why not

Peer-to-Peer?

NVIDIA Digital Video Pipeline: Peer-to-Peer Communication

CPU Memory

SYSTEM

SDI Video SDI Video

NVIDIA SDI capture and output cards only

Quadro®

SDI Output

Quadro® GPU

Compute + Graphics

Quadro®

SDI Capture

NVIDIA Digital Video Pipeline

GPU blit/

Chroma expansion

GPU Processing

Input DMA

Output DMA

The Input DMA is peer-to-

peer of 4 inputs

The output is a direct

scanout from the GPU

BUT…

Fixed 3 frames of latency

Each I/O board must be

tied to a single GPU

vsync vsync

Scan Out

Scan In

NVIDIA GPUDirect™ v2.0: Peer-to-Peer Communication

Direct Transfers between GPUs with CUDA

Memory

System

Memory

Only one copy required: 1. cudaMemcpy(GPU2, GPU1)

NVIDIA GPUDirect™ now supports RDMA

GPU1 GPU2

System

Memory GDDR5

Memory

Network

Server 1

GPU1 GPU2

Memory GDDR5

Memory

System

Memory

Network

Server 2

Network

Linux only, HPC centric

Why Not Peer-to-Peer?

• Supportability

• Same performance

• IOH-to-IOH communication issues

• Limited PCIE slot options due to lane allocations

• Support Graphics APIs as well as CUDA

• Multi-GPU Support!

• Multi-OS support!

Video I/O Card Vendors

Use the NVIDIA GPU Direct for Video SDK:

http://developer.nvidia.com/nvidia-

gpudirect%E2%84%A2-video

Samples (OpenGL, D3D9, D3D11, CUDA)

Programming Guide

Windows7, Linux

Static and Shared Libraries

Conclusions

Lowest latency video I/O in and out of the GPU

Optimal transfer times

Optimal GPU processing time

Supports OpenGL and DirectX as well as CUDA

Does not require sophisticated programming.

Scales to multiple GPUs

GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’...

Documents

HIGH-PERFORMANCE GPU VIDEO ENCODINGon-demand.gputechconf.com/gtc/2015/presentation/S5613... · 2015-03-19 · AGENDA GPU Video Encoding Overview NVIDIA Video Encoding Capabilities

Gpu digital lab video analysis

Making Future Metadata Shareable

161104 Isuma-IBC Channel-video desc-shareable-revs3.amazonaws.com/isuma.attachments/161104_isuma-ibc_channel-vid… · ! 2! Filmand!Video!Archives!linkbelow.!! VideoCopies& & If!you!wish!to!order!a!copy!of!a!video,!please!see!the!Inuit!Filmand!Video!Archives!web!

Shareable Australia Launch

web video: shareable vs shareworthy

Enable GPU virtualization in OpenStack · • Intel GPU Virtualization Overview • OpenStack vGPUenhancement • Future Work. Motivation Automatic Driving Video Streaming Cloud Gaming

Shareable Austin

Openappchallenge phase1submission Shareable Ink

Policies for Shareable Cities

CREATING SHAREABLE SECURITY MODULES

Review for Modern GPU Hardwareviplab.cs.nctu.edu.tw/course/VLSIDSP2020_Spring/VLSIDSP_CHAP10… · CPU GPU Application Vertex Processor Rasterize Fragment Processor Video Memory (Textures)

10. GPU - Video Card (Display, Graphic, VGA)

Shareable Trust

Best Practices GPU-Based Video Processing | GTC 2013on-demand.gputechconf.com/...Based-Video-Processing... · GPU-Based Video Processing Optimal GPU Upload GPU Processing GPU Readback

GRID VIRTUAL GPU - Nvidia · 2016-09-13 · All vGPUs resident on a physical GPU share access to the GPU’s engines including the graphics (3D), video decode, and video encode engines

How GPU Helps to Power Next Generation of ArcVideo Video ...on-demand.gputechconf.com/gtc/2016/presentation/s... · How GPU Helps to Power Next Generation of ArcVideo Video Products

Real-time 3D Video Processing Using Multi-stream GPU ... 3D Video Processing U… · Real-time 3D Video Processing Using Multi-stream GPU Parallel Computing Kenia Picos, Víctor H

VIDEO CAPTURE, ENCODING, AND STREAMING IN A MULTI- GPU … · Video Capture, Encoding, and Streaming in a Multi-GPU System Video Capture, Encoding, and Streaming in a Multi -GPU System

Creating Shareable Content