48
Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith All image sources and references are provided at the end. 1

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

  • Upload
    others

  • View
    4

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in

Autonomous SystemsMing Yang, Nathan Otterness, Tanya Amert, Joshua Bakita,

James H. Anderson, F. Donelson Smith

All image sources and references are provided at the end. 1

Page 2: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness 2

Page 3: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness 3

Page 4: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

4

Computer Vision & AI Expertise

GPU Behavior Expertise

Real-time Expertise

Page 5: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Pitfalls for Real-Time GPU Usage● Synchronization and blocking● GPU concurrency and performance● CUDA programming perils

5

Page 6: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

CUDA Programming Fundamentals(i) Allocate GPU memory cudaMalloc(&devicePtr,

bufferSize);

(ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize);

(iii) Launch the kernel (kernel = code that runs on GPU)

computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);

(iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize);

(v) Free GPU memory cudaFree(devicePtr);

6

Page 7: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

CUDA Programming Fundamentals(i) Allocate GPU memory cudaMalloc(&devicePtr,

bufferSize);

(ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize);

(iii) Launch the kernel (kernel = code that runs on GPU)

computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);

(iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize);

(v) Free GPU memory cudaFree(devicePtr);

7

Page 8: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

CUDA Programming Fundamentals(i) Allocate GPU memory cudaMalloc(&devicePtr,

bufferSize);

(ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize);

(iii) Launch the kernel (kernel = code that runs on GPU)

computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);

(iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize);

(v) Free GPU memory cudaFree(devicePtr);

8

Page 9: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

(i) Allocate GPU memory cudaMalloc(&devicePtr, bufferSize);

(ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize);

(iii) Launch the kernel (kernel = code that runs on GPU)

computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);

(iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize);

(v) Free GPU memory cudaFree(devicePtr);

CUDA Programming Fundamentals

9

Page 10: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

CUDA Programming Fundamentals(i) Allocate GPU memory cudaMalloc(&devicePtr,

bufferSize);

(ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize);

(iii) Launch the kernel (kernel = code that runs on GPU)

computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);

(iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize);

(v) Free GPU memory cudaFree(devicePtr);

10

Page 11: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

CUDA Programming Fundamentals(i) Allocate GPU memory cudaMalloc(&devicePtr,

bufferSize);

(ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize);

(iii) Launch the kernel (kernel = code that runs on GPU)

computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr);

(iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize);

(v) Free GPU memory cudaFree(devicePtr);

11

Page 12: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Pitfalls for Real-Time GPU Usage● Synchronization and blocking● GPU concurrency and performance● CUDA programming perils

12

Page 13: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Explicit Synchronization

13

Page 14: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Explicit Synchronization

14

CPU threads ("tasks")

Page 15: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Explicit Synchronization

15K1 starts K1 completes

Page 16: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Explicit Synchronization

16

1024 threads

256 threads

Page 17: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Explicit Synchronization

17

Thread 3

1. Call cudaDeviceSynchronize (explicit synchronization).

2. Sleep for 0.2 seconds.

3. Launch kernel K3.

Page 18: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Explicit Synchronization

18

1. Thread 3 calls cudaDeviceSynchronize (explicit synchronization). (a)

2. Thread 3 sleeps for 0.2 seconds. (c)

3. Thread 3 launches kernel K3. (d)

Page 19: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Explicit Synchronization

19

1. Thread 3 calls cudaDeviceSynchronize (explicit synchronization). (a)

2. Thread 4 launches kernel K4. (b)

3. Thread 3 sleeps for 0.2 seconds. (c)

4. Thread 3 launches kernel K3. (d)

Page 20: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Explicit Synchronization

20

Pitfall 1. Explicit synchronization does not block future commands issued by other tasks.

Page 21: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Implicit Synchronization

21

Two commands from different streams cannot run concurrently [if separated by]:

1. A page-locked host memory allocation2. A device memory allocation3. A device memory set4. A memory copy between two addresses to the same device memory5. Any CUDA command to the NULL stream

CUDA toolkit 9.2.88 Programming Guide, Section 3.2.5.5.4, "Implicit Synchronization":

Page 22: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Implicit Synchronization

22

➔ Pitfall 2. Documented sources of implicit synchronization may not occur.

1. A page-locked host memory allocation2. A device memory allocation3. A device memory set4. A memory copy between two addresses to the same device memory5. Any CUDA command to the NULL stream

Page 23: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Implicit Synchronization

23

Page 24: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Implicit Synchronization

24

1. Thread 3 calls cudaFree. (a)

2. Thread 3 sleeps for 0.2 seconds. (c)

3. Thread 3 launches kernel K3. (d)

Page 25: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Implicit Synchronization

25

1. Thread 3 calls cudaFree. (a)

2. Thread 4 is blocked on the CPU when trying to launch kernel 4. (b)

3. Thread 4 finishes launching kernel K4, thread 3 sleeps for 0.2 seconds. (c)

4. Thread 3 launches kernel K3. (d)

Page 26: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Implicit Synchronization

26

➔ Pitfall 3. The CUDA documentation neglects to list some functions that cause implicit synchronization.

➔ Pitfall 4. Some CUDA API functions will block future, unrelated, CUDA tasks on the CPU.

Page 27: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Pitfalls for Real-Time GPU Usage● Synchronization and blocking

○ Suggestion: use CUDA Multi-Process Service (MPS).

● GPU concurrency and performance● CUDA programming perils

27

Page 28: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Pitfalls for Real-Time GPU Usage● Synchronization and blocking

○ Suggestion: use CUDA Multi-Process Service (MPS).

● GPU concurrency and performance

● CUDA programming perils28

Multiple Process-based Tasks Multiple Thread-based Tasks

Without MPS MP MT

With MPS MP(MPS) MT(MPS)

Page 29: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Pitfalls for Real-Time GPU Usage● Synchronization and blocking

○ Suggestion: use CUDA Multi-Process Service (MPS).

● GPU concurrency and performance

● CUDA programming perils29

Multiple Process-based Tasks Multiple Thread-based Tasks

Without MPS MP MT

With MPS MP(MPS) MT(MPS)

Page 30: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

GPU Concurrency and Performance

30

70% of the time, a single Hough transform iteration completed in 12 ms or less.

Page 31: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

GPU Concurrency and Performance

31

This occurred when four concurrent instances were running in separate CPU threads.

Page 32: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

GPU Concurrency and Performance

32

The observed WCET under MT was over 4x the WCET under MP.

Page 33: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

GPU Concurrency and Performance

33

Page 34: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

GPU Concurrency and Performance

34

Page 35: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

GPU Concurrency and Performance

35

➔ Pitfall 5. The suggestion from NVIDIA’s documentation to exploit concurrency through user-defined streams may be of limited use for improving performance in thread-based tasks.

Page 36: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Pitfalls for Real-Time GPU Usage● Synchronization and blocking

○ Suggestion: use CUDA Multi-Process Service (MPS).

● GPU concurrency and performance● CUDA programming perils

36

Page 37: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Pitfalls for Real-Time GPU Usage● Synchronization and blocking

○ Suggestion: use CUDA Multi-Process Service (MPS).

● GPU concurrency and performance● CUDA programming perils

37

Page 38: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Pitfalls for Real-Time GPU Usage● Synchronization and blocking

○ Suggestion: use CUDA Multi-Process Service (MPS).

● GPU concurrency and performance● CUDA programming perils

38

Page 39: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Synchronous Defaultsif (!CheckCUDAError(

cudaMemsetAsync(state->device_block_smids,0, data_size))) {return 0;

}

39

Page 40: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Synchronous Defaultsif (!CheckCUDAError(

cudaMemsetAsync(state->device_block_smids,0, data_size))) {return 0;

}

40

• What about the CUDA docs saying that memset causes implicit synchronization?

Page 41: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Synchronous Defaultsif (!CheckCUDAError(

cudaMemsetAsync(state->device_block_smids,0, data_size))) {return 0;

}

41

• What about the CUDA docs saying that memset causes implicit synchronization?

• Didn't slide 22 say memset doesn't cause implicit synchronization?

Page 42: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Synchronous Defaultsif (!CheckCUDAError(

cudaMemsetAsync(state->device_block_smids,0, data_size))) {return 0;

}

42

if (!CheckCUDAError( cudaMemsetAsync(state->device_block_smids,0, data_size, state->stream))) {return 0;

}

➔ Pitfall 6. Async CUDA functions use the GPU-synchronous NULL stream by default.

Page 43: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Other Perils

43

➔ Pitfall 7. Observed CUDA behavior often diverges from what the documentation states or implies.

Page 44: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Other Perils

44

➔ Pitfall 8. CUDA documentation can be contradictory.

Page 45: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Other Perils

45

➔ Pitfall 8. CUDA documentation can be contradictory.

CUDA Programming Guide, section 3.2.5.1:

The following device operations are asynchronous with respect to the host: [...] Memory copies performed by functions that are suffixed with Async

CUDA Runtime API Documentation, section 2:

For transfers from device memory to pageable host memory, [cudaMemcpyAsync] will return only once the copy has completed.

Page 46: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Other Perils

46

➔ Pitfall 9. What we learn about current black-box GPUs may not apply in the future.

Page 47: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Conclusion● The GPU ecosystem needs clarity and openness!

● Avoid pitfalls when using NVIDIA GPUs for real-time tasks in autonomous systems○ GPU synchronization, application performance, and

problems with documentation

47

Page 48: Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systemsyang/slides/ECRTS18.pdf · 2018-08-24 · Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time

Nathan Otterness

Thanks!Questions?Figure sources:https://electrek.co/guides/tesla-vision/https://www.quora.com/What-are-the-different-types-of-artificial-neural-networkhttps://www.researchgate.net/figure/Compute-unified-device-architecture-CUDA-threads-and-blocks-multidimensional_fig1_320806445?_sg=ziaY-gBKKiKX4pljRq4vJSWZvDvdOidZ2aCRYnD1QVFBJDxIx3MEO1I03cI31e1It6pUr53qaS1L1w4Bt5fd8w

48