NVIDIA DEVELOPER TOOLS: NEW CAPABILITIES IN CUDA 9 · 7 CUDA-MEMCHECK • Support for host API...

Jeff Larkin, CAAR Workshop March 2018

NVIDIA DEVELOPER TOOLS: NEW CAPABILITIES IN CUDA 9.X

CUDA TOOLS

CUDA-GDB

• Debug CUDA kernels with CLI

• Debug CPU and GPU code

• CPU and GPU core dump support

NVPROF

• Collect Performance events and metrics

VISUAL PROFILER

• Trace CUDA activities

• Profile CUDA kernels

• Correlate performance instrumentation

with source code

• Expert-guided performance analysis

CUDA-MEMCHECK

• Detect out-of-bounds and misaligned memory

accesses

• Detect race condition in memory accesses

• Detect uninitialized global memory accesses

• Detect incorrect GPU thread synchronization

NVDISASM, CUOBJDUMP

GPU LIBRARY ADVISOR

• Detect CUDA library optimization

opportunities

NEW TOOLS FEATURES IN CUDA 9

SUPPORT FOR VOLTA ARCHITECTURE

Support for GPUs with Compute Capability 7.0

CUDA-GDB

• GPU core dump generation is supported on Volta-MPS

• Reading lightweight GPU core dump files is supported

New features post CUDA 9

$ time CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 CUDA_ENABLE_LIGHTWEIGHT_COREDUMP=1 ./simple_cuda_program

Before: 9.85s user 12.33s system 173% cpu 12.794 total

After: 0.41s user 1.36s system 75% cpu 2.350 total

CUDA-MEMCHECK

• Support for host API functions with pitch parameter.

• Initial support for the Cooperative Groups programming model.

• Support for shared memory atomic instructions.

• Support for detecting invalid accesses to global memory on Pascal and later architectures that extend beyond the end of an allocation.

• Support for limiting the numbers of errors printed by cuda-memcheck.

• Racecheck analysis reports are assigned a severity level.

• Default print level changed from INFO to WARN.

• A new command line option to report deprecated instructions even when they are used in safe execution paths. (post CUDA 9)

CUDA VISUAL PROFILER

Unified Memory

NVLink

Multi-hop remote profiling

Tracing and profiling of Cooperative Kernel launches

PC sampling

Enhancements in CUDA 9.0

VISUAL PROFILER

Segment mode interval Heat map for CPU page faults

Segment mode timeline

VISUAL PROFILER

Selected interval Source location

CPU Page Fault Source Correlation

CPU PAGE FAULT SOURCE CORRELATION

Unguided Analysis

Option to collect Unified Memory

information

Summary of all CPU page faults

VISUAL PROFILER

Source line causing CPU page fault

CPU Page Fault Source Correlation

VISUAL PROFILER - NEW UNIFIED MEMORY EVENTSPage throttling, Memory thrashing, Remote map

Memory thrashing

Page throttling

Remote map

VISUAL PROFILERFilter and Analyze

Filtered intervals

OPTIMIZATION

int threadsPerBlock = 256;int numBlocks = (length + threadsPerBlock – 1) / threadsPerBlock;

kernel<<< numBlocks , threadsPerBlock >>>(A, B, C, length);

int threadsPerBlock = 256;int numBlocks = (length + threadsPerBlock – 1) / threadsPerBlock;

cudaMemAdvise(A, size, cudaMemAdviseSetReadMostly, 0);cudaMemAdvise(B, size, cudaMemAdviseSetReadMostly, 0);

kernel<<< numBlocks , threadsPerBlock >>>(A, B, C, length);

OPTIMIZED APPLICATION

No DtoH Migrations and thrashing

2.9 ms

Speedup 4x (2.9 vs 12.2)

VISUAL PROFILERNVLINK visualization

Color codes for

NVLink

Topology

Static

properties

Runtime

values

Option to collect

NVLink information

Unguided Analysis

Selected

NVLink

Version

VISUAL PROFILER

MemCpy API

NVLink Events on

Timeline

Color Coding of

NVLink Events

NVLink events on timeline - Segment mode

VISUAL PROFILER

Host Compute NodeLogin Node

Visual Profiler ScriptCUDA

Applicationssh

Multi-hop remote profiling

Script available on github: https://github.com/NVIDIA/cuda-profiler/tree/master/one_hop_profiling)

VISUAL PROFILER

Login Node

Script

Connect Visual Profiler to the login node

Configure script on the login node

Use the custom script option

Multi-hop remote profiling - One-Time Setup

VISUAL PROFILER

1 2 Application transparently runs on compute node and profiling data is displayed in the Visual Profiler

Select custom script, then create a remote session as usual

Multi-hop remote profiling - Application Profiling

VISUAL PROFILERCPU Sampling (post CUDA 9)

Percentage of time spent collectively by all threads

Range of time

spent across

all threads

Selected thread

is highlighted in

Orange

Bar chart of the

amount of time

spent by thread

VISUAL PROFILER - PC SAMPLINGOption to select sampling period (post CUDA 9)

VISUAL PROFILER

Pie chart for sample distribution for a CUDA function

Source-Assembly view

PC SAMPLING UI

MPI PROFILING

$ LD_PRELOAD=“libnvtx_pmpi.so mpirun -n 4 nvprof --process-name "MPI Rank %q{PMIX_RANK}" --context-name "MPI Rank %q{PMIX_RANK}" -o timeline.%q{PMIX_RANK}.nvprof ./simpleMPIRunning on 4 nodes==21977== NVPROF is profiling process 21977, command: ./simpleMPI==21983== NVPROF is profiling process 21983, command: ./simpleMPI==21979== NVPROF is profiling process 21979, command: ./simpleMPI==21982== NVPROF is profiling process 21982, command: ./simpleMPI<program output>==21982== Generated result file: timeline.0.nvprof==21977== Generated result file: timeline.3.nvprof==21983== Generated result file: timeline.1.nvprof==21979== Generated result file: timeline.2.nvprof

nvprof

MPI PROFILING

Importing into the Visual Profiler

MPI PROFILINGVisual Profiler

MPI Rank-based naming

NVTX Markers & Ranges

See: https://devblogs.nvidia.com/parallelforall/gpu-pro-

tip-track-mpi-calls-nvidia-visual-profiler

MPI PROFILINGMPI + NVTX

nvtxEventAttributes_t range = {0};range.message.ascii = "MPI_Scatter";nvtxRangePushEx(range);int result = MPI_Scatter(...);nvtxRangePop();

Auto-generate mpi_interception.so

LD_PRELOAD=mpi_interception.so

Run your MPI app with nvprof.

MPI calls will be auto-annotated using NVTX.

Manual mode Interception mode

https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/

MPI PROFILINGInterception

int res = MPI_Scatter(...);

MPI app Interception library(LD_PRELOAD)

int MPI_Scatter(...) { nvtxRangePushEx(range);int res = PMPI_Scatter(...);nvtxRangePop();return res;

MPI library

int PMPI_Scatter(...)

FOR MORE INFORMATION …

CUDA 9 Features Revealed Parallel Forall Blog Post : https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

CUDA Documentation: http://docs.nvidia.com/cuda/

Download CUDA Toolkit: https://developer.nvidia.com/cuda-downloads

BACKUP SLIDES

DGX-1V NVLINK TOPOLOGY

PC SAMPLING

PC sampling feature is available for device with CC >= 5.2

Provides CPU PC sampling parity + additional information for warp states/stalls reasons for GPU kernels

Effective in optimizing large kernels, pinpoints performance bottlenecks at specific lines in source code or assembly instructions

Samples warp states periodically in round robin order over all active warps

No overheads in kernel runtime, CPU overheads to parse the records

FILTER AND ANALYZE

1 Select unified memory in the unguided analysis section

2 Select required events and click on ‘Filter and Analyze’

Summary of filtered intervals

FILTER AND ANALYZEUnfiltered

CPU PAGE FAULT SOURCE CORRELATION

Unguided Analysis

Option to collect Unified Memory

information

Summary of all CPU page faults

CPU SAMPLING

• CPU profile is gathered by periodically sampling the state of each thread in the running application.

• The CPU details view summarizes the samples collected into a call-tree, listing the number of samples (or amount of time) that was recorded in each function.

NVIDIA DEVELOPER TOOLS: NEW CAPABILITIES IN CUDA 9 · 7 CUDA-MEMCHECK • Support for host API...

Documents

NVIDIA CUDA Installation Guide for Microsoft … › cuda › archive › 9.2 › pdf › CUDA...Visual C++ 10.0 DEPRECATED Visual Studio 2010 YES YES x86_32 support is limited. See

Jared Law CUDA: Super-Computing Made Easy. Jared Law NVidia CUDA: Why CUDA? What is CUDA? Where/how is CUDA being used? What does CUDA mean to programmers?

Best Practices Guide -- CUDA 2 - Nc State Universitymoss.csc.ncsu.edu/.../2.3/NVIDIA_CUDA_BestPracticesGuide_2.3.pdf · NVIDIA CUDA C Programming Best Practices Guide . CUDA ... CUDA

Programming with CUDA · Programming with CUDA ... CUDA C programming guide – CUDA Programming 4 …

CUDA-GDB (NVIDIA CUDA Debugger)

Debugging Experience with CUDA-GBD and CUDA-MEMCHECK · 2012-11-27 · Debugging Experience with CUDA-GDB and CUDA-MEMCHECK ... CUDA Debugging Solutions C UDA-G DB (Linux & Mac) C

CUDA-GDB: The NVIDIA CUDA Debugger · CUDA Debugger User Manual Version 2.1 Beta 1 Chapter 1. Introduction 1.1 CUDA-GDB: The NVIDIA CUDA Debugger CUDA-GDB is a ported version of GDB:

CUDA-MEMCHECKdeveloper.download.nvidia.com/compute/DevZone/docs/... · CUDA-MEMCHECK DU-05355-001_v041 | 5 Chapter 02 : CUDA-MEMCHECK FEATURES Nonblocking Mode By default, on sm_20

CUDA-GDB: The NVIDIA CUDA Debuggerdeveloper.download.nvidia.com/compute/cuda/2_1/cudagdb/... · 2008-12-24 · 1.1 CUDA-GDB: The NVIDIA CUDA Debugger ... You must select Linux 32-bit

vR450 | May 2020 CUDA COMPATIBILITY CUDA … CUDA Compatibility vR450 | 6 Chapter 3. SUPPORT 3.1. Hardware Support The current hardware support is shown in Table 2. Table 2 Compute

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC€¦ · Steve Abbott, February 12, 2019 GPUDIRECT, CUDA AWARE MPI,& CUDA IPC

GPU Computing with CUDA Lecture 2 - CUDA · PDF fileGPU Computing with CUDA Lecture 2 - CUDA Memories ... August, 2011 UTFSM, Valparaíso, Chile 1. ... Memory hierarchy ‣CUDA works

Introduction to Scientific Programming using GPGPU and CUDA · Introduction to Scientific Programming using GPGPU and CUDA ... (NVIDIA CUDA Programming Guide) ... CUDA C OpenCL CUDA

Debugging Experience with CUDA-GDB and CUDA …

MD-CUDA · GPGPU CUDA N-body problem ... –Application programming interface (API) –CUDA runtime –CUFFT –CUBLAS. 20 CUDA Layers. 21 GPU Architecture In CUDA Memory Addressing

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

NVIDIA CUDA Toolkit 9.0 - nw.tsuda.ac.jpNVIDIA CUDA Toolkit 9.0.176 RN-06722-001 _v9.0 | 3 Chapter 2. NEW FEATURES 2.1. General CUDA ‣ CUDA 9.0 adds support for the Volta architecture

CUDA Fortran 2013 | GTC 2013...PGI CUDA Fortran 2013 New Features Texture memory support CUDA 5.0 Dynamic Parallelism Chevron launches within global subroutines Support for allocate,

CUDA Without Cuda (CUDA Libraries) - Nvidiadeveloper.download.nvidia.com/CUDA/training/ntrotoCUDALibraries.pdf · CUDA Without Cuda (CUDA Libraries) GPU Computing Webinar 7/16/2011

DU-05355-001 v9.0 | September 2017 CUDA-MEMCHECK User …karel.tsuda.ac.jp/lec/cuda/doc_v9_0/pdf/CUDA_Memcheck.pdfCUDA-MEMCHECK can be run in standalone mode where the user's application