34
ORNL is managed by UT-Battelle for the US Department of Energy Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation Chris Davis, Sophie Voisin, Devin White, Andrew Hardin Scalable and High Performance Geocomputation Team Geographic Information Science and Technology Group Oak Ridge National Laboratory GTC 2017 – May 2017

Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

ORNL is managed by UT-Battelle for the US Department of Energy

Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation

Chris Davis,Sophie Voisin,Devin White,Andrew Hardin

Scalable and High Performance Geocomputation Team Geographic Information Science and Technology GroupOak Ridge National Laboratory

GTC 2017 – May 2017

Page 2: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

2

Outline

• Background

• Example HPC Application

• Study Results

• Lessons Learned / Future Work

Page 3: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

3

The Story

• We are:– Developing an HPC suite of applications– Spread across multiple R&D teams– In an Agile development process– Delivering to a production environment– Needing to support multiple systems / multiple capabilities– Collecting performance metrics for system optimization

Page 4: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

4

Why We Use NVIDIA-DockerResource Optimization

GPU Access

Flexibility

Operating Systems

NVIDIA-Docker Docker Virtual Machine

Page 5: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

5

Hardware – Quadro: Compute + Display

Card M4000 P6000

Capability 5.2 6.1Block 32 32SM 13 30Cores 1664 3840Memory 8GB 24GB

Page 6: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

6

Hardware – Tesla: Compute Only

Card K40 K80

Capability 3.5 3.7Block 16 16SM 15 13Cores 2880 2496Memory 12GB 12GB

Page 7: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

7

Hardware – High End

DELL C4130

GPU 4 x K80

RAM 256GB

Cores 48

SSD Storage 400GB

Page 8: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

8

Constructing Containers

• Build Container:– Based off NVIDIA Images at gitlab.com– https://gitlab.com/nvidia/cuda/tree/centos7– CentOS 7– CUDA 8.0 / 7.5– cuDNN 5.1– GCC 4.9.2– Cores: 24– Mount local folder with code

• Compile against chosen compute capability• Copy product inside container• ”docker commit” container updates to new image• “docker save” to Isilon

Isilon

Container

Container

Container

Git Repo

PostgreSQL

Compile Stats

Profile Stats

Data

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

Page 9: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

9

Running Containers

• For each compute capability:– “docker load” from Isilon storage– Run container & profile script– Send nvprof results to Profile Stats DB– Container/Image removed

Isilon

Container

Container

Container

PostgreSQL

Compile Stats

Profile Stats

Data

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

Page 10: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

10

Hooking It All Together

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

Isilon

Container

Container

Container

Git Repo

PostgreSQL

Compile Stats

Profile Stats

Data

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

• One server generates containers

• All servers pull containers from Isilon

• Data to be processed pulled from Isilon

• Container build stats stored in Compiler DB

• Container execution stats stored in Profiler DB

Page 11: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

11

Profiling Combinations

• nvprof– Output Parsed– Sent to Profile DB

• Containers for:– Cuda Version– Each Capability– All Capabilities– CPU only

• Data sets: 4

• Total of 104 profiles

CPU

3.0

3.5

3.75.0

5.2

6.0

6.1

CUDA 8.0

D1

D2D3

D4

M4000

K80

P6000

K40

All Capabilities

CUDA 7.5

Page 12: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

12

Database

Hostname

Dataset

CUDA Version

Num CPU Threads

Compile Time

Compute Capability

Execution Time

Timestamp

GPU Device

Num CPU Threads

Timestamp

Num CPU Threads

Dataset

Kernel / API Call

Step Time Percent

Step Time

Num Calls

Ave Time

Min Time

Max Time

Step Name

Timestamp

• Postgres Databases– Shared Fields– Compile DB– Run Time DB– NVPROF DB

Page 13: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

13

Outline

• Background

• Example HPC Application

• Study Results

• Lessons Learned / Future Work

Page 14: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

14

Example HPC Application

• Geospatial metadata generator– Leverages Open Source 3rdparty libraries

• OpenCV, Caffe, GDAL, …

– Computer Vision Algorithms – GPU Enabled• SURF, ORB, NCC, NMI…

– Automated matching against control data– Calculates geospatial metadata for input imagery

Satellites Manned Aircraft Unmanned Aerial Systems

Page 15: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

15

• Two-step Image Re-alignment Application using NMI

Example HPC Application - GTC16

Input Image

Source Selection

Global Localization

Registration

Resection

MetadataOutput Image

GPU

Preprocessing

CPU

Pipeline

Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP

Normalized Mutual Information

!"# = &' + &)&*

Histograms SourceControl

Joint

Page 16: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

16

• Global Localization

Example HPC Application - GTC16

Input Image

Source Selection

Global Localization

Registration

Resection

MetadataOutput Image

GPU

Preprocessing

CPU

Pipeline

Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP

Control 382x100

Tactical 258x67

• Objective– Re-align the source image with the control image.

• Method In-house Implementation– Roughly match source and control images.

– Coarse resolution

– Mask for non-valid data

– Exhaustive search

Solutions 4250

Page 17: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

17

Example HPC Application - GTC16

• Global Localization

Page 18: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

18

• Similarity Metric

Example HPC Application - GTC16

– Normalized Mutual Information

– Histogram with masked area• Missing data

• Artifact

• Homogeneous area

Source image and mask: NSxMS pixels

Control image and mask: NCxMC pixels

Solution space: nxm NMI coefficients

!"# = &' + &)&*

& = −,- . /012- .3

456

& istheentropy- . istheprobabilitydensityfunction

H ∈ J 0. . 255 for S and C0. . 65535 for J

Page 19: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

19

Example HPC Application - GTC16

Summary• Global Localization as coarse re-alignment

– Problematic: joint histogram computation for each solution• No compromise on the number of bins - 65536

• Exhaustive search

– Solution: leverage of the K80 specifications• 12 GB of memory

• 1 thread per solution

• Less than 25 seconds - 61K solutions

for a 131K pixel image

Kernel specifications

occupancy 100%

threads / block 128

stack frame 264192

total memory / block 33.81 MB

total memory / SM 541.06 MB

total memory / GPU 7.03 GB

memory % 61.06%

spill stores – spill loads 0 – 0

registers 27

smem / block 0

smem / SM 0

smem % 0.00%

cmem[0] – cmem[2] 448 – 20- 1 solution / thread

Page 20: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

20

• Registration Control 382x100

Tactical 258x67

Example HPC Application - GTC16

Input Image

Source Selection

Global Localization

Registration

Resection

MetadataOutput Image

GPU

Preprocessing

CPU

Pipeline

Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP

Page 21: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

21

• Registration Control 382x100

Tactical 258x67Tactical & Control 4571x1555

Example HPC Application - GTC16

Input Image

Source Selection

Global Localization

Registration

Resection

MetadataOutput Image

GPU

Preprocessing

CPU

Pipeline

Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP

• Objective– Refine the localization

• Method– Use higher resolution ~400 times– Keypoint matching

Page 22: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

22

Example HPC Application - GTC16

Tiepoint list

Control Image

Descriptor

Keypoint listdetect frommetric

Search Windows

detect describeSource Image Keypoint list Descriptor

Descriptors: 11x11 intensity values

Search windows: 73x73 pixels

• Registration Workflow

Page 23: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

23

• Similarity Metric– Normalized Mutual Information

– Small “images” but numerous Keypoints• Numerous keypoints

– up to 65536 with GPU SURF detector• Image / Descriptor size

– 11 x 11 intensity values to describe• Search area

– 73 x 73 control sub-image• Solution space

– 63 x 63 = 3969 / keypoint

Application

Descriptors: 11x11 intensity values

Search windows: 73x73 pixels

Solution spaces: 63x63 NMI coefficients

!"# = &' + &)&*

& = −,- . /012- .3

456

& istheentropy- . istheprobabilitydensityfunction

H ∈ J 0. . 255 for S and C0. . 65535 for J

Page 24: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

24

Example HPC Application - GTC16

Summary• Registration refine the re-alignment

– Problematic: joint histogram computation for each solution• No compromise on the number of bins - 65536

• Exhaustive search

– Solution: leverage of the K80 specifications• 12 GB of memory

• 1 block per solution

• Leverage the number of values of the descriptors

121 (maximum) << 65536

• Less than 100 seconds - 65K keypoints

260M NMI coefficients

• About 10K keypoints in less than 20 seconds

List of indices for source

List of indices for the corresponding subset controlJoint histogram

=

KernelFind the best match for all keypoints

1 block per keypointOptimize for the 63 x 63 search windows

64 threads / blocks – 1 idle each threads compute a “row” of solutions

Sparse joint histogram65536 bins but only 121 values

Leverage the 11 x 11 descriptor sizeCreate 2 lists (length 121) of intensity valuesUpdate joint histogram count from listsLoop over lists to retrieve aggregate count Set aggregate count to 0 after first retrieval

Page 25: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

25

Outline

• Background

• Example HPC Application

• Study Results

• Lessons Learned / Future Work

Page 26: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

26

Compile Time Results

0

100

200

300

400

500

600

700

800

900

1000

0

500

1000

1500

2000

2500

OFF 30 35 37 50 52 60 61 30 - 52 30 - 61

size

of b

inar

y fil

es in

MB

time

in s

econ

dsCompute Capability Specifications

CUDA 7.5 CUDA 8.0 CUDA7.5 CUDA 8.0

Page 27: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

27

Run Time Results

0

50

100

150

200

D1 Ave Run Time (sec)

CPU CUDA 7.5 CUDA8

0

50

100

150

200

D2 Ave Run Time (sec)

CPU CUDA 7.5 CUDA 8

0

50

100

150

200

D3 Ave Run Time (sec)

CPU CUDA 7.5 CUDA 8

0

50

100

150

200

D4 Ave Run Time (sec)

CPU CUDA 7.5 CUDA 8

Page 28: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

28

K80 - Kernel Time Results in Seconds with nvprof

10

15

20

25

30

35

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

D1 D2 D3 D4

Step 2 Kernel Timings vs CUDA version (7.5 and 8)

average min max std std

0.1

0.15

0.2

0.25

0.3

0.35

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

D1 D2 D3 D4

Step 1 Kernel Timings vs CUDA version (7.5 and 8)

average min max std std

Page 29: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

29

Run Time Results

020406080

100120140160180200

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

K40 K80 M4000 P6000

D1 - Step 2 Kernel (sec)

average min max std std

020406080

100120140160180200

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

K40 K80 M4000 P6000

D2 - Step 2 Kernel (sec)

average min max std std

020406080

100120140160180200

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

K40 K80 M4000 P6000

D3 - Step 2 Kernel (sec)

average min max std std

020406080

100120140160180200

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

K40 K80 M4000 P6000

D4 - Step 2 Kernel (sec)

average min max std std

Page 30: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

30

Outline

• Background

• Example HPC Application

• Study Results

• Lessons Learned / Future Work

Page 31: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

31

Lessons Learned

• GPU isolation: Ran into issue with swapping out P6000 and K40. – nvidia-smi swapped GPU ID for K40 and M4000.– This caused nvidia-docker to ignore NV_GPU value– UUID vs Index – Our Application can set the GPU index for multi-GPU environment

• (default to 0)

Page 32: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

32

Future Work

• Move off Desktop machines to full testing platform with dedicated hardware with multiple GPU types

• Investigate Docker Registry & Docker Swarm for managing containers

• Enhance Database analysis to autogenerate reports

• Generalize the process to containerize any GPU application to profile with this architecture

Page 33: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

Thank you!

Page 34: Combining NVIDIA Docker and databases to enhance agile ...on-demand.gputechconf.com/gtc/2017/presentation/s... · –Solution: leverage of the K80 specifications • 12 GB of memory

34

Customer Resources

DELL C4130

GPU 4 x K80

RAM 256GB

Cores 48

SSD Storage 400GB

0

5

10

15

20

25

30

35

40

45

50

D1 D2 D3 D4

Run time with 6 threads (sec)

CPU CUDA 7.5