Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require

Fahim Chowdhury, Jialin Liu,

Quincey Koziol, Thorsten Kurth,

Steven Farrell, Suren Byna,

Prabhat, and Weikuan Yu

Initial Characterization of

I/O in Large-Scale Deep

Learning Applications

- 1 -

November 12, 2018

Outline

- 2 -

Objectives

DL Benchmarks at NERSC

Profiling Approaches

Experimental Results

Future Work

Outline

- 3 -

Objectives




Future Work

Objectives

Deep Learning (DL) applications demand large-scale computing facilities.

DL applications require efficient I/O support in the data processing pipeline to

accelerate the training phase.

The goals of this project are

Exploring I/O patterns invoked through multiple DL applications running on

HPC systems

Addressing possible bottlenecks caused by I/O in the training phase

Developing optimization strategies to overcome the possible I/O bottlenecks

4

Objectives

Deep Learning (DL) applications demand large-scale computing facilities.

DL applications require efficient I/O support in the data processing pipeline to

accelerate the training phase.

The goals of this project are

Exploring I/O patterns invoked through multiple DL applications running on

HPC systems

Addressing possible bottlenecks caused by I/O in the training phase

Developing optimization strategies to overcome the possible I/O bottlenecks

5

Outline

- 6 -

Objectives




Future Work

HEPCNNB Overview

High Energy Physics Deep Learning Convolutional Neural Network Benchmark

(HEPCNNB)

Runs on distributed TensorFlow using Horovod

Can generate particle events that can be described by standard model physics

and particle events with R-parity violating Supersymmetry

Uses a 496 GB dataset of 2048 HDF5 files representing particle collisions

generated by a fast Monte-Carlo generator named Delphes at CERN

7

CDB Overview

Climate Data Benchmark (CDB)

Runs on distributed TensorFlow using Horovod

Can act as an image recognition model to detect patterns for extreme weather

Uses a 3.5 TB dataset of 62738 HDF5 images representing climate data

Leverages TensorFlow Dataset API and python’s multiprocessing package for

input pipelining

8

Outline

- 9 -

Objectives




Future Work


Develop TimeLogger tool based on python to profile application layer

Determine the total latency from merged interval list for each training component

Explore TensorFlow Runtime Tracing Metadata Visualization (TRTMV) tool

developed at Google and extract I/O specific metadata

Working on integration of runtime metadata from application and framework layer

Work available in: https://github.com/NERSC/DL-Parallel-IO

10

https://github.com/NERSC/DL-Parallel-IO

Outline

- 11 -

Objectives




Future Work

HEPCNNB Latency Breakdown

- 12 -

Local Shuffle Global Shuffle

3.60%

3.08%

3.17%

2.91%1.44%

8.01%

7.72%

6.83%

6.16%

1.49%

I/O takes more time when Global Shuffling is introduced

Global Shuffling affects I/O even for small dataset and only 5 epochs training

I/O bottleneck can become more severe with increasing epochs

HEPCNNB Read Bandwidth

- 13 -

I/O takes more time when Global Shuffling is introduced

Global Shuffling affects I/O even for small dataset and only 5 epochs training

I/O bottleneck can become more severe with increasing epochs

Local Shuffle Global Shuffle

9.9121.69

44.20

91.53

194.99

4.33 8.76 15.9030.86

187.99

CDB Latency and Read Bandwidth

- 14 -

The percentage of I/O in the training process is more when dataset is larger

The I/O percentage increases with the number of nodes

Training benefits more from the scaling than I/O

8.73%

15.05%

10.63%

11.04% 0.30 0.33

1.14

2.57

Outline

- 15 -

Objectives




Future Work

Future Work

To integrate TRTMV results with TimeLogger data for better profiling of highly

parallelized I/O pipeline

To explore the I/O patterns and determine possible I/O bottlenecks in distributed

TensorFlow

To develop an optimized cross-framework I/O strategy to overcome the possible

I/O bottlenecks

16

- 17 -

Thank You

Documents

Initial Characterization of I/O in Large-Scale Deep Learning … · 2018-11-11 · Deep Learning (DL) applications demand large-scale computing facilities. DL applications require