27
Do You Know What Your I/O Is Doing? William Gropp www.cs.illinois.edu/~wgropp

Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

Do You Know What Your I/O Is Doing?

William Gropp www.cs.illinois.edu/~wgropp

Page 2: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

2

Messages

• Current I/O performance is poor ♦ Even relative to what current systems

can achieve ♦ Part of the problem is the I/O

interface semantics • Big data is more than just I/O

♦ HPC has relevant insights

Page 3: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

3

Just How Bad Is Current I/O Performance?

•  Much of the data (and some slides) taken from “A Multiplatform Study of I/O Behavior on Petascale Supercomputers,” Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms, Prabhat, Suren Byna, and Yushu Yao, presented at HPDC’15. ♦  This paper has lots more data – consider this

presentation a sampling •  Thanks to Luu, Behzad, and the Blue Waters

staff and project for Blue Waters results ♦  Analysis part of PAID program at Blue Waters

Page 4: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

4

I/O Logs Captured By Darshan, A Lightweight I/O Characterization Tool

•  Instruments I/O functions at multiple levels

• Reports key I/O characteristics • Does not capture text I/O

functions • Low overhead à Automatically

deployed on multiple platforms.

Page 5: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

5

Caveats on Darshan Data

•  Users can opt out ♦ Not all applications recorded; typically about ½ on DOE systems

•  Data saved at MPI_Finalize ♦ Applications that don’t call MPI_Finalize,

e.g., run until time is expired and then restart from the last checkpoint, aren’t covered

•  About ½ of Blue Waters Darshan data not included in analysis ♦ Expect to be fixed soon

Page 6: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

6

I/O log dataset: 4 platforms, >1M jobs, almost 7 years combined

Intrepid Mira Edison Blue Waters

Architecture BG/P BG/Q Cray XC30 Cray XE6/XK7

Peak Flops 0.557 PF 10 PF 2.57 PF 13.34 PF Cores 160K 768K 130K 792K+59K

smx Total Storage 6 PB 24 PB 7.56 PB 26.4 PB Peak I/O Throughput

88 GB/s 240 GB/s 168 GB/s 963 GB/s

File System GPFS GPFS Lustre Lustre # of jobs 239K 137K 703K 300K Time period 4 years 18 months 9 months 6 months

Page 7: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

7

Very Low I/O Throughput Is The Norm

Page 8: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

8

System peak - 240 GB/s

10 USB

1 USB

1 B/s

1 KB/s

1 MB/s

1 GB/s

1TB/s

1 B 1 KB 1 MB 1 GB 1 TB 1 PBNumber of bytes transferred

I/O T

hrou

ghpu

t

Jobs Count

1 - 10

11 - 100

101 - 500

501 - 1k

1k1 - 5k

5k1 - 10k

Most jobs transfer little data. Many big-data jobs also have very low thruput System peak - 88 GB/s

1% peak

1 B/s

1 KB/s

1 MB/s

1 GB/s

1TB/s

1 B 1 KB 1 MB 1 GB 1 TB 1 PBNumber of bytes transferred

I/O T

hrou

ghpu

t

Jobs Count

1 - 10

11 - 100

101 - 500

501 - 1k

1k1 - 5k

5k1 - 10k

10k1 - 100k

Intrepid: Jobs I/O Throughput

Mira

Job

s’ I/

O T

hrou

ghpu

t

Page 9: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

9

Most Jobs Read/Write Little Data (Blue Waters data)

Page 10: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

10

I/O Thruput vs Relative Peak

Page 11: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

11

~50% of apps never transfer > 1GB

~20% of apps use only text I/O

Page 12: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

12

I/O Time Usage Is Dominated By A Small Number Of Jobs/Apps

Page 13: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

13

Improving the performance of the top 15 apps can save a lot of I/O time

Platform I/O time percent

Percent of platform I/O time saved if min thruput = 1 GB/s

Mira 83% 32% Intrepid 73% 31% Edison 70% 60% Blue Waters

75% 63%

Page 14: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

14

Top 15 apps with largest I/O time (Blue Waters)

• Consumed 1500 hours of I/O time (75% total system I/O time)

Page 15: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

15

POSIX I/O is far more widely used than parallel I/O libraries.

0

100,000

200,000

1 4 16 64 256 1K 4K 16K 64K 256KNumber of processesN

um

be

r o

f jo

bs interface MPI POSIX

Edison

0

20000

40000

1 4 16 64 256 1K 4K 16K 64K 256KNumber of processes

Nu

mb

er

of jo

bs Intrepid

0

10000

20000

1 4 16 64 256 1K 4K 16K 64K 256K 1MNumber of processes

Nu

mb

er

of jo

bs Mira

0

100,000

200,000

1 4 16 64 256 1K 4K 16K 64K 256KNumber of processesN

um

be

r o

f jo

bs interface MPI POSIX

Edison

0

20000

40000

1 4 16 64 256 1K 4K 16K 64K 256KNumber of processes

Nu

mb

er

of jo

bs Intrepid

0

10000

20000

1 4 16 64 256 1K 4K 16K 64K 256K 1MNumber of processes

Nu

mb

er

of jo

bs Mira

0

100,000

200,000

1 4 16 64 256 1K 4K 16K 64K 256KNumber of processesN

um

be

r o

f jo

bs interface MPI POSIX

Edison

0

20000

40000

1 4 16 64 256 1K 4K 16K 64K 256KNumber of processes

Nu

mb

er

of jo

bs Intrepid

0

10000

20000

1 4 16 64 256 1K 4K 16K 64K 256K 1MNumber of processes

Nu

mb

er

of jo

bs Mira

Page 16: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

16

What Are Some of the Problems?

•  POSIX I/O has a strong consistency model ♦  Hard to cache effectively ♦  Applications need to transfer block-aligned and sized

data to achieve performance •  Files as I/O objects add metadata “choke points”

♦  Serialize operations, even with “independent” files

•  Burst buffers will not fix these problems – must change the semantics of the operations

•  “Big Data” file systems have very different consistency models and meta data structures, designed for their application needs ♦  Why doesn’t HPC?

•  There have been some efforts, such as PVFS, but the requirement for POSIX has held up progress

Page 17: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

17

Big Data is More Than I/O

•  One example is distributed, out-of-core graph processing ♦ Constantly growing graph sizes with large

memory footprints ♦ Current distributed graph processing

frameworks assume graphs fit in memory •  Including all intermediate states •  “Easy” but expensive fix is very large memory nodes

♦ Can we use out-of-core techniques? •  This is work of Hassan Eslami, conducted

during a summer internship at Facebook

Page 18: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

18

Solution

• We need a strategy to automatically and intelligently decide which data should be in-memory or out-of-core.

• This is done by: ♦ Adaptive control of in-memory data ♦ Congestion control of incoming

messages ♦ Capacity control of outgoing

messages

Page 19: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

19

Adaptive Control of In-memory Data

• Data usage > High: offload data to disk until usage below Mid

• Data usage < Low: lazily load data of latest offload from disk

Low Mid High

Page 20: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

20

Congestion Control of Incoming Messages

Worker

Worker

Worker

Worker

Disk

Back-off

Back-off

Back-off

Page 21: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

21

Congestion Control of Incoming Messages

Worker

Worker

Worker

Worker

Disk

Resume

Resume

Resume

Page 22: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

22

Capacity Control of Outgoing Messages

• Keeps a count of outgoing on-the-fly messages per worker pair

• Limits in-transit messages per each worker pair in a two phase approach 1.  count > MAX-IN-TRANSIT: cache the

message 2.  size(cache) > MAX-CACHE-SIZE:

stop computation

Page 23: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

23

Result

0

1000

2000

3000

4000

120GB 100GB 90GB 80GB Exec

uti

on T

ime

(sec

)

Available Memory in Each Worker

Memory-CMS OOC-CMS

•  Implementation now available in Apache Giraph •  Results for PageRank on 8 workers on an input graph

where graph data and messages take roughly 650GB with CMS as garbage collection strategy

Ou

tOfM

emor

y Er

ror

Exce

ssiv

e G

C

Ove

rhea

d

Page 24: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

24

Observations

• Dealing with large graphs requires fast messaging ♦ Issues such as memory management

of “eager” data, flow control, nonblocking operations are important

♦ Latency hiding in I/O also important • Common programming model is

BSP

Page 25: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

25

Message 1

• Current I/O performance is poor ♦ Metadata operations often a significant

source of poor performance ♦ Related to mismatch between system

and user expectations • CS Challenge: Better I/O consistency and

programming models • Math Challenge: Match algorithms to

realities of (changing) hardware; need aggregates, realistic model of data transfer costs

Page 26: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

26

Message 2

• Big data is more than just I/O ♦ And more than just operations on

nearly independent data, for example…

♦ Need to handle large graphs • CS Challenge: Low latency, high

bandwidth, latency hiding programming and implementation, including multiple levels of memory hierarchy

• Math Challenge: Match algorithms to problems; exploit years of effective sparse matrix work

Page 27: Do You Know What Your I/O Is Doing?wgropp.cs.illinois.edu/bib/talks/tdata/2015/Gatlinburg.pdf · • Thanks to Luu, Behzad, and the Blue Waters ... Resume Worker Worker Worker …

27

Thanks!

•  Especially Huong Luu, Babak Behzad, Hassan Eslami

•  Funding from: ♦  NSF ♦  Blue Waters

•  Partners at ANL, LBNL; DOE funding •  Internship support for Eslami from Facebook