41
EMC Presentation April 2005 1 Research @ Northeastern University I/O storage modeling and performance David Kaeli Soft error modeling and mitigation Mehdi B. Tahoori

Research @ Northeastern University

  • Upload
    argyle

  • View
    59

  • Download
    0

Embed Size (px)

DESCRIPTION

Research @ Northeastern University. I/O storage modeling and performance David Kaeli Soft error modeling and mitigation Mehdi B. Tahoori. I/O Storage Research at Northeastern University. David Kaeli Yijian Wang Department of Electrical and Computer Engineering Northeastern University - PowerPoint PPT Presentation

Citation preview

Page 1: Research @ Northeastern University

EMC Presentation April 2005 1

Research @ Northeastern University

• I/O storage modeling and performance – David Kaeli

• Soft error modeling and mitigation – Mehdi B. Tahoori

Page 2: Research @ Northeastern University

I/O Storage Research at Northeastern University

David KaeliYijian Wang

Department of Electrical and Computer Engineering Northeastern University

Boston, [email protected]

Page 3: Research @ Northeastern University

EMC Presentation April 2005 3

Outline• Motivation to study file-based I/O• Profile-driven partitioning for parallel file

I/O• I/O Qualification Laboratory @ NU• Areas for future work

Page 4: Research @ Northeastern University

EMC Presentation April 2005 4

Important File-base I/O Workloads

• Many subsurface sensing and imaging workloads involve file-based I/O– Cellular biology – in-vitro fertilization with NU biologists– Medical imaging – cancer therapy with MGH– Underwater mapping – multi-sensor fusion with Woods Hole

Oceanographic Institution– Ground-penetrating radar – toxic waste tracking with Idaho

National Labs

Page 5: Research @ Northeastern University

EMC Presentation April 2005 5

The Impact of Profile-guided Parallelization on SSI Applications

• Reduced the runtime of a single-body Steepest Descent Fast Multipole Method (SDFMM) application by 74% on a 32-node Beowulf cluster

• Hot-path parallelization• Data restructuring

• Reduced the runtime of a Monte Carloscattered light simulation by 98% on a 16-node Silicon Graphics Origin 2000

• Matlab-to-C compliation• Hot-path parallelization

• Obtained superlinear speedup of Ellipsoid Algorithm run on a 16-node IBM SP2

• Matlab-to-C compliation• Hot-path parallelization

Soil

Air

Mine

Scattered Light Simulation Speedup

1

10

100

1000

10000

100000

Run

time

in s

econ

ds

Original

Matlab-to-C

Hot pathparallelization

Ellipsoid Algorithm Speedup(versus serial C version)

05

101520

1 2 4 8 16Number of Nodes

Spee

dup

64-vector 256-vector1024-vector linear speedup

Page 6: Research @ Northeastern University

EMC Presentation April 2005 6

Limits of Parallelization• For compute-bound workloads, Beowulf clusters can

be used effectively to overcome computational barriers

• Middlewares (e.g., MPI and MPI/IO) can significantly reduce the programming effort on parallel systems

• Multiple clusters can be combined, utilizing Grid Middleware (Globus Toolkit)

• For file-based I/O-bound workloads, Beowulf clusters and Grid systems are presently ill-suited to exploit the potential parallelism present on these systems

Page 7: Research @ Northeastern University

EMC Presentation April 2005 7

Outline• Motivation to study file-based I/O• Profile-driven partitioning for parallel file

I/O• I/O Qualification Laboratory @ NU• Areas for future work

Page 8: Research @ Northeastern University

EMC Presentation April 2005 8

Parallel I/O Acceleration• The I/O bottleneck

– The growing gap between the speed of processors, networks and underlying I/O devices

– Many imaging and scientific applications access disks very frequently

• I/O intensive applications– Out-of-core applications

– Work on large datasets that cannot fit in main memory

– File-intensive applications– Access file-based datasets frequently– Large number of file operations

Page 9: Research @ Northeastern University

EMC Presentation April 2005 9

Introduction

• Storage architectures– Direct Attached Storage (DAS)

– Storage device is directly attached to the computer

– Network Attached Storage (NAS)– Storage subsystem is attached to a network of servers

and file requests are passed through a parallel filesystem to the centralized storage device

– Storage Area Network (SAN)– A dedicated network to provide an any-to-any connection

between processors and disks

Page 10: Research @ Northeastern University

EMC Presentation April 2005 10

I/O PartitioningP

An I/O intensive

applicationDisk

P P P…

Disk

Disk Disk Disk

P P P…

Data Partitioning

Multiple disks(i.e. RAID)

Disk Disk Disk

P

…Data Striping

Multiple Processes (i.e. MPI-IO)

Page 11: Research @ Northeastern University

EMC Presentation April 2005 11

I/O Partitioning• I/O is parallelized at both the application level

(using MPI and MPI-IO) and the disk level (using file partitioning)

• Ideally, every process will only access files on local disk (though this is typically not possible due to data sharing)

• How to recognize the access patterns?• Profile-guided approach

Page 12: Research @ Northeastern University

EMC Presentation April 2005 12

Profile Generation

Run the application

Capture I/O execution profiles

Apply our partitioning algorithm

Rerun the tuned application

Page 13: Research @ Northeastern University

EMC Presentation April 2005 13

I/O traces and partitioning• For every process, for every contiguous file access,

we capture the following I/O profile information:– Process ID– File ID– Address– Chunk size– I/O operation (read/write)– Timestamp

• Generate a partition for every process• Optimal partitioning is NP-complete, so we develop a

greedy algorithm• We have found we can use partial profiles to guide

partitioning

Page 14: Research @ Northeastern University

EMC Presentation April 2005 14

for each IO process, create a partition;for each contiguous data chunk {

total up the # of read/write accesses on a process-ID basis;if the chunk is accessed by only one process

assign the chunk to the associated partition;if the chunk is read (but never written) by multiple processes

duplicate the chunk in all partitions where read;if the chunk is written by one process, but later read by multiple {

assign the chunk to all partitions where read and broadcast the updates on writes;

else assign the chunk to a shared partition; } }For each partition

sort chunks based on the earliest timestamp for each chunk;

Greedy File Partitioning Algorithm

Page 15: Research @ Northeastern University

EMC Presentation April 2005 15

Parallel I/O Workloads• NASA Parallel Benchmark (NPB2.4)/BT

– Computational fluid dynamics– Generates a file (~1.6 GB) dynamically and then reads it back– Writes/reads sequentially in chunk sizes of 2040 Bytes

• SPEChpc96/seismic– Seismic processing– Generates a file (~1.5 GB) dynamically and then reads it back– Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB

• Tile-IO– Parallel Benchmarking Consortium– Tile access to a two-dimensional matrix (~1 GB) with overlap– Writes/reads sequential chunks of 32 KB, with 2KB of overlap

• Perf– Parallel I/O test program within MPICH– Writes a 1 MB chunk at a location determined by rank, no overlap

• Mandelbrot– An image processing application that includes visualization– Chunk size is dependent on the number of processes

Page 16: Research @ Northeastern University

EMC Presentation April 2005 16

10/100Mb Ethernet Switch

RAIDNode

LocalPCI-IDE

Disk

LocalPCI-IDE

Disk

P2-350Mhz

P2-350Mhz P2-350Mhz

P2-350Mhz

P2-350Mhz

RAIDNode

P2-350Mhz

Beowulf Cluster

Page 17: Research @ Northeastern University

EMC Presentation April 2005 17

Hardware Specifics• DAS configuration

– Linux box, Western Digital WD800BB (IDE), 80GB, 7200RPM

• Beowulf cluster (base configuration)– Fast Ethernet 100Mbits/sec– Network Attached RAID - Morstor TF200 with 6-9GB drives

Seagate SCSI disks, 7200rpm, RAID-5 – Local attached IDE disks – IBM UltraATA-350840, 5400rpm

• Fibre channel disks– Seagate Cheetah X15 ST-336752FC, 15000rpm

Page 18: Research @ Northeastern University

EMC Presentation April 2005 18

0

50

100

150

200

UnixWrite

UnixRead

MPI-IOWrite

MPI-IORead

P-IOWrite

P-IORead

Band

wid

th (M

B/se

c)

4 procs9 procs16 procs25 procs

Write/Read Bandwidth

0

50

100

150

200

UnixWrite

UnixRead

MPI-IOWrite

MPI-IORead

P-IOWrite

P-IORead

Band

wid

th (M

B/se

c)

4 procs8 procs16 procs24 procs

NPB2.4/BT

SPECHPC/seis

Page 19: Research @ Northeastern University

EMC Presentation April 2005 19

0

25

50

75

100

125

MPI write MPI read PIO write PIO read

Band

wid

th (M

B/se

c)Write/Read Bandwidth

0

50

100

150

200

250

MPI write MPI read PIO write PIO read

Band

wid

th (M

B/se

c)

0

50

100

150

200

250

MPI write MPI read PIO write PIO read

Band

wid

th (M

B/se

c)

4 procs8 procs16 procs24 procs

MPI-Tile Perf

Mandelbrot

Page 20: Research @ Northeastern University

EMC Presentation April 2005 20

Total Execution Time

0

1000

2000

3000

4000

Exec

utio

n Ti

me

(sec

onds

)

MPI-IOPIO

Page 21: Research @ Northeastern University

EMC Presentation April 2005 21

Profile training sensitivity analysis• We have found that IO access patterns are

independent of file-based data values• When we increase the problem size or reduce

the number of processes, either:– the number of IOs increases, but access patterns and

chunk size remain the same (SPEChpc96, Mandelbrot), or

– the number of IOs and IO access patterns remain the same, but the chunk size increases (NBT, Tile-IO, Perf)

• Re-profiling can be avoided

Page 22: Research @ Northeastern University

EMC Presentation April 2005 22

Execution-driven Parallel I/O Modeling• Growing need to process large, complex

datasets in high performance parallel computing applications

• Efficient implementation of storage architectures can significantly improve system performance

• An accurate simulation environment for users to test and evaluate different storage architectures and applications

Page 23: Research @ Northeastern University

EMC Presentation April 2005 23

Execution-driven I/O Modeling • Target applications: parallel scientific programs

(MPI)• Target machine/Host machine: Beowulf clusters• Use DiskSim as the underlying disk drive

simulator• Direct execution to model CPU and network

communication• We execute the real parallel I/O accesses and

meanwhile, calculate the simulated I/O response time

Page 24: Research @ Northeastern University

EMC Presentation April 2005 24

Validation – Synthetic I/O Workload on DASResponse Time of Sequential Writes

0

2

4

6

8

10

12

1 2 4 8 16access size in number of blocks

number of accesses = 1000

seco

nds

Response Time of Sequential Reads

0

2

4

6

8

10

1 2 4 8 16access size in number of blocks

number of accesses = 1000

seco

nds

modelreal

Response Time of Non-contiguous Reads

0

2

4

6

8

10

1 2 4 8 16 32seek distance in number of blocks

access size = 1 blocknumber of accesses = 1000

seco

nds

Response Time of Non-contiguous Writes

0

2

4

6

8

10

1 2 4 8 16 32seek distance in number of blocks

access size = 1 blocknumber of accesses = 1000

seco

nds

Page 25: Research @ Northeastern University

EMC Presentation April 2005 25

Simulation Framework - NAS

LAN/WAN

Network File System

I/O traces

Local I/O traces Local I/O traces Local I/O traces Local I/O traces

RAID controller

DiskSim

I/O requests

Filesystem metadata

Logical file access addresses

Page 26: Research @ Northeastern University

EMC Presentation April 2005 26

Execution Time of NPB2.4/BT on NAS - base configuration

0

500

1000

1500

2000

2500

3000

3500

4000

4 9 16 25number of processors

seco

nds

modelreal

Page 27: Research @ Northeastern University

EMC Presentation April 2005 27

LAN/WAN

FileSystem FileSystem FileSystem FileSystem

I/O traces I/O traces I/O traces I/O traces

DiskSim

DiskSim

DiskSim

DiskSim

Simulation Framework – SAN direct• A variety of SAN where disks are distributed across the network and each server is directly connected to a single device• File partitioning• Utilize I/O profiling and data partitioning heuristics to distribute portions of files to disks close to the processing nodes

Page 28: Research @ Northeastern University

EMC Presentation April 2005 28

Execution Time of NPB2.4/BT on SAN-direct - base configuration

0

500

1000

1500

2000

2500

3000

4 9 16 25

number of processors

seco

nds

modelreal

Page 29: Research @ Northeastern University

EMC Presentation April 2005 29

Hardware Specifications

Page 30: Research @ Northeastern University

EMC Presentation April 2005 30

I/O Bandwidth of SPEChpc/seis

0

50

100

150

200

250

NA

S-joulian

NA

S-ATA

NA

S-SCSI

NA

S-FC

SAN

-joulian

SAN

-direct -ATA

SAN

-direct-SCSI

SAN

-direct-FC

storage architectures

MB

/s

4 processors8 processors16 processors

Page 31: Research @ Northeastern University

EMC Presentation April 2005 31

I/O Bandwidth of Mandelbrot

050

100150200250300350400

NA

S-joulian

NA

S-ATA

NA

S-SCSI

NA

S-FC

SAN

-joulian

SAN

-direct -ATA

SAN

-direct-SCSI

SAN

-direct-FC

storage architectures

MB

/s 4 processors8 processors16 processors

Page 32: Research @ Northeastern University

EMC Presentation April 2005 32

Publications• “Profile-guided File Partitioning on Beowulf Clusters,” Journal of Cluster

Computing, Special Issue on Parallel I/O, to appear 2005.• “Execution-Driven Simulation of Network Storage Systems,” Proceedings of the 12th

ACM/IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, pp. 604-611.

• “Profile-Guided I/O Partitioning,” Proceedings of the 17th ACM International Symposium on Supercomputing, June 2003, pp. 252-260.

• “Source Level Transformations to Apply I/O Data Partitioning,” Proceedings of the IEEE Workshop on Storage Network Architecture And Parallel IO, Oct. 2003, pp. 12-21.

• “Profile-Based Characterization and Tuning for Subsurface Sensing and Imaging Applications,” International Journal of Systems, Science and Technology, September 2002, pp. 40-55.

Page 33: Research @ Northeastern University

EMC Presentation April 2005 33

Summary of Cluster-based Work • Many imaging applications are dominated by file-based

I/O• Parallel systems can only be effectively utilized if I/O is

also parallelized • Developed a profile-guided approach to I/O data

partitioning• Impacting clinical trials at MGH• Reduced overall execution time by 27-82% over MPI-IO• Execution-driven I/O model is highly accurate and

provides significant modeling flexibility

Page 34: Research @ Northeastern University

EMC Presentation April 2005 34

Outline• Motivation to study file-based I/O• Profile-driven partitioning for parallel file

I/O• I/O Qualification Laboratory @ NU• Areas for future work

Page 35: Research @ Northeastern University

EMC Presentation April 2005 35

I/O Qualification Laboratory• Working with Enterprise Strategy Group• Develop a state-of-the-art facility to provide

independent performance qualification of Enterprise Storage systems

• Provide a quarterly report to ES customer base on the status of current ES offerings

• Work with leading ES vendors to provide them with custom early performance evaluation of their beta products

Page 36: Research @ Northeastern University

EMC Presentation April 2005 36

I/O Qualification Laboratory

• Contacted by IOIntegrity and SANGATE for product qualification

• Developed potential partners that are leaders in the ES field

• Initial proposals already reviewed by IBM, Hitachi and other ES vendors

• Looking for initial endorsement from industry

Page 37: Research @ Northeastern University

EMC Presentation April 2005 37

I/O Qualification Laboratory

• Why @ NU– Track record with industry (EMC, IBM,

Sun)– Experience with benchmarking and IO

characterization– Interesting set of applications (medical,

environmental, etc.)– Great opportunity to work within the

cooperative education model

Page 38: Research @ Northeastern University

EMC Presentation April 2005 38

Outline• Motivation to study file-based I/O• Profile-driven partitioning for parallel file

I/O• I/O Qualification Laboratory @ NU• Areas for future work

Page 39: Research @ Northeastern University

EMC Presentation April 2005 39

Areas for Future Work• Designing a Peer-to-Peer storage system on a Grid system

by partitioning datasets across geographically distributed storage devices

joulian.hpcl.neu.edu keys.ece.neu.edu

Internet

1Gbit/s100Mbit/sRAID

31 sub-nodes 8 sub-nodes

Head node Head node

Page 40: Research @ Northeastern University

EMC Presentation April 2005 40

NPB2.4/BT read performance

0

20

40

60

80

100

120

140

160

180

single server dual server P2P

MB/s

4 procs9 procs16 procs25 procs

Page 41: Research @ Northeastern University

EMC Presentation April 2005 41

Areas for Future Work• Reduce simulation time by identifying

characteristic “phases” in I/O workloads• Apply machine learning algorithms to identify

clusters of representative I/O behavior• Utilize K-Means and Multinomial clustering to

obtain high fidelity in simulation runs utilizing sampled I/O behavior

“A Multinomial Clustering Model for Fast Simulation of Architecture Designs”, submitted to the 2005 ACM KDD Conference.