22
Progress on Efficient Integration of Lustre* and Hadoop/YARN Robin Goldstone Weikuan Yu Omkar Kulkarni Bryon Neitzel * Some name and brands may be claimed as the property of others.

Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Robin Goldstone

Weikuan Yu Omkar Kulkarni Bryon Neitzel

* Some name and brands may be claimed as the property of others.

Page 2: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S2

MapReduce l  A simple data processing model to process big data l  Designed for commodity off-the-shelf hardware components. l  Strong merits for big data analytics

l  Scalability: increase throughput by increasing # of nodes

l  Fault-tolerance (quick and low cost recovery of the failures of tasks)

l  YARN, the next generation of Hadoop MapReduce Implementation

Page 3: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S3

High-Level Overview of YARN

MRAppMaster

•  Consists of HDFS and MapReduce frameworks. •  Exposes map and reduce interfaces.

v ResourceManager and NodeManagers v MRAppMaster, MapTask, and ReduceTask.

H D F S

Map Task

Map Task

Map Task

Map Task

(2)

(3)

Reduce Task

Reduce Task

HDFS

(3)

HDFS

Resource Manager

Page 4: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S4

Supercomputers and Lustre* •  Lustre popularly deployed on supercomputers

•  A vast number of computer nodes (CN) for computation

•  A parallel pool of back-end storage servers, composed a large pool of storage nodes

Interconnect

CN CN CN

Lustre

Page 5: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S5

Lustre for MapReduce-based Analytics? •  Desire

–  Integration of Lustre as a storage solution –  Understand the requirements of MapReduce on data

organization, task placement, and data movement, and their implications to Lustre

•  Approach: –  Mitigate the impact of centralized data store at Lustre –  Reduce repetitive data movement from computer nodes

and storage nodes –  Cater to the preference of task scheduling and data locality

*other names and brands may be claimed by others

Page 6: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S6

Overarching Goal •  Enable analytics shipping on Lustre* storage servers

–  Users ship their analytics jobs to SNs on-demand

•  Retain the default I/O model for scientific applications, storing data to Lustre

•  Enable in-situ analytics at the storage nodes

Simulation Output

Analytics

Shipping

3

*other names and brands may be claimed by others

Page 7: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S7

Technical Objectives •  Segregate analytics and storage functionalities

within the same storage nodes –  Mitigate interference between YARN and Lustre*

•  Develop a coordinated data placement and task scheduling between Lustre and YARN –  Enable and exploit data and task locality

•  Improve Intermediate Data Organization for Efficient Shuffling on Lustre

*other names and brands may be claimed by others

Page 8: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S8

YARN and Lustre* Integration with Performance Segregation •  Leverage KVM to create VM (virtual machine)

instances on SNs

•  Create Lustre storage servers on the physical machines (PMs)

•  Run YARN programs and Lustre clients on the VMs

•  Placement of YARN Intermediate data –  On Lustre or local disks?

KVM

Physical Machine OST

OSS

A Total of 8 OSTs

Yarn

*other names and brands may be claimed by others

Page 9: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S9

Running Hadoop/YARN on KVM •  50 TeraSort jobs, 1GB input each. One job submitted every 3 seconds,

•  There is a huge overhead caused by running YARN on KVM.

•  Running IOR on 6 other machines. The impact is not very significant.

0

10

20

30

40

50

60

Average Job Execution Time (s)

Tim

e (s

ec)

Running Alone on PM Running Alone on KVM Running with IOR - POSIX Running with IOR - MPIIO

*other names and brands may be claimed by others

4.9% 6.8%

KVM configuration on each node: u 4 cores, 6 GB memory. u Using a different hard drive for Lustre*. u Using a different network port

Page 10: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S10

KVM Overhead •  4 cores are not enough for YARN jobs

•  6 cores help improve the performance of YARN

•  Increasing memory size from 4GB to 6GB has little effects when number of cores is the bottleneck

0

10

20

30

40

50

60

Tim

e (S

ec)

Average Job Excution Time (s)

Running on PM Running on KVM (4 Cores, 6GB Memory) Running on KVM (6 Cores, 4GB Memory) Running on KVM (6 Cores, 6GB Memory)

62.6%

31.8% 29.2%

Page 11: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S11

YARN Memory Utilization •  Running Yarn on the physical machines alone.

•  NodeManager is given 8GB memory, 1GB per container, 1GB heap per task. •  HDFS with local ext3 disks. Intensive writes to HDFS (via local ext3)

0

2

4

6

8

10

12

14

0 200 400 600 800 1000

Memo

ry Us

age (G

B)

Execution Time (Sec)

Yarn MemoryYarn and Ext3 MemoryExt3 MemoryTotal Memory

Page 12: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S12

Intermediate Data on Local Disk or Lustre* •  Place intermediate data on local ext3 file system or Lustre, which is

mapped to KVM (yarn.nodemanager.local-dirs).

•  Yarn and Lustre Clients are placed on the KVM, OSS/OST on the Physical Machine

•  Terasort (4G) and PageRank (1G) benchmarks have been measured

97  

301  

149  

364  

0  

50  

100  

150  

200  

250  

300  

350  

400  

450  

500  

TeraSort  (4G)   Page  Rank  (1G)  

Job  Execu*

on  Tim

e  (sec)  

Configuring  Yarn’s  intermediate  data  directory    on  Local  ext3  or  Lustre  

Local  Ext3   Lustre  

35%

17%

KVM

Physical Machine

OST

OSS

OSC

8 OSS/OST + 8 OSC

Yarn

*other names and brands may be claimed by others

Page 13: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Slide --- 13

Data Locality for YARN on Lustre*

Lustre

Yarn

OST1 OST2 OST3

MDS

OSS1 OSS2 OSS3

OSC1

Resource Manager

Client Job

Submission

Node Manager

lfs getstripe /lustre/stripe1

Node Manager

Node Manager

Container MapTask1

MRAppMaster Services

Scheduler

OSC2 OSC3

stripe1 stripe2 stripe3

Container MapTask2

Container MapTask3

Master Slave 1 Slave 2 Slave 3

<stripe1, OST1> <stripe2, OST2> <stripe3, OST3>

assignNodeLocalContainers() assignRackLocalContainers() assignOffSwitchContainers()

<stripe1, Slave1, Map1>

<stripe1, Map1> <stripe2, Map2>

<stripe3, Map3>

Launch Map1 Launch Map2 Launch Map3

Stripe1 Data Stripe2 Data Stripe3 Data

<stripe2, Slave2, Map2> <stripe3, Slave3, Map3>

lfs getstripe /lustre/stripe2 lfs getstripe /lustre/stripe3

*other names and brands may be claimed by others

Page 14: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Slide --- 14

Background of TeraSort Test

•  Four cases being compared –  Intermediate Data on Lustre* or Local disks –  Scheduling Map tasks with or without data locality –  lustre_shfl_opt: (on lustre, with locality) –  lustre_shfl_orig: (on lustre, without locality) –  local_shfl_opt: (on local disks, with locality) –  local_shfl_orig: (on local disks, without locality)

•  Test environments -- Lustre 2.5 with dataset from 10GB to 30GB and 128MB stripe size and block size

*other names and brands may be claimed by others

Page 15: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Slide --- 15

Average Number of Local Map Tasks

•  local_shfl_opt and lustre_shfl_opt achieve high locality •  The other two have low locality.

Input Size

0

5

10

15

20

25

30

10G 15G 20G 25G 30G

Avg.

Loc

al M

ap T

asks

local_shfl_orig local_shfl_opt lustre_shfl_orig lustre_shfl_opt

Page 16: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Slide --- 16

Terasort under Lustre* 2.5

•  On average, local_shfl_orig has best performance •  lustre_shfl_opt is in the middle of best case and worst case;

Input Size

0

100

200

300

400

500

600

700

800

10G 15G 20G 25G 30G

Exec

utio

n T

ime

(Sec

) local_shfl_orig local_shfl_opt lustre_shfl_orig lustre_shfl_opt

*other names and brands may be claimed by others

Page 17: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Slide --- 17

Data Flow in Original YARN over HDFS •  This figure shows all of the Disk I/O in original Hadoop •  Map Task: Input Split, Spilled Data, MapOutput •  Reduce Task: Shuffled Data, Merged Data, Output Data

MapTask ReduceTask

HDFS Input Split

map sortAndSpill

Spilled

mergeParts

MapOutput

shuffle Merge ManagerImpl

Shuffle spilled

Repetitive Merge

Merged

reduce

Output

Page 18: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Slide --- 18

Data Flow in YARN over Lustre* •  This figure shows all of the Disk I/O of YARN over Lustre •  Avoid as much Disk I/O as possible •  Speed up Reading Input data and Writing Output data

MapTask ReduceTask

Lustre Input Split

map sortAndSpill

Spilled Data

mergeParts

MapOutput

shuffle Merge ManagerImpl

Repetitive Merge

Merged Data

reduce

Output Spilled Data

*other names and brands may be claimed by others

Page 19: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Slide --- 19

New Implementation Design Review •  Improve I/O performance

–  Read/Write from/to local OST –  Avoid unnecessary shuffle spill and repetitive merge –  After all MapOutput has been written, launch reduce task to read data

•  Avoid Lustre* write/read lock issues? •  Reduce Lustre write/read contention?

•  Reduce network contention –  Most of data is written/read from local OST through virtio bridged network –  Reserve more network bandwidth for Lustre Shuffle

MapTask ReduceTask

Lustre Input Split

map sortAndSpill

Spilled Data

mergeParts

MapOutput

shuffle Merge ManagerImpl

Repetitive Merge

Merged Data

reduce

Output

Local Local Local Local

Spilled Data

Avoid Avoid Lustre Shuffle

*other names and brands may be claimed by others

Page 20: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Slide --- 20

Evaluation Results •  SATA Disk for OST, 10G Networking, Lustre* 2.5 •  Running Terasort Benchmark, 1 master node, 8 slave nodes •  Optimized YARN performs on 21% better than the original YARN

on average

0

100

200

300

400

500

600

700

800

900

1000

8G 16G 24G 32G

Exec

utio

n Ti

me

(Sec

)

YARN-Lustre YARN-Lustre-Opt

*other names and brands may be claimed by others

Page 21: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S21

Summary

•  Explore the design of an Analytics Shipping Framework by integrating Lustre* and YARN

•  Provided End-to-End optimizations on data organization, movement and task scheduling for efficient integration of Lustre and YARN

•  Demonstrated its performance benefits to analytics applications

*other names and brands may be claimed by others

Page 22: Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

LUG - S22

Acknowledgment •  Contributions from Cong Xu, Yandong Wang,

and Huansong Fu.

•  Awards to Auburn University from Intel, Alabama Department of Commerce and the US National Science Foundation.

•  This work was also performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-647813).