Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Robin Goldstone

Weikuan Yu Omkar Kulkarni Bryon Neitzel

* Some name and brands may be claimed as the property of others.

LUG - S2

MapReduce l  A simple data processing model to process big data l  Designed for commodity off-the-shelf hardware components. l  Strong merits for big data analytics

l  Scalability: increase throughput by increasing # of nodes

l  Fault-tolerance (quick and low cost recovery of the failures of tasks)

l  YARN, the next generation of Hadoop MapReduce Implementation

LUG - S3

High-Level Overview of YARN

MRAppMaster

•  Consists of HDFS and MapReduce frameworks. •  Exposes map and reduce interfaces.

v ResourceManager and NodeManagers v MRAppMaster, MapTask, and ReduceTask.

H D F S

Map Task

Map Task

Map Task

Map Task

(2)

(3)

Reduce Task

Reduce Task

HDFS

(3)

HDFS

Resource Manager

LUG - S4

Supercomputers and Lustre* •  Lustre popularly deployed on supercomputers

•  A vast number of computer nodes (CN) for computation

•  A parallel pool of back-end storage servers, composed a large pool of storage nodes

Interconnect

CN CN CN

Lustre

LUG - S5

Lustre for MapReduce-based Analytics? •  Desire

–  Integration of Lustre as a storage solution –  Understand the requirements of MapReduce on data

organization, task placement, and data movement, and their implications to Lustre

•  Approach: –  Mitigate the impact of centralized data store at Lustre –  Reduce repetitive data movement from computer nodes

and storage nodes –  Cater to the preference of task scheduling and data locality

*other names and brands may be claimed by others

LUG - S6

Overarching Goal •  Enable analytics shipping on Lustre* storage servers

–  Users ship their analytics jobs to SNs on-demand

•  Retain the default I/O model for scientific applications, storing data to Lustre

•  Enable in-situ analytics at the storage nodes

Simulation Output

Analytics

Shipping

3


LUG - S7

Technical Objectives •  Segregate analytics and storage functionalities

within the same storage nodes –  Mitigate interference between YARN and Lustre*

•  Develop a coordinated data placement and task scheduling between Lustre and YARN –  Enable and exploit data and task locality

•  Improve Intermediate Data Organization for Efficient Shuffling on Lustre


LUG - S8

YARN and Lustre* Integration with Performance Segregation •  Leverage KVM to create VM (virtual machine)

instances on SNs

•  Create Lustre storage servers on the physical machines (PMs)

•  Run YARN programs and Lustre clients on the VMs

•  Placement of YARN Intermediate data –  On Lustre or local disks?

KVM

Physical Machine OST

OSS

A Total of 8 OSTs

Yarn


LUG - S9

Running Hadoop/YARN on KVM •  50 TeraSort jobs, 1GB input each. One job submitted every 3 seconds,

•  There is a huge overhead caused by running YARN on KVM.

•  Running IOR on 6 other machines. The impact is not very significant.

0

10

20

30

40

50

60

Average Job Execution Time (s)

Tim

e (s

ec)

Running Alone on PM Running Alone on KVM Running with IOR - POSIX Running with IOR - MPIIO


4.9% 6.8%

KVM configuration on each node: u 4 cores, 6 GB memory. u Using a different hard drive for Lustre*. u Using a different network port

LUG - S10

KVM Overhead •  4 cores are not enough for YARN jobs

•  6 cores help improve the performance of YARN

•  Increasing memory size from 4GB to 6GB has little effects when number of cores is the bottleneck

0

10

20

30

40

50

60

Tim

e (S

ec)

Average Job Excution Time (s)

Running on PM Running on KVM (4 Cores, 6GB Memory) Running on KVM (6 Cores, 4GB Memory) Running on KVM (6 Cores, 6GB Memory)

62.6%

31.8% 29.2%

LUG - S11

YARN Memory Utilization •  Running Yarn on the physical machines alone.

•  NodeManager is given 8GB memory, 1GB per container, 1GB heap per task. •  HDFS with local ext3 disks. Intensive writes to HDFS (via local ext3)

0

2

4

6

8

10

12

14

0 200 400 600 800 1000

Memo

ry Us

age (G

B)

Execution Time (Sec)

Yarn MemoryYarn and Ext3 MemoryExt3 MemoryTotal Memory

LUG - S12

Intermediate Data on Local Disk or Lustre* •  Place intermediate data on local ext3 file system or Lustre, which is

mapped to KVM (yarn.nodemanager.local-dirs).

•  Yarn and Lustre Clients are placed on the KVM, OSS/OST on the Physical Machine

•  Terasort (4G) and PageRank (1G) benchmarks have been measured

97

301

149

364

0

50

100

150

200

250

300

350

400

450

500

TeraSort (4G) Page Rank (1G)

Job Execu*

on Tim

e (sec)

Configuring Yarn’s intermediate data directory on Local ext3 or Lustre

Local Ext3 Lustre

35%

17%

KVM

Physical Machine

OST

OSS

OSC

8 OSS/OST + 8 OSC

Yarn


Slide --- 13

Data Locality for YARN on Lustre*

Lustre

Yarn

OST1 OST2 OST3

MDS

OSS1 OSS2 OSS3

OSC1

Resource Manager

Client Job

Submission

Node Manager

lfs getstripe /lustre/stripe1

Node Manager

Node Manager

Container MapTask1

MRAppMaster Services

Scheduler

OSC2 OSC3

stripe1 stripe2 stripe3

Container MapTask2

Container MapTask3

Master Slave 1 Slave 2 Slave 3

<stripe1, OST1> <stripe2, OST2> <stripe3, OST3>

assignNodeLocalContainers() assignRackLocalContainers() assignOffSwitchContainers()

<stripe1, Slave1, Map1>

<stripe1, Map1> <stripe2, Map2>

<stripe3, Map3>

Launch Map1 Launch Map2 Launch Map3

Stripe1 Data Stripe2 Data Stripe3 Data

<stripe2, Slave2, Map2> <stripe3, Slave3, Map3>

lfs getstripe /lustre/stripe2 lfs getstripe /lustre/stripe3


Slide --- 14

Background of TeraSort Test

•  Four cases being compared –  Intermediate Data on Lustre* or Local disks –  Scheduling Map tasks with or without data locality –  lustre_shfl_opt: (on lustre, with locality) –  lustre_shfl_orig: (on lustre, without locality) –  local_shfl_opt: (on local disks, with locality) –  local_shfl_orig: (on local disks, without locality)

•  Test environments -- Lustre 2.5 with dataset from 10GB to 30GB and 128MB stripe size and block size


Slide --- 15

Average Number of Local Map Tasks

•  local_shfl_opt and lustre_shfl_opt achieve high locality •  The other two have low locality.

Input Size

0

5

10

15

20

25

30

10G 15G 20G 25G 30G

Avg.

Loc

al M

ap T

asks

local_shfl_orig local_shfl_opt lustre_shfl_orig lustre_shfl_opt

Slide --- 16

Terasort under Lustre* 2.5

•  On average, local_shfl_orig has best performance •  lustre_shfl_opt is in the middle of best case and worst case;

Input Size

0

100

200

300

400

500

600

700

800

10G 15G 20G 25G 30G

Exec

utio

n T

ime

(Sec

) local_shfl_orig local_shfl_opt lustre_shfl_orig lustre_shfl_opt


Slide --- 17

Data Flow in Original YARN over HDFS •  This figure shows all of the Disk I/O in original Hadoop •  Map Task: Input Split, Spilled Data, MapOutput •  Reduce Task: Shuffled Data, Merged Data, Output Data

MapTask ReduceTask

HDFS Input Split

map sortAndSpill

Spilled

mergeParts

MapOutput

shuffle Merge ManagerImpl

Shuffle spilled

Repetitive Merge

Merged

reduce

Output

Slide --- 18

Data Flow in YARN over Lustre* •  This figure shows all of the Disk I/O of YARN over Lustre •  Avoid as much Disk I/O as possible •  Speed up Reading Input data and Writing Output data

MapTask ReduceTask

Lustre Input Split

map sortAndSpill

Spilled Data

mergeParts

MapOutput


Repetitive Merge

Merged Data

reduce

Output Spilled Data


Slide --- 19

New Implementation Design Review •  Improve I/O performance

–  Read/Write from/to local OST –  Avoid unnecessary shuffle spill and repetitive merge –  After all MapOutput has been written, launch reduce task to read data

•  Avoid Lustre* write/read lock issues? •  Reduce Lustre write/read contention?

•  Reduce network contention –  Most of data is written/read from local OST through virtio bridged network –  Reserve more network bandwidth for Lustre Shuffle

MapTask ReduceTask

Lustre Input Split

map sortAndSpill

Spilled Data

mergeParts

MapOutput


Repetitive Merge

Merged Data

reduce

Output

Local Local Local Local

Spilled Data

Avoid Avoid Lustre Shuffle


Slide --- 20

Evaluation Results •  SATA Disk for OST, 10G Networking, Lustre* 2.5 •  Running Terasort Benchmark, 1 master node, 8 slave nodes •  Optimized YARN performs on 21% better than the original YARN

on average

0

100

200

300

400

500

600

700

800

900

1000

8G 16G 24G 32G

Exec

utio

n Ti

me

(Sec

)

YARN-Lustre YARN-Lustre-Opt


LUG - S21

Summary

•  Explore the design of an Analytics Shipping Framework by integrating Lustre* and YARN

•  Provided End-to-End optimizations on data organization, movement and task scheduling for efficient integration of Lustre and YARN

•  Demonstrated its performance benefits to analytics applications


LUG - S22

Acknowledgment •  Contributions from Cong Xu, Yandong Wang,

and Huansong Fu.

•  Awards to Auburn University from Intel, Alabama Department of Commerce and the US National Science Foundation.

•  This work was also performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-647813).

Documents

Progress on Efficient Integration of Lustre* and Hadoop/YARNopensfs.org/wp-content/uploads/2014/04/D3_S29_ProgressReporton... · A simple data processing model to process big data