Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Progress on Efficient Integration of Lustre* and Hadoop/YARN
Robin Goldstone
Weikuan Yu Omkar Kulkarni Bryon Neitzel
* Some name and brands may be claimed as the property of others.
LUG - S2
MapReduce l A simple data processing model to process big data l Designed for commodity off-the-shelf hardware components. l Strong merits for big data analytics
l Scalability: increase throughput by increasing # of nodes
l Fault-tolerance (quick and low cost recovery of the failures of tasks)
l YARN, the next generation of Hadoop MapReduce Implementation
LUG - S3
High-Level Overview of YARN
MRAppMaster
• Consists of HDFS and MapReduce frameworks. • Exposes map and reduce interfaces.
v ResourceManager and NodeManagers v MRAppMaster, MapTask, and ReduceTask.
H D F S
Map Task
Map Task
Map Task
Map Task
(2)
(3)
Reduce Task
Reduce Task
HDFS
(3)
HDFS
Resource Manager
LUG - S4
Supercomputers and Lustre* • Lustre popularly deployed on supercomputers
• A vast number of computer nodes (CN) for computation
• A parallel pool of back-end storage servers, composed a large pool of storage nodes
Interconnect
CN CN CN
Lustre
LUG - S5
Lustre for MapReduce-based Analytics? • Desire
– Integration of Lustre as a storage solution – Understand the requirements of MapReduce on data
organization, task placement, and data movement, and their implications to Lustre
• Approach: – Mitigate the impact of centralized data store at Lustre – Reduce repetitive data movement from computer nodes
and storage nodes – Cater to the preference of task scheduling and data locality
*other names and brands may be claimed by others
LUG - S6
Overarching Goal • Enable analytics shipping on Lustre* storage servers
– Users ship their analytics jobs to SNs on-demand
• Retain the default I/O model for scientific applications, storing data to Lustre
• Enable in-situ analytics at the storage nodes
Simulation Output
Analytics
Shipping
3
*other names and brands may be claimed by others
LUG - S7
Technical Objectives • Segregate analytics and storage functionalities
within the same storage nodes – Mitigate interference between YARN and Lustre*
• Develop a coordinated data placement and task scheduling between Lustre and YARN – Enable and exploit data and task locality
• Improve Intermediate Data Organization for Efficient Shuffling on Lustre
*other names and brands may be claimed by others
LUG - S8
YARN and Lustre* Integration with Performance Segregation • Leverage KVM to create VM (virtual machine)
instances on SNs
• Create Lustre storage servers on the physical machines (PMs)
• Run YARN programs and Lustre clients on the VMs
• Placement of YARN Intermediate data – On Lustre or local disks?
KVM
Physical Machine OST
OSS
A Total of 8 OSTs
Yarn
*other names and brands may be claimed by others
LUG - S9
Running Hadoop/YARN on KVM • 50 TeraSort jobs, 1GB input each. One job submitted every 3 seconds,
• There is a huge overhead caused by running YARN on KVM.
• Running IOR on 6 other machines. The impact is not very significant.
0
10
20
30
40
50
60
Average Job Execution Time (s)
Tim
e (s
ec)
Running Alone on PM Running Alone on KVM Running with IOR - POSIX Running with IOR - MPIIO
*other names and brands may be claimed by others
4.9% 6.8%
KVM configuration on each node: u 4 cores, 6 GB memory. u Using a different hard drive for Lustre*. u Using a different network port
LUG - S10
KVM Overhead • 4 cores are not enough for YARN jobs
• 6 cores help improve the performance of YARN
• Increasing memory size from 4GB to 6GB has little effects when number of cores is the bottleneck
0
10
20
30
40
50
60
Tim
e (S
ec)
Average Job Excution Time (s)
Running on PM Running on KVM (4 Cores, 6GB Memory) Running on KVM (6 Cores, 4GB Memory) Running on KVM (6 Cores, 6GB Memory)
62.6%
31.8% 29.2%
LUG - S11
YARN Memory Utilization • Running Yarn on the physical machines alone.
• NodeManager is given 8GB memory, 1GB per container, 1GB heap per task. • HDFS with local ext3 disks. Intensive writes to HDFS (via local ext3)
0
2
4
6
8
10
12
14
0 200 400 600 800 1000
Memo
ry Us
age (G
B)
Execution Time (Sec)
Yarn MemoryYarn and Ext3 MemoryExt3 MemoryTotal Memory
LUG - S12
Intermediate Data on Local Disk or Lustre* • Place intermediate data on local ext3 file system or Lustre, which is
mapped to KVM (yarn.nodemanager.local-dirs).
• Yarn and Lustre Clients are placed on the KVM, OSS/OST on the Physical Machine
• Terasort (4G) and PageRank (1G) benchmarks have been measured
97
301
149
364
0
50
100
150
200
250
300
350
400
450
500
TeraSort (4G) Page Rank (1G)
Job Execu*
on Tim
e (sec)
Configuring Yarn’s intermediate data directory on Local ext3 or Lustre
Local Ext3 Lustre
35%
17%
KVM
Physical Machine
OST
OSS
OSC
8 OSS/OST + 8 OSC
Yarn
*other names and brands may be claimed by others
Slide --- 13
Data Locality for YARN on Lustre*
Lustre
Yarn
OST1 OST2 OST3
MDS
OSS1 OSS2 OSS3
OSC1
Resource Manager
Client Job
Submission
Node Manager
lfs getstripe /lustre/stripe1
Node Manager
Node Manager
Container MapTask1
MRAppMaster Services
Scheduler
OSC2 OSC3
stripe1 stripe2 stripe3
Container MapTask2
Container MapTask3
Master Slave 1 Slave 2 Slave 3
<stripe1, OST1> <stripe2, OST2> <stripe3, OST3>
assignNodeLocalContainers() assignRackLocalContainers() assignOffSwitchContainers()
<stripe1, Slave1, Map1>
<stripe1, Map1> <stripe2, Map2>
<stripe3, Map3>
Launch Map1 Launch Map2 Launch Map3
Stripe1 Data Stripe2 Data Stripe3 Data
<stripe2, Slave2, Map2> <stripe3, Slave3, Map3>
lfs getstripe /lustre/stripe2 lfs getstripe /lustre/stripe3
*other names and brands may be claimed by others
Slide --- 14
Background of TeraSort Test
• Four cases being compared – Intermediate Data on Lustre* or Local disks – Scheduling Map tasks with or without data locality – lustre_shfl_opt: (on lustre, with locality) – lustre_shfl_orig: (on lustre, without locality) – local_shfl_opt: (on local disks, with locality) – local_shfl_orig: (on local disks, without locality)
• Test environments -- Lustre 2.5 with dataset from 10GB to 30GB and 128MB stripe size and block size
*other names and brands may be claimed by others
Slide --- 15
Average Number of Local Map Tasks
• local_shfl_opt and lustre_shfl_opt achieve high locality • The other two have low locality.
Input Size
0
5
10
15
20
25
30
10G 15G 20G 25G 30G
Avg.
Loc
al M
ap T
asks
local_shfl_orig local_shfl_opt lustre_shfl_orig lustre_shfl_opt
Slide --- 16
Terasort under Lustre* 2.5
• On average, local_shfl_orig has best performance • lustre_shfl_opt is in the middle of best case and worst case;
Input Size
0
100
200
300
400
500
600
700
800
10G 15G 20G 25G 30G
Exec
utio
n T
ime
(Sec
) local_shfl_orig local_shfl_opt lustre_shfl_orig lustre_shfl_opt
*other names and brands may be claimed by others
Slide --- 17
Data Flow in Original YARN over HDFS • This figure shows all of the Disk I/O in original Hadoop • Map Task: Input Split, Spilled Data, MapOutput • Reduce Task: Shuffled Data, Merged Data, Output Data
MapTask ReduceTask
HDFS Input Split
map sortAndSpill
Spilled
mergeParts
MapOutput
shuffle Merge ManagerImpl
Shuffle spilled
Repetitive Merge
Merged
reduce
Output
Slide --- 18
Data Flow in YARN over Lustre* • This figure shows all of the Disk I/O of YARN over Lustre • Avoid as much Disk I/O as possible • Speed up Reading Input data and Writing Output data
MapTask ReduceTask
Lustre Input Split
map sortAndSpill
Spilled Data
mergeParts
MapOutput
shuffle Merge ManagerImpl
Repetitive Merge
Merged Data
reduce
Output Spilled Data
*other names and brands may be claimed by others
Slide --- 19
New Implementation Design Review • Improve I/O performance
– Read/Write from/to local OST – Avoid unnecessary shuffle spill and repetitive merge – After all MapOutput has been written, launch reduce task to read data
• Avoid Lustre* write/read lock issues? • Reduce Lustre write/read contention?
• Reduce network contention – Most of data is written/read from local OST through virtio bridged network – Reserve more network bandwidth for Lustre Shuffle
MapTask ReduceTask
Lustre Input Split
map sortAndSpill
Spilled Data
mergeParts
MapOutput
shuffle Merge ManagerImpl
Repetitive Merge
Merged Data
reduce
Output
Local Local Local Local
Spilled Data
Avoid Avoid Lustre Shuffle
*other names and brands may be claimed by others
Slide --- 20
Evaluation Results • SATA Disk for OST, 10G Networking, Lustre* 2.5 • Running Terasort Benchmark, 1 master node, 8 slave nodes • Optimized YARN performs on 21% better than the original YARN
on average
0
100
200
300
400
500
600
700
800
900
1000
8G 16G 24G 32G
Exec
utio
n Ti
me
(Sec
)
YARN-Lustre YARN-Lustre-Opt
*other names and brands may be claimed by others
LUG - S21
Summary
• Explore the design of an Analytics Shipping Framework by integrating Lustre* and YARN
• Provided End-to-End optimizations on data organization, movement and task scheduling for efficient integration of Lustre and YARN
• Demonstrated its performance benefits to analytics applications
*other names and brands may be claimed by others
LUG - S22
Acknowledgment • Contributions from Cong Xu, Yandong Wang,
and Huansong Fu.
• Awards to Auburn University from Intel, Alabama Department of Commerce and the US National Science Foundation.
• This work was also performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-647813).