Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma,...

OrchestraManaging Data Transfers in Computer Clusters

Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica

UC Berkeley

Moving Data is Expensive

Typical MapReduce jobs in Facebook spend 33% of job running time in large data transfers

Application for training a spam classifier on Twitter data spends 40% time in communication

Limits Scalability

Scalability of Netflix-like recommendation system is bottlenecked by communication

Did not scale beyond 60 nodes» Comm. time increased

faster than comp. time decreased

Transfer Patterns

Transfer: set of all flows transporting data between two stages of a job» Acts as a barrier

Completion time: Time for the last receiver to finish

Shuffle

Reduce

Broadcast

Incast*

Contributions

1. Optimize at the level of transfers instead of individual flows

2. Inter-transfer coordination

TC (broadcast)

HDFSTree

Cornet

HDFSTree

Cornet

HDFSTree

Cornet

HDFSTree

Cornet

TC (shuffle)TC (shuffle)

Hadoop shuffleWSS

shuffle broadcast 1 broadcast 2

ITCITC

Fair sharingFIFO

Priority

Fair sharingFIFO

Priority

ShuffleTransfer

Controller (TC)

ShuffleTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

Inter-TransferController (ITC)Inter-Transfer

Controller (ITC)

Orchestra

OutlineCooperative broadcast (Cornet)

» Infer and utilize topology information

Weighted Shuffle Scheduling (WSS)

» Assign flow rates to optimize shuffle completion time

Inter-Transfer Controller» Implement weighted fair sharing

between transfers

End-to-end performance7

TC (broadcast)

HDFSTree

Cornet

HDFSTree

Cornet

HDFSTree

Cornet

HDFSTree

Cornet

TC (shuffle)TC (shuffle)

Hadoop shuffleWSS

ITCITC

Fair sharingFIFO

Priority

Fair sharingFIFO

Priority

ITCITC

ShuffleTC

BroadcastTC

Cornet: Cooperative broadcast

Observations Cornet Design Decisions

1. High-bandwidth, low-latency network

Large block size (4-16MB)

2. No selfish or malicious peers

No need for incentives (e.g., TFT)

No (un)choking Everyone stays till the end

3. Topology matters Topology-aware broadcast8

Broadcast same data to every receiver»Fast, scalable, adaptive to bandwidth, and

resilient

Peer-to-peer mechanism optimized for cooperative environments

Cornet performance

1GB data to 100 receivers on EC2

4.5x to 5x improvement

Status quo

Topology-aware Cornet

Many data center networks employ tree topologies

Each rack should receive exactly one copy of broadcast

»Minimize cross-rack communication

Topology information reduces cross-rack data transfer

»Mixture of spherical Gaussians to infer network topology

Topology-aware Cornet

~2x faster than vanilla Cornet

200MB data to 30 receivers on DETER

3 inferred clusters

Status quo in Shuffle

r1r1 r2r2

s2s2 s3s3 s4s4s1s1 s5s5

Links to r1 and r2 are full:Link from s3 is

full:Completion time:

3 time units

2 time units

5 time units

Allocate rates to each flow using weighted fair sharing, where the weight of a flow between a sender-receiver pair is proportional to the total amount of data to be sent

Up to 1.5X improvement

Completion time: 4 time units

Weighted Shuffle Scheduling

r1r1 r2r2

s2s2 s3s3 s4s4s1s1 s5s5

1 1 2 2 1 1

Inter-Transfer Controller aka ConductorWeighted fair sharing

»Each transfer is assigned a weight»Congested links shared proportionally to

transfers’ weights

Implementation: Weighted Flow Assignment (WFA)

»Each transfer gets a number of TCP connections proportional to its weight

»Requires no changes in the network nor in end host OSes

Benefits of the ITC

Two priority classes»FIFO within each

Low priority transfer»2GB per reducer

High priority transfers

»250MB per reducer

Shuffle using 30 nodes on EC2

43% reduction in high priority xfers

6% increase of the low priority xfer

Without Inter-transfer Scheduling

Priority Scheduling in Conductor

End-to-end evaluation

Developed in the context of Spark – an iterative, in-memory MapReduce-like framework

Evaluated using two iterative applications developed by ML researchers at UC Berkeley

»Training spam classifier on Twitter data»Recommendation system for the Netflix

challenge

Faster spam classification

Communication reduced from 42% to 28% of the iteration

Overall 22% reduction in iteration time

Scalable recommendation system

Before After

1.9x faster at 90 nodes

Related workDCN architectures (VL2, Fat-tree etc.)»Mechanism for faster network, not policy for better sharing

Schedulers for data-intensive applications (Hadoop scheduler, Quincy, Mesos etc.)»Schedules CPU, memory, and disk across the cluster

Hedera»Transfer-unaware flow scheduling

Seawall»Performance isolation among cloud tenants

Summary

Optimize transfers instead of individual flows»Utilize knowledge about application semantics

Coordinate transfers»Orchestra enables policy-based transfer

management»Cornet performs up to 4.5x better than the

status quo»WSS can outperform default solutions by 1.5x

No changes in the network nor in end host OSes

20http://www.mosharaf.com/

Backup Slides

MapReduce logs

Weeklong trace of 188,000 MapReduce jobs from a 3000-node cluster

Maximum number of concurrent transfers is several hundreds

33% time in shuffle on average

Monarch (Oakland’11)Real-time spam classification from 345,000 tweets with urls

»Logistic Regression»Written in Spark

Spends 42% of the iteration time in transfers

»30% broadcast»12% shuffle

100 iterations to converge

Collaborative Filtering

Netflix challenge»Predict users’ ratings

for movies they haven’t seen based on their ratings for other movies

385MB data broadcasted in each iteration

Does not scale beyond 60 nodes

Cornet performance

1GB data to 100 receivers on EC2

4.5x to 6.5x improvement

Shuffle bottlenecks

An optimal shuffle schedule must keep at least one link fully utilized throughout the transfer

At a sender At a receiver In the network

Current implementations

Shuffle 1GB to 30 reducers on EC2

Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma,...

Documents

Zaharia stancu

Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University

Fast,&Interactive,&LanguageBIntegrated ... - Scaladays2011.scala-lang.org/sites/days2011/files/20. Spark.pdf · Matei&Zaharia,&Mosharaf&Chowdhury,Tathagata Das,& Ankur&Dave,&Justin&Ma,&Murphy&McCauley,&Michael&Franklin,&

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark

Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Surviving Failures in Bandwidth Constrained Datacenters Authors: Peter Bodik Ishai Menache Mosharaf Chowdhury Pradeepkumar Mani David A.Maltz Ion Stoica

Author: Author: Michael Armbrust, Tathagata Das, Aaron Davidson, Ali Ghodsi, Andrew Or, Josh Rosen, Ion Stoica, Patrick Wendell, Reynold Xin, Matei Zaharia

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica

DRFQ: Multi-Resource Fair Queueing for Packet Processing Ali Ghodsi 1,3, Vyas Sekar 2, Matei Zaharia 1, Ion Stoica 1 1 UC Berkeley, 2 Intel ISTC/Stony

2013 12 02 Sparrow - EECS at UC Berkeleykeo/talks/sparrow-spark-summit... · Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

In#MemoryClusterComputingfor · Spark In#MemoryClusterComputingfor IterativeandInteractiveApplications MateiZaharia,MosharafChowdhury,Tathagata Das, AnkurDave,JustinMa,MurphyMcCauley,

Fast, Interactive, Language-Integrated Cluster Computingjfc/DataMining/SP12/lecs/lec11.pdf · Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Resilient Distributed Datasets: A Fault-Tolerant …...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury,

Sparrow Distributed Low-Latency Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

A Case for Performance- Centric Network Allocation Gautam Kumar, Mosharaf Chowdhury, Sylvia Ratnasamy, Ion Stoica UC Berkeley

Spark In-Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,