27
Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Embed Size (px)

Citation preview

Page 1: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

OrchestraManaging Data Transfers in Computer Clusters

Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica

UC Berkeley

Page 2: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Moving Data is Expensive

Typical MapReduce jobs in Facebook spend 33% of job running time in large data transfers

Application for training a spam classifier on Twitter data spends 40% time in communication

2

Page 3: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Limits Scalability

Scalability of Netflix-like recommendation system is bottlenecked by communication

3

Did not scale beyond 60 nodes» Comm. time increased

faster than comp. time decreased

Page 4: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Transfer Patterns

Transfer: set of all flows transporting data between two stages of a job» Acts as a barrier

Completion time: Time for the last receiver to finish

Map

Shuffle

Reduce

Broadcast

Incast*

4

Page 5: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Contributions

1. Optimize at the level of transfers instead of individual flows

2. Inter-transfer coordination

5

Page 6: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

TC (broadcast)

TC (broadcast)

TC (broadcast)

TC (broadcast)

HDFSTree

Cornet

HDFSTree

Cornet

HDFSTree

Cornet

HDFSTree

Cornet

TC (shuffle)TC (shuffle)

Hadoop shuffleWSS

Hadoop shuffleWSS

shuffle broadcast 1 broadcast 2

ITCITC

Fair sharingFIFO

Priority

Fair sharingFIFO

Priority

6

ShuffleTransfer

Controller (TC)

ShuffleTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

Inter-TransferController (ITC)Inter-Transfer

Controller (ITC)

Orchestra

Page 7: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

OutlineCooperative broadcast (Cornet)

» Infer and utilize topology information

Weighted Shuffle Scheduling (WSS)

» Assign flow rates to optimize shuffle completion time

Inter-Transfer Controller» Implement weighted fair sharing

between transfers

End-to-end performance7

TC (broadcast)

TC (broadcast)

TC (broadcast)

TC (broadcast)

HDFSTree

Cornet

HDFSTree

Cornet

HDFSTree

Cornet

HDFSTree

Cornet

TC (shuffle)TC (shuffle)

Hadoop shuffleWSS

Hadoop shuffleWSS

ITCITC

Fair sharingFIFO

Priority

Fair sharingFIFO

Priority

ITCITC

ShuffleTC

ShuffleTC

BroadcastTC

BroadcastTC

BroadcastTC

BroadcastTC

Page 8: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Cornet: Cooperative broadcast

Observations Cornet Design Decisions

1. High-bandwidth, low-latency network

Large block size (4-16MB)

2. No selfish or malicious peers

No need for incentives (e.g., TFT)

No (un)choking Everyone stays till the end

3. Topology matters Topology-aware broadcast8

Broadcast same data to every receiver»Fast, scalable, adaptive to bandwidth, and

resilient

Peer-to-peer mechanism optimized for cooperative environments

Page 9: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Cornet performance

9

1GB data to 100 receivers on EC2

4.5x to 5x improvement

Status quo

Page 10: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Topology-aware Cornet

Many data center networks employ tree topologies

Each rack should receive exactly one copy of broadcast

»Minimize cross-rack communication

Topology information reduces cross-rack data transfer

»Mixture of spherical Gaussians to infer network topology

10

Page 11: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Topology-aware Cornet

11

~2x faster than vanilla Cornet

200MB data to 30 receivers on DETER

3 inferred clusters

Page 12: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Status quo in Shuffle

12

r1r1 r2r2

s2s2 s3s3 s4s4s1s1 s5s5

Links to r1 and r2 are full:Link from s3 is

full:Completion time:

3 time units

2 time units

5 time units

Page 13: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Allocate rates to each flow using weighted fair sharing, where the weight of a flow between a sender-receiver pair is proportional to the total amount of data to be sent

13

Up to 1.5X improvement

Completion time: 4 time units

Weighted Shuffle Scheduling

r1r1 r2r2

s2s2 s3s3 s4s4s1s1 s5s5

1 1 2 2 1 1

Page 14: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Inter-Transfer Controller aka ConductorWeighted fair sharing

»Each transfer is assigned a weight»Congested links shared proportionally to

transfers’ weights

Implementation: Weighted Flow Assignment (WFA)

»Each transfer gets a number of TCP connections proportional to its weight

»Requires no changes in the network nor in end host OSes

14

Page 15: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Benefits of the ITC

Two priority classes»FIFO within each

class

Low priority transfer»2GB per reducer

High priority transfers

»250MB per reducer

Shuffle using 30 nodes on EC2

43% reduction in high priority xfers

6% increase of the low priority xfer

Without Inter-transfer Scheduling

Priority Scheduling in Conductor

Page 16: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

End-to-end evaluation

Developed in the context of Spark – an iterative, in-memory MapReduce-like framework

Evaluated using two iterative applications developed by ML researchers at UC Berkeley

»Training spam classifier on Twitter data»Recommendation system for the Netflix

challenge

16

Page 17: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Faster spam classification

17

Communication reduced from 42% to 28% of the iteration

time

Overall 22% reduction in iteration time

Page 18: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Scalable recommendation system

Before After

18

1.9x faster at 90 nodes

Page 19: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Related workDCN architectures (VL2, Fat-tree etc.)»Mechanism for faster network, not policy for better sharing

Schedulers for data-intensive applications (Hadoop scheduler, Quincy, Mesos etc.)»Schedules CPU, memory, and disk across the cluster

Hedera»Transfer-unaware flow scheduling

Seawall»Performance isolation among cloud tenants

19

Page 20: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Summary

Optimize transfers instead of individual flows»Utilize knowledge about application semantics

Coordinate transfers»Orchestra enables policy-based transfer

management»Cornet performs up to 4.5x better than the

status quo»WSS can outperform default solutions by 1.5x

No changes in the network nor in end host OSes

20http://www.mosharaf.com/

Page 21: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Backup Slides

21

Page 22: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

MapReduce logs

Weeklong trace of 188,000 MapReduce jobs from a 3000-node cluster

Maximum number of concurrent transfers is several hundreds

22

33% time in shuffle on average

Page 23: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Monarch (Oakland’11)Real-time spam classification from 345,000 tweets with urls

»Logistic Regression»Written in Spark

Spends 42% of the iteration time in transfers

»30% broadcast»12% shuffle

100 iterations to converge

23

Page 24: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Collaborative Filtering

Netflix challenge»Predict users’ ratings

for movies they haven’t seen based on their ratings for other movies

385MB data broadcasted in each iteration

24

Does not scale beyond 60 nodes

Page 25: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Cornet performance

25

1GB data to 100 receivers on EC2

4.5x to 6.5x improvement

Page 26: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Shuffle bottlenecks

An optimal shuffle schedule must keep at least one link fully utilized throughout the transfer

26

At a sender At a receiver In the network

Page 27: Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley

Current implementations

27

Shuffle 1GB to 30 reducers on EC2