36
Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica UC Berkeley

Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Sinbad

Leveraging Endpoint Flexibility in Data-Intensive Clusters

Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica UC  Berkeley  

Page 2: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Communication is Crucial for Analytics at Scale

Performance Facebook analytics jobs spend 33% of their runtime in communication1

As in-memory systems proliferate, the network is likely to become the primary bottleneck

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011

Page 3: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Network Usage is Imbalanced1

Facebook

0

0.25

0.5

0.75

1

0 1 2 3 4 5 6

Frac

tion

of T

ime

Coeff. of Var.2 of Load Across Core-Rack Links

Bing

0

0.25

0.5

0.75

1

0 1 2 3 4

Frac

tion

of T

ime

Coeff. of Var. of Load Across Core-Rack Links

More than 50% of the time, links have high imbalance (Cv > 1).

1. Imbalance considering all cross-rack bytes. Calculated in 10s bins. 2. Coefficient of variation, Cv = (stdev/mean).

Imbalance (Coeff. of Var.2 of Link Utilization)

Imbalance (Coeff. of Var.2 of Link Utilization)

Page 4: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Write Sources 1. Ingestion 2. Pre-processing 3.  Job outputs

What Are the Sources of Cross-Rack Traffic?

DFS Reads 14%

Shuffle 46%

DFS Writes

40%

DFS Reads 31%

Shuffle 15%

DFS Writes

54%

Facebook Bing

1. DFS = Distributed File System

Page 5: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Distributed File Systems (DFS)

Core

Rack 1 Rack 2 Rack 3

F

F F

Pervasive in BigData clusters •  E.g., GFS, HDFS, Cosmos •  Many frameworks interact w/ the same DFS

Files are divided into blocks •  64MB to 1GB in size

Each block is replicated •  To 3 machines for fault tolerance •  In 2 fault domains for partition tolerance.

•  Uniformly placed for a balanced storage

Synchronous operations

F I L E

I I I

E L L E

L E

Page 6: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Fixed Sources Destinations

Flexible Paths Rates

Core

Rack 1 Rack 2 Rack 3

F

F I I

E L L E

Pervasive in BigData clusters •  E.g., GFS, HDFS, Cosmos •  Many frameworks interact w/ the same DFS

Files are divided into blocks •  64MB to 1GB in size

Each block is replicated •  To 3 machines for fault tolerance •  In 2 fault domains for partition tolerance.

•  Uniformly placed for a balanced storage

Synchronous operations

How to handle DFS flows?

A few seconds long

Hedera, VLB,

Orchestra, Coflow, MicroTE, DevoFlow, …

Page 7: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Pervasive in BigData clusters •  Many frameworks interact w/ the same DFS

Files are divided into blocks •  64MB to 1GB in size

Each block is replicated •  To 3 machines for fault tolerance

•  In 2 fault domains for partition tolerance. •  Uniformly placed for a balanced storage

Synchronous operations

Fixed Sources Destinations

Flexible Paths Rates

Core

Rack 1 Rack 2 Rack 3

F

F I I

E L L E

Distributed File Systems (DFS)

Replica locations do not matter as long as constraints are met

Flexible Sources Destinations ✔

Page 8: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Sinbad Steers flexible replication traffic away from hotspots

1.  Faster Writes By avoiding contention during replication

2.  Faster Transfers Due to more balanced network usage closer to edges

Page 9: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

The Distributed Writing Problem

Core

Rack 1 Rack 2 Rack 3

Given •  Blocks of different size, and •  Links of different capacities,

Place blocks to minimize •  The average block write time •  The average file write time

is NP-Hard

F E I L

Given •  Jobs of different length, and •  Machines of different speed,

Schedule jobs to minimize •  The average job completion time

F E I L

Machine 1 Machine 2

Machine 3

Job Shop Scheduling

Page 10: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

The Distributed Writing Problem is NP-Hard

Lack of future knowledge about the •  Locations and durations of network hotspots, •  Size and arrival times of new replica placement requests

Online!^!

Page 11: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

1 C B

T T+1

Time

How to Make it Easy?

Assumptions 1.  Link utilizations are stable 2.  All blocks have the same size

A 2

Theorem: Greedy placement minimizes average block/file write times

Page 12: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

How to Make it Easy? – In Practice

Reality 1.  Average link utilizations are

temporarily stable1,2

2.  Fixed-size large blocks write 93% of all bytes

Assumptions 1.  Link utilizations are stable 2.  All blocks have the same size

1. Utilization is considered stable if its average over next x seconds remains within ±5% of the initial value 2. Typically, x ranges from 5 to 10 seconds. Time to write a 256MB block assuming 50MBps write throughput is 5 seconds

Page 13: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Sinbad Performs two-step greedy replica placement

1.  Pick the least-loaded link 2.  Send a block from the file with the least-remaining

blocks through the selected link

Page 14: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Sinbad Overview

Centralized master-slave architecture •  Agents collocated with DFS agents

Slaves periodically report information

Sinbad Master

DFS Master

DFS Slave

Sinbad Slave

DFS Slave

Sinbad Slave

DFS Slave

Sinbad Slave

Machine

Page 15: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Sinbad Master

Performs network-aware replica placement for large blocks

•  Periodically estimates network hotspots •  Takes greedy online decisions •  Adds hysteresis until next measurement

Sinbad Master Where to put block B?

•  Static Information •  Network topology •  Link, disk capacities •  Dynamic distributions of •  loads in links •  popularity of files

Information (from slaves)

{ Locations }

•  At least r replicas •  In f fault domains •  Collocate with block B’ •  …

Constraints & Hints

Page 16: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

1.  Does it improve performance? 2.  Does it balance the network? 3.  Does the storage remain balanced? YES

Evaluation A 3000-node trace-driven simulation matched against a 100-node EC2 deployment

Page 17: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Faster

Exp

Sim 1.39X 1.58X

1.26X 1.30X

Job Improv. DFS Improv.

1.60X In-memory!storage!

^!

1.79X

Page 18: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

More Balanced EC2 Deployment

0

0.25

0.5

0.75

1

0 1 2 3 4

Frac

tion

of T

ime

Coeff. of Var. of Load Across Rack-to-Host Links

Default Network-Aware

Facebook Trace Simulation

0

0.25

0.5

0.75

1

0 1 2 3 4

Frac

tion

of T

ime

Coeff. of Var. of Load Across Core-to-Rack Links

Default

Network-Aware

Imbalance (Coeff. of Var.1 of Link Utilization)

Imbalance (Coeff. of Var.1 of Link Utilization)

Page 19: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

What About Storage Balance?

Network is imbalanced in the short term; but, in the long term,

hotspots are uniformly distributed

Page 20: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Three Approaches

Toward Contention Mitigation

#3 Balance

Usage

Manage elephant flows Optimize intermediate comm.

Valiant load balancing (VLB), Hedera, Orchestra, Coflow, MicroTE, DevoFlow, …

#1 Increase Capacity

Fatter links/interfaces Increase Bisection B/W

Fat tree, VL2, DCell, BCube, F10, …

#2 Decrease

Load

Data locality Static optimization

Fair scheduling, Delay scheduling, Mantri, Quincy, PeriSCOPE, RoPE, Rhea, …

Page 21: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

•  Improves job performance by making the network more balanced •  Improves DFS write performance while keeping the storage balanced •  Sinbad will become increasingly more important as storage becomes faster

Sinbad Greedily steers replication traffic away from hotspots

We are planning to deploy Sinbad at

Mosharaf Chowdhury - @mosharaf

Page 22: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#
Page 23: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Trace Details

Facebook Microsoft Bing

Period Oct 2010 Mar – Apr 2012

Duration 1 Week 1 Month

Framework Hadoop MapReduce SCOPE

Jobs 175,000 O(10,000)

Tasks 30 Millions O(10 Millions)

File System HDFS Cosmos

Block Size 256MB 256MB

Number of Machines 3,000 O(1,000)

Number of Racks 150 O(100)

Core : Rack Oversubscription 10 : 1 Better / Less Oversubscribed

Page 24: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Writer Characteristics

0

0.25

0.5

0.75

1

0 0.25 0.5 0.75 1

CD

F (W

eigh

ted

by B

ytes

Wri

tten

)

Fraction of Task Duration in Write

Preproc./Ingest

Reducers

Combined 37% of all tasks write to the DFS Writers spend large fractions of runtime in writing

42% of the reducers and 91% of other writers spend at least 50% of run time

Page 25: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Big Blocks Write Most Bytes

0

0.25

0.5

0.75

1

1 100 10000 1000000

Frac

tion

of D

FS B

ytes

Block Size (KB)

30% blocks are full sized, i.e., they are capped at 256MB 35% blocks are of at least 128 MB or more in size

256 MB blocks write 81% of the bytes

Blocks of size at least 128 MB or more write 93% of the bytes

One third of the blocks generate almost all

replication bytes

Page 26: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Big Blocks Write Most Bytes

30% 81% Bytes are written by large blocks

Blocks are large (256 MB)

Page 27: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Hotspots are Stable1 in the Short Term

1. Utilization is considered stable if its average over next x seconds remains within ±5% of the initial value 2. Time to write a 256MB block assuming 50MBps write throughput is 5 seconds

0.94 0.89 0.8

0.66

0

0.2

0.4

0.6

0.8

1

5s 10s 20s 40s

Prob

abili

ty

Duration of Utilization Stability

Long enough2 to write a block even if disk is the bottleneck

Page 28: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Greedy assignment of blocks to the least-loaded link in the least-remaining-blocks-first order is optimal for minimizing the average write times Thm.

•  If the block size is fixed, and •  hotspots are temporarily stable,

the solution is … Greedy Placement

Page 29: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Utilization Estimator

Utilizations updated using EWMA at Δ intervals

•  vnew = α * vmeasured + (1-α) * vold

•  We use α = 0.2 Update interval (Δ)

•  Too small a Δ creates overhead, but too large gives stale data •  We use Δ = 1 second right now •  Missing updates are treated conservatively, as if the link is fully loaded

Page 30: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Utilization Estimator

Hysteresis after each placement decision

•  Temporarily bump up estimates to avoid putting too many blocks in the same location •  Once the next measurement update arrives, hysteresis is removed and

the actual estimation is used Hysteresis function

•  Proportional to the size of block just placed •  Inversely proportional to the time remaining till next update

Page 31: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Implementation

Implemented and integrated with HDFS •  Pluggable replica placement policy on https://github.com/facebook/hadoop-20

•  Slaves are integrated into DataNode and the master into NameNode •  Update comes over the Heartbeat messages •  Few hundred lines of Java

Page 32: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Methodology

HDFS deployment in EC2 •  Focus on large (in terms of network bytes) jobs only •  100 m1.xlarge nodes with 4x400GB disks •  55MBps/disk maximum write throughput •  700+Mbps/node during all-to-all communication

Trace-driven simulation •  Detailed replay of a day-long Facebook trace (circa October 2010) •  3000-node,150-rack cluster with 10:1 oversubscription

Page 33: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Breakdown [Time Spent]

1.11

1.08

1.08 1.

18 1.

26

1.19

1.45

1.27

1.19

1.24

1.28

1.29

1

1.25

1.5

1.75

Bin 1 Bin 2 Bin 3 Bin 4 Bin 5 ALL

Fact

or o

f Im

prov

emen

t End-to-End WriteTime Jobs spend varying amounts of

time in writing •  Jobs in Bin-1 the least and jobs in

Bin-5 the most

No clear correlation •  We improve block writes w/o

considering job characteristics

Page 34: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

1.07

1.11

1.08 1.13 1.

22

1.20

1.31

1.15

1.44

1.23

1

1.25

1.5

1.75

[0, 10) [10, 100) [100, 1E3) [1E3, 1E4) [1E4, 1E5)

Fact

or o

f Im

prov

emen

t

Job Write Size (MB)

End-to-End WriteTime

Breakdown [Bytes Written]

Jobs write varying amounts to HDFS as well •  Ingestion and pre-processing jobs

write the most

No clear correlation •  We do not use file characteristics

while selecting destinations

Page 35: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Balanced Storage [Simulation]

0.5

0.6

0.7

0.8

0.9

1

1 26 51 76 101 126 151

% o

f Byt

es

Rank of Racks

NetworkAware

Default

0.5

0.6

0.7

0.8

0.9

1

1 26 51 76 101 126 151

% o

f Blo

cks

Rank of Racks

NetworkAware Default

Byte Distribution Block Distribution

Reacting to the imbalance isn’t always perfect!

Page 36: Sinbad - mosharaf.com€¦ · Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters! Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica! UC#Berkeley#

Balanced Storage [EC2] An hour-long 100-node EC2 experiment •  Wrote ~10TB of data •  Calculated standard deviation of disk usage across all machines

6.7 15.8

1600

1

10

100

1000

10000

Default Network-Aware Capacity

GB

Imbalance is less than 1% of the storage capacity of each machine