HUG August 2010: Mesos

MesosA Resource Management Platform for Hadoop and Big Data Clusters

Benjamin Hindman, Andy Konwinski, Matei Zaharia,

Ali Ghodsi, Anthony D. Joseph, Randy Katz,Scott Shenker, Ion Stoica

UC Berkeley

Challenges in Hadoop Cluster ManagementIsolation (both fault and resource)

»E.g. if co-locating production and experimental jobs

Testing and deploying upgrades

JobTracker scalability (in larger clusters)

Running non-Hadoop jobs»E.g. MPI, streaming MapReduce, etc

MesosMesos is a cluster resource manager over which multiple instances of Hadoop and other distributed applications can run

Hadoop

(production)

Hadoop

(ad-hoc)

MPI

Mesos

Node Node Node Node

…

What’s Different About It?Other resource managers exist today, including

» Hadoop on Demand» Batch schedulers (e.g. Torque)» VM schedulers (e.g. Eucalyptus)

However, have 2 problems with Hadoop-like workloads:» Data locality compromised due to static partitioning of nodes» Utilization hurt because jobs hold nodes for their full duration

Mesos addresses these through two features:» Fine-grained sharing at the level of tasks» Two-level scheduling model where jobs control placement

Fine-Grained Sharing

Hadoop 1

Hadoop 2

Hadoop 3

Hadoop 1

Hadoop 1

Hadoop 1

Hadoop 1

Hadoop 3

Hadoop 3

Hadoop 3

Hadoop 3 Hadoop

3

Hadoop 2

Hadoop 2

Hadoop 2

Hadoop 2

Hadoop 2

Hadoop 2

Hadoop 1

Hadoop 3

Hadoop 2

Hadoop 3

Hadoop 1

Hadoop 2

Coarse-Grained (HOD, etc) Fine-Grained (Mesos)

Mesos slave

Mesos master

Hadoop 0.20

scheduler

Mesos slave

Hadoop job

Hadoop 20 executor

task

Mesos slaveHadoop

19 executor

task

MPIscheduler

MPI job

MPIexecutor

task

Mesos Architecture

Hadoop 0.19

scheduler

Hadoop job

Hadoop 19

executor

task

MPIexecutor

task

Design GoalsScalability (to 10,000’s of nodes)

Robustness (even to master failure)

Flexibility (support wide variety of frameworks)

Resulting design: simple, minimal core that pushes resource selection logic to frameworks

Other FeaturesMaster fault tolerance using ZooKeeper

Resource isolation using Linux Containers

»Isolate CPU and memory between tasks on each node

»In newest kernels, can isolate network & disk IO too

Web UI for viewing cluster state

Deploy scripts for private clusters and EC2

Mesos StatusPrototype in 10000 lines of C++

Ported frameworks:» Hadoop (0.20.2), MPI (MPICH2), Torque

New frameworks:» Spark, Scala framework for iterative & interactive jobs

Test deployments at Twitter and Facebook

Results

1 28 55 82 1091361631902172442712983253520%

20%

40%

60%

80%

100%

MPIHadoopSpark

Time (s)

Sh

are

of

Clu

ste

r

Dynamic Resource Sharing

Data Locality Scalability

Static partition-

ing

Mesos, no delay sched.

Mesos, 1s delay sched.

Mesos, 5s delay sched.

0%

20%

40%

60%

80%

100%

0

120

240

360

480

600

Data Locality Job Running Times

Loca

l M

ap T

asks (

%)

Job R

unnin

g T

Ime (

s)

0 10000 20000 30000 40000 500000

0.25

0.5

0.75

1

Number of Nodes

Task S

tart

up

Overh

ead (

s)

Mesos Demo

SparkA framework for iterative and interactive cluster computingMatei Zaharia, Mosharaf Chowdhury,Michael Franklin, Scott Shenker, Ion Stoica

Spark GoalsSupport iterative applications

»Common in machine learning but problematic for MapReduce, Dryad, etc

Retain MapReduce’s fault tolerance & scalability

Experiment with programmability»Integrate into Scala programming language»Support interactive use from Scala

interpreter

Key AbstractionResilient Distributed Datasets (RDDs)

»Collections of elements distributed across cluster that can persist across parallel operations

»Can be stored in memory, on disk, etc»Can be transformed with map, filter, etc»Automatically rebuilt on failure

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Cached RDD Parallel

operation

Example: Logistic RegressionGoal: find best line separating two sets

of points

+

–

++

+

+

+

++ +

– ––

–

–

–

––

+

target

–

random initial line

Logistic Regression Codeval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y scale * p.x }).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Logistic Regression Performance

1 5 10 20 300

50010001500200025003000350040004500

Hadoop

Number of Iterations

Ru

nn

ing

Tim

e (

s) 127 s / iteration

first iteration 174 s

further iterations 6 s

Spark Demo

ConclusionMesos provides a stable platform to multiplex resources among diverse cluster applications

Spark is a new cluster programming framework for iterative & interactive jobs enabled by Mesos

Both are open-source (but in very early alpha!)

http://github.com/mesos

http://mesos.berkeley.edu

http://github.com/mesos

http://mesos.berkeley.edu/

http://mesos.berkeley.edu/

Technology

HUG August 2010: Mesos