Cluster Computing with Dryad

Cluster Computing with DryadLINQ

Mihai Budiu Microsoft Research, Silicon Valley

Cloudera, February 12, 2010

2

Goal

3

Design Space

ThroughputLatency

Internet

Privatedata

center

Data-parallel

Sharedmemory

DryadSearch

HPC

Grid

Transaction

Execution

Application

Data-Parallel Computation

4

Storage

Language

ParallelDatabases

Map-Reduce

GFSBigTable

CosmosAzure

SQL Server

Dryad

DryadLINQScope

Sawzall

Hadoop

HDFSS3

Pig, HiveSQL ≈SQL LINQ, SQLSawzall

Cosmos, HPC, Azure

5

SQL

Software Stack

Windows Server

Cosmos

Cosmos FS

Dryad

Distributed Shell

PSQL

DryadLINQSQL

server

Windows Server

Windows Server

Windows Server

C++

NTFS

legacycode

SSISScope

C#MachineLearning

.Net Distributed Data Structures

GraphsData

mining

Applications

Azure XCompute Windows HPC

Azure XStore SQL Server

Analytics

Tidy FS

Optimi-zation

6

• Introduction• Dryad • DryadLINQ• Building on DryadLINQ• Conclusions

Outline

7

Dryad

• Continuously deployed since 2006• Running on >> 104 machines• Sifting through > 10Pb data daily• Runs on clusters > 3000 machines• Handles jobs with > 105 processes each• Platform for rich software ecosystem• Used by >> 100 developers

• Written at Microsoft Research, Silicon Valley

http://upload.wikimedia.org/wikipedia/commons/0/0f/Dryad11.jpg

8

Dryad = Execution Layer

Job (application)

Dryad

Cluster

Pipeline

Shell

Machine≈

9

2-D Piping• Unix Pipes: 1-D

grep | sed | sort | awk | perl

• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

10

Virtualized 2-D Pipelines

11


12


13


14

Virtualized 2-D Pipelines• 2D DAG• multi-machine• virtualized

15

Dryad Job Structure

grep

sed

sortawk

perlgrep

grepsed

sort

sort

awk

Inputfiles

Vertices (processes)

Outputfiles

ChannelsStage

16

Channels

X

M

Items

Finite streams of items

• distributed filesystem files (persistent)• SMB/NTFS files (temporary)• TCP pipes (inter-machine)• memory FIFOs (intra-machine)

17

Dryad System Architecture

Files, TCP, FIFO, Networkjob schedule

data plane

control plane

NS,Sched PD PDPD

V V V

Job manager cluster

Fault Tolerance

19

Policy Managers

R R

X X X X

Stage RR R

Stage X

Job Manager

R managerX ManagerR-X

Manager

Connection R-X

X[0] X[1] X[3] X[2] X’[2]

Completed vertices Slow vertex

Duplicatevertex

Dynamic Graph Rewriting

Duplication Policy = f(running times, data volumes)

Cluster network topology

rack

top-of-rack switch

top-level switch

22

S S S S

A A A

S S

T

S S S S S S

T

# 1 # 2 # 1 # 3 # 3 # 2

# 3# 2# 1

static

dynamic

rack #

Dynamic Aggregation

23

Policy vs. Mechanism

• Application-level• Most complex in C++

code• Invoked with upcalls• Need good default

implementations• DryadLINQ provides

a comprehensive set

• Built-in• Scheduling• Graph rewriting• Fault tolerance• Statistics and

reporting

24


Outline

25

LINQ

Dryad

=> DryadLINQ

26

LINQ = .Net+ Queries

Collection<T> collection;bool IsLegal(Key);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

27

Collections and Iteratorsclass Collection<T> : IEnumerable<T>;

public interface IEnumerable<T> {IEnumerator<T> GetEnumerator();

}

public interface IEnumerator <T> {T Current { get; }bool MoveNext();void Reset();

}

28

DryadLINQ Data Model

Partition

Collection

.Net objects

29

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

DryadLINQ = LINQ + Dryad

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)Data

30

Demo

31

Example: Histogrampublic static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k){ var words = input.SelectMany(x => x.line.Split(' ')); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top;}

“A line of words of wisdom”

[“A”, “line”, “of”, “words”, “of”, “wisdom”]

[[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]

[ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]

[{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]

[{“of”, 2}, {“A”, 1}, {“line”, 1}]

32

Histogram Plan

SelectManySort

GroupBy+SelectHashDistribute

MergeSortGroupBy

SelectSortTake

MergeSortTake

33

Map-Reduce in DryadLINQ

public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input,

Func<T, IEnumerable<M>> mapper,Func<M,K> keySelector,Func<IGrouping<K,M>,S> reducer)

{ var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result;}

34

Map-Reduce Plan

M

R

G

M

Q

G1

R

D

MS

G2

R

static dynamic

X

X

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

MS

G2

R

MS

G2

R

map

sort

groupby

reduce

distribute

mergesort

groupby

reduce

mergesort

groupby

reduce

consumer

map

parti

al a

ggre

gatio

nre

duce

S S S S

A A A

S S

T

dynamic

35

Distributed Sorting Plan

O

DS

H

D

M

S

DS

H

D

M

S

DS

D

DS

H

D

M

S

DS

D

M

S

M

S

static dynamic dynamic

Expectation Maximization

36

• 160 lines • 3 iterations shown

37

Probabilistic Index MapsImages

features

38

Language Summary

WhereSelectGroupByOrderByAggregateJoinApplyMaterialize

39

LINQ System Architecture

Local machine

.Netprogram(C#, VB, F#, etc)

LINQProvider

Execution engine

Query

Objects

•LINQ-to-obj•PLINQ•LINQ-to-SQL•LINQ-to-WS•DryadLINQ•Flickr•Oracle•LINQ-to-XML•Your own

The DryadLINQ Provider

40

DryadLINQClient machine

(11)

Distributedquery plan

.Net

Query Expr

Data center

Output TablesResults

Input TablesInvoke Query

Output DryadTable

Dryad Execution

.Net Objects

Dryad JM

ToCollection

foreach

Vertexcode

Con-text

41

Combining Query Providers

PLINQ

Local machine

.Netprogram(C#, VB, F#, etc)

LINQProvider

Execution engines

Query

Objects

SQL Server

DryadLINQ

LINQProvider

LINQProvider

LINQProvider

LINQ-to-obj

42

Using PLINQQuery

DryadLINQ

PLINQ

Local query

43

LINQ to SQL

Using LINQ to SQL Server

Query

DryadLINQ

Query Query Query

Query Query

LINQ to SQL

44

Using LINQ-to-objects

Query

DryadLINQ

Local machine

Cluster

LINQ to obj

debug

production

45

• Introduction• Dryad • DryadLINQ• Building on/for DryadLINQ

– System monitoring with Artemis– Privacy-preserving query language (PINQ)– Machine learning

• Conclusions

Outline

46

Artemis: measuring clusters

CosmosCluster

HPCCluster

AzureCluster

Cluster/Job State API

DryadLINQ

Log collectionClusterbrowser/manager

Jobbrowser

Visualization

Statistics

DB

Plug-ins

47

DryadLINQ job browser

48

Automated diagnostics

49

Job statistics: schedule and critical path

50

Running time distribution

51

Performance counters

52

CPU Utilization

53

Load imbalance:rack assignment

54

PINQ

Privacy-sensitive database

Queries(LINQ)

Answer

55

PINQ = Privacy-Preserving LINQ• “Type-safety” for privacy• Provides interface to data that looks very much

like LINQ.• All access through the interface gives

differential privacy.• Analysts write arbitrary C# code against data

sets, like in LINQ.• No privacy expertise needed to produce

analyses.• Privacy currency is used to limit per-record

information released.

56

Example: search logs mining

Distribution of queries about “Cricket”

// Open sensitive data set with state-of-the-art securityPINQueryable<VisitRecord> visits = OpenSecretData(password);

// Group visits by patient and identify frequent patients.var patients = visits.GroupBy(x => x.Patient.SSN)

.Where(x => x.Count() > 5);

// Map each patient to their post code using their SSN.var locations = patients.Join(SSNtoPost, x => x.SSN, y => y.SSN, (x,y) => y.PostCode);

// Count post codes containing at least 10 frequent patients.var activity = locations.GroupBy(x => x)

.Where(x => x.Count() > 10);Visualize(activity); // Who knows what this does???

57

PINQ Download

• Implemented on top of DryadLINQ• Allows mining very sensitive datasets privately• Code is available• http://research.microsoft.com/en-us/projects/PINQ/• Frank McSherry, Privacy Integrated Queries,

SIGMOD 2009

http://research.microsoft.com/en-us/projects/PINQ/

http://research.microsoft.com/apps/pubs/default.aspx?id=80218

58

Natal Training

59

Natal Problem

• Recognize players from depth map• At frame rate• Using 15% of one Xbox CPU core

60

Learn from Data

Motion Capture(ground truth)

Classifier

Training examplesMachine learning

Rasterize

61

Running on Xbox

62

Learning from data

Classifier

Training examples

Dryad

DryadLINQ

Machine learning

63

Large-Scale Machine Learning• > 1022 objects• Sparse, multi-dimensional data structures• Complex datatypes

(images, video, matrices, etc.)• Complex application logic and dataflow

– >35000 lines of .Net– 140 CPU days – > 105 processes– 30 TB data analyzed– 140 avg parallelism (235 machines)– 300% CPU utilization (4 cores/machine)

64

Highly efficient parallellization

65


Outline

66

Lessons Learned• Complete separation of

storage / execution / language• Using LINQ +.Net (language integration)• Static typing

– No protocol buffers (serialization code)• Allowing flexible and powerful policies• Centralized job manager: no replication, no

consensus, no checkpointing• Porting (HPC, Cosmos, Azure, SQL Server)

Conclusions

67

Visual Studio

LINQ

Dryad

67

=

68

“What’s the point if I can’t have it?”

• Dryad+DryadLINQ available for download– Academic license– Commercial evaluation license

• Runs on Windows HPC platform• Dryad is in binary form, DryadLINQ in source• Requires signing a 3-page licensing agreement• http://connect.microsoft.com/site/sitehome.aspx?SiteID=89

1

http://connect.microsoft.com/site/sitehome.aspx?SiteID=891

http://connect.microsoft.com/site/sitehome.aspx?SiteID=891

69

Backup Slides

70

What does DryadLINQ do?public struct Data { … public static int Compare(Data left, Data right);}

Data g = new Data();var result = table.Where(s => Data.Compare(s, g) < 0);

public static void Read(this DryadBinaryReader reader, out Data obj); public static int Write(this DryadBinaryWriter writer, Data obj);

public class DryadFactoryType__0 : LinqToDryad.DryadFactory<Data>

DryadVertexEnv denv = new DryadVertexEnv(args);var dwriter__2 = denv.MakeWriter(FactoryType__0);var dreader__3 = denv.MakeReader(FactoryType__0);var source__4 = DryadLinqVertex.Where(dreader__3,

s => (Data.Compare(s, ((Data)DryadLinqObjectStore.Get(0))) < ((System.Int32)(0))), false);

dwriter__2.WriteItemSequence(source__4);

Data serialization

Data factory

Channel writerChannel reader

LINQ code

Context serialization

71

Ongoing Dryad/DryadLINQ Research

• Performance modeling• Scheduling and resource allocation• Profiling and performance debugging• Incremental computation• Hardware acceleration• High-level programming abstractions• Many domain-specific applications

72

Sample applications written using DryadLINQ Class

Distributed linear algebra Numerical

Accelerated Page-Rank computation Web graph

Privacy-preserving query language Data mining

Expectation maximization for a mixture of Gaussians Clustering

K-means Clustering

Linear regression Statistics

Probabilistic Index Maps Image processing

Principal component analysis Data mining

Probabilistic Latent Semantic Indexing Data mining

Performance analysis and visualization Debugging

Road network shortest-path preprocessing Graph

Botnet detection Data mining

Epitome computation Image processing

Neural network training Statistics

Parallel machine learning framework infer.net Machine learning

Distributed query caching Optimization

Image indexing Image processing

Web indexing structure Web graph

JM code

vertex code

Staging1. Build

2. Send .exe

3. Start JM

5. Generate graph

7. Serializevertices

8. MonitorVertex execution

4. Querycluster resources

Cluster services6. Initialize vertices

74

BibliographyDryad: Distributed Data-Parallel Programs from Sequential Building BlocksMichael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis FetterlyEuropean Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level LanguageYuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon CurreySymposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008

SCOPE: Easy and Efficient Parallel Processing of Massive Data SetsRonnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren ZhouVery Large Databases Conference (VLDB), Auckland, New Zealand, August 23-28 2008

Hunting for problems with ArtemisGabriela F. Creţu-Ciocârlie, Mihai Budiu, and Moises GoldszmidtUSENIX Workshop on the Analysis of System Logs (WASL), San Diego, CA, December 7, 2008

DryadInc: Reusing work in large-scale computationsLucian Popa, Mihai Budiu, Yuan Yu, and Michael IsardWorkshop on Hot Topics in Cloud Computing (HotCloud), San Diego, CA, June 15, 2009

Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations, Yuan Yu, Pradeep Kumar Gunda, and Michael Isard, ACM Symposium on Operating Systems Principles (SOSP), October 2009

Quincy: Fair Scheduling for Distributed Computing ClustersMichael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew GoldbergACM Symposium on Operating Systems Principles (SOSP), October 2009

http://research.microsoft.com/users/mbudiu/eurosys07.pdf

http://research.microsoft.com/users/mbudiu/DryadLINQ.pdf

http://research.microsoft.com/users/mbudiu/DryadLINQ.pdf

http://budiu.info/work/wasl08.pdf

http://budiu.info/work/hotcloud09.pdf

http://budiu.info/work/hotcloud09.pdf



Incremental Computation

…

…

Goal: Reuse (part of) prior computations to: - Speed up the current job- Increase cluster throughput- Reduce energy and costs

Goal: Reuse (part of) prior computations to: - Speed up the current job- Increase cluster throughput- Reduce energy and costs

Outputs

Inputs

Distributed Computation

Append-only data

Propose Two Approaches

1. Reuse Identical computations from the past(like make or memoization)

2. Do only incremental computation on the new data and Merge results with the previous ones(like patch)

Context

• Implemented for Dryad– Dryad Job = Computational DAG

• Vertex: arbitrary computation + inputs/outputs• Edge: data flows

Simple Example: Record Count

C

I2

C

AAdd

Outputs

Inputs (partitions)

Count

I1

Identical ComputationRecord Count

C

I2

C

AAdd

Outputs

Inputs (partitions)

Count

I1

First executionDAG

Identical Computation

Second executionDAG

Record Count

C

I2

C

AAdd

Outputs

Inputs (partitions)

Count

I1 I3

C

New Input

IDE – IDEntical Computation

Second executionDAG

Record Count

C

I2

C

AAdd

Outputs

Inputs (partitions)

Count

I1 I3

C

Identical subDAG


IDE Modified DAG

Replaced with Cached Data

Replace identical computational subDAG with edge data cached from previous execution


AAdd

Outputs

Inputs (partitions)

Count

I3

C


IDE Modified DAG

Use DAG fingerprints to determine if computations are identical

AAdd

Outputs

Inputs (partitions)

Count

I3

C



Semantic Knowledge Can Help

C

I2

C

A

I1

Reuse Output

Semantic Knowledge Can Help

C

I2

C

A

I1

C

I3

A Merge (Add)

Previous Output

Incremental DAG

Mergeable Computation

C

I2

C

A

I1

C

I3

A Merge (Add)

Automatically Inferred

Automatically Built

User-specified

Mergeable Computation

C

I2

C

A

I1

A

C

I2

C

A

I1 I3

C

Empty

Save to Cache

Incremental DAG – Remove Old Inputs

Merge Vertex

Documents

Cluster Computing with Dryad