Upload
butest
View
964
Download
1
Tags:
Embed Size (px)
Citation preview
Cluster Computing with DryadLINQ
Mihai Budiu Microsoft Research, Silicon Valley
Cloudera, February 12, 2010
2
Goal
3
Design Space
ThroughputLatency
Internet
Privatedata
center
Data-parallel
Sharedmemory
DryadSearch
HPC
Grid
Transaction
Execution
Application
Data-Parallel Computation
4
Storage
Language
ParallelDatabases
Map-Reduce
GFSBigTable
CosmosAzure
SQL Server
Dryad
DryadLINQScope
Sawzall
Hadoop
HDFSS3
Pig, HiveSQL ≈SQL LINQ, SQLSawzall
Cosmos, HPC, Azure
5
SQL
Software Stack
Windows Server
Cosmos
Cosmos FS
Dryad
Distributed Shell
PSQL
DryadLINQSQL
server
Windows Server
Windows Server
Windows Server
C++
NTFS
legacycode
SSISScope
C#MachineLearning
.Net Distributed Data Structures
GraphsData
mining
Applications
Azure XCompute Windows HPC
Azure XStore SQL Server
Analytics
Tidy FS
Optimi-zation
6
• Introduction• Dryad • DryadLINQ• Building on DryadLINQ• Conclusions
Outline
7
Dryad
• Continuously deployed since 2006• Running on >> 104 machines• Sifting through > 10Pb data daily• Runs on clusters > 3000 machines• Handles jobs with > 105 processes each• Platform for rich software ecosystem• Used by >> 100 developers
• Written at Microsoft Research, Silicon Valley
8
Dryad = Execution Layer
Job (application)
Dryad
Cluster
Pipeline
Shell
Machine≈
9
2-D Piping• Unix Pipes: 1-D
grep | sed | sort | awk | perl
• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50
10
Virtualized 2-D Pipelines
11
Virtualized 2-D Pipelines
12
Virtualized 2-D Pipelines
13
Virtualized 2-D Pipelines
14
Virtualized 2-D Pipelines• 2D DAG• multi-machine• virtualized
15
Dryad Job Structure
grep
sed
sortawk
perlgrep
grepsed
sort
sort
awk
Inputfiles
Vertices (processes)
Outputfiles
ChannelsStage
16
Channels
X
M
Items
Finite streams of items
• distributed filesystem files (persistent)• SMB/NTFS files (temporary)• TCP pipes (inter-machine)• memory FIFOs (intra-machine)
17
Dryad System Architecture
Files, TCP, FIFO, Networkjob schedule
data plane
control plane
NS,Sched PD PDPD
V V V
Job manager cluster
Fault Tolerance
19
Policy Managers
R R
X X X X
Stage RR R
Stage X
Job Manager
R managerX ManagerR-X
Manager
Connection R-X
X[0] X[1] X[3] X[2] X’[2]
Completed vertices Slow vertex
Duplicatevertex
Dynamic Graph Rewriting
Duplication Policy = f(running times, data volumes)
Cluster network topology
rack
top-of-rack switch
top-level switch
22
S S S S
A A A
S S
T
S S S S S S
T
# 1 # 2 # 1 # 3 # 3 # 2
# 3# 2# 1
static
dynamic
rack #
Dynamic Aggregation
23
Policy vs. Mechanism
• Application-level• Most complex in C++
code• Invoked with upcalls• Need good default
implementations• DryadLINQ provides
a comprehensive set
• Built-in• Scheduling• Graph rewriting• Fault tolerance• Statistics and
reporting
24
• Introduction• Dryad • DryadLINQ• Building on DryadLINQ• Conclusions
Outline
25
LINQ
Dryad
=> DryadLINQ
26
LINQ = .Net+ Queries
Collection<T> collection;bool IsLegal(Key);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
27
Collections and Iteratorsclass Collection<T> : IEnumerable<T>;
public interface IEnumerable<T> {IEnumerator<T> GetEnumerator();
}
public interface IEnumerator <T> {T Current { get; }bool MoveNext();void Reset();
}
28
DryadLINQ Data Model
Partition
Collection
.Net objects
29
Collection<T> collection;bool IsLegal(Key k);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
DryadLINQ = LINQ + Dryad
C#
collection
results
C# C# C#
Vertexcode
Queryplan(Dryad job)Data
30
Demo
31
Example: Histogrampublic static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k){ var words = input.SelectMany(x => x.line.Split(' ')); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top;}
“A line of words of wisdom”
[“A”, “line”, “of”, “words”, “of”, “wisdom”]
[[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]
[ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}]
32
Histogram Plan
SelectManySort
GroupBy+SelectHashDistribute
MergeSortGroupBy
SelectSortTake
MergeSortTake
33
Map-Reduce in DryadLINQ
public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input,
Func<T, IEnumerable<M>> mapper,Func<M,K> keySelector,Func<IGrouping<K,M>,S> reducer)
{ var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result;}
34
Map-Reduce Plan
M
R
G
M
Q
G1
R
D
MS
G2
R
static dynamic
X
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
MS
G2
R
map
sort
groupby
reduce
distribute
mergesort
groupby
reduce
mergesort
groupby
reduce
consumer
map
parti
al a
ggre
gatio
nre
duce
S S S S
A A A
S S
T
dynamic
35
Distributed Sorting Plan
O
DS
H
D
M
S
DS
H
D
M
S
DS
D
DS
H
D
M
S
DS
D
M
S
M
S
static dynamic dynamic
Expectation Maximization
36
• 160 lines • 3 iterations shown
37
Probabilistic Index MapsImages
features
38
Language Summary
WhereSelectGroupByOrderByAggregateJoinApplyMaterialize
39
LINQ System Architecture
Local machine
.Netprogram(C#, VB, F#, etc)
LINQProvider
Execution engine
Query
Objects
•LINQ-to-obj•PLINQ•LINQ-to-SQL•LINQ-to-WS•DryadLINQ•Flickr•Oracle•LINQ-to-XML•Your own
The DryadLINQ Provider
40
DryadLINQClient machine
(11)
Distributedquery plan
.Net
Query Expr
Data center
Output TablesResults
Input TablesInvoke Query
Output DryadTable
Dryad Execution
.Net Objects
Dryad JM
ToCollection
foreach
Vertexcode
Con-text
41
Combining Query Providers
PLINQ
Local machine
.Netprogram(C#, VB, F#, etc)
LINQProvider
Execution engines
Query
Objects
SQL Server
DryadLINQ
LINQProvider
LINQProvider
LINQProvider
LINQ-to-obj
42
Using PLINQQuery
DryadLINQ
PLINQ
Local query
43
LINQ to SQL
Using LINQ to SQL Server
Query
DryadLINQ
Query Query Query
Query Query
LINQ to SQL
44
Using LINQ-to-objects
Query
DryadLINQ
Local machine
Cluster
LINQ to obj
debug
production
45
• Introduction• Dryad • DryadLINQ• Building on/for DryadLINQ
– System monitoring with Artemis– Privacy-preserving query language (PINQ)– Machine learning
• Conclusions
Outline
46
Artemis: measuring clusters
CosmosCluster
HPCCluster
AzureCluster
Cluster/Job State API
DryadLINQ
Log collectionClusterbrowser/manager
Jobbrowser
Visualization
Statistics
DB
Plug-ins
47
DryadLINQ job browser
48
Automated diagnostics
49
Job statistics: schedule and critical path
50
Running time distribution
51
Performance counters
52
CPU Utilization
53
Load imbalance:rack assignment
54
PINQ
Privacy-sensitive database
Queries(LINQ)
Answer
55
PINQ = Privacy-Preserving LINQ• “Type-safety” for privacy• Provides interface to data that looks very much
like LINQ.• All access through the interface gives
differential privacy.• Analysts write arbitrary C# code against data
sets, like in LINQ.• No privacy expertise needed to produce
analyses.• Privacy currency is used to limit per-record
information released.
56
Example: search logs mining
Distribution of queries about “Cricket”
// Open sensitive data set with state-of-the-art securityPINQueryable<VisitRecord> visits = OpenSecretData(password);
// Group visits by patient and identify frequent patients.var patients = visits.GroupBy(x => x.Patient.SSN)
.Where(x => x.Count() > 5);
// Map each patient to their post code using their SSN.var locations = patients.Join(SSNtoPost, x => x.SSN, y => y.SSN, (x,y) => y.PostCode);
// Count post codes containing at least 10 frequent patients.var activity = locations.GroupBy(x => x)
.Where(x => x.Count() > 10);Visualize(activity); // Who knows what this does???
57
PINQ Download
• Implemented on top of DryadLINQ• Allows mining very sensitive datasets privately• Code is available• http://research.microsoft.com/en-us/projects/PINQ/• Frank McSherry, Privacy Integrated Queries,
SIGMOD 2009
58
Natal Training
59
Natal Problem
• Recognize players from depth map• At frame rate• Using 15% of one Xbox CPU core
60
Learn from Data
Motion Capture(ground truth)
Classifier
Training examplesMachine learning
Rasterize
61
Running on Xbox
62
Learning from data
Classifier
Training examples
Dryad
DryadLINQ
Machine learning
63
Large-Scale Machine Learning• > 1022 objects• Sparse, multi-dimensional data structures• Complex datatypes
(images, video, matrices, etc.)• Complex application logic and dataflow
– >35000 lines of .Net– 140 CPU days – > 105 processes– 30 TB data analyzed– 140 avg parallelism (235 machines)– 300% CPU utilization (4 cores/machine)
64
Highly efficient parallellization
65
• Introduction• Dryad • DryadLINQ• Building on DryadLINQ• Conclusions
Outline
66
Lessons Learned• Complete separation of
storage / execution / language• Using LINQ +.Net (language integration)• Static typing
– No protocol buffers (serialization code)• Allowing flexible and powerful policies• Centralized job manager: no replication, no
consensus, no checkpointing• Porting (HPC, Cosmos, Azure, SQL Server)
Conclusions
67
Visual Studio
LINQ
Dryad
67
=
68
“What’s the point if I can’t have it?”
• Dryad+DryadLINQ available for download– Academic license– Commercial evaluation license
• Runs on Windows HPC platform• Dryad is in binary form, DryadLINQ in source• Requires signing a 3-page licensing agreement• http://connect.microsoft.com/site/sitehome.aspx?SiteID=89
1
69
Backup Slides
70
What does DryadLINQ do?public struct Data { … public static int Compare(Data left, Data right);}
Data g = new Data();var result = table.Where(s => Data.Compare(s, g) < 0);
public static void Read(this DryadBinaryReader reader, out Data obj); public static int Write(this DryadBinaryWriter writer, Data obj);
public class DryadFactoryType__0 : LinqToDryad.DryadFactory<Data>
DryadVertexEnv denv = new DryadVertexEnv(args);var dwriter__2 = denv.MakeWriter(FactoryType__0);var dreader__3 = denv.MakeReader(FactoryType__0);var source__4 = DryadLinqVertex.Where(dreader__3,
s => (Data.Compare(s, ((Data)DryadLinqObjectStore.Get(0))) < ((System.Int32)(0))), false);
dwriter__2.WriteItemSequence(source__4);
Data serialization
Data factory
Channel writerChannel reader
LINQ code
Context serialization
71
Ongoing Dryad/DryadLINQ Research
• Performance modeling• Scheduling and resource allocation• Profiling and performance debugging• Incremental computation• Hardware acceleration• High-level programming abstractions• Many domain-specific applications
72
Sample applications written using DryadLINQ Class
Distributed linear algebra Numerical
Accelerated Page-Rank computation Web graph
Privacy-preserving query language Data mining
Expectation maximization for a mixture of Gaussians Clustering
K-means Clustering
Linear regression Statistics
Probabilistic Index Maps Image processing
Principal component analysis Data mining
Probabilistic Latent Semantic Indexing Data mining
Performance analysis and visualization Debugging
Road network shortest-path preprocessing Graph
Botnet detection Data mining
Epitome computation Image processing
Neural network training Statistics
Parallel machine learning framework infer.net Machine learning
Distributed query caching Optimization
Image indexing Image processing
Web indexing structure Web graph
JM code
vertex code
Staging1. Build
2. Send .exe
3. Start JM
5. Generate graph
7. Serializevertices
8. MonitorVertex execution
4. Querycluster resources
Cluster services6. Initialize vertices
74
BibliographyDryad: Distributed Data-Parallel Programs from Sequential Building BlocksMichael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis FetterlyEuropean Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level LanguageYuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon CurreySymposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008
SCOPE: Easy and Efficient Parallel Processing of Massive Data SetsRonnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren ZhouVery Large Databases Conference (VLDB), Auckland, New Zealand, August 23-28 2008
Hunting for problems with ArtemisGabriela F. Creţu-Ciocârlie, Mihai Budiu, and Moises GoldszmidtUSENIX Workshop on the Analysis of System Logs (WASL), San Diego, CA, December 7, 2008
DryadInc: Reusing work in large-scale computationsLucian Popa, Mihai Budiu, Yuan Yu, and Michael IsardWorkshop on Hot Topics in Cloud Computing (HotCloud), San Diego, CA, June 15, 2009
Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations, Yuan Yu, Pradeep Kumar Gunda, and Michael Isard, ACM Symposium on Operating Systems Principles (SOSP), October 2009
Quincy: Fair Scheduling for Distributed Computing ClustersMichael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew GoldbergACM Symposium on Operating Systems Principles (SOSP), October 2009
Incremental Computation
…
…
Goal: Reuse (part of) prior computations to: - Speed up the current job- Increase cluster throughput- Reduce energy and costs
Goal: Reuse (part of) prior computations to: - Speed up the current job- Increase cluster throughput- Reduce energy and costs
Outputs
Inputs
Distributed Computation
Append-only data
Propose Two Approaches
1. Reuse Identical computations from the past(like make or memoization)
2. Do only incremental computation on the new data and Merge results with the previous ones(like patch)
Context
• Implemented for Dryad– Dryad Job = Computational DAG
• Vertex: arbitrary computation + inputs/outputs• Edge: data flows
Simple Example: Record Count
C
I2
C
AAdd
Outputs
Inputs (partitions)
Count
I1
Identical ComputationRecord Count
C
I2
C
AAdd
Outputs
Inputs (partitions)
Count
I1
First executionDAG
Identical Computation
Second executionDAG
Record Count
C
I2
C
AAdd
Outputs
Inputs (partitions)
Count
I1 I3
C
New Input
IDE – IDEntical Computation
Second executionDAG
Record Count
C
I2
C
AAdd
Outputs
Inputs (partitions)
Count
I1 I3
C
Identical subDAG
Identical Computation
IDE Modified DAG
Replaced with Cached Data
Replace identical computational subDAG with edge data cached from previous execution
Replace identical computational subDAG with edge data cached from previous execution
AAdd
Outputs
Inputs (partitions)
Count
I3
C
Identical Computation
IDE Modified DAG
Use DAG fingerprints to determine if computations are identical
AAdd
Outputs
Inputs (partitions)
Count
I3
C
Replace identical computational subDAG with edge data cached from previous execution
Replace identical computational subDAG with edge data cached from previous execution
Semantic Knowledge Can Help
C
I2
C
A
I1
Reuse Output
Semantic Knowledge Can Help
C
I2
C
A
I1
C
I3
A Merge (Add)
Previous Output
Incremental DAG
Mergeable Computation
C
I2
C
A
I1
C
I3
A Merge (Add)
Automatically Inferred
Automatically Built
User-specified
Mergeable Computation
C
I2
C
A
I1
A
C
I2
C
A
I1 I3
C
Empty
Save to Cache
Incremental DAG – Remove Old Inputs
Merge Vertex