Distributed Cluster Computing Platforms. Outline What is the purpose of Data Intensive Super Computing? MapReduce Pregel Dryad Spark/Shark Distributed

Distributed Cluster Computing Platforms

Outline What is the purpose of Data Intensive Super Computing? MapReduce Pregel Dryad Spark/Shark Distributed Graph Computing

Why DISC DISC stands for Data Intensive Super Computing A lot of applications. scientific data, web search engine, social network economic, GIS New data are continuously generated People want to understand the data BigData analysis is now considered as a very important method for scientific research.

What are the required features for the platform to handle DISC? Application specific: it is very difficult or even impossible to construct one system to fit them all. One example is the POSIX compatible file system. Each system should be re-configure or even re-designed for a specific application. Think about the motivation for building the Google file system for Google search engine. Programmer friendly interfaces: The Application programmer should not consider how to handle the infrastructure such as machines and networks. Fault Tolerant: The platform should handle the fault components automatically without any special treatment from the application. Scalability: The platform should run on top of at least thousands of machines and harnessing the power of all the components. The load balance should be achieved by the platform instead of the application itself. Try to understand all these four features during the introduction of the concrete platform below.

Google MapReduc e Programming Model Implementation Refinements Evaluation Conclusion

Motivation: large scale data processing Process lots of data to produce other derived data Input: crawled documents, web request logs etc. Output: inverted indices, web page graph structure, top queries in a day etc. Want to use hundreds or thousands of CPUs but want to only focus on the functionality MapReduce hides messy details in a library: Parallelization Data distribution Fault-tolerance Load balancing

Motivation: Large Scale Data Processing Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of CPUs Want to make this easy "Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data." From: http://googlesystem.blogspot.com/2006/09/how-much- data-does-google-store.html

MapReduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers

Programming Model Borrows from functional programming Users implement interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input.

reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)

Architecture

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase cant start until map phase is completely finished.

Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Example vs. Actual Source Code Example is written in pseudo-code Actual implementation is in C++, using a MapReduce library Bindings for Python and Java exist via interfaces True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)

Example Page 1: the weather is good Page 2: today is good Page 3: good weather is good.

Map output Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1).

Reduce Input Worker 1: (the 1) Worker 2: (is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1)

Reduce Output Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)

Some Other Real Examples Term frequencies through the whole Web repository Count of URL access frequency Reverse web-link graph

Implementation Overview Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs

Architecture

Execution

Parallel Execution

Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines

Locality Master program divvies up tasks based on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks: same size as Google File System chunks Without this, rack switches limit read rate Effect: Thousands of machines read input at local disk speed

Fault Tolerance Master detects worker failures Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. Effect: Can work around bugs in third- party libraries!

Fault Tolerance On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine

Optimizations No reduce can start until map is complete: A single slow disk controller can rate-limit the whole process Master redundantly executes slow-moving map tasks; uses results of first copy to finish, (one finishes first wins) Why is it safe to redundantly execute map tasks? Wouldnt this mess up the total computation? Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)

Optimizations Combiner functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner?

Refinement Sorting guarantees within each reduce partition Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters

Performance Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps Two benchmarks: MR_GrepScan 10 10 100-byte records to extract records matching a rare pattern (92K matching records) MR_SortSort 10 10 100-byte records (modeled after TeraSort benchmark)

MR_Grep Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs

MR_Sort Backup tasks reduce job completion time significantly System deals well with failures NormalNo Backup Tasks200 processes killed

More and more MapReduce MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation

Real MapReduce : Rewrite of Production Indexing System Rewrote Google's production indexing system using MapReduce Set of 10, 14, 17, 21, 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow machines Easy to make indexing faster by adding more machines

MapReduce Conclusions MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let library deal w/ messy details

MapReduce Programs Sorting Searching Indexing Classification TF-IDF Breadth-First Search / SSSP PageRank Clustering

MapReduc e for PageRank

PageRank: Random Walks Over The Web If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? The PageRank of a page captures this notion More popular or worthwhile pages get a higher rank

PageRank: Visually

PageRank: Formula Given page A, and pages T 1 through T n linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T n )) C(P) is the cardinality (out-degree) of page P d is the damping (random URL) factor

PageRank: Intuition Calculation is iterative: PR i+1 is based on PR i Each page distributes its PR i to all pages it links to. Linkees add up their awarded rank fragments to find their PR i+1 d is a tunable parameter (usually = 0.85) encapsulating the random jump factor PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T n ))

PageRank: First Implementation Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees current := next; next := fresh_table(); Go back to iteration step or end if converged

Distribution of the Algorithm Key insights allowing parallelization: The 'next' table depends on 'current', but not on any other rows of 'next' Individual rows of the adjacency matrix can be processed in parallel Sparse matrix rows are relatively small

Distribution of the Algorithm Consequences of insights: We can map each row of 'current' to a list of PageRank fragments to assign to linkees These fragments can be reduced into a single PageRank value for a page by summing Graph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 1

Phase 1: Parse HTML Map task takes (URL, page content) pairs and maps them to (URL, (PR init, list-of-urls)) PR init is the seed PageRank for URL list-of-urls contains all pages pointed to by URL Reduce task is just the identity function

Phase 2: PageRank Distribution Map task takes (URL, (cur_rank, url_list)) For each u in url_list, emit (u, cur_rank/|url_list|) Emit (URL, url_list) to carry the points-to list along through iterations PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T n ))

Phase 2: PageRank Distribution Reduce task gets (URL, url_list) and many (URL, val) values Sum vals and fix up with d Emit (URL, (new_rank, url_list)) PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T n ))

Finishing up... A non-parallelizable component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?) If so, write out the PageRank lists - done! Otherwise, feed output of Phase 2 into another Phase 2 iteration

PageRank Conclusions MapReduce isn't the greatest at iterated computation, but still helps run the heavy lifting Key element in parallelization is independent PageRank computations in a given step Parallelization requires thinking about minimum data partitions to transmit (e.g., compact representations of graph rows) Even the implementation shown today doesn't actually scale to the whole Internet; but it works for intermediate-sized graphs So, do you think that MapReduce is suitable for PageRank? (homework, give concrete reason for why and why not.)

Dryad Dryad Design Implementation Policies as Plug-ins Building on Dryad

Design Space 54 ThroughputLatency Internet Private data center Data- parallel Shared memory

Data Partitioning 55 RAM DATA

2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 56

Dryad = Execution Layer 57 Job (Application) Dryad Cluster Pipeline Shell Machine

Dryad Design Implementation Policies as Plug-ins Building on Dryad 58

Virtualized 2-D Pipelines 59

Virtualized 2-D Pipelines 63 2D DAG multi-machine virtualized

Dryad Job Structure 64 grep sed sort awk perl grep sed sort awk Input files Vertices (processes) Output files Channels Stage grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50

Channels 65 X M Items Finite Streams of items distributed filesystem files (persistent) SMB/NTFS files (temporary) TCP pipes (inter-machine) memory FIFOs (intra-machine)

Architecture 66 Files, TCP, FIFO, Network job schedule data plane control plane NSPD V VV Job managercluster

JM code vertex code Staging 1. Build 2. Send.exe 3. Start JM 5. Generate graph 7. Serialize vertices 8. Monitor Vertex execution 4. Query cluster resources Cluster services 6. Initialize vertices

Fault Tolerance

Dryad Design Implementation Policies and Resource Management Building on Dryad 69

Policy Managers 70 RR XXXX Stage R RR Stage X Job Manager R managerX Manager R-X Manager Connection R-X

X[0]X[1]X[3]X[2]X[2] Completed vertices Slow vertex Duplicate vertex Duplicate Execution Manager Duplication Policy = f(running times, data volumes)

SSSS AAA SS T SSSSSS T # 1# 2# 1# 3 # 2 # 3# 2# 1 static dynamic rack # Aggregation Manager 72

Data Distribution (Group By) 73 Dest Source Dest Source Dest Source m n m x n

TT [0-?)[?-100) Range-Distribution Manager S DDD SS SSS T static dynamic 74 Hist [0-30),[30-100) [30-100)[0-30) [0-100)

Goal: Declarative Programming 75 X T S XX SS TTT X staticdynamic

Dryad Design Implementation Policies as Plug-ins Building on Dryad 76

Software Stack 77 Windows Server Cluster Services Distributed Filesystem Dryad Distributed Shell PSQL DryadLINQ Perl SQL server C++ Windows Server C++ CIFS/NTFS legacy code sed, awk, grep, etc. SSIS Queries C# Vectors Machine Learning C# Job queueing, monitoring

SkyServer Query 18 78 select distinct P.ObjID into results from photoPrimary U, neighbors N, photoPrimary L where U.ObjID = N.ObjID and L.ObjID = N.NeighborObjID and P.ObjID < L.ObjID and abs((U.u-U.g)-(L.u-L.g)) a.Multiply(b)); 94

Expectation Maximization (Gaussians) 95 160 lines 3 iterations shown

Conclusions Dryad = distributed execution environment Application-independent (semantics oblivious) Supports rich software ecosystem Relational algebra Map-reduce LINQ Etc. DryadLINQ = A Dryad provider for LINQ This is only the beginning! 96

Some other system you should know about BigData processing Hadoop HDFS, MapReduce (open source version of GFS and MapReduce) HIVE/Pig/Sawzall (Query Language Processing) Spark/Shark (Efficient use of cluster memory and supporting iterative mapreduce program)

Thank you! Any Questions?

Pregel as backup slides

Pregel Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion

Introduction (1/2) Source: SIGMETRICS 09 Tutorial MapReduce: The Programming Model and Practice, by Jerry Zhao

Introduction (2/2) Many practical computing problems concern large graphs MapReduce is ill-suited for graph processing Many iterations are needed for parallel graph processing Materializations of intermediate results at every MapReduce iteration harm performance Large graph data Web graph Transportation routes Citation relationships Social networks Graph algorithms PageRank Shortest path Connected components Clustering techniques

Single Source Shortest Path (SSSP) Problem Find shortest path from a source node to all target nodes Solution Single processor machine: Dijkstras algorithm

Example: SSSP Dijkstras Algorithm 0 10 5 23 2 1 9 7 46

Example: SSSP Dijkstras Algorithm 0 10 5 5 23 2 1 9 7 46

Example: SSSP Dijkstras Algorithm 0 8 5 14 7 10 5 23 2 1 9 7 46

Single Source Shortest Path (SSSP) Problem Find shortest path from a source node to all target nodes Solution Single processor machine: Dijkstras algorithm MapReduce/Pregel: parallel breadth-first search (BFS)

MapReduce Execution Overview

Adjacency matrix Adjacency List A: (B, 10), (D, 5) B: (C, 1), (D, 2) C: (E, 4) D: (B, 3), (C, 9), (E, 2) E: (A, 7), (C, 6) 0 10 5 23 2 1 9 7 46 A BC DE ABCDE A 5 B12 C4 D392 E76 Example: SSSP Parallel BFS in MapReduce

0 10 5 23 2 1 9 7 46 A BC DE Map input: > >> Map output: >> Flushed to local disk!! Example: SSSP Parallel BFS in MapReduce

Reduce input: >> >> >> >> >> 0 10 5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in MapReduce

Reduce output: > = Map input for next iteration >> Map output: 0 10 5 5 23 2 1 9 7 46 A BC DE >> Flushed to DFS!! Flushed to local disk!! Example: SSSP Parallel BFS in MapReduce

Reduce input: >> >> >> >> >> 0 10 5 5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in MapReduce

Reduce output: > = Map input for next iteration >> the rest omitted 0 8 5 11 7 10 5 23 2 1 9 7 46 A BC DE Flushed to DFS!! Example: SSSP Parallel BFS in MapReduce

Computation Model (1/3) Input Output Supersteps (a sequence of iterations)

Think like a vertex Inspired by Valiants Bulk Synchronous Parallel model (1990) Computation Model (2/3) Source: http://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Computation Model (3/3) Superstep: the vertices compute in parallel Each vertex Receives messages sent in the previous superstep Executes the same user-defined function Modifies its value or that of its outgoing edges Sends messages to other vertices (to be received in the next superstep) Mutates the topology of the graph Votes to halt if it has no further work to do Termination condition All vertices are simultaneously inactive There are no messages in transit

Example: SSSP Parallel BFS in Pregel 0 10 5 23 2 1 9 7 46

Example: SSSP Parallel BFS in Pregel 0 10 5 23 2 1 9 7 46 5

Example: SSSP Parallel BFS in Pregel 0 10 5 5 23 2 1 9 7 46

Example: SSSP Parallel BFS in Pregel 0 10 5 5 23 2 1 9 7 46 11 7 12 8 14

Example: SSSP Parallel BFS in Pregel 0 8 5 11 7 10 5 23 2 1 9 7 46

Example: SSSP Parallel BFS in Pregel 0 8 5 11 7 10 5 23 2 1 9 7 46 9 14 13 15

Example: SSSP Parallel BFS in Pregel 0 8 5 9 7 10 5 23 2 1 9 7 46 13

Differences from MapReduce Graph algorithms can be written as a series of chained MapReduce invocation Pregel Keeps vertices & edges on the machine that performs computation Uses network transfers only for messages MapReduce Passes the entire state of the graph from one stage to the next Needs to coordinate the steps of a chained MapReduce

C++ API Writing a Pregel program Subclassing the predefined Vertex class Override this! in msgs out msg

Example: Vertex Class for SSSP

System Architecture Pregel system also uses the master/worker model Master Maintains worker Recovers faults of workers Provides Web-UI monitoring tool of job progress Worker Processes its task Communicates with the other workers Persistent data is stored as files on a distributed storage system (such as GFS or BigTable) Temporary data is stored on local disk

Execution of a Pregel Program 1.Many copies of the program begin executing on a cluster of machines 2.The master assigns a partition of the input to each worker Each worker loads the vertices and marks them as active 3.The master instructs each worker to perform a superstep Each worker loops through its active vertices & computes for each vertex Messages are sent asynchronously, but are delivered before the end of the superstep This step is repeated as long as any vertices are active, or any messages are in transit 4.After the computation halts, the master may instruct each worker to save its portion of the graph

Fault Tolerance Checkpointing The master periodically instructs the workers to save the state of their partitions to persistent storage e.g., Vertex values, edge values, incoming messages Failure detection Using regular ping messages Recovery The master reassigns graph partitions to the currently available workers The workers all reload their partition state from most recent available checkpoint

Experiments Environment H/W: A cluster of 300 multicore commodity PCs Data: binary trees, log-normal random graphs (general graphs) Nave SSSP implementation The weight of all edges = 1 No checkpointing

Experiments SSSP 1 billion vertex binary tree: varying # of worker tasks

Experiments SSSP binary trees: varying graph sizes on 800 worker tasks

Experiments SSSP Random graphs: varying graph sizes on 800 worker tasks

Conclusion & Future Work Pregel is a scalable and fault-tolerant platform with an API that is sufficiently flexible to express arbitrary graph algorithms Future work Relaxing the synchronicity of the model Not to wait for slower workers at inter-superstep barriers Assigning vertices to machines to minimize inter-machine communication Caring dense graphs in which most vertices send messages to most other vertices

Documents

Distributed Cluster Computing Platforms. Outline What is the purpose of Data Intensive Super Computing? MapReduce Pregel Dryad Spark/Shark Distributed