Outline What is the purpose of Data Intensive Super Computing?
MapReduce Pregel Dryad Spark/Shark Distributed Graph Computing
Slide 3
Why DISC DISC stands for Data Intensive Super Computing A lot
of applications. scientific data, web search engine, social network
economic, GIS New data are continuously generated People want to
understand the data BigData analysis is now considered as a very
important method for scientific research.
Slide 4
What are the required features for the platform to handle DISC?
Application specific: it is very difficult or even impossible to
construct one system to fit them all. One example is the POSIX
compatible file system. Each system should be re-configure or even
re-designed for a specific application. Think about the motivation
for building the Google file system for Google search engine.
Programmer friendly interfaces: The Application programmer should
not consider how to handle the infrastructure such as machines and
networks. Fault Tolerant: The platform should handle the fault
components automatically without any special treatment from the
application. Scalability: The platform should run on top of at
least thousands of machines and harnessing the power of all the
components. The load balance should be achieved by the platform
instead of the application itself. Try to understand all these four
features during the introduction of the concrete platform
below.
Slide 5
Google MapReduc e Programming Model Implementation Refinements
Evaluation Conclusion
Slide 6
Motivation: large scale data processing Process lots of data to
produce other derived data Input: crawled documents, web request
logs etc. Output: inverted indices, web page graph structure, top
queries in a day etc. Want to use hundreds or thousands of CPUs but
want to only focus on the functionality MapReduce hides messy
details in a library: Parallelization Data distribution
Fault-tolerance Load balancing
Slide 7
Motivation: Large Scale Data Processing Want to process lots of
data ( > 1 TB) Want to parallelize across hundreds/thousands of
CPUs Want to make this easy "Google Earth uses 70.5 TB: 70 TB for
the raw imagery and 500 GB for the index data." From:
http://googlesystem.blogspot.com/2006/09/how-much-
data-does-google-store.html
Slide 8
MapReduce Automatic parallelization & distribution
Fault-tolerant Provides status and monitoring tools Clean
abstraction for programmers
Slide 9
Programming Model Borrows from functional programming Users
implement interface of two functions: map (in_key, in_value) ->
(out_key, intermediate_value) list reduce (out_key,
intermediate_value list) -> out_value list
Slide 10
map Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value pairs:
e.g., (filename, line). map() produces one or more intermediate
values along with an output key from the input.
Slide 11
reduce After the map phase is over, all the intermediate values
for a given output key are combined together into a list reduce()
combines those intermediate values into one or more final values
for that same output key (in practice, usually only one final value
per key)
Slide 12
Architecture
Slide 13
Parallelism map() functions run in parallel, creating different
intermediate values from different input data sets reduce()
functions also run in parallel, each working on a different output
key All values are processed independently Bottleneck: reduce phase
cant start until map phase is completely finished.
Slide 14
Example: Count word occurrences map(String input_key, String
input_value): // input_key: document name // input_value: document
contents for each word w in input_value: EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values): //
output_key: a word // output_values: a list of counts int result =
0; for each v in intermediate_values: result += ParseInt(v);
Emit(AsString(result));
Slide 15
Example vs. Actual Source Code Example is written in
pseudo-code Actual implementation is in C++, using a MapReduce
library Bindings for Python and Java exist via interfaces True code
is somewhat more involved (defines how the input key/values are
divided up and accessed, etc.)
Slide 16
Example Page 1: the weather is good Page 2: today is good Page
3: good weather is good.
Some Other Real Examples Term frequencies through the whole Web
repository Count of URL access frequency Reverse web-link
graph
Slide 21
Implementation Overview Typical cluster: 100s/1000s of 2-CPU
x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage
is on local IDE disks GFS: distributed file system manages data
(SOSP'03) Job scheduling system: jobs made up of tasks, scheduler
assigns tasks to machines Implementation is a C++ library linked
into user programs
Slide 22
Architecture
Slide 23
Execution
Slide 24
Parallel Execution
Slide 25
Task Granularity And Pipelining Fine granularity tasks: many
more map tasks than machines Minimizes time for fault recovery Can
pipeline shuffling with map execution Better dynamic load balancing
Often use 200,000 map/5000 reduce tasks w/ 2000 machines
Slide 26
Locality Master program divvies up tasks based on location of
data: (Asks GFS for locations of replicas of input file blocks)
tries to have map() tasks on same machine as physical file data, or
at least same rack map() task inputs are divided into 64 MB blocks:
same size as Google File System chunks Without this, rack switches
limit read rate Effect: Thousands of machines read input at local
disk speed
Slide 27
Fault Tolerance Master detects worker failures Re-executes
completed & in-progress map() tasks Re-executes in-progress
reduce() tasks Master notices particular input key/values cause
crashes in map(), and skips those values on re-execution. Effect:
Can work around bugs in third- party libraries!
Slide 28
Fault Tolerance On worker failure: Detect failure via periodic
heartbeats Re-execute completed and in-progress map tasks
Re-execute in progress reduce tasks Task completion committed
through master Master failure: Could handle, but don't yet (master
failure unlikely) Robust: lost 1600 of 1800 machines once, but
finished fine
Slide 29
Optimizations No reduce can start until map is complete: A
single slow disk controller can rate-limit the whole process Master
redundantly executes slow-moving map tasks; uses results of first
copy to finish, (one finishes first wins) Why is it safe to
redundantly execute map tasks? Wouldnt this mess up the total
computation? Slow workers significantly lengthen completion time
Other jobs consuming resources on machine Bad disks with soft
errors transfer data very slowly Weird things: processor caches
disabled (!!)
Slide 30
Optimizations Combiner functions can run on same machine as a
mapper Causes a mini-reduce phase to occur before the real reduce
phase, to save bandwidth Under what conditions is it sound to use a
combiner?
Slide 31
Refinement Sorting guarantees within each reduce partition
Compression of intermediate data Combiner: useful for saving
network bandwidth Local execution for debugging/testing
User-defined counters
Slide 32
Performance Tests run on cluster of 1800 machines: 4 GB of
memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB
IDE disks Gigabit Ethernet per machine Bisection bandwidth
approximately 100 Gbps Two benchmarks: MR_GrepScan 10 10 100-byte
records to extract records matching a rare pattern (92K matching
records) MR_SortSort 10 10 100-byte records (modeled after TeraSort
benchmark)
Slide 33
MR_Grep Locality optimization helps: 1800 machines read 1 TB of
data at peak of ~31 GB/s Without this, rack switches would limit to
10 GB/s Startup overhead is significant for short jobs
Slide 34
MR_Sort Backup tasks reduce job completion time significantly
System deals well with failures NormalNo Backup Tasks200 processes
killed
Slide 35
More and more MapReduce MapReduce Programs In Google Source
Tree Example uses: distributed grep distributed sort web link-graph
reversal term-vector per host web access log stats inverted index
construction document clustering machine learning statistical
machine translation
Slide 36
Real MapReduce : Rewrite of Production Indexing System Rewrote
Google's production indexing system using MapReduce Set of 10, 14,
17, 21, 24 MapReduce operations New code is simpler, easier to
understand MapReduce takes care of failures, slow machines Easy to
make indexing faster by adding more machines
Slide 37
MapReduce Conclusions MapReduce has proven to be a useful
abstraction Greatly simplifies large-scale computations at Google
Functional programming paradigm can be applied to large-scale
applications Fun to use: focus on problem, let library deal w/
messy details
PageRank: Random Walks Over The Web If a user starts at a
random web page and surfs by clicking links and randomly entering
new URLs, what is the probability that s/he will arrive at a given
page? The PageRank of a page captures this notion More popular or
worthwhile pages get a higher rank
Slide 41
PageRank: Visually
Slide 42
PageRank: Formula Given page A, and pages T 1 through T n
linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T 1
)/C(T 1 ) +... + PR(T n )/C(T n )) C(P) is the cardinality
(out-degree) of page P d is the damping (random URL) factor
Slide 43
PageRank: Intuition Calculation is iterative: PR i+1 is based
on PR i Each page distributes its PR i to all pages it links to.
Linkees add up their awarded rank fragments to find their PR i+1 d
is a tunable parameter (usually = 0.85) encapsulating the random
jump factor PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T
n ))
Slide 44
PageRank: First Implementation Create two tables 'current' and
'next' holding the PageRank for each page. Seed 'current' with
initial PR values Iterate over all pages in the graph, distributing
PR from 'current' into 'next' of linkees current := next; next :=
fresh_table(); Go back to iteration step or end if converged
Slide 45
Distribution of the Algorithm Key insights allowing
parallelization: The 'next' table depends on 'current', but not on
any other rows of 'next' Individual rows of the adjacency matrix
can be processed in parallel Sparse matrix rows are relatively
small
Slide 46
Distribution of the Algorithm Consequences of insights: We can
map each row of 'current' to a list of PageRank fragments to assign
to linkees These fragments can be reduced into a single PageRank
value for a page by summing Graph representation can be even more
compact; since each element is simply 0 or 1, only transmit column
numbers where it's 1
Slide 47
Slide 48
Phase 1: Parse HTML Map task takes (URL, page content) pairs
and maps them to (URL, (PR init, list-of-urls)) PR init is the seed
PageRank for URL list-of-urls contains all pages pointed to by URL
Reduce task is just the identity function
Slide 49
Phase 2: PageRank Distribution Map task takes (URL, (cur_rank,
url_list)) For each u in url_list, emit (u, cur_rank/|url_list|)
Emit (URL, url_list) to carry the points-to list along through
iterations PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T
n ))
Slide 50
Phase 2: PageRank Distribution Reduce task gets (URL, url_list)
and many (URL, val) values Sum vals and fix up with d Emit (URL,
(new_rank, url_list)) PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... +
PR(T n )/C(T n ))
Slide 51
Finishing up... A non-parallelizable component determines
whether convergence has been achieved (Fixed number of iterations?
Comparison of key values?) If so, write out the PageRank lists -
done! Otherwise, feed output of Phase 2 into another Phase 2
iteration
Slide 52
PageRank Conclusions MapReduce isn't the greatest at iterated
computation, but still helps run the heavy lifting Key element in
parallelization is independent PageRank computations in a given
step Parallelization requires thinking about minimum data
partitions to transmit (e.g., compact representations of graph
rows) Even the implementation shown today doesn't actually scale to
the whole Internet; but it works for intermediate-sized graphs So,
do you think that MapReduce is suitable for PageRank? (homework,
give concrete reason for why and why not.)
Slide 53
Dryad Dryad Design Implementation Policies as Plug-ins Building
on Dryad
Slide 54
Design Space 54 ThroughputLatency Internet Private data center
Data- parallel Shared memory
SSSS AAA SS T SSSSSS T # 1# 2# 1# 3 # 2 # 3# 2# 1 static
dynamic rack # Aggregation Manager 72
Slide 73
Data Distribution (Group By) 73 Dest Source Dest Source Dest
Source m n m x n
Slide 74
TT [0-?)[?-100) Range-Distribution Manager S DDD SS SSS T
static dynamic 74 Hist [0-30),[30-100) [30-100)[0-30) [0-100)
Slide 75
Goal: Declarative Programming 75 X T S XX SS TTT X
staticdynamic
Slide 76
Dryad Design Implementation Policies as Plug-ins Building on
Dryad 76
Slide 77
Software Stack 77 Windows Server Cluster Services Distributed
Filesystem Dryad Distributed Shell PSQL DryadLINQ Perl SQL server
C++ Windows Server C++ CIFS/NTFS legacy code sed, awk, grep, etc.
SSIS Queries C# Vectors Machine Learning C# Job queueing,
monitoring
Slide 78
SkyServer Query 18 78 select distinct P.ObjID into results from
photoPrimary U, neighbors N, photoPrimary L where U.ObjID = N.ObjID
and L.ObjID = N.NeighborObjID and P.ObjID < L.ObjID and
abs((U.u-U.g)-(L.u-L.g)) a.Multiply(b)); 94
Conclusions Dryad = distributed execution environment
Application-independent (semantics oblivious) Supports rich
software ecosystem Relational algebra Map-reduce LINQ Etc.
DryadLINQ = A Dryad provider for LINQ This is only the beginning!
96
Slide 97
Some other system you should know about BigData processing
Hadoop HDFS, MapReduce (open source version of GFS and MapReduce)
HIVE/Pig/Sawzall (Query Language Processing) Spark/Shark (Efficient
use of cluster memory and supporting iterative mapreduce
program)
Slide 98
Thank you! Any Questions?
Slide 99
Pregel as backup slides
Slide 100
Pregel Introduction Computation Model Writing a Pregel Program
System Implementation Experiments Conclusion
Slide 101
Introduction (1/2) Source: SIGMETRICS 09 Tutorial MapReduce:
The Programming Model and Practice, by Jerry Zhao
Slide 102
Introduction (2/2) Many practical computing problems concern
large graphs MapReduce is ill-suited for graph processing Many
iterations are needed for parallel graph processing
Materializations of intermediate results at every MapReduce
iteration harm performance Large graph data Web graph
Transportation routes Citation relationships Social networks Graph
algorithms PageRank Shortest path Connected components Clustering
techniques
Slide 103
Single Source Shortest Path (SSSP) Problem Find shortest path
from a source node to all target nodes Solution Single processor
machine: Dijkstras algorithm
Single Source Shortest Path (SSSP) Problem Find shortest path
from a source node to all target nodes Solution Single processor
machine: Dijkstras algorithm MapReduce/Pregel: parallel
breadth-first search (BFS)
Slide 111
MapReduce Execution Overview
Slide 112
Adjacency matrix Adjacency List A: (B, 10), (D, 5) B: (C, 1),
(D, 2) C: (E, 4) D: (B, 3), (C, 9), (E, 2) E: (A, 7), (C, 6) 0 10 5
23 2 1 9 7 46 A BC DE ABCDE A 5 B12 C4 D392 E76 Example: SSSP
Parallel BFS in MapReduce
Slide 113
0 10 5 23 2 1 9 7 46 A BC DE Map input: > >> Map
output: >> Flushed to local disk!! Example: SSSP Parallel BFS
in MapReduce
Slide 114
Reduce input: >> >> >> >> >> 0 10
5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in
MapReduce
Slide 115
Reduce input: >> >> >> >> >> 0 10
5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in
MapReduce
Slide 116
Reduce output: > = Map input for next iteration >> Map
output: 0 10 5 5 23 2 1 9 7 46 A BC DE >> Flushed to DFS!!
Flushed to local disk!! Example: SSSP Parallel BFS in
MapReduce
Slide 117
Reduce input: >> >> >> >> >> 0 10
5 5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in
MapReduce
Slide 118
Reduce input: >> >> >> >> >> 0 10
5 5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in
MapReduce
Slide 119
Reduce output: > = Map input for next iteration >> the
rest omitted 0 8 5 11 7 10 5 23 2 1 9 7 46 A BC DE Flushed to DFS!!
Example: SSSP Parallel BFS in MapReduce
Slide 120
Computation Model (1/3) Input Output Supersteps (a sequence of
iterations)
Slide 121
Think like a vertex Inspired by Valiants Bulk Synchronous
Parallel model (1990) Computation Model (2/3) Source:
http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Slide 122
Computation Model (3/3) Superstep: the vertices compute in
parallel Each vertex Receives messages sent in the previous
superstep Executes the same user-defined function Modifies its
value or that of its outgoing edges Sends messages to other
vertices (to be received in the next superstep) Mutates the
topology of the graph Votes to halt if it has no further work to do
Termination condition All vertices are simultaneously inactive
There are no messages in transit
Differences from MapReduce Graph algorithms can be written as a
series of chained MapReduce invocation Pregel Keeps vertices &
edges on the machine that performs computation Uses network
transfers only for messages MapReduce Passes the entire state of
the graph from one stage to the next Needs to coordinate the steps
of a chained MapReduce
Slide 133
C++ API Writing a Pregel program Subclassing the predefined
Vertex class Override this! in msgs out msg
Slide 134
Example: Vertex Class for SSSP
Slide 135
System Architecture Pregel system also uses the master/worker
model Master Maintains worker Recovers faults of workers Provides
Web-UI monitoring tool of job progress Worker Processes its task
Communicates with the other workers Persistent data is stored as
files on a distributed storage system (such as GFS or BigTable)
Temporary data is stored on local disk
Slide 136
Execution of a Pregel Program 1.Many copies of the program
begin executing on a cluster of machines 2.The master assigns a
partition of the input to each worker Each worker loads the
vertices and marks them as active 3.The master instructs each
worker to perform a superstep Each worker loops through its active
vertices & computes for each vertex Messages are sent
asynchronously, but are delivered before the end of the superstep
This step is repeated as long as any vertices are active, or any
messages are in transit 4.After the computation halts, the master
may instruct each worker to save its portion of the graph
Slide 137
Fault Tolerance Checkpointing The master periodically instructs
the workers to save the state of their partitions to persistent
storage e.g., Vertex values, edge values, incoming messages Failure
detection Using regular ping messages Recovery The master reassigns
graph partitions to the currently available workers The workers all
reload their partition state from most recent available
checkpoint
Slide 138
Experiments Environment H/W: A cluster of 300 multicore
commodity PCs Data: binary trees, log-normal random graphs (general
graphs) Nave SSSP implementation The weight of all edges = 1 No
checkpointing
Experiments SSSP Random graphs: varying graph sizes on 800
worker tasks
Slide 142
Conclusion & Future Work Pregel is a scalable and
fault-tolerant platform with an API that is sufficiently flexible
to express arbitrary graph algorithms Future work Relaxing the
synchronicity of the model Not to wait for slower workers at
inter-superstep barriers Assigning vertices to machines to minimize
inter-machine communication Caring dense graphs in which most
vertices send messages to most other vertices