MapReduceand Friends - MGNetdouglas/Classes/bigdata/lectures/2019su/mapredu… · SQL versus NoSQL •SQL –Every record in a collection is a table that has the same sequence of

MapReduce and Friends

Craig C. DouglasUniversity of Wyoming

with thanks to Mookwon Seo

Why was it invented?

• MapReduce is a mergesort for large distributed memory computers.

• It was the basis for a web page search engine now known as Google, where pages are ranked.

• It is most useful for embarrassingly parallel applications.

• It has become a paradigm for implementation of parallel algorithms.

2

Is it efficient?

• Yes:–When communication time can be managed so as

not to be overwhelming.– Encourages reconsidering how standard

algorithms are implemented on distributed parallel machines.

• No:– Frequently uses files instead of pipes to

communicate.– Encourages terrible algorithm implementations.

3

Distributed file system

• We assume that we have huge amounts of storage that has to be spread across a compute cluster made up of commodity PC’s.

• Files are read and appended, not edited.• Files distributed in chunks (64 - 128 MB is

typical).• Chunks are replicated on multiple compute

nodes (3 is typical).• A master keeps an index of chunk locations.

4

Compute clusters

• Typically made up of racks of 1U boards with multiple CPUs per board plus local storage.

• Intra-rack is usually 1-40 Gb/s (ethernet or Infiniband).

• Inter-rack connection is usually something very fast (Infiniband).

• Multiple communication fabric common on very big clusters.

5

Distributed/cluster file systems

• Common ones– IBM GPFS, LUSTRE, Apache HDFS, Google GFS–Wikipedia lists 24 systems (1st is not on their list!).

• Proprietary and open source ones exist• Some are easy to setup and others are an

ongoing nightmare.• Reconfiguring a DFS is usually beyond painful.

6

How does it all interact?SQL: PIG, OINK, HIVE, etc., orNoSQL: Cassandra, MongoDB, etc Object Store (key-value):

BigTable, Hbase, etcGraph systems: GraphX, etcMapReduce: Hadoop, MR-MPI, etc

Workflow: Spark, TensorFlow, etcGraphs: Pregel, GraphLab

Distributed file system

Compute cluster

7

MapReduce systems

• Very common ones:– Google’s MapReduce (Python)– Apache’s Hadoop (GPL, Java open source)– MR-MPI (Sandia Nat’l Lab, BSDL, C/C++ open source)

• Important features:– Specialized parallel computing tools.

• Typically, user writes just two (or three) serial functions.

– Avoid restart of whole job if there is a compute node failure.

8

Key-value systems

• Popular ones:– Google BigTable– Apache Hbase and Cassandra (NoSQL)– Almost any NoSQL has MapReduce built in

• Each row is associated with a key.• The number of columns in a row can be

variable.• Each (row,column) has a set of values.

9

SQL-like systems

• Many, many available. Some popular ones:– Apache HIVE (open source)• Implements QL, a restricted subset of SQL standard.• Sits on top of Hadoop.

– Yahoo! PIG• Implements a relational algebra.

–Microsoft Scope and SQL Server– Google Sawzall and SQL• Implements parallel select + aggregation.

10

NoSQL systems

• Not Only Structured Query Language– NoSQL implies a non-relational database– Key-store structure internally• PostGreSQL• Apache Hbase and Cassandra• Dynamo• CouchDB• MongoDB

– Eventually consistent over quiet periods.

11

SQL versus NoSQL

• SQL– Every record in a collection is a table that has the

same sequence of fields (though not necessarily the same number of fields).

• NoSQL– Documents in a collection may have fields that are

completely different.– Documents are addressed by a unique key.– Queries allow a document type.

12

SQL Format

• In each column is the same data type or empty. The right edge of the matrix is ragged.

13

NoSQL Format

• Each element can be any data type.

14

Why Mergesort?

• Comparisons and performance– Serial computer: O(nlogn).– Parallel computer with n processors: O(logn).– Cache aware versions of mergesort exist.– n/2 auxiliary storage standard, but only O(1) if a

linked list is used.– Too much copying unless a linked list is used.– Lots of communication for parallel computing.

15

Other sorting methods

• Heapsort– Usually faster on serial computers.• Impractical for linked lists.

– O(1) auxiliary storage standard.

• Quicksort (serial computers with caches)– If quicksort average is Cnlogn comparisons, then

mergesort maximum comparisons is 0.39Cnlogn.• Quicksort on average is much faster by clock time.

16

MapReduce Paradigm

• MapReduce system– Creates a large number of tasks for each of two

functions.– Work is divides among tasks precisely.

• Two functions only:– Map tasks converts inputs from DFS to key-value pairs,

where the keys are not necessarily unique. Output sorted by key.

– Reduce tasks combines the key-value pairs for a given key. Usually one Reduce task per key. Output to DFS.

17

What is MapReduce Good for?

• Matrix-vector and matrix-matrix multiplication– Iterative form of PageRank uses these operations

extensively.

• General relational algebra operations.– Join operations in databases.

• Almost anything that is embarrassingly parallel that uses lots of data from a DFS.

• Dealing with failures efficiently.

18

Failure techniques

• Re-execute failed tasks, not whole jobs.• Some systems do checkpointing and then

restart at the last checkpoint.– Very expensive to dump everything to disk.• Adds cost of extra disk drives that have to be on

another compute rack and can lead to early disk failure if used extensively.• Time lengthy to move data across inter-rack network

and may be measured in minutes or fractions of hours.

19

What is the obvious Map function?

• A hash function h(x)!– Produce h(x) as the key.– Value is x and placed in the h(x) bucket.–When finished mapping, send the h(x) bucket to

its Reduce task for combining.

• An efficient hash table code is imperative.– Use memory cache tricks and complicated hash

table implementations, not textbook ones.

20

MapReduce variant of Join

• Suppose we have a chunked file with lots of edges from one vertex to another for graph.

• We want to find all edges of the form R(a,b) and S(b,c) and join them to create T(a,c) if it does not already exist.

• Map: b value• Use hash function h from b values to k

buckets.• Reduce: deal with a bucket.

21

MapReduce variant of Join

• Tuple R(a,b) to Reduce task h(b)– key = b, value = R(a,b)

• Tuple S(b,c) to Reduce task h(c)– key = c, value = S(b,c)

• If R(a,b) joins with S(b,c), then both edges are sent to Reduce task h(b).– Their join (a,b,c) is appended to the output file on

the DFS.

22

Example of Join

• Suppose we have a directed graph.

23

1

2 3

4 (1,2)(1,3)(1,4)(2,3)(3,4)

Example of Join

24

Keys R(a,b) T(a,c)

1 (1,2),(1,3),(1,4)

2 (1,2) (2,3)

3 (1,3), (2,3) (3,4)

4 (1,4), (3,4)

• Map

• ReduceT(1,2,3), T(1,3,4), T(2,3,4)

Matrix-vector multiplication

• Compute y = Mx. For NxN matrix M = [ mij ] and N-vectors x = [ xi ] and y = [ yi ], then

yi = mi1x1 + mi2x2 + … + miNxN.• Simplest Map function– key = i, value = mijxj. (optimization: ignore if 0)–Works for dense and sparse matrices.

• Reduce function adds up products for given i.• Inexcusably inefficient in this form, however.

25

Matrix-vector multiplication example

• Let M = [ m11 m12 m13 ; m21 m22 m23 ; m31 m32 m33 ] andx = [ x1 ; x2 ; x3 ].

• MapKey Values1 m11x1, m12x2, m13x32 m21x1, m22x2, m23x33 m31x1, m32x2, m33x3

• ReduceKey Values1 m11x1 + m12x2 + m13x32 m21x1 + m22x2 + m23x33 m31x1 + m32x2 + m33x3

26

Matrix-vector multiplication

• Better approach when x and y are small enough to fit on all nodes.– Input M by sets of rows and assign key k based on

the row sets.• Compute whole yi’s as a value element.• Store a subvector of y as the value for key k.

– Reduce task just writes out the subvectors to DFS.

• When x and y are too big, apply 2D domain decomposition methods to M, x, and y.

27

Matrix-matrix multiplication

• Compute C = AB. For NxL matrix A = [ aij ], LxMmatrix B = [ bij ], and NxM matrix C = [ cij ], then C is formed from NxM inner products.

• Simplest Map function– Apply matrix-vector product formulation– key = i, value = individual product.–Works for dense and sparse matrices.

• Reduce function adds up products for given i.

28

Matrix-matrix multiplication

• Unbelievably inefficient, but seen in practice.• Better approach for Map function:– Assume each matrix is stored in blocks of size nxn

(pad by zeroes at right and bottom of a matrix), where n is convenient to your DFS.

– Do matrix-matrix multiplication using a block scheme.

– Never, ever do a formal transpose on a DFS.– Still works for dense and sparse matrices.

29

Security and scaling

• Most MapReduce systems (e.g., Hadoop) provide no security or firewalling abilities.– All users have access to everything in databases.– No encryption by default.– Allows for far better scaling on parallel systems.– Extremely difficult to add later and still scale.–Medical record systems in USA using Hadoop on

notice they are not in compliance with privacy laws in effect on January 1, 2014. Big Disaster.

30

Choosing a MapReduce

• O(100) MapReduce systems available• Some are imbedded (e.g., in Matlab, R, or a

(No)SQL database system)• You have to write minimally a Map and a

Reduce function in some language that you know, which is usually the key measure in your choice of a MapReduce system.

• Some are proprietary, others are open source

31

Workflow systems

• MapReduce has two proper functions and the output from the map function goes to a reduce function and that provides the results.

• Workflow systems have multiple stages forming an acyclic graph of data flow (including loops) before some function provides the results.

• Spark and TensorFlow are the best known.

32

Workflow systems

• Output from function j after all data flows to it from other functions.

• Iteration is possible, too.• Master controller, like in MapReduce,

required.

33

Workflow systems

• Master controller functions:– Divides up the input to the initial function(s).– Accepts the output from each function and delivers it

to the next function in the graph, typically via the disk file system.

• Each function has the blocking property, where they only deliver output data when the function completes, just like in MapReduce.

• Multiple, piped MapReduces are workflow examples.

34

Spark

• Provides a more efficient way of…– Coping with failures.– Grouping tasks among compute nodes and

scheduling execution of functions.– Integration of programming language features

such as looping and function libraries.

• All data is grouped in distributed chunks called a Resilient Distributed Dataset, or RDD.

35

Spark

• RDDs– Contain data objects of one type.– (key,value) pairs is all we have seen (MapReduce).– Can apply 4 types of functions to a RDD:• Map, Filter, Flatmap, and relational database functions

• A Spark program is a set of functions, each on a RDD to get another RDD.– The end result, called an action, is returned to

what called Spark.

36

Spark filter functions

• Processes each object in a RDD and assigns a 0 or 1 value based on the function. The new RDD consists only of the objects with a 1 value.

• Example: A RDD of words. The filter evaluates the first letter of each word and assigns a 1 to all words beginning with the letter a or A.

37

Spark map functions

• Similar to MapReduce map function, but not the same.

• Apply map function to each object in a RDD and produce another object in a new RDD. Note that a RDD can only have one data type.–MapReduce object always a (key,value) pair.– Spark can be anything, but the same type of

anything.

38

Spark map functions

• Example: A RDD of words. The map function determines duplicates and produces an output RDD dictionary of the form (w,c), where w is a unique word and c is the count of how many times it appeared in the input RDD.

39

Spark flatmap functions

• Identical to MapReduce except that the data is not required to be (key,value) pairs, but can be any single data type.

40

Spark reduce function

• In Spark, this is an action, not a transformation. A single value is returned, not another RDD.

• Example: Matrix-matrix multiplication could take as input a RDD of vectors. The reduce function takes the proper inner products and creates a new matrix, which is the end result of the action.

41

Spark relational database functions

• A number of operations that are common in relational databases are implemented in Spark. RDDs consist of (key,value) pairs and a new RDD, also of (key,value) pairs results.

• Examples:– Join transformation takes (k,x) and (k,y) and

produces (k,(x,y)).– GroupByKey is similar to group-by.–Many more functions.

42

TensorFlow

• Another workflow system that works usually on acyclic graphs, but not always.

• Data consists of tensors, not RDDs. A tensor is a multi-dimensional matrix (called an array in may programming languages).– Example: A 2×2×2 tensor is represented by [ 1,2 , 3,4 , 11,12 , 13,14 ]. In C or C++ it would be declared as int tensor[2][2][2].

43

TensorFlow

• A tensor can be thought of a restricted form of a RDD.

• Tensorflow has builtin functions to do– Linear algebra, e.g.,

C=tensorflow.matmul(A,B).– Create a tensor from input training data in one

step that can be used in machine learning applications.

44

Quick summary

• MapReduce is a distributed mergesort, usually using disk files as intermediaries.– Replace mergesort with a fast sorting algorithm in

each Reduce task.

• No reason to restrict to slow disk files if all fits in the global memory of the compute cluster.– Use MPI or OpenMP communication techniques

from traditional supercomputing.

45

Quick summary

• Workflow systems provide a logical extension to MapReduce.– Learning Tensorflow is worthwhile and very useful

in many fields, particularly in robotics and machine learning.

– Spark performance degraded by writing function output to the disk file system in large chunks.

–While based on acyclic graphs, usually just graphs are implemented to allow for looping.

46