Distributed Processing Frameworks

Distributed Processing

Frameworks

Author: Antonios Katsarakis

Literature

• MapReduce: Simplified Data Processing on Large Clusters

Jeff Dean et al. - OSDI’04.

• Spark: Cluster Computing with Working Sets

M. Zaharia et al. - HotCloud’10.

Why Big Data?

• More data to process: IoT, smart devices, web applications

- About 2.3 trillion GB of new data are generated every day

• Growth of CPU performance cannot keep up with increasing

amount of data to process

• This leads us to the Big Data era

- Big data: Data sets are so large that the processing power of a

single machine is inadequate to deal with them

• We need to find ways to process these massive amounts of data

MapReduce• Proposed by Jeff Dean et al. (Google) 2004

- Cited more than 18k

• A programming model that enables the parallel

and distributed processing of large data sets

• Typical MapReduce Program:

- Read Data

- Map: filtering of the data

- Shuffle and short

- Reduce: summary operation on data

- Write the Results

ReduceReduce

Input Data

1/3

Input

1/3

Input

1/3

Input

Map Map Map

Interm.

Data

Interm.

Data

Interm.

Data

Output

Data

Output

Data

Critical Reflection• Outcome:

- Novel idea that lead to a whole new era of distributed systems

- Big impact in industry (Hadoop MapReduce)

- Lowered the cost of computations

• Limitations:

- Restricted to batch processing

- It only support map and reduce operations

- The shuffling phase introduces overheads

Spark

• Proposed by Matei Zaharia et al. 2010

- Cited 1.5k

• Another programming model based on

higher-ordered functions that execute

user-defined functions in parallel

• Aims to replace MapReduce in industry

• Main Ideas:

- Represent the computations as DAGs

- Cache datasets into memory

Spark Model• Resilient Distributed

Datasets (RRDs):

immutable collections of

objects spread across a

cluster

• Operations over RDDs:

1.Transformations: lazy

operators that create new

RDDs

2.Actions: launch a

computation on an RDD

Pip

elin

ed RDD1

var count = readFile(…)

.map(…)

.filter(..)

.reduceByKey()

.count()

File splited

into chunks

(RDD0)

RDD2

RDD3

RDD4

Result

Job (RDD) Graph

Sta

ge

1S

t. 2

Critical Reflection• Benefits:

- High level API

- Support more applications types

- Performance optimizations

• Limitations:

- Detailed performance analysis on the thread level is hard

- Multipurpose application support makes performance improvements and

tuning really challenging

- The shuffling phase introduces overheads

Conclusion

• Clusters provide the computational power to

process Big Data

• MapReduce allows developers to build programs for

clusters

• Spark tries to overcome limitations of MapReduce

• These systems introduce many challenges in terms

of measuring and improving their performance

Software

Distributed Processing Frameworks