Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 [email protected]

Comp6611 Course LectureBig data applications

Yang PENG Network and System LabCSE, HKUST

Monday, March 11, [email protected]

Material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

Today's Topics MapReduce

Background information/overview Map and Reduce

-------- from a programmer's perspective Architecture and workflow

-------- a global overview Virtues and defects

Improvement Spark

Background

MapReduce

Spark

MapReduce Background Before MapReduce, large-scale data processing

was difficult Managing parallelization and distribution

Application development is tedious and hard to debug Resource scheduling and load-balancing

Data storage and distribution Distributed file system “Moving computation is cheaper than moving data.”

Fault/crash tolerance ScalabilityBackground

MapReduce

Spark

Does “Divide and Conquer paradigm” still work in big data?

Background

MapReduce

Spark

Work

𝒘𝟏 𝒘𝟐 𝒘𝟑

𝒘𝟏 𝒘𝟐 𝒘𝟑

Worker Worker Worker

Partition

CombineResult

Programming Model• Opportunity: design an software abstraction undertake the divide

and conquer and reduce programmers' workload for• resource management• task scheduling• distributed synchronization and communication

• Functional programming, which has long history, provides some high-order functions to support divide and conquer. Map: do something to everything in a list Fold: combine results of a list in some way

Background

MapReduce

Spark

Computer Computer Computer

Abstraction

Application

…

Map Map is a higher-order function How map works:

Function is applied to every element in a list Result is a new list

f f f f fBackground

MapReduce

Spark

Fold Fold is also a higher-order function How fold works:

Accumulator set to initial value Function applied to list element and the accumulator Result stored in the accumulator Repeated for every item in the list Result is the final value in the accumulator

f f f f f final value

Initial value

Background

MapReduce

Spark

Map/Fold in Action Simple map example:

Fold examples:

Sum of squares:

(map (lambda (x) (* x x)) '(1 2 3 4 5)) '(1 4 9 16 25)

(fold + 0 '(1 2 3 4 5)) 15(fold * 1 '(1 2 3 4 5)) 120

(define (sum-of-squares v) (fold + 0 (map (lambda (x) (* x x)) v)))(sum-of-squares '(1 2 3 4 5)) 55

Background

MapReduce

Spark

MapReduce Programmers specify two functions:

map (k1,v1) → list(k2,v2) reduce (k2, list (v2)) → list(v2)

function map(String name, String document): // K1 name: document name // V1 document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // K2 word: a word // list(V2) partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

Background

MapReduce

Spark

An implementation of WordCount

It's just divide and conquer!Data Store

Initial kv pairs

mapmap

Initial kv pairs

map

Initial kv pairs

map

Initial kv pairs

k1, values…

k2, values…k3, values…

k1, values…


k1, values…


k1, values…


Barrier: aggregate values by keys

reduce

k1, values…

final k1 values

reduce

k2, values…

final k2 values

reduce

k3, values…

final k3 values

Background

MapReduce

Spark

Behind the scenes…

Background

MapReduce

Spark

Programming interface input reader Map function partition function

The partition function is given the key and the number of reducers and returns the index of the desired reduce.

For load-balance, e.g. Hash function compare function

The compare function is used to sort computing output. Ordering guarantee

Reduce function output writer

Background

MapReduce

Spark

Ouput of a Hadoop jobypeng@vm115:~/hadoop-0.20.2$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount /user/hduser/wordcount/15G-enwiki-input /user/hduser/wordcount/15G-enwiki-output

13/01/16 07:00:48 INFO input.FileInputFormat: Total input paths to process : 1

13/01/16 07:00:49 INFO mapred.JobClient: Running job: job_201301160607_0003

13/01/16 07:00:50 INFO mapred.JobClient: map 0% reduce 0%

.........................



13/01/16 07:02:06 INFO mapred.JobClient: map 20% reduce 0%13/01/16 07:02:08 INFO mapred.JobClient: map 20% reduce 1%


.........................


13/01/16 07:06:47 INFO mapred.JobClient: map 100% reduce 33%13/01/16 07:06:55 INFO mapred.JobClient: map 100% reduce 39%

.........................



13/01/16 07:07:43 INFO mapred.JobClient: Job complete: job_201301160607_0003

(To continue.)

Background

MapReduce

Spark

Progress

Counters in a Hadoop job13/01/16 07:07:43 INFO mapred.JobClient: Counters: 18

13/01/16 07:07:43 INFO mapred.JobClient: Job Counters

13/01/16 07:07:43 INFO mapred.JobClient: Launched reduce tasks=24

13/01/16 07:07:43 INFO mapred.JobClient: Rack-local map tasks=17

13/01/16 07:07:43 INFO mapred.JobClient: Launched map tasks=249

13/01/16 07:07:43 INFO mapred.JobClient: Data-local map tasks=203

13/01/16 07:07:43 INFO mapred.JobClient: FileSystemCounters

13/01/16 07:07:43 INFO mapred.JobClient: FILE_BYTES_READ=12023025990

13/01/16 07:07:43 INFO mapred.JobClient: HDFS_BYTES_READ=15492905740

13/01/16 07:07:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=14330761040

13/01/16 07:07:43 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=752814339

13/01/16 07:07:43 INFO mapred.JobClient: Map-Reduce Framework

13/01/16 07:07:43 INFO mapred.JobClient: Reduce input groups=39698527

13/01/16 07:07:43 INFO mapred.JobClient: Combine output records=508662829

13/01/16 07:07:43 INFO mapred.JobClient: Map input records=279422018

13/01/16 07:07:43 INFO mapred.JobClient: Reduce shuffle bytes=2647359503

13/01/16 07:07:43 INFO mapred.JobClient: Reduce output records=39698527

13/01/16 07:07:43 INFO mapred.JobClient: Spilled Records=828280813

13/01/16 07:07:43 INFO mapred.JobClient: Map output bytes=24932976267

13/01/16 07:07:43 INFO mapred.JobClient: Combine input records=2813475352

13/01/16 07:07:43 INFO mapred.JobClient: Map output records=2376465967

13/01/16 07:07:43 INFO mapred.JobClient: Reduce input records=71653444

Background

MapReduce

Spark

Summary of counters in job

Master in MapReduce Resource Management

Maintain the current resource usage of each Worker(CPU, RAM, Used & free disk space, etc.)

Examine worker failure periodically. Task Scheduling

“Moving computation is cheaper than moving data.” Map and reduce tasks are assigned to idle Workers. Tasks on failure workers will be re-scheduled. When job is close to end, it launches backup tasks.

Counter provides interactive job progress. stores the occurrences of various events. is helpful to performance tuning.

Background

MapReduce

Spark

Data-oriented Map scheduling

Launch map 1 on Worker 3

Switch

Worker 1

Worker 2

Worker 3

Switch

Worker 4

Worker 5

Worker 6

1

11

2

2

2 3

3

3

4 4

45

5 5

input =1 2 3 4 5+ + + +

input splits

Rack 1 Rack 2





Background

MapReduce

Spark

Data flow in MapReduce jobs

Mapper Reducer

other Mappers

other Reducers

circular buffer (in memory)

spills (on disk)

merged spills (on disk)

intermediate files (on disk)

Combiner

Background

MapReduce

Spark

GFSlocal split

rack-local split non-local split

GFS

Map internal The map phase reads the task’s input split from

GFS, parses it into records(key/value pairs), and applies the map function to each records.

After the map function has been applied to each record, the commit phase registers the final output to Master, which will tell reduce the location of map output.

Background

MapReduce

Spark

Reduce internal The shuffle phase fetches the reduce task’s input

data.

The sort phase groups records with the same key together.

The reduce phase applies the user-defined reduce function to each key and corresponding list of values.

Background

MapReduce

Spark

Backup Tasks There are barriers in a MapReduce job.

No reduce function executes until all maps finish. The job can not complete until all reduces finish.

The execution time of a job will be severely lengthened if a task is blocked.

Master schedules backup/speculative tasks for unfinished ones before the job is close to end.

A job will take 44% longer if backup tasks are disabled.

MapMapMap

Map

ReduceReduceReduce

Job complete

Background

MapReduce

Spark

Virtues and defects of MR

Virtues Towards large scale data Programming friendly Implicit parallelism Data-locality Fault/crash tolerance Scalability Open-source with good

ecosystem[1]

Defects Bad for iterative

ML algorithms Not sure

[1] http://docs.hortonworks.com/CURRENT/index.htm#About_Hortonworks_Data_Platform/Understanding_Hadoop_Ecosystem.htm

Background

MapReduce

Spark

Network traffic in MapReduce1. Map may read split from remote ChunkServer2. Reduce copy the output of Map3. Reduce output write to GFS

1

2 3

Background

MapReduce

Spark

Disk R/W in MapReduce1. ChunkServer reads local block for remote split fetching2. Spill intermediate result to disk3. Write the copied partition to local disk4. Write the result output to local ChunkServer5. Write the result output to remote ChunkServer

12

34

5Background

MapReduce

Spark

Iterative MapReduce

Performing graph algorithm Using MapReduce.

Background

MapReduce

Spark

Motivation of Spark Iterative algorithms (machine learning, graphs)

Interactive data mining tools (R, Excel, Python)

Background

MapReduce

Spark

Programming Model Fine-grained

Computing outputs of every iteration are distributed and store to stable storage

Coarse-grained Only logging the transformations to build a dataset(i.e. lineage)

Resilient distributed datasets (RDDs) Immutable, partitioned collections of objects Created through parallel transformations (map, filter,

groupBy, join, …) on data in stable storage Can be cached for efficient reuse

Actions on RDDs Count, reduce, collect, save, …

Background

MapReduce

Spark

Spark Operations

Transformations(define a new

RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result to driver program)

collectreducecountsave

lookupKey

Background

MapReduce

Spark

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count. . .

tasksresults

Cache 1

Cache 2

Cache 3

Base RDDTransformed

RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

RDD Fault ToleranceRDDs maintain lineage information that can be used to reconstruct lost partitions

Ex:messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDDfilter

(func = _.contains(...))map

(func = _.split(...))Background

MapReduce

Spark

Example: Logistic Regression

Goal: find best line separating two sets of points

+

–

+ ++

+

+

++ +

– ––

–

–

–– –

+

target

–

random initial line

Background

MapReduce

Spark

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)Background

MapReduce

Spark

Keep variable “data” in memory

Logistic Regression Performance

1 5 10 20 300

50010001500200025003000350040004500

Hadoop

Spark

Number of Iterations

Run

ning

Tim

e (s

)

127 s / iteration

first iteration 174 sfurther iterations 6

sBackground

MapReduce

Spark

Spark Programming Interface (eg. page rank)

Representing RDDs

Spark Scheduler

Dryad-like DAGs Pipelines functions

within a stage Cache-aware work

reuse & locality Partitioning-aware

to avoid shuffles

= cached data partition

Background

MapReduce

Spark

Behavior with Not Enough RAM

Cache disabled

25% 50% 75% Fully cached

020406080

10068

.8

58.1

40.7

29.7

11.5

% of working set in memory

Iter

atio

n ti

me

(s)

Fault Recovery Results

1 2 3 4 5 6 7 8 9 10020406080

100120140

119

57 56 58 58

81

57 59 57 59

No Failure

Iteration

Iter

atri

on t

ime

(s)

Conclusion Both MapReduce and Spark are excellent big

data software, which are scalable, fault-tolerant, and programming friendly.

Especially, Spark provides a more effective method for iterative computing jobs.

Background

MapReduce

Spark

QUESTIONS?Thanks!

Documents

Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 [email protected]