Study Notes: Apache Spark

Summary of Apache Spark

Original Papers:1. “Spark: Cluster Computing with Working Sets” by Matei Zaharia, et al. Hotcloud

2010.2. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Computing” by Matei Zaharia, et al. NSDI 2012.

Motivation• MapReduce is great, but,• There are applications where you iterate on the same set of data, e.g.,

for many iterations { old_data = new_data; new_data = func(old_data); }

• Problem: The body of each iteration can be described as a MapReduce task, where the inputs and final outputs are GFS files. There is redundant work storing the output data to GFS and then reading it out again in the next iteration.

• Idea: We can provide a mode to cache the final outputs in memory if possible.o Challenge: but if the machine crashes, we lose the outputs, can we recover?o Solution: store the lineage of the data so that they can be reconstructed as needed (e.g., if they

get lost or insufficient memory).

Motivation• Spark’s goal was to generalize MapReduce to support new apps

within same engineoMapReduce problems can be expressed in Spark too.oWhere Spark shines and MapReduce does not: applications that need to

reuse a working set of data across multiple parallel operations

• Two reasonably small additions allowed the previous specialized models to be expressed within Spark:o fast data sharingogeneral DAGs

MotivationSome key points about Spark:• handles batch, interactive, and real-time within a single framework• native integration with Java, Python, Scala.

oHas APIs written in these languages

• programming at a higher level of abstraction• more general: map/reduce is just one set of supported constructs

Use Example• We’ll run Spark’s interactive shell…

./bin/spark-shell

• Then from the “scala>” REPL prompt, let’s create some data…val data = 1 to 10000

• Create an RDD based on that data…val distData = sc.parallelize(data)

• Then use a filter to select values less than 10…distData.filter(_ < 10).collect()

Resilient Distributed Datasets (RDD)

• Represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.• RDDs can only be created through deterministic operations (aka

transformations) on:o Either data in stable storage:

Any file stored in HDFS Or other storage systems supported by Hadoop

oOr other RDDs

• A program cannot reference an RDD that it cannot reconstruct after a failure

Resilient Distributed Datasets (RDD)

• Two types of operations on RDDs: transformations and actionso Programmers start by defining one or more RDDs through transformations on data in stable storage

(e.g., map and filter). Transformations create a new dataset from an existing one transformations are lazy (not computed immediately) instead they remember the transformations applied to some base dataset The transformed RDD gets recomputed when an action is run on it (default)

o They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system.

• However, an RDD can be persisted into storage in memory or disko Each node stores in memory any slices of it that it computes and reuses them in other actions on that

dataset – often making future actions more than 10x fastero The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using

the transformations that originally created ito Note that by default, the base RDD (the original data in stable storage) is not loaded into RAM because

the useful data after transformation might be only a small fraction (small enough to fit into memory).

RDD ImplementationCommon interface of each RDD:• A set of partitions: atomic pieces of the dataset• A set of dependencies on parent RDDs: one RDD can have multiple parents

o narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD.

o wide dependencies: multiple child partitions may depend on a parent. Requires the shuffle operation.

• A function for computing the dataset based on its parents• Metadata about its partitioning scheme• Metadata about its data placement, e.g., perferredLocations(p) returns a list of

nodes where partition p can be accessed faster due to data locality

Narrow vs Wide Dependencies• Narrow dependencies allow for pipelined

execution on one cluster node, which can compute all the parent partitions. Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation.• Recovery after a node failure is more

efficient with a narrow dependency, as only the lost parent partitions need be recomputed, and re-computation can be done in parallel

Job Scheduling on RDDs• Similar to Dryad, but takes data locality into account• When user runs an action, scheduler builds a DAG of

stages to execute. Each stage contains as many pipelined transformations with narrow dependencies as possible.

• Stage boundaries are the shuffle operations required for wide dependencies

• Scheduler launches tasks to compute missing partitions from each stage. Tasks are assigned to machines based on data locality using delay scheduling.

• For wide dependencies, intermediate records are materialized on the nodes holding parent partitions (similar to the intermediate map outputs of MapReduce) to simplify fault recovery.

Checkpointing• Although lineage can always be used to recover RDDs after a failure,

checkpointing can be helpful for RDDs with long lineage chains containing wide dependencies.• For RDDs with narrow dependencies on data in stable storage,

checkpointing is not worthwhile. Reconstruction can be done in parallel for these RDDs, at a fraction of the cost of replicating the whole RDD.

RDD vs Distributed Shared Memory (DSM)

• Previous frameworks that support data reuse, e.g., Pregel and Piccolo.o Perform data sharing implicitly for these patternso Specialized frameworks; do not provide abstractions for more general reuseo Programming interface supports fine-grained updates (reads and writes to each memory

location): fault-tolerance requires expensive replication of data across machines or logging of updates across machines

• RDD:o Only coarse-grained transformations (e.g., map, filter and join): apply the same operation to many

data item. Note that reads on RDDs can still be fine-grained.o Fault-tolerance only requires logging the transformation used to build a dataset instead of the

actual datao RDDs are not suitable for applications that make asynchronous fine-grained updates to shared

state.

RDD vs Distributed Shared Memory (DSM)

• Other benefits of RDDs:oRDDs are immutable. A system can mitigate slow nodes (stragglers) by

running backup copies of slow tasks as in MapReduce.o In bulk operations, a runtime can schedule tasks based on data locality to

improve performanceoRDDs degrade gracefully when there is not enough memory to store them. An

LRU eviction policy is used at the level of RDDs. A partition from the least recently accessed RDD is evicted to make room for a newly computed RDD partition. This is user-configurable via the “persistence priority” for each RDD.

Debugging RDDs• One can reconstruct the RDDs later from the lineage and let the user

query them interactively• One can re-run any task from the job in a single-process debugger by

recomputing the RDD partitions it depends on.

• Similar to the replay debuggers but without the capturing/recording overhead.

// load error messages from a log into memory; then interactively search for

// various patterns

// base RDD

val lines = sc.textFile("hdfs://...")

// transformed RDDs!

val errors = lines.filter(_.startsWith("ERROR"))

val messages = errors.map(_.split("\t")).map(r => r(1))

messages.cache()

// action 1

messages.filter(_.contains("mysql")).count()

// action 2

messages.filter(_.contains("php")).count()

RDD Example


// various patterns

// base RDD





messages.cache()

// action 1


// action 2


RDD


// various patterns

// base RDD





messages.cache()

// action 1


// action 2


RDDRDDRDDRDDTransformations


// various patterns

// base RDD





messages.cache()

// action 1


// action 2


RDDRDDRDDRDDTransformations

ValueActions

Shared Variables• Broadcast variables let programmer keep a read-only variable cached

on each machine rather than shipping a copy of it with taskso For example, to give every node a copy of a large input dataset efficiently

• Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost

Shared Variables• Accumulators are variables that can only be “added” to through an

associative operation.oUsed to implement counters and sums, efficiently in parallel

• Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types• Only the driver program can read an accumulator’s value, not the workers

o Each accumulator is given a unique ID upon creationo Each worker creates a separate copy of the accumulatoroWorker sends a message to driver about the updates to the accumulator

Internet

Study Notes: Apache Spark