Download pdf - What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

1

COSC 6339

Big Data Analytics

Introduction to Spark

Edgar Gabriel

Spring 2015

What is SPARK?

• In-Memory Cluster Computing for Big Data Applications

• Fixes the weaknesses of MapReduce

– Iterative applications

– Streaming data processing

– Keep data in memory across different functions

• Sparks works across many environments

– Standalone,

– Hadoop,

– Mesos,

• Spark support accessing data from diverse sources ( HDFS,

HBase, Cassandra, …)

2

What is SPARK (II) • Three modes of execution

– Spark shell

– Spark scripts

– Spark code

• API defined for multiple languages

– Scala

– Python

– Java

A couple of words on Scala

• Object-oriented language: everything is an object and

every operation is a method-call.

• Scala is also a functional language

– Functions are first class values

– Can be passed as arguments to functions

– Functions have to be free of side effects

– Can defined functions inside of functions

• Scala runs on the JVM

– Java and Scala classes can be freely mixed

3

A couple of words on Scala

• Scala supports type inference, i.e. automatic deduction

of the data type of an expression

• val: ‘value’, i.e. an immutable object whose content

cannot be changed after initial assignment

• var: ‘variable’, a mutable object

Spark Essentials

• Spark program has to create a SparkContext object,

which tells Spark how to access a cluster

• Automatically done in the shell for Scala or Python: accessible through the sc variable

• Programs must use a constructor to instantiate a new SparkContext

scala> sc

res: spark.SparkContext = spark.SparkContext@470d1f30

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

http://cdn.liber118.com/workshop/itas_workshop.pdf



4

Spark Essentials

• The master parameter for a SparkContext

determines which cluster to use, e.g. shark>spark-shell --master yarn-cluster


SPARK cluster utilization

1. master connects to a cluster manager to allocate

resources across applications

2. acquires executors on cluster nodes – processes run

compute tasks, cache data

3. sends app code to the executors

4. sends tasks for the executors to run








5

SPARK master parameter Master URL Meaning

local Run Spark locally with one worker thread (i.e.

no parallelism at all).

local[K] Run Spark locally with K worker threads (ideally,

set this to the number of cores on your

machine).

spark://HOST:PORT Connect to the given Spark standalone

cluster master. The port must be whichever one

your master is configured to use, which is 7077

by default.

mesos://HOST:PORT Connect to the given Mesos cluster. The port

must be whichever one your is configured to

use, which is 5050 by default. Or, for a Mesos

cluster using ZooKeeper, use mesos://zk://....

yarn-cluster Connect to a YARN cluster in cluster mode. The

cluster location will be found based on

HADOOP_CONF_DIR.


Programming Model

• Resilient distributed datasets (RDDs)

– Immutable collections partitioned across cluster that can be rebuilt if a partition is lost

– Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)

• Two types of RDDs defined today:

– parallelized collections – take an existing Scala collection

and run functions on it in parallel

– Hadoop datasets – run functions on each record of a file

in Hadoop distributed file system or any other storage

system supported by Hadoop Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

http://spark.apache.org/docs/latest/spark-standalone.html

http://spark.apache.org/docs/latest/spark-standalone.html

http://spark.apache.org/docs/latest/running-on-mesos.html

http://spark.apache.org/docs/latest/running-on-yarn.html

http://spark.apache.org/docs/latest/running-on-yarn.html







6

Programming Model (II)

• Two types of operations on RDDs: transformations and

actions

– transformations are lazy (not computed immediately)

– the transformed RDD gets recomputed when an action is

run on it (default)

– instead they remember the transformations applied to

some base dataset

• optimize the required calculations

• recover from lost data partitions


Programming Model (III)

• Spark can create RDDs from any file stored in HDFS or other

storage systems supported by Hadoop, e.g., local file

system, Amazon S3, Hypertable, HBase, etc.

• Spark supports text files, SequenceFiles, and any other

Hadoop InputFormat, and can also take a directory or a glob

(e.g. /data/201404*)








7

Transformations


Transformations








8

Actions

Slide based on a talk found at http://cdn.liber118.com/workshop/itas_workshop.pdf

Actions








9

SPARK Word Count Example

val file = sc.textFile(“/gabriel/input/input.txt")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile(“/gabriel/output/output.txt")

Returns a new distributed data set by passing each element through a function

Similar to map, but returns an RDD which has all elements in a flat representation

A dataset consisting of (K, V) pairs are aggregated using the given reduce function func, on a per-key basis. Function must be of type (V,V) => V.

map vs. flatMap • Sample input file shark:~> cat input.txt

This is a short sentence.

This is a second sentence

• Output of map vs. flatMap scala> val file = sc.textFile("/gabriel/input/input.txt")

scala> val words = file.map (line => line.split (" "))

scala> words.collect

res1: Array[Array[String]] = Array(Array(This, is, a, short,

sentence.), Array(This, is, a, second, sentence))

scala> val words2 = file.flatMap( line => line.split(" "))

scala> words2.collect

res2: Array[String] = Array(This, is, a, short, sentence.,

This, is, a, second, sentence)

10

Persistence • Spark can persist (or cache) a dataset in memory across

operations

• Each node stores in memory any slices of it that it

computes and reuses them in other actions on that

dataset – often making future actions more than 10x

faster

• The cache is fault-tolerant: if any partition of an RDD

is lost, it will automatically be recomputed using the

transformations that originally created it

val f = sc.textFile("README.md")

val w = f.flatMap(l => l.split(" ")).cache()


Broadcast variables

• Broadcast variables let programmer keep a read-only

variable cached on each machine rather than shipping a

copy of it with tasks

• For example, to give every node a copy of a large input

dataset efficiently

• Spark also attempts to distribute broadcast variables

using efficient broadcast algorithms to reduce

communication cost

val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar.value








11

K-means example

object LocalKMeans {

def closestPoint(p: Vector[Double], centers: HashMap[Int,

Vector[Double]]): Int = {

var index = 0

var bestIndex = 0

var closest = Double.PositiveInfinity

for (i <- 1 to centers.size) {

val vCurr = centers.get(i).get

val tempDist = squaredDistance(p, vCurr)

if (tempDist < closest) {

closest = tempDist

bestIndex = i

}

}

bestIndex

}

def main(args: Array[String]) {

// Not shown here; setting points and initial cluster

// centroids by reading a file

var points = new HashSet[Vector[Double]]

var kPoints = new HashMap[Int, Vector[Double]]

while(tempDist > convergeDist) {

// For every point, determined closest cluster

var closest = data.map(p =>(closestPoint(p, kPoints), (p, 1)))

// Group points by closest cluster

var mappings = closest.groupBy[Int] (x => x._1)

K-means example

val pair = (a, b)

pair._1 // => a

pair._2 // => b

12

// Calculate no. sum of all points assigned to a cluster and

// number of cluster points

var pointStats = mappings.map { pair =>

pair._2.reduceLeft [(Int, (Vector[Double], Int))] {

case ((id1, (x1, y1)), (id2, (x2, y2))) => (id1,

(x1 + x2, y1 + y2))

}

}

// calculate new cluster centroids

var newPoints = pointStats.map {mapping =>

(mapping._1, mapping._2._1 * (1.0 / mapping._2._2))}

K-means example

Applies the given function to successive elements in the collection, e.g.

// calculate sum of distances

tempDist = 0.0

for (mapping <- newPoints) {

tempDist += squaredDistance(kPoints.get(mapping._1).get,

mapping._2)

}

// set new cluster centroids

for (newP <- newPoints) {

kPoints.put(newP._1, newP._2)

}

} // end of while loop

K-means example

13

SPARK software

More information

• Project webpage

http://spark.apache.org/

• A flurry of books coming up on the topic

– Most scheduled for later spring this year

– Very good examples and documentation available on their

webpages

• Check out youtube for looooong tutorials ( ~3 hours)