13
1 COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Spring 2015 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications Streaming data processing Keep data in memory across different functions Sparks works across many environments Standalone, Hadoop, Mesos, Spark support accessing data from diverse sources ( HDFS, HBase, Cassandra, …)

What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

  • Upload
    hathuy

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Page 1: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

1

COSC 6339

Big Data Analytics

Introduction to Spark

Edgar Gabriel

Spring 2015

What is SPARK?

• In-Memory Cluster Computing for Big Data Applications

• Fixes the weaknesses of MapReduce

– Iterative applications

– Streaming data processing

– Keep data in memory across different functions

• Sparks works across many environments

– Standalone,

– Hadoop,

– Mesos,

• Spark support accessing data from diverse sources ( HDFS,

HBase, Cassandra, …)

Page 2: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

2

What is SPARK (II) • Three modes of execution

– Spark shell

– Spark scripts

– Spark code

• API defined for multiple languages

– Scala

– Python

– Java

A couple of words on Scala

• Object-oriented language: everything is an object and

every operation is a method-call.

• Scala is also a functional language

– Functions are first class values

– Can be passed as arguments to functions

– Functions have to be free of side effects

– Can defined functions inside of functions

• Scala runs on the JVM

– Java and Scala classes can be freely mixed

Page 3: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

3

A couple of words on Scala

• Scala supports type inference, i.e. automatic deduction

of the data type of an expression

• val: ‘value’, i.e. an immutable object whose content

cannot be changed after initial assignment

• var: ‘variable’, a mutable object

Spark Essentials

• Spark program has to create a SparkContext object,

which tells Spark how to access a cluster

• Automatically done in the shell for Scala or Python: accessible through the sc variable

• Programs must use a constructor to instantiate a new SparkContext

scala> sc

res: spark.SparkContext = spark.SparkContext@470d1f30

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Page 4: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

4

Spark Essentials

• The master parameter for a SparkContext

determines which cluster to use, e.g. shark>spark-shell --master yarn-cluster

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

SPARK cluster utilization

1. master connects to a cluster manager to allocate

resources across applications

2. acquires executors on cluster nodes – processes run

compute tasks, cache data

3. sends app code to the executors

4. sends tasks for the executors to run

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Page 5: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

5

SPARK master parameter Master URL Meaning

local Run Spark locally with one worker thread (i.e.

no parallelism at all).

local[K] Run Spark locally with K worker threads (ideally,

set this to the number of cores on your

machine).

spark://HOST:PORT Connect to the given Spark standalone

cluster master. The port must be whichever one

your master is configured to use, which is 7077

by default.

mesos://HOST:PORT Connect to the given Mesos cluster. The port

must be whichever one your is configured to

use, which is 5050 by default. Or, for a Mesos

cluster using ZooKeeper, use mesos://zk://....

yarn-cluster Connect to a YARN cluster in cluster mode. The

cluster location will be found based on

HADOOP_CONF_DIR.

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Programming Model

• Resilient distributed datasets (RDDs)

– Immutable collections partitioned across cluster that can be rebuilt if a partition is lost

– Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)

• Two types of RDDs defined today:

– parallelized collections – take an existing Scala collection

and run functions on it in parallel

– Hadoop datasets – run functions on each record of a file

in Hadoop distributed file system or any other storage

system supported by Hadoop Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Page 6: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

6

Programming Model (II)

• Two types of operations on RDDs: transformations and

actions

– transformations are lazy (not computed immediately)

– the transformed RDD gets recomputed when an action is

run on it (default)

– instead they remember the transformations applied to

some base dataset

• optimize the required calculations

• recover from lost data partitions

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Programming Model (III)

• Spark can create RDDs from any file stored in HDFS or other

storage systems supported by Hadoop, e.g., local file

system, Amazon S3, Hypertable, HBase, etc.

• Spark supports text files, SequenceFiles, and any other

Hadoop InputFormat, and can also take a directory or a glob

(e.g. /data/201404*)

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Page 7: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

7

Transformations

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Transformations

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Page 9: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

9

SPARK Word Count Example

val file = sc.textFile(“/gabriel/input/input.txt")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile(“/gabriel/output/output.txt")

Returns a new distributed data set by passing each element through a function

Similar to map, but returns an RDD which has all elements in a flat representation

A dataset consisting of (K, V) pairs are aggregated using the given reduce function func, on a per-key basis. Function must be of type (V,V) => V.

map vs. flatMap • Sample input file shark:~> cat input.txt

This is a short sentence.

This is a second sentence

• Output of map vs. flatMap scala> val file = sc.textFile("/gabriel/input/input.txt")

scala> val words = file.map (line => line.split (" "))

scala> words.collect

res1: Array[Array[String]] = Array(Array(This, is, a, short,

sentence.), Array(This, is, a, second, sentence))

scala> val words2 = file.flatMap( line => line.split(" "))

scala> words2.collect

res2: Array[String] = Array(This, is, a, short, sentence.,

This, is, a, second, sentence)

Page 10: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

10

Persistence • Spark can persist (or cache) a dataset in memory across

operations

• Each node stores in memory any slices of it that it

computes and reuses them in other actions on that

dataset – often making future actions more than 10x

faster

• The cache is fault-tolerant: if any partition of an RDD

is lost, it will automatically be recomputed using the

transformations that originally created it

val f = sc.textFile("README.md")

val w = f.flatMap(l => l.split(" ")).cache()

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Broadcast variables

• Broadcast variables let programmer keep a read-only

variable cached on each machine rather than shipping a

copy of it with tasks

• For example, to give every node a copy of a large input

dataset efficiently

• Spark also attempts to distribute broadcast variables

using efficient broadcast algorithms to reduce

communication cost

val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar.value

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Page 11: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

11

K-means example

object LocalKMeans {

def closestPoint(p: Vector[Double], centers: HashMap[Int,

Vector[Double]]): Int = {

var index = 0

var bestIndex = 0

var closest = Double.PositiveInfinity

for (i <- 1 to centers.size) {

val vCurr = centers.get(i).get

val tempDist = squaredDistance(p, vCurr)

if (tempDist < closest) {

closest = tempDist

bestIndex = i

}

}

bestIndex

}

def main(args: Array[String]) {

// Not shown here; setting points and initial cluster

// centroids by reading a file

var points = new HashSet[Vector[Double]]

var kPoints = new HashMap[Int, Vector[Double]]

while(tempDist > convergeDist) {

// For every point, determined closest cluster

var closest = data.map(p =>(closestPoint(p, kPoints), (p, 1)))

// Group points by closest cluster

var mappings = closest.groupBy[Int] (x => x._1)

K-means example

val pair = (a, b)

pair._1 // => a

pair._2 // => b

Page 12: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

12

// Calculate no. sum of all points assigned to a cluster and

// number of cluster points

var pointStats = mappings.map { pair =>

pair._2.reduceLeft [(Int, (Vector[Double], Int))] {

case ((id1, (x1, y1)), (id2, (x2, y2))) => (id1,

(x1 + x2, y1 + y2))

}

}

// calculate new cluster centroids

var newPoints = pointStats.map {mapping =>

(mapping._1, mapping._2._1 * (1.0 / mapping._2._2))}

K-means example

Applies the given function to successive elements in the collection, e.g.

// calculate sum of distances

tempDist = 0.0

for (mapping <- newPoints) {

tempDist += squaredDistance(kPoints.get(mapping._1).get,

mapping._2)

}

// set new cluster centroids

for (newP <- newPoints) {

kPoints.put(newP._1, newP._2)

}

} // end of while loop

K-means example

Page 13: What is SPARK? - Computer Sciencegabriel/courses/cosc6339_s15/BDA_07_Spark.pdf · What is SPARK (II) ... cannot be changed after initial assignment •var: ‘variable’, a mutable

13

SPARK software

More information

• Project webpage

http://spark.apache.org/

• A flurry of books coming up on the topic

– Most scheduled for later spring this year

– Very good examples and documentation available on their

webpages

• Check out youtube for looooong tutorials ( ~3 hours)