Upload
hathuy
View
219
Download
3
Embed Size (px)
Citation preview
1
COSC 6339
Big Data Analytics
Introduction to Spark
Edgar Gabriel
Spring 2015
What is SPARK?
• In-Memory Cluster Computing for Big Data Applications
• Fixes the weaknesses of MapReduce
– Iterative applications
– Streaming data processing
– Keep data in memory across different functions
• Sparks works across many environments
– Standalone,
– Hadoop,
– Mesos,
• Spark support accessing data from diverse sources ( HDFS,
HBase, Cassandra, …)
2
What is SPARK (II) • Three modes of execution
– Spark shell
– Spark scripts
– Spark code
• API defined for multiple languages
– Scala
– Python
– Java
A couple of words on Scala
• Object-oriented language: everything is an object and
every operation is a method-call.
• Scala is also a functional language
– Functions are first class values
– Can be passed as arguments to functions
– Functions have to be free of side effects
– Can defined functions inside of functions
• Scala runs on the JVM
– Java and Scala classes can be freely mixed
3
A couple of words on Scala
• Scala supports type inference, i.e. automatic deduction
of the data type of an expression
• val: ‘value’, i.e. an immutable object whose content
cannot be changed after initial assignment
• var: ‘variable’, a mutable object
Spark Essentials
• Spark program has to create a SparkContext object,
which tells Spark how to access a cluster
• Automatically done in the shell for Scala or Python: accessible through the sc variable
• Programs must use a constructor to instantiate a new SparkContext
scala> sc
res: spark.SparkContext = spark.SparkContext@470d1f30
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
4
Spark Essentials
• The master parameter for a SparkContext
determines which cluster to use, e.g. shark>spark-shell --master yarn-cluster
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
SPARK cluster utilization
1. master connects to a cluster manager to allocate
resources across applications
2. acquires executors on cluster nodes – processes run
compute tasks, cache data
3. sends app code to the executors
4. sends tasks for the executors to run
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
5
SPARK master parameter Master URL Meaning
local Run Spark locally with one worker thread (i.e.
no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally,
set this to the number of cores on your
machine).
spark://HOST:PORT Connect to the given Spark standalone
cluster master. The port must be whichever one
your master is configured to use, which is 7077
by default.
mesos://HOST:PORT Connect to the given Mesos cluster. The port
must be whichever one your is configured to
use, which is 5050 by default. Or, for a Mesos
cluster using ZooKeeper, use mesos://zk://....
yarn-cluster Connect to a YARN cluster in cluster mode. The
cluster location will be found based on
HADOOP_CONF_DIR.
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
Programming Model
• Resilient distributed datasets (RDDs)
– Immutable collections partitioned across cluster that can be rebuilt if a partition is lost
– Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)
• Two types of RDDs defined today:
– parallelized collections – take an existing Scala collection
and run functions on it in parallel
– Hadoop datasets – run functions on each record of a file
in Hadoop distributed file system or any other storage
system supported by Hadoop Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
6
Programming Model (II)
• Two types of operations on RDDs: transformations and
actions
– transformations are lazy (not computed immediately)
– the transformed RDD gets recomputed when an action is
run on it (default)
– instead they remember the transformations applied to
some base dataset
• optimize the required calculations
• recover from lost data partitions
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
Programming Model (III)
• Spark can create RDDs from any file stored in HDFS or other
storage systems supported by Hadoop, e.g., local file
system, Amazon S3, Hypertable, HBase, etc.
• Spark supports text files, SequenceFiles, and any other
Hadoop InputFormat, and can also take a directory or a glob
(e.g. /data/201404*)
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
7
Transformations
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
Transformations
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
8
Actions
Slide based on a talk found at http://cdn.liber118.com/workshop/itas_workshop.pdf
Actions
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
9
SPARK Word Count Example
val file = sc.textFile(“/gabriel/input/input.txt")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(“/gabriel/output/output.txt")
Returns a new distributed data set by passing each element through a function
Similar to map, but returns an RDD which has all elements in a flat representation
A dataset consisting of (K, V) pairs are aggregated using the given reduce function func, on a per-key basis. Function must be of type (V,V) => V.
map vs. flatMap • Sample input file shark:~> cat input.txt
This is a short sentence.
This is a second sentence
• Output of map vs. flatMap scala> val file = sc.textFile("/gabriel/input/input.txt")
scala> val words = file.map (line => line.split (" "))
scala> words.collect
res1: Array[Array[String]] = Array(Array(This, is, a, short,
sentence.), Array(This, is, a, second, sentence))
scala> val words2 = file.flatMap( line => line.split(" "))
scala> words2.collect
res2: Array[String] = Array(This, is, a, short, sentence.,
This, is, a, second, sentence)
10
Persistence • Spark can persist (or cache) a dataset in memory across
operations
• Each node stores in memory any slices of it that it
computes and reuses them in other actions on that
dataset – often making future actions more than 10x
faster
• The cache is fault-tolerant: if any partition of an RDD
is lost, it will automatically be recomputed using the
transformations that originally created it
val f = sc.textFile("README.md")
val w = f.flatMap(l => l.split(" ")).cache()
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
Broadcast variables
• Broadcast variables let programmer keep a read-only
variable cached on each machine rather than shipping a
copy of it with tasks
• For example, to give every node a copy of a large input
dataset efficiently
• Spark also attempts to distribute broadcast variables
using efficient broadcast algorithms to reduce
communication cost
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf
11
K-means example
object LocalKMeans {
def closestPoint(p: Vector[Double], centers: HashMap[Int,
Vector[Double]]): Int = {
var index = 0
var bestIndex = 0
var closest = Double.PositiveInfinity
for (i <- 1 to centers.size) {
val vCurr = centers.get(i).get
val tempDist = squaredDistance(p, vCurr)
if (tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
bestIndex
}
def main(args: Array[String]) {
// Not shown here; setting points and initial cluster
// centroids by reading a file
var points = new HashSet[Vector[Double]]
var kPoints = new HashMap[Int, Vector[Double]]
while(tempDist > convergeDist) {
// For every point, determined closest cluster
var closest = data.map(p =>(closestPoint(p, kPoints), (p, 1)))
// Group points by closest cluster
var mappings = closest.groupBy[Int] (x => x._1)
K-means example
val pair = (a, b)
pair._1 // => a
pair._2 // => b
12
// Calculate no. sum of all points assigned to a cluster and
// number of cluster points
var pointStats = mappings.map { pair =>
pair._2.reduceLeft [(Int, (Vector[Double], Int))] {
case ((id1, (x1, y1)), (id2, (x2, y2))) => (id1,
(x1 + x2, y1 + y2))
}
}
// calculate new cluster centroids
var newPoints = pointStats.map {mapping =>
(mapping._1, mapping._2._1 * (1.0 / mapping._2._2))}
K-means example
Applies the given function to successive elements in the collection, e.g.
// calculate sum of distances
tempDist = 0.0
for (mapping <- newPoints) {
tempDist += squaredDistance(kPoints.get(mapping._1).get,
mapping._2)
}
// set new cluster centroids
for (newP <- newPoints) {
kPoints.put(newP._1, newP._2)
}
} // end of while loop
K-means example
13
SPARK software
More information
• Project webpage
http://spark.apache.org/
• A flurry of books coming up on the topic
– Most scheduled for later spring this year
– Very good examples and documentation available on their
webpages
• Check out youtube for looooong tutorials ( ~3 hours)