20130912 YTC_Reynold Xin_Spark and Shark

Spark and Shark: High-speed Analytics over Hadoop Data

Sep 12, 2013 @ Yahoo! Tech Conference

Reynold Xin (辛湜 ), AMPLab, UC Berkeley

The Spark Ecosystem

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos/YARN Resource Manager

Spark Streamin

gGraphX MLBase

Today’s Talk

Spark

SharkSQL


Mesos/YARN Resource Manager

Spark Streamin

gGraphX MLBase

Apache Spark

Fast and expressive cluster computing system interoperable with Apache Hadoop

Improves efficiency through:» In-memory computing primitives»General computation graphs

Improves usability through:»Rich APIs in Scala, Java, Python» Interactive shell

Up to 100× faster(2-10× on disk)

Often 5× less code

Shark

An analytic query engine built on top of Spark

»Support both SQL and complex analytics»Can be100X faster than Apache Hive

Compatible with Hive data, metastore, queries

»HiveQL»UDF / UDAF»SerDes»Scripts

Community

3000 people attended online training

1100+ meetup members

80+ GitHub contributors

Today’s Talk

Spark

SharkSQL


Mesos Resource Manager

Spark Streamin

gGraphX MLBase

Why a New Framework?

MapReduce greatly simplified big data analysis

But as soon as it got popular, users wanted more:

»More complex, multi-pass analytics (e.g. ML, graph)

»More interactive ad-hoc queries»More real-time stream processing

Hadoop MapReduce

iter. 1 iter. 2 …

Input

filesystemread

filesystemwrite

filesystemread

filesystemwrite

Iterative:

Interactive:

Input

query 1

query 2

…

select

Subset

Slow due to replication and disk I/O,but necessary for fault tolerance

Spark Programming Model

Key idea: resilient distributed datasets (RDDs)

»Distributed collections of objects that can optionally be cached in memory across cluster

»Manipulated through parallel operators»Automatically recomputed on failure

Programming interface»Functional APIs in Scala, Java, Python» Interactive use from Scala shell

Example: Log Mining

Exposes RDDs through a functional API in Scala

Usable interactively from Scala shelllines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

errors.persist() Block 1

Block 2

Block 3

Worker

errors.filter(_.contains(“foo”)).count()

errors.filter(_.contains(“bar”)).count()

tasks

results

Errors 2

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for

on-disk data)

Result: 1 TB data in 5 sec(vs 170 sec for on-disk data)

Worker

Errors 3

Worker

Errors 1

Master

Data Sharing in Spark

iter. 1 iter. 2 …

Input

Iterative:

Interactive:

Input

query 1

query 2select

…

10-100x faster than network/disk

Fault Tolerance

RDDs track the series of transformations used to build them (their lineage) to recompute lost data

E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Spark: Expressive API

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

15

and broadcast variables, accumulators…

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }}

public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}

Word Countval docs = sc.textFiles(“hdfs://…”)

docs.flatMap(line => line.split(“\s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b)

Word Countval docs = sc.textFiles(“hdfs://…”)

docs.flatMap(line => line.split(“\s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b)

docs.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)

Spark in Java and Python

Python APIlines = spark.textFile(…)

errors = lines.filter( lambda s: "ERROR" in s)

errors.count()

Java APIJavaRDD<String> lines = spark.textFile(…);

errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } });

errors.count()

SchedulerGeneral task DAGs

Pipelines functionswithin a stage

Cache-aware datalocality & reuse

Partitioning-awareto avoid shuffles

Launch jobs in ms

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

Example: Logistic Regression

Goal: find best line separating two sets of points

+

–

++

+

+

+

++ +

– ––

–

–

–

––

+

target

–

random initial line

22

Logistic Regression: Prep the Data

val sc = new SparkContext(“spark://master”, “LRExp”)

def loadDict(): Map[String, Double] = { … }

val dict = sc.broadcast( loadDict() )

def parser(s: String): (Array[Double], Double) = {val parts = s.split(‘\t’)val y = parts(0).toDoubleval x = parts.drop(1).map(dict.getOrElse(_, 0.0))(x, y)

}

val data = sc.textFile(“hdfs://d”).map(parser).cache()

Load data in memory once

Start a new Spark Session

Build a dictionary for file parsing

Broadcast the Dictionary to the Cluster

Line Parser which uses the Dictionary

23

Logistic Regression: Grad. Desc.

val sc = new SparkContext(“spark://master”, “LRExp”)

// Parsing …

val data: Spark.RDD[(Array[Double],Double)] = …

var w = Vector.random(D)

val lambda = 1.0/data.count

for (i <- 1 to ITERATIONS) { val gradient = data.map {

case (x, y) => (1.0/(1+exp(-y*(w dot x)))–1)*y*x }.reduce( (g1, g2) => g1 + g2 ) w -= (lambda / sqrt(i) ) * gradient}

Initial parameter vector

Iterate

Set Learning Rate

Apply Gradient on Master

Compute Gradient Using Map & Reduce

24

Logistic Regression Performance

1 5 10 20 300

500

1000

1500

2000

2500

3000

3500

4000

Hadoop

Spark

Number of Iterations

Ru

nn

ing

Tim

e (

s) 110 s / iteration

first iteration 80 sfurther iterations 1

s

Spark ML Lib (Spark 0.8)Classification Logistic Regression, Linear

SVM (+L1, L2)

Regression Linear Regression(+Lasso, Ridge)

Collaborative Filtering Alternating Least Squares

Clustering K-Means

Optimization SGD, Parallel Gradient

Spark

Spark

SharkSQL



Spark Streamin

gGraphX MLBase

Shark

Spark

SharkSQL



Spark Streamin

gGraphX MLBase

Shark

A data analytics system that»builds on Spark,»scales out and tolerate worker failures,»supports low-latency, interactive queries through

(optional) in-memory computation,»supports both SQL and complex analytics,» is compatible with Hive (storage, serdes, UDFs,

types, metadata).

Hive Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Query Optimize

r

Physical Plan

Execution

CLI JDBC

MapReduce

Shark Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimize

r

Shark Query Language

Hive QL and a few additional DDL add-ons:

Creating an in-memory table:»CREATE TABLE src_cached AS SELECT * FROM …

Engine Features

Columnar Memory Store

Data Co-partitioning & Co-location

Partition Pruning based on Statistics (range, bloom filter and others)

Spark Integration

…

Columnar Memory Store

Simply caching Hive records as JVM objects is inefficient.

Shark employs column-oriented storage.

1

Column Storage

2 3

johnmike

sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2mike

3.5

3sally

6.4Benefit: compact representation, CPU efficient compression, cache locality.

Spark IntegrationUnified system for query processing and Spark analytics

Query processing and ML share the same set of workers and caches

Performance

1.7 TB Real Warehouse Data on 100 EC2 nodes

More InformationDownload and docs: www.spark-project.org

» Easy to run locally, on EC2, or on Mesos/YARN

Email: [email protected]

Twitter: @rxin

http://www.spark-project.org/

mailto:[email protected]

Behavior with Insufficient RAM

0% 25% 50% 75% 100%0

20

40

60

80

10068.8

58.1

40.729.7

11.5

Percent of working set in memory

Itera

tion

tim

e (

s)

Example: PageRank

1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}

Technology

20130912 YTC_Reynold Xin_Spark and Shark