Upload
yahootechconference
View
113
Download
2
Tags:
Embed Size (px)
DESCRIPTION
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications. These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Citation preview
Spark and Shark: High-speed Analytics over Hadoop Data
Sep 12, 2013 @ Yahoo! Tech Conference
Reynold Xin (辛湜 ), AMPLab, UC Berkeley
The Spark Ecosystem
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark Streamin
gGraphX MLBase
Today’s Talk
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark Streamin
gGraphX MLBase
Apache Spark
Fast and expressive cluster computing system interoperable with Apache Hadoop
Improves efficiency through:» In-memory computing primitives»General computation graphs
Improves usability through:»Rich APIs in Scala, Java, Python» Interactive shell
Up to 100× faster(2-10× on disk)
Often 5× less code
Shark
An analytic query engine built on top of Spark
»Support both SQL and complex analytics»Can be100X faster than Apache Hive
Compatible with Hive data, metastore, queries
»HiveQL»UDF / UDAF»SerDes»Scripts
Community
3000 people attended online training
1100+ meetup members
80+ GitHub contributors
Today’s Talk
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos Resource Manager
Spark Streamin
gGraphX MLBase
Why a New Framework?
MapReduce greatly simplified big data analysis
But as soon as it got popular, users wanted more:
»More complex, multi-pass analytics (e.g. ML, graph)
»More interactive ad-hoc queries»More real-time stream processing
Hadoop MapReduce
iter. 1 iter. 2 …
Input
filesystemread
filesystemwrite
filesystemread
filesystemwrite
Iterative:
Interactive:
Input
query 1
query 2
…
select
Subset
Slow due to replication and disk I/O,but necessary for fault tolerance
Spark Programming Model
Key idea: resilient distributed datasets (RDDs)
»Distributed collections of objects that can optionally be cached in memory across cluster
»Manipulated through parallel operators»Automatically recomputed on failure
Programming interface»Functional APIs in Scala, Java, Python» Interactive use from Scala shell
Example: Log Mining
Exposes RDDs through a functional API in Scala
Usable interactively from Scala shelllines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
errors.persist() Block 1
Block 2
Block 3
Worker
errors.filter(_.contains(“foo”)).count()
errors.filter(_.contains(“bar”)).count()
tasks
results
Errors 2
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for
on-disk data)
Result: 1 TB data in 5 sec(vs 170 sec for on-disk data)
Worker
Errors 3
Worker
Errors 1
Master
Data Sharing in Spark
iter. 1 iter. 2 …
Input
Iterative:
Interactive:
Input
query 1
query 2select
…
10-100x faster than network/disk
Fault Tolerance
RDDs track the series of transformations used to build them (their lineage) to recompute lost data
E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDDpath = hdfs://…
FilteredRDDfunc =
_.contains(...)
MappedRDDfunc = _.split(…)
Spark: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
15
and broadcast variables, accumulators…
public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }}
public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}
Word Countval docs = sc.textFiles(“hdfs://…”)
docs.flatMap(line => line.split(“\s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b)
Word Countval docs = sc.textFiles(“hdfs://…”)
docs.flatMap(line => line.split(“\s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b)
docs.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)
Spark in Java and Python
Python APIlines = spark.textFile(…)
errors = lines.filter( lambda s: "ERROR" in s)
errors.count()
Java APIJavaRDD<String> lines = spark.textFile(…);
errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } });
errors.count()
SchedulerGeneral task DAGs
Pipelines functionswithin a stage
Cache-aware datalocality & reuse
Partitioning-awareto avoid shuffles
Launch jobs in ms
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Example: Logistic Regression
Goal: find best line separating two sets of points
+
–
++
+
+
+
++ +
– ––
–
–
–
––
+
target
–
random initial line
22
Logistic Regression: Prep the Data
val sc = new SparkContext(“spark://master”, “LRExp”)
def loadDict(): Map[String, Double] = { … }
val dict = sc.broadcast( loadDict() )
def parser(s: String): (Array[Double], Double) = {val parts = s.split(‘\t’)val y = parts(0).toDoubleval x = parts.drop(1).map(dict.getOrElse(_, 0.0))(x, y)
}
val data = sc.textFile(“hdfs://d”).map(parser).cache()
Load data in memory once
Start a new Spark Session
Build a dictionary for file parsing
Broadcast the Dictionary to the Cluster
Line Parser which uses the Dictionary
23
Logistic Regression: Grad. Desc.
val sc = new SparkContext(“spark://master”, “LRExp”)
// Parsing …
val data: Spark.RDD[(Array[Double],Double)] = …
var w = Vector.random(D)
val lambda = 1.0/data.count
for (i <- 1 to ITERATIONS) { val gradient = data.map {
case (x, y) => (1.0/(1+exp(-y*(w dot x)))–1)*y*x }.reduce( (g1, g2) => g1 + g2 ) w -= (lambda / sqrt(i) ) * gradient}
Initial parameter vector
Iterate
Set Learning Rate
Apply Gradient on Master
Compute Gradient Using Map & Reduce
24
Logistic Regression Performance
1 5 10 20 300
500
1000
1500
2000
2500
3000
3500
4000
Hadoop
Spark
Number of Iterations
Ru
nn
ing
Tim
e (
s) 110 s / iteration
first iteration 80 sfurther iterations 1
s
Spark ML Lib (Spark 0.8)Classification Logistic Regression, Linear
SVM (+L1, L2)
Regression Linear Regression(+Lasso, Ridge)
Collaborative Filtering Alternating Least Squares
Clustering K-Means
Optimization SGD, Parallel Gradient
Spark
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos Resource Manager
Spark Streamin
gGraphX MLBase
Shark
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos Resource Manager
Spark Streamin
gGraphX MLBase
Shark
A data analytics system that»builds on Spark,»scales out and tolerate worker failures,»supports low-latency, interactive queries through
(optional) in-memory computation,»supports both SQL and complex analytics,» is compatible with Hive (storage, serdes, UDFs,
types, metadata).
Hive Architecture
Meta store
HDFS
Client
Driver
SQL Parse
r
Query Optimize
r
Physical Plan
Execution
CLI JDBC
MapReduce
Shark Architecture
Meta store
HDFS
Client
Driver
SQL Parse
r
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query Optimize
r
Shark Query Language
Hive QL and a few additional DDL add-ons:
Creating an in-memory table:»CREATE TABLE src_cached AS SELECT * FROM …
Engine Features
Columnar Memory Store
Data Co-partitioning & Co-location
Partition Pruning based on Statistics (range, bloom filter and others)
Spark Integration
…
Columnar Memory Store
Simply caching Hive records as JVM objects is inefficient.
Shark employs column-oriented storage.
1
Column Storage
2 3
johnmike
sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2mike
3.5
3sally
6.4Benefit: compact representation, CPU efficient compression, cache locality.
Spark IntegrationUnified system for query processing and Spark analytics
Query processing and ML share the same set of workers and caches
Performance
1.7 TB Real Warehouse Data on 100 EC2 nodes
More InformationDownload and docs: www.spark-project.org
» Easy to run locally, on EC2, or on Mesos/YARN
Email: [email protected]
Twitter: @rxin
Behavior with Insufficient RAM
0% 25% 50% 75% 100%0
20
40
60
80
10068.8
58.1
40.729.7
11.5
Percent of working set in memory
Itera
tion
tim
e (
s)
Example: PageRank
1. Start each page with a rank of 12. On each iteration, update each page’s rank to
Σi∈neighbors ranki / |neighborsi|
links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}