Spark SQL Deep Dive @ Melbourne Spark Meetup

Spark SQL Deep Dive

Michael Armbrust Melbourne Spark Meetup – June 1st 2015

What is Apache Spark?

Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through:

>  In-memory computing primitives >  General computation graphs

Improves usability through: >  Rich APIs in Scala, Java, Python >  Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Spark Model

Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs)

> Collections of objects that can be stored in memory or disk across a cluster

>  Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure

More than Map & Reduce

filter

groupBy

leftOuterJoin

rightOuterJoin

reduce

reduceByKey

groupByKey

cogroup

sample

partitionBy

mapWith

On-Disk Sort Record: Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1PB in 4 hours

Spark “Hall of Fame”

LARGEST SINGLE-DAY INTAKE

LONGEST-RUNNING JOB

LARGEST SHUFFLE

MOST INTERESTING APP

Tencent (1PB+ /day)

Alibaba (1 week on 1PB+ data)

Databricks PB Sort (1PB)

Jeremy Freeman Mapping the Brain at Scale

(with lasers!)

LARGEST CLUSTER

Tencent (8000+ nodes)

Based on Reynold Xin’s personal knowledge

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(_ startswith “ERROR”)

val messages = errors.map(_.split(“\t”)(2)) messages.cache() lines

Block 1

lines Block 2

lines Block 3

Worker

Driver

messages.filter(_ contains “foo”).count()

messages.filter(_ contains “bar”).count()

results

messages Cache 1

messages Cache 2

messages Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec#(vs 170 sec for on-disk data)

A General Stack

Spark Streaming#

real-time

Spark SQL

GraphX graph

MLlib machine learning …

Spark SQL

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

non-test, non-example source lines

Powerful Stack – Agile Development

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Streaming

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

SparkSQL Streaming

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

GraphX

Streaming SparkSQL

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

GraphX

Streaming SparkSQL

Your App?

About SQL

Spark SQL >  Part of the core distribution since Spark 1.0

(April 2014)

100 150 200 250

# Of Commits Per Month

2014-0

2014-1

2015-0

# of Contributors

SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data)

(April 2014) > Runs SQL / HiveQL queries including UDFs

UDAFs and SerDes

About SQL

UDAFs and SerDes > Connect existing BI tools to Spark through

About SQL

UDAFs and SerDes > Connect existing BI tools to Spark through

JDBC > Bindings in Python, Scala, and Java

About SQL

The not-so-secret truth…

is not about SQL.

SQL: The whole story

Create and Run Spark Programs Faster: > Write less code > Read less data >  Let the optimizer do the hard work

DataFrame noun – [dey-tuh-freym] 1.  A distributed collection of rows

organized into named columns. 2.  An abstraction for selecting, filtering,

aggregating and plotting structured data (cf. R, Pandas).

3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).

Write Less Code: Input & Output

Spark SQL’s Data Source API can read and write DataFrames using a variety of formats.

{ JSON }

Built-In External

and more…

Unified interface to reading/writing data in a variety of formats: df = sqlContext.read \ .format("json") \ .option("samplingRatio", "0.1") \ .load("/home/michael/data.json") df.write \ .format("parquet") \ .mode("append") \ .partitionBy("year") \ .saveAsTable("fasterData")

read and write functions create new builders for

doing I/O

Builder methods are used to specify: •  Format •  Partitioning •  Handling of

existing data •  and more

load(…), save(…) or saveAsTable(…) functions create new builders for

doing I/O

ETL Using Custom Data Sources

sqlContext.read .format("com.databricks.spark.git") .option("url", "https://github.com/apache/spark.git") .option("numPartitions", "100") .option("branches", "master,branch-‐1.3,branch-‐1.2") .load() .repartition(1) .write .format("json") .save("/home/michael/spark.json")

Write Less Code: Powerful Operations

Common operations can be expressed concisely as calls to the DataFrame API: •  Selecting required columns •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Filtering

Write Less Code: Compute an Average

private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) }

data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()

Write Less Code: Compute an Average

Using RDDs

data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [int(x[1]), 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()

Using DataFrames

sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()

Using SQL

SELECT name, avg(age) FROM people GROUP BY name

Not Just Less Code: Faster Implementations

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame SQL

Time to Aggregate 10 million int pairs (secs)

Demo: Data Frames Using Spark SQL to read, write, slice and dice your data using a simple functions

Read Less Data

Spark SQL can help you read less data automatically:

•  Converting to more efficient formats •  Using columnar formats (i.e. parquet) •  Using partitioning (i.e., /year=2014/month=02/…)1 •  Skipping data using statistics (i.e., min, max)2

•  Pushing predicates into storage systems (i.e., JDBC)

Optimization happens as late as possible, therefore Spark SQL can

optimize across functions.

def add_demographics(events): u = sqlCtx.table("users") # Load Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # udf adds city column events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events.where(events.city == "Palo Alto") .select(events.timestamp).collect()

Logical Plan

filter

events file users table

expensive

only join relevant users

Physical Plan

scan (events) filter

scan (users)

def add_demographics(events): u = sqlCtx.table("users") # Load partitioned Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column

Physical Plan with Predicate Pushdown

and Column Pruning

optimized scan

(events) optimized

scan (users)

events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "Palo Alto") .select(events.timestamp).collect()

Logical Plan

filter

events file users table

Physical Plan

scan (events) filter

scan (users)

Machine Learning Pipelines

tokenizer = Tokenizer(inputCol="text", outputCol="words”) hashingTF = HashingTF(inputCol="words", outputCol="features”) lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) df = sqlCtx.load("/path/to/data") model = pipeline.fit(df)

df0 df1 df2 df3 tokenizer hashingTF lr.model

Pipeline Model

Set Footer from Insert Dropdown Menu 37

So how does it all work?

Plan Optimization & Execution

Set Footer from Insert Dropdown Menu 38

SQL AST

DataFrame

Unresolved Logical

Logical Plan

Optimized Logical

Plan RDDs

Selected Physical

Analysis Logical Optimization

Physical Planning

Physical Plans

Catalog

DataFrames and SQL share the same optimization/execution pipeline

Code Generation

An example query SELECT name

FROM (

SELECT id, name

FROM People) p

WHERE p.id = 1

Projectname

Projectid,name

Filterid = 1

People

LogicalPlan

Naïve Query Planning SELECT name

FROM (

SELECT id, name

FROM People) p

WHERE p.id = 1

Projectname

Projectid,name

Filterid = 1

People

LogicalPlan

Projectname

Projectid,name

Filterid = 1

TableScanPeople

PhysicalPlan

Optimized Execution Writing imperative code to optimize all possible patterns is hard.

Projectname

Projectid,name

Filterid = 1

People

LogicalPlan

Projectname

Projectid,name

Filterid = 1

People

IndexLookupid = 1

return: name

LogicalPlan

PhysicalPlan

Instead write simple rules: •  Each rule makes one change •  Run many rules together to

fixed point.

Prior Work: #Optimizer Generators Volcano / Cascades: •  Create a custom language for expressing

rules that rewrite trees of relational operators.

•  Build a compiler that generates executable code for these rules.

Cons: Developers need to learn this custom language. Language might not be powerful enough. 42

TreeNode Library Easily transformable trees of operators •  Standard collection functionality - foreach,

map,collect,etc. •  transform function – recursive modification

of tree fragments that match a pattern. •  Debugging support – pretty printing,

splicing, etc.

Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1.  If the function does apply to an operator, that

operator is replaced with the result. 2.  When the function does not apply to an

operator, that operator is left unchanged. 3.  The transformation is applied recursively to all

children. 44

Writing Rules as Tree Transformations 1.  Find filters on top of

projections. 2.  Check that the filter

can be evaluated without the result of the project.

3.  If so, switch the operators.

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

Filter Push Down Transformation

val newPlan = queryPlan transform {

case f @ Filter(_, p @ Project(_, grandChild))

if(f.references subsetOf grandChild.output) =>

p.copy(child = f.copy(child = grandChild)

Partial Function Tree

Find Filter on Project

Check that the filter can be evaluated without the result of the project.

If so, switch the order.

Scala: Pattern Matching

Catalyst: Attribute Reference Tracking

} Scala: Copy Constructors

Optimizing with Rules

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

Projectname

Filterid = 1

People

CombineProjection

IndexLookupid = 1

return: name

PhysicalPlan

Future Work – Project Tungsten

Consider “abcd” – 4 bytes with UTF8 encoding java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 4 (object header) ... 4 4 (object header) ... 8 4 (object header) ... 12 4 char[] String.value [] 16 4 int String.hash 0 20 4 int String.hash32 0 Instance size: 24 bytes (reported by Instrumentation API)

Project Tungsten

Overcome JVM limitations: •  Memory Management and Binary Processing:

leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection

•  Cache-aware computation: algorithms and data structures to exploit memory hierarchy

•  Code generation: using code generation to exploit modern compilers and CPUs

Questions? Learn more at: http://spark.apache.org/docs/latest/ Get Involved: https://github.com/apache/spark

Spark SQL Deep Dive @ Melbourne Spark Meetup

Software

Chicago Spark Meetup 03 01 2016 - Spark and Recommendations

PySpark Cassandra - Amsterdam Spark Meetup

Hadoop World Spark Meetup: Interactive Spark in your Browser

Spark Hsinchu meetup

Bellevue Big Data meetup: Dive Deep into Spark Streaming

Spark sql meetup

Dec6 meetup spark presentation

Jump Start into Apache Spark (Seattle Spark Meetup)

Talend spark meetup 03042017 - Paris Spark Meetup

Budapest Spark Meetup - Basics of Spark coding

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

[Spark meetup] Spark Streaming Overview

Spark meetup - Zoomdata Streaming

Spark Meetup TensorFrames

Sparklint @ Spark Meetup Chicago

Spark meetup v2.0.5

Seattle spark-meetup-032317

Spark Streaming Resiliency (Bay Area Spark Meetup)

Ankara Spark Meetup - Big Data & Apache Spark Mimarisi Sunumu

IBM Spark Meetup - RDD & Spark Basics