114
1 Pivotal Confidential–Internal Use Only Spark Architecture A.Grishchenko

Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

  • Upload
    others

  • View
    121

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

1 Pivotal Confidential–Internal Use Only 1 Pivotal Confidential–Internal Use Only

Spark Architecture

A.Grishchenko

Page 2: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

About me

Enterprise Architect @ Pivotal �  7 years in data processing

�  5 years with MPP

�  4 years with Hadoop

�  Spark contributor

�  http://0x0fff.com

Page 3: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Outline�  Spark Motivation

�  Spark Pillars

�  Spark Architecture

�  Spark Shuffle

�  Spark DataFrame

Page 4: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Outline�  Spark Motivation �  Spark Pillars

�  Spark Architecture

�  Spark Shuffle

�  Spark DataFrame

Page 5: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Motivation

�  Difficultly of programming directly in Hadoop MapReduce

Page 6: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Motivation

�  Difficultly of programming directly in Hadoop MapReduce

�  Performance bottlenecks, or batch not fitting use cases

Page 7: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Motivation

�  Difficultly of programming directly in Hadoop MapReduce

�  Performance bottlenecks, or batch not fitting use cases

�  Better support iterative jobs typical for machine learning

Page 8: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Difficulty of Programming in MR

Word Count implementations

�  Hadoop MR – 61 lines in Java

�  Spark – 1 line in interactive shell

sc.textFile('...').flatMap(lambda x: x.split())

.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y)

.saveAsTextFile('...')

VS

Page 9: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

How many times the data is put to the HDD during a single MapReduce Job?

�  One

�  Two

�  Three

�  More

Page 10: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

How many times the data is put to the HDD during a single MapReduce Job?

�  One

�  Two

ü  Three

ü  More

Page 11: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

Consider Hive as main SQL tool

Page 12: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

Consider Hive as main SQL tool �  Typical Hive query is translated to 3-5 MR jobs

Page 13: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

Consider Hive as main SQL tool �  Typical Hive query is translated to 3-5 MR jobs �  Each MR would scan put data to HDD 3+ times

Page 14: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

Consider Hive as main SQL tool �  Typical Hive query is translated to 3-5 MR jobs �  Each MR would scan put data to HDD 3+ times �  Each put to HDD – write followed by read

Page 15: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

Consider Hive as main SQL tool �  Typical Hive query is translated to 3-5 MR jobs �  Each MR would scan put data to HDD 3+ times �  Each put to HDD – write followed by read �  Sums up to 18-30 scans of data during a single

Hive query

Page 16: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

Spark offers you �  Lazy Computations –  Optimize the job before executing

Page 17: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

Spark offers you �  Lazy Computations –  Optimize the job before executing

�  In-memory data caching –  Scan HDD only once, then scan your RAM

Page 18: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Performance Bottlenecks

Spark offers you �  Lazy Computations –  Optimize the job before executing

�  In-memory data caching –  Scan HDD only once, then scan your RAM

�  Efficient pipelining –  Avoids the data hitting the HDD by all means

Page 19: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Outline�  Spark Motivation

�  Spark Pillars �  Spark Architecture

�  Spark Shuffle

�  Spark DataFrame

Page 20: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Pillars

Two main abstractions of Spark

Page 21: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Pillars

Two main abstractions of Spark

�  RDD – Resilient Distributed Dataset

Page 22: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Pillars

Two main abstractions of Spark

�  RDD – Resilient Distributed Dataset

�  DAG – Direct Acyclic Graph

Page 23: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  Simple view –  RDD is collection of data items split into partitions and

stored in memory on worker nodes of the cluster

Page 24: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  Simple view –  RDD is collection of data items split into partitions and

stored in memory on worker nodes of the cluster

�  Complex view –  RDD is an interface for data transformation

Page 25: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  Simple view –  RDD is collection of data items split into partitions and

stored in memory on worker nodes of the cluster

�  Complex view –  RDD is an interface for data transformation –  RDD refers to the data stored either in persisted store

(HDFS, Cassandra, HBase, etc.) or in cache (memory, memory+disks, disk only, etc.) or in another RDD

Page 26: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  Complex view (cont’d) –  Partitions are recomputed on failure or cache eviction

Page 27: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  Complex view (cont’d) –  Partitions are recomputed on failure or cache eviction –  Metadata stored for interface ▪  Partitions – set of data splits associated with this RDD

Page 28: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  Complex view (cont’d) –  Partitions are recomputed on failure or cache eviction –  Metadata stored for interface ▪  Partitions – set of data splits associated with this RDD ▪  Dependencies – list of parent RDDs involved in computation

Page 29: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  Complex view (cont’d) –  Partitions are recomputed on failure or cache eviction –  Metadata stored for interface ▪  Partitions – set of data splits associated with this RDD ▪  Dependencies – list of parent RDDs involved in computation ▪  Compute – function to compute partition of the RDD given the

parent partitions from the Dependencies

Page 30: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  Complex view (cont’d) –  Partitions are recomputed on failure or cache eviction –  Metadata stored for interface ▪  Partitions – set of data splits associated with this RDD ▪  Dependencies – list of parent RDDs involved in computation ▪  Compute – function to compute partition of the RDD given the

parent partitions from the Dependencies ▪  Preferred Locations – where is the best place to put

computations on this partition (data locality)

Page 31: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  Complex view (cont’d) –  Partitions are recomputed on failure or cache eviction –  Metadata stored for interface ▪  Partitions – set of data splits associated with this RDD ▪  Dependencies – list of parent RDDs involved in computation ▪  Compute – function to compute partition of the RDD given the

parent partitions from the Dependencies ▪  Preferred Locations – where is the best place to put

computations on this partition (data locality) ▪  Partitioner – how the data is split into partitions

Page 32: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

�  RDD is the main and only tool for data manipulation in Spark

�  Two classes of operations –  Transformations –  Actions

Page 33: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

RDD

Lazy computations model

Transformation cause only metadata change

Page 34: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DAG

Direct Acyclic Graph – sequence of computations performed on data

Page 35: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DAG

Direct Acyclic Graph – sequence of computations performed on data �  Node – RDD partition

Page 36: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DAG

Direct Acyclic Graph – sequence of computations performed on data �  Node – RDD partition �  Edge – transformation on top of data

Page 37: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DAG

Direct Acyclic Graph – sequence of computations performed on data �  Node – RDD partition �  Edge – transformation on top of data �  Acyclic – graph cannot return to the older partition

Page 38: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DAG

Direct Acyclic Graph – sequence of computations performed on data �  Node – RDD partition �  Edge – transformation on top of data �  Acyclic – graph cannot return to the older partition �  Direct – transformation is an action that transitions

data partition state (from A to B)

Page 39: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DAGWordCount example

Page 40: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DAGWordCount example

HD

FS In

put S

plits

HDFS

RD

D P

artit

ions

RDD RDD RDD RDD RDD

sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach

Page 41: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Outline�  Spark Motivation

�  Spark Pillars

�  Spark Architecture �  Spark Shuffle

�  Spark DataFrames

Page 42: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark ClusterDriver Node

Worker Node Worker Node Worker Node

Page 43: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark ClusterDriver Node

Driver

Worker Node

Executor Executor

Worker Node

Executor Executor

Worker Node

Executor Executor

Page 44: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark ClusterDriver Node

Driver

Spark Context

Worker Node

Executor

Cache

Executor

Cache

Worker Node

Executor

Cache

Executor

Cache

Worker Node

Executor

Cache

Executor

Cache

Page 45: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark ClusterDriver Node

Driver

Spark Context

Worker Node

Executor

Cache

Task …Task

Task

Executor

Cache

Task …Task

Task

Worker Node

Executor

Cache

Task …Task

Task

Executor

Cache

Task …Task

Task

Worker Node

Executor

Cache

Task …Task

Task

Executor

Cache

Task …Task

Task

Page 46: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Driver –  Entry point of the Spark Shell (Scala, Python, R)

Page 47: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Driver –  Entry point of the Spark Shell (Scala, Python, R) –  The place where SparkContext is created

Page 48: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Driver –  Entry point of the Spark Shell (Scala, Python, R) –  The place where SparkContext is created –  Translates RDD into the execution graph

Page 49: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Driver –  Entry point of the Spark Shell (Scala, Python, R) –  The place where SparkContext is created –  Translates RDD into the execution graph –  Splits graph into stages

Page 50: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Driver –  Entry point of the Spark Shell (Scala, Python, R) –  The place where SparkContext is created –  Translates RDD into the execution graph –  Splits graph into stages –  Schedules tasks and controls their execution

Page 51: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Driver –  Entry point of the Spark Shell (Scala, Python, R) –  The place where SparkContext is created –  Translates RDD into the execution graph –  Splits graph into stages –  Schedules tasks and controls their execution –  Stores metadata about all the RDDs and their partitions

Page 52: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Driver –  Entry point of the Spark Shell (Scala, Python, R) –  The place where SparkContext is created –  Translates RDD into the execution graph –  Splits graph into stages –  Schedules tasks and controls their execution –  Stores metadata about all the RDDs and their partitions –  Brings up Spark WebUI with job information

Page 53: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Executor –  Stores the data in cache in JVM heap or on HDDs

Page 54: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Executor –  Stores the data in cache in JVM heap or on HDDs –  Reads data from external sources

Page 55: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Executor –  Stores the data in cache in JVM heap or on HDDs –  Reads data from external sources –  Writes data to external sources

Page 56: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster

�  Executor –  Stores the data in cache in JVM heap or on HDDs –  Reads data from external sources –  Writes data to external sources –  Performs all the data processing

Page 57: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor Memory

Page 58: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster – Detailed

Page 59: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Spark Cluster – PySpark

Page 60: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Application Decomposition

�  Application –  Single instance of SparkContext that stores some data

processing logic and can schedule series of jobs, sequentially or in parallel (SparkContext is thread-safe)

Page 61: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Application Decomposition

�  Application –  Single instance of SparkContext that stores some data

processing logic and can schedule series of jobs, sequentially or in parallel (SparkContext is thread-safe)

�  Job –  Complete set of transformations on RDD that finishes

with action or data saving, triggered by the driver application

Page 62: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Application Decomposition

�  Stage –  Set of transformations that can be pipelined and

executed by a single independent worker. Usually it is app the transformations between “read”, “shuffle”, “action”, “save”

Page 63: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Application Decomposition

�  Stage –  Set of transformations that can be pipelined and

executed by a single independent worker. Usually it is app the transformations between “read”, “shuffle”, “action”, “save”

�  Task –  Execution of the stage on a single data partition. Basic

unit of scheduling

Page 64: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

WordCount ExampleH

DFS

Inpu

t Spl

its

HDFS

RD

D P

artit

ions

RDD RDD RDD RDD RDD

sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach

Page 65: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Stage 1

WordCount ExampleH

DFS

Inpu

t Spl

its

HDFS

RD

D P

artit

ions

RDD RDD RDD RDD RDD

sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach

Page 66: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Stage 2 Stage 1

WordCount ExampleH

DFS

Inpu

t Spl

its

HDFS

RD

D P

artit

ions

RDD RDD RDD RDD RDD

sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach

Page 67: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Stage 2 Stage 1

WordCount ExampleH

DFS

Inpu

t Spl

its

HDFS

RD

D P

artit

ions

RDD RDD RDD RDD RDD

sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach

Task 1

Task 2

Task 3

Task 4

Page 68: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Stage 2 Stage 1

WordCount ExampleH

DFS

Inpu

t Spl

its

HDFS

RD

D P

artit

ions

RDD RDD RDD RDD RDD

sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Task 8

Page 69: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Stage 2 Stage 1

WordCount ExampleH

DFS

Inpu

t Spl

its

HDFS pipeline

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Task 8

partition shuffle pipeline

Page 70: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Persistence in SparkPersistence Level Description MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in

memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

DISK_ONLY Store the RDD partitions only on disk.

MEMORY_ONLY_2, DISK_ONLY_2, etc.

Same as the levels above, but replicate each partition on two cluster nodes.

Page 71: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Persistence in Spark

�  Spark considers memory as a cache with LRU eviction rules

�  If “Disk” is involved, data is evicted to disks rdd = sc.parallelize(xrange(1000))

rdd.cache().count()

rdd.persist(StorageLevel.MEMORY_AND_DISK_SER).count()

rdd.unpersist()

Page 72: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Outline�  Spark Motivation

�  Spark Pillars

�  Spark Architecture

�  Spark Shuffle �  Spark DataFrame

Page 73: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Shuffles in Spark

�  Hash Shuffle – default prior to 1.2.0

Page 74: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Shuffles in Spark

�  Hash Shuffle – default prior to 1.2.0 �  Sort Shuffle – default now

Page 75: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Shuffles in Spark

�  Hash Shuffle – default prior to 1.2.0 �  Sort Shuffle – default now �  Tungsten Sort – new optimized one!

Page 76: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Hash Shuffle

Page 77: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Hash Shuffle

Page 78: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Hash Shuffle

Page 79: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Local Directory Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Output File

Output File

Output File …

Num

ber o

f “R

educ

ers”

spark.local.dir

Hash Shuffle

Page 80: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Local Directory Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Output File

Output File

Output File …

Num

ber o

f “R

educ

ers”

Output File

Output File

Output File …

Num

ber o

f “R

educ

ers”

spark.local.dir

Hash Shuffle

Num

ber o

f “m

ap” t

asks

ex

ecut

ed b

y th

is e

xecu

tor

Page 81: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Hash Shuffle with Consolidation

Page 82: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Hash Shuffle With Consolidation

Page 83: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Local Directory Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

Num

ber o

f “R

educ

ers”

Output File

Output File

Output File

spark.local.dir

Hash Shuffle With Consolidation sp

ark.

exec

utor

.cor

es /

spar

k.ta

sk.c

pus

Page 84: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Local Directory Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

… …

Num

ber o

f “R

educ

ers”

spark.local.dir

Hash Shuffle With Consolidation sp

ark.

exec

utor

.cor

es /

spar

k.ta

sk.c

pus

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s …

Num

ber o

f “R

educ

ers”

Output File

Output File

Output File

Output File

Output File

Output File

Page 85: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Local Directory Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n “map” task

… …

Num

ber o

f “R

educ

ers”

spark.local.dir

Hash Shuffle With Consolidation sp

ark.

exec

utor

.cor

es /

spar

k.ta

sk.c

pus

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s …

Num

ber o

f “R

educ

ers”

Output File

Output File

Output File

Output File

Output File

Output File

Page 86: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Local Directory Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

… …

Num

ber o

f “R

educ

ers”

spark.local.dir

Hash Shuffle With Consolidation sp

ark.

exec

utor

.cor

es /

spar

k.ta

sk.c

pus

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Output File

Output File

Output File

Num

ber o

f “R

educ

ers”

Output File

Output File

Output File

Page 87: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Local Directory Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

… …

Num

ber o

f “R

educ

ers”

spark.local.dir

Hash Shuffle With Consolidation sp

ark.

exec

utor

.cor

es /

spar

k.ta

sk.c

pus

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Output File

Output File

Output File

Num

ber o

f “R

educ

ers”

Output File

Output File

Output File

Page 88: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Local Directory Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

… …

Num

ber o

f “R

educ

ers”

spark.local.dir

Hash Shuffle With Consolidation sp

ark.

exec

utor

.cor

es /

spar

k.ta

sk.c

pus

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s …

Num

ber o

f “R

educ

ers”

Output File

Output File

Output File

Output File

Output File

Output File

Page 89: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Sort Shuffle

Page 90: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Sort Shuffle

spark.storage.[safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Page 91: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n “map” task

“map” task

Sort Shuffle

spark.storage.[safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 92: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n “map” task

“map” task

Sort Shuffle

spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

AppendOnlyMap

… AppendOnlyMap

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 93: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n “map” task

“map” task

Sort Shuffle

spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

AppendOnlyMap

… AppendOnlyMap

sort & spill

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 94: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n “map” task

“map” task

Sort Shuffle

spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

AppendOnlyMap

… AppendOnlyMap

sort & spill

Local Directory

Output File inde

x

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 95: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n “map” task

“map” task

Sort Shuffle

spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

AppendOnlyMap

… AppendOnlyMap

sort & spill

sort & spill

Local Directory

Output File inde

x

Output File inde

x

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 96: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n “map” task

“map” task

Sort Shuffle

spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

AppendOnlyMap

… AppendOnlyMap

sort & spill

sort & spill

sort & spill

Local Directory

Output File inde

x

Output File inde

x

Output File inde

x

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 97: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

MinHeap Merge

MinHeap Merge

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n “map” task

“map” task

Sort Shuffle

spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

AppendOnlyMap

… AppendOnlyMap

sort & spill

sort & spill

sort & spill

Local Directory

“red

uce”

task

“r

educ

e” ta

sk

… Output File inde

x

Output File inde

x

Output File inde

x

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 98: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Tungsten Sort Shuffle

Page 99: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Page 100: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 101: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Serialized Data LinkedList<MemoryBlock>

Array of data pointers and Partition IDs, long[]

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 102: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Serialized Data LinkedList<MemoryBlock>

Array of data pointers and Partition IDs, long[]

sort & spill

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 103: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Serialized Data LinkedList<MemoryBlock>

Local Directory

Array of data pointers and Partition IDs, long[]

sort & spill

spark.local.dir

Output File partition partition partition partition

inde

x

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 104: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Serialized Data LinkedList<MemoryBlock>

Local Directory

Array of data pointers and Partition IDs, long[]

sort & spill

sort & spill

spark.local.dir

Output File partition partition partition partition

inde

x

Output File partition partition partition partition

inde

x

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 105: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Serialized Data LinkedList<MemoryBlock>

Local Directory

Array of data pointers and Partition IDs, long[]

sort & spill

sort & spill

spark.local.dir

Output File partition partition partition partition

inde

x

Output File partition partition partition partition

inde

x

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 106: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Serialized Data LinkedList<MemoryBlock>

Local Directory

Array of data pointers and Partition IDs, long[]

Serialized Data LinkedList<MemoryBlock>

Array of data pointers and Partition IDs, long[]

sort & spill

sort & spill

spark.local.dir

Output File partition partition partition partition

inde

x

Output File partition partition partition partition

inde

x

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 107: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Serialized Data LinkedList<MemoryBlock>

Local Directory

Array of data pointers and Partition IDs, long[]

Serialized Data LinkedList<MemoryBlock>

Array of data pointers and Partition IDs, long[]

sort & spill

sort & spill

sort & spill

spark.local.dir

Output File partition partition partition partition

inde

x

Output File partition partition partition partition

inde

x

Output File partition partition partition partition

inde

x

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 108: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Executor JVM P

artit

ion

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

Par

titio

n

“map” task

“map” task

Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction]

spark.shuffle. [safetyFraction * memoryFraction]

Par

titio

n

Par

titio

n

Serialized Data LinkedList<MemoryBlock>

Local Directory

Array of data pointers and Partition IDs, long[]

Serialized Data LinkedList<MemoryBlock>

Array of data pointers and Partition IDs, long[]

sort & spill

sort & spill

sort & spill

spark.local.dir

Output File partition partition partition partition

inde

x

Output File partition partition partition partition

inde

x

Output File partition partition partition partition

inde

x

Output File

partition

partition

partition

partition

inde

x

merge

spar

k.ex

ecut

or.c

ores

/ sp

ark.

task

.cpu

s

Page 109: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Outline�  Spark Motivation

�  Spark Pillars

�  Spark Architecture

�  Spark Shuffle

�  Spark DataFrame

Page 110: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DataFrame Idea

Page 111: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DataFrame Implementation

�  Interface –  DataFrame is an RDD with schema – field names, field

data types and statistics –  Unified transformation interface in all languages, all the

transformations are passed to JVM –  Can be accessed as RDD, in this case transformed to

the RDD of Row objects

Page 112: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

DataFrame Implementation

�  Internals –  Internally it is the same RDD –  Data is stored in row-columnar format, row chunk size

is set by spark.sql.inMemoryColumnarStorage.batchSize –  Each column in each partition stores min-max values

for partition pruning –  Allows better compression ratio than standard RDD –  Delivers faster performance for small subsets of

columns

Page 113: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

113 Pivotal Confidential–Internal Use Only 113 Pivotal Confidential–Internal Use Only

Questions?

Page 114: Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

BUILT FOR THE SPEED OF BUSINESS