Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf ·...

CSCE 678

Why Not MapReduce?

• “MapReduce largely simplified big-data analysis on

large, unrealistic clusters” – by Matei Zahari

• Users want more complex, multi-stage applications

(e.g., iterative ML, graph processing)

• More interactive, real-time jobs than batched jobs

Why Not MapReduce?

Iter. 1 Iter. 2 …

HDFSRead

HDFSWrite

HDFSRead

Query 1

Query 2

Query 3

Query 4

Result 1

Result 2

Result 3

Result 4

HDFS:simplified fault tolerance,but too slow due toreplication and disk I/O.

HDFSRead

Goal: In-Memory Data Sharing

Iter. 1 Iter. 2 …

Query 1

Query 2

Query 3

Query 4

Result 1

Result 2

Result 3

Result 4

10-100x faster than network/disk

Goal: In-Memory Data Sharing

Iter. 1 Iter. 2 …

Query 1

Query 2

Query 3

Query 4

Result 1

Result 2

Result 3

Result 4

How to achieve Fault tolerance: What if DRAM crashes?

Resilient Distributed Dataset (RDD)

• Immutable collection of objects

• Statically typed

• Distributed across clusters

val lines = spark.textFile(“hdfs://log.txt")val errors = lines.filter(_.startsWith("ERROR"))val msgs = errors.map(_.split(‘\t’)(2))

msgs.saveAsTextFile(“msgs.txt") Kick off computation(lazy evaluation)

errors

filter(_.startswith(“ERROR”))

map(_.split(“\t”)(2))

Spark Programming Interface

• Scala language: lambda-like language on Java

• Run interactively on Scala interpreter

• Interface for specifying dataflow between RDDs:

• Transformations (filter, map, reduce, etc)

• Actions (count, persist, save, etc)

• User-custom controls for partitioning

Fault Tolerance

• RDDs track lineage to rebuild lost data

file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

Fault Tolerance

• RDDs track lineage to rebuild lost data

Partitioning

• RDDs know their partitioning functions

Hash-partitioned Same

Example: PageRank

1. Start each page with rank = 1

2. For iteration = 1, …, N, update rank(page) =

𝑖 ∈Neighbors

rank𝑖/ Neighbors𝑖

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairsfor (i <- 1 to ITERATIONS) {

ranks = links.join(ranks).flatMap {(url, (links, rank)) =>

links.map(dest => (dest, rank/links.size))}.reduceByKey(_ + _)

Example: PageRank

• Repeatedly join links & ranks

• Co-partition them (e.g., hash

on url) to avoid shuffling

• Specify partitioner:

links = links.partitionBy(new URLPartitioner())

Spark Operations

• Transformations (define a new RDD):

• Actions (return a result to the driver program)

mapfiltersample

groupByKeyreduceByKeysortByKey

aggregateByByte

flatMapunionjoin

intersectioncogroupcross

distinct

cartesianpipe

coalescemapPartition

mapPartitionWithIndexrepartition

repartitionAndSortWithinPartitions

reducecollectcountfirst

takeSampletakeOrderedcountByKeyforeach

saveAsTextFilesaveAsSequenceFilesaveAsObjectFile

How General is Spark?

• RDDs can express many parallel algorithms

• Dataflow models: Hadoop, SQL, Dryad, …

• Specialized models: graph processing, machine

learning, iterative MapReduce (HaLoop), …

Spark Stack

Apache Spark Core

Cassandra HBase S3

SparkSQL&

Dataframe

Catalyst Optimizer

SparkStreaming

Mllib(Machine learning)

GraphX(Graph

processing)

SparkR(R on Spark)

Spark SQL

• Relational DBMS on Spark

• Using RDDs to define and partition database states

• Optimized with DBMS techniques

• Caching

• Query optimizers

DataFrame API

• Spark SQL uses DataFrame to express structured dataset with a fixed schema

Values

RDDsUserID

Structured RDDs

Name Age

UserID Name Age

DataFrame API

cities = spark.read.csv(‘cities’, header=True, inferschema=True)

cities.show()+---------+----------+-------+| city |population| area|+---------+----------+-------+|Vancouver| 2463431|2878.52|+---------+----------+-------+

DataFrame API

cities.show()+---------+----------+-------+| city |population| area|+---------+----------+-------+|Vancouver| 2463431|2878.52|+---------+----------+-------+

cities.printSchema()root|-- city: string (nullable = true)|-- population: integer (nullable = true)|-- area: double (nullable = true)

DataFrame API

small_cities = cities.where(cities[‘area’] < 5000)density = small_cities(cities[‘population’] / cities[‘area’])

density.show()+-------------------++(population / area)|+-------------------+| 855.797710|+-------------------+

RDDRDD

Lazy evaluation:operations are processed whenoutputting results

Limitations

• DataFrame API provides many built-in functions

• Math: abs(), cos(), avg(), ceil(), …

• Data processing: asc(), base64(), …

• Aggregation: array(), array_contains(), …

• String: concat(), substring(), …

• You can’t directly use Scala/Python functions

because they are not DataFrame API

• What to do? Write User Define Functions (UDFs)

User-Defined Function (UDF)

@functions.udf(returnType=types.DoubleType())def my_logic(a, b):

return a + 2 * b * math.log2(a)

res = df.select(my_logic(df[‘a’], df[‘b’]),alias(‘res’))

Including UDF in the expression

User-Defined Function (UDF)

@functions.udf(returnType=types.DoubleType())def my_logic(a, b):

return a + 2 * b * math.log2(a)

res = df.select(my_logic(df[‘a’], df[‘b’]),alias(‘res’))

Including UDF in the expression

UDFs are much slower than DataFrame API

Spark DataFrame API 10 s

Python UDF 437 s

Query Optimization

• RDBMS usually has a query optimizer to find a

more ideal version of a query

• Example: (X + 1) + (Y + 2) ➔ X + Y + 3

• Spark SQL: Catalyst optimizer

• Transforming a query into equivalent logics

• Selecting the ideal physical plan based on cost model

References

• Spark class hosted by Stanford ICME:

http://stanford.edu/~rezab/sparkclass/

• RDD Talk in NSDI (by Zaharia):

https://www.usenix.org/sites/default/files/confere

nce/protected-files/nsdi_zaharia.pdf

Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf ·...

Documents

Spark Plugs - · PDF fileSpark Plugs Catalogue: Issue 3.2. AC-DELCO ... Spark Plug Specifications ACDelco Regular Spark Plugs Spark Plug ... SPARK PLUGS. An ideal plug for vehicles

NGK SPARK pÚü6s RESISTOR TYPE SPARK PLUGS SPARK PLUGS ... · ngk spark pÚü6s resistor type spark plugs spark plugs bougies bujias

What is SPARK? - UHgabriel/courses/cosc6339_s17/BDA_11_Spark.pdf · What is SPARK? •In-Memory Cluster ... –Hadoop, –Mesos, •Spark ... Spark Essentials •Spark program has

CSCE678 - Ethereum & Smart Contractscourses.cse.tamu.edu/chiache/csce678/s19/slides/ethereum.pdf · 2019-04-29 · Bitcoin vs Ethereum •Bitcoin: miners (& other users) only has

CiscoSpark インストール手順 (forWindowsPC)...Cisco Spark Test Spark Spark 15:02 O Spark spark 15:02 Welcome to Cisco Spark! You can easily meet with your team and collaborate

Spark Platform Spark Core Spark Extensions Using … Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks 20 years

Spark Plug Thread Repair Spark Plug Spark Plug Sockets for Ford

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

Bosch Industrial Spark Plugs: The Right Spark Plug for ... · Bosch Industrial Spark Plugs: The Right Spark Plug ... series spark plugs dissipates ... Introducing the new Double Ir

Spark Concepts - Spark SQL, Graphx, Streaming

CSCE678 - Datacenterscourses.cse.tamu.edu/chiache/csce678/s19/slides/datacenter.pdf · Types of Cloud Services •Software as a Service (SaaS) •Vendors sell licensed cloud software

Intro to Spark and Spark SQL

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

CSCE678 - Network Function Virtualizationcourses.cse.tamu.edu/chiache/csce678/s19/slides/nfv.pdf · •“ClickOS and the Art of Network Function Virtualization”, by Martins et

REPLACEMENT SPARK PLUGS Spark Plug Application Chart · REPLACEMENT SPARK PLUGS Spark Plug Application Chart ... EC Series Air-Cooled 1 ... REPLACEMENT SPARK PLUGS Spark Plug Application

Real Time Analytics via Spark & Scala | Spark & Scala Fundamentals | Spark & Scala Architecture

A survey of attacks on Ethereum smart contractscourses.cse.tamu.edu/chiache/csce678/s19/download/atzei16.pdfcontracts secure. Indeed, several security vulnerabilities in Ethereum smart

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel

Spark Infrastructure - Australian Securities Exchange · Spark Infrastructure represents Spark Infrastructure Trust and its consolidated entities. Spark Infrastructure RE Limited

ZooKeeper - courses.cse.tamu.educourses.cse.tamu.edu/chiache/csce678/s19/slides/zookeeper.pdf · ZooKeeper CSCE 678. Weak vs Strong Consistency • In distributed systems, consistency