76
www.cwi.nl/~boncz/bads Big Data Infrastructures & Technologies Spark and MLLIB

Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Big Data Infrastructures & Technologies

Spark and MLLIB

Page 2: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

OVERVIEW OF SPARK

Page 3: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What is Spark? • Fast and expressive cluster computing system interoperable with

Apache Hadoop

• Improves efficiency through:

– In-memory computing primitives

– General computation graphs

• Improves usability through:

– Rich APIs in Scala, Java, Python

– Interactive shell

Page 4: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What is Spark? • Fast and expressive cluster computing system interoperable with

Apache Hadoop

• Improves efficiency through:

– In-memory computing primitives

– General computation graphs

• Improves usability through:

– Rich APIs in Scala, Java, Python

– Interactive shell

Up to 100× faster (2-10× on disk)

Page 5: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What is Spark? • Fast and expressive cluster computing system interoperable with

Apache Hadoop

• Improves efficiency through:

– In-memory computing primitives

– General computation graphs

• Improves usability through:

– Rich APIs in Scala, Java, Python

– Interactive shell

Up to 100× faster (2-10× on disk)

Often 5× less code

Page 6: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

The Spark Stack

• Spark is the basis of a wide set of projects in the Berkeley Data Analytics

Stack (BDAS)

Spark

Spark Streaming

(real-time)

GraphX (graph)

Spark SQL

MLIB (machine learning)

More details: amplab.berkeley.edu

Page 7: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Why a New Programming Model?

• MapReduce greatly simplified big data analysis

• But as soon as it got popular, users wanted more:

– More complex, multi-pass analytics (e.g. ML, graph)

– More interactive ad-hoc queries

– More real-time stream processing

• All 3 need faster data sharing across parallel jobs

Page 8: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

Page 9: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

Page 10: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

Page 11: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

Page 12: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Page 13: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Page 14: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Slow due to replication, serialization, and disk IO

Page 15: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

iter. 1 iter. 2 . . .

Input

Data Sharing in Spark

Distributed memory

Input

query 1

query 2

query 3

. . .

one-time processing

~10× faster than network and disk

Page 16: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Spark Programming Model • Key idea: resilient distributed datasets (RDDs)

– Distributed collections of objects that can be cached in memory

across the cluster

– Manipulated through parallel operators

– Automatically recomputed on failure

• Programming interface

– Functional APIs in Scala, Java, Python

– Interactive use from Scala shell

Page 17: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

Page 18: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

Worker

Worker

Worker

Driver

Page 19: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

Worker

Worker

Worker

Driver

Page 20: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

Base RDD

Worker

Worker

Worker

Driver

Page 21: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

Worker

Worker

Worker

Driver

Page 22: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Page 23: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

Transformed RDD

Worker

Worker

Worker

Driver

Page 24: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Page 25: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

Worker

Worker

Worker

Driver

Page 26: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache()

Worker

Worker

Worker

Driver

Page 27: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Lambda Functions

Lambda function functional programming!

= implicit function definition

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

Page 28: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Lambda Functions

Lambda function functional programming!

= implicit function definition

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

bool detect_error(string x) {

return x.startswith(“ERROR”);

}

Page 29: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache()

Worker

Worker

Worker

Driver

Base RDD Transformed RDD

Page 30: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

Base RDD Transformed RDD

Page 31: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

Base RDD Transformed RDD

Action

Page 32: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

Base RDD Transformed RDD

Page 33: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

Base RDD Transformed RDD

Page 34: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

tasks

Base RDD Transformed RDD

Page 35: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

tasks

Base RDD Transformed RDD

Page 36: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

tasks

results

Base RDD Transformed RDD

Page 37: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Page 38: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Page 39: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

messages.filter(lambda x: “bar” in x).count

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Page 40: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

messages.filter(lambda x: “bar” in x).count

tasks

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Page 41: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

messages.filter(lambda x: “bar” in x).count

tasks

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Page 42: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

messages.filter(lambda x: “bar” in x).count

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Page 43: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

messages.filter(lambda x: “bar” in x).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Page 44: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

messages.filter(lambda x: “bar” in x).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Result: full-text search of Wikipedia in

<1 sec (vs 20 sec for on-disk data)

Page 45: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

messages.filter(lambda x: “bar” in x).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Result: full-text search of Wikipedia in

<1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Page 46: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Fault Tolerance

• file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filter reduce map

Inp

ut

file

RDDs track lineage info to rebuild lost data

Page 47: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

filter reduce map

Inp

ut

file

• file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Fault Tolerance

Page 48: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

filter reduce map

Inp

ut

file

• file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Fault Tolerance

Page 49: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Example: Logistic Regression

0

500

1000

1500

2000

2500

3000

3500

4000

1 5 10 20 30

Ru

nn

ing

Tim

e (

s)

Number of Iterations

Hadoop

Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Page 50: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Spark in Scala and Java

// Scala:

val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

// Java:

JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 51: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Supported Operators

•map

•filter

•groupBy

•sort

•union

•join

•leftOuterJoin

•rightOuterJoin

•reduce

•count

•fold

•reduceByKey

•groupByKey

•cogroup

•cross

•zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 52: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Software Components

• Spark client is library in user program (1 instance

per app)

• Runs tasks locally or on cluster

– Mesos, YARN, standalone mode

• Accesses storage systems via Hadoop

InputFormat API

– Can use HBase, HDFS, S3, …

Your application

SparkContext

Local

threads

Cluster

manager

Worker

Spark

executor

Worker

Spark

executor

HDFS or other storage

Page 53: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Task Scheduler

General task graphs

Automatically pipelines

functions

Data locality aware

Partitioning aware

to avoid shuffles

= cached partition = RDD

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

map

Page 54: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Spark SQL • Columnar SQL analytics engine for Spark

– Support both SQL and complex analytics

– Up to 100X faster than Apache Hive

• Compatible with Apache Hive

– HiveQL, UDF/UDAF, SerDes, Scripts

– Runs on existing Hive warehouses

• In use at Yahoo! for fast in-memory OLAP

Page 55: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Hive Architecture

Hive

Catalog

HDFS

Client

Driver

SQL

Parser

Query

Optimizer

Physical Plan

Execution

CLI JDBC

MapReduce

Page 56: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Spark SQL Architecture

Hive

Catalog

HDFS

Client

Driver

SQL

Parser

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query

Optimizer

[Engle et al, SIGMOD 2012]

Page 57: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What Makes it Faster? • Lower-latency engine (Spark OK with 0.5s jobs)

• Support for general DAGs

• Column-oriented storage and compression

• New optimizations (e.g. map pruning)

Page 58: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Other Spark Stack Projects

• Spark Streaming: stateful, fault-tolerant stream processing (out since Spark 0.7)

•sc.twitterStream(...) .flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

• MLlib: Library of high-quality machine learning algorithms (out since 0.8)

Page 59: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Performance

Imp

ala

(dis

k)

Imp

ala

(mem

)

Red

shif

t

Sp

ark

SQ

L (d

isk)

Sp

ark

SQ

L (

mem

)

0

5

10

15

20

25

Re

spo

nse

Tim

e (

s)

SQL

Page 60: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Performance

Imp

ala

(dis

k)

Imp

ala

(mem

)

Red

shif

t

Sp

ark

SQ

L (d

isk)

Sp

ark

SQ

L (

mem

)

0

5

10

15

20

25

Re

spo

nse

Tim

e (

s)

SQL

Sto

rm

Sp

ark

0

5

10

15

20

25

30

35

Th

rou

gh

pu

t (M

B/s

/no

de)

Streaming

Page 61: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Performance

Imp

ala

(dis

k)

Imp

ala

(mem

)

Red

shif

t

Sp

ark

SQ

L (d

isk)

Sp

ark

SQ

L (

mem

)

0

5

10

15

20

25

Re

spo

nse

Tim

e (

s)

SQL

Sto

rm

Sp

ark

0

5

10

15

20

25

30

35

Th

rou

gh

pu

t (M

B/s

/no

de)

Streaming

Had

oo

p

Gir

aph

Gra

ph

X

0

5

10

15

20

25

30

Re

spo

nse

Tim

e (

min

)

Graph

Page 62: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What it Means for Users

• Separate frameworks:

… HDFS

read

HDFS

write ET

L

HDFS

read

HDFS

write tra

in

HDFS

read

HDFS

write query

Page 63: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What it Means for Users

• Separate frameworks:

… HDFS

read

HDFS

write ET

L

HDFS

read

HDFS

write tra

in

HDFS

read

HDFS

write query

HDFS

read ET

L

tra

in

query

Spark:

Page 64: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What it Means for Users

• Separate frameworks:

… HDFS

read

HDFS

write ET

L

HDFS

read

HDFS

write tra

in

HDFS

read

HDFS

write query

HDFS

read ET

L

tra

in

query

Spark:

Page 65: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What it Means for Users

• Separate frameworks:

… HDFS

read

HDFS

write ET

L

HDFS

read

HDFS

write tra

in

HDFS

read

HDFS

write query

HDFS

HDFS

read ET

L

tra

in

query

Spark:

Page 66: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Conclusion

• Big data analytics is evolving to include:

– More complex analytics (e.g. machine learning)

– More interactive ad-hoc queries

– More real-time stream processing

• Spark is a fast platform that unifies these apps

• More info: spark-project.org

Page 67: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

SPARK MLLIB

Page 68: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What is MLLIB? MLlib is a Spark subproject providing machine learning primitives:

• initial contribution from AMPLab, UC Berkeley

• shipped with Spark since version 0.8

Page 69: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

What is MLLIB? Algorithms:

• classification: logistic regression, linear support vector machine (SVM),

naive Bayes

• regression: generalized linear regression (GLM)

• collaborative filtering: alternating least squares (ALS)

• clustering: k-means

• decomposition: singular value decomposition (SVD), principal

component analysis (PCA)

Page 70: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Collaborative Filtering

Page 71: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Alternating Least Squares (ALS)

Page 72: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Collaborative Filtering in Spark MLLIB

trainset =

sc.textFile("s3n://bads-music-dataset/train_*.gz")

.map(lambda l: l.split('\t'))

.map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2])))

model = ALS.train(trainset, rank=10, iterations=10) # train

testset = # load testing set

sc.textFile("s3n://bads-music-dataset/test_*.gz")

.map(lambda l: l.split('\t'))

.map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2])))

# apply model to testing set (only first two cols) to predict

predictions =

model.predictAll(testset.map(lambda p: (p[0], p[1])))

.map(lambda r: ((r[0], r[1]), r[2]))

Page 73: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Spark MLLIB – ALS Performance

System Wall-clock /me (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

• Dataset: Netflix data

• Cluster: 9 machines.

• MLlib is an order of magnitude faster than Mahout.

• MLlib is within factor of 2 of GraphLab.

Page 74: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Spark Implementation of ALS

• Workers load data

• Models are instantiated

at workers.

• At each iteration, models

are shared via join

between workers.

• Good scalability.

• Works on large datasets

Master Workers

Page 75: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

Spark SQL + MLLIB

Page 76: Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark a… · •MapReduce greatly simplified big data analysis •But as soon as it got

www.cwi.nl/~boncz/bads

MLLIB Pointers • Website: http://spark.apache.org

• Tutorials: http://ampcamp.berkeley.edu

• Spark Summit: http://spark-summit.org

• Github: https://github.com/apache/spark

• Mailing lists: [email protected] [email protected]