30
Stratosphere v0.4 Stephan Ewen ([email protected]) 1

Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Stratosphere v0.4

Stephan Ewen

([email protected])

1

Page 2: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Release Preview

Official release coming end of November

Hands on sessions today with the latest code snapshot

2

Page 3: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

New Features in a Nutshell

• Declarative Scala Programming API

• Iterative Programs o Bulk (batch-to-batch in memory) and Incremental (Delta Updates)

o Automatic caching and cross-loop optimizations

• Runs on top of YARN (Hadoop Next Gen)

• Various deployment methods o VMs, Debian packages, EC2 scripts, ...

• Many usability fixes and of bugfixes

3

Page 4: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Stratosphere System Stack

Sky Java API

Storage

Stratosphere Runtime

HDFS Local Files

S3

Cluster Manager

YARN EC2 Direct

Stratosphere Optimizer

Sky Scala API

Meteor ...

...

4

Page 5: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

MapReduce It is nice and good, but...

Map

Map Red.

Red. Map

Map Red.

Red.

Map

Map Red.

Red.

Map

Map

Map

Map

Red.

Red.

Very verbose and low level. Only usable by system programmers.

Everything slightly more complex must result in a cascade of jobs. Loses performance and optimization potential.

5

Page 6: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

SQL (or Hive or Pig) It is nice and good, but...

• Allow you to do a subset of the tasks efficiently and elegantly

• What about the cases that do not fit SQL? o Custom types

o Custom non-relational functions (they occur a lot!)

o Iterative Algorithms Machine learning, graph analysis

• How does it look to mix SQL with MapReduce?

6

Page 7: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

SQL (or Hive or Pig) is nice and good, but...

A = load 'WordcountInput.txt'; B = MAPREDUCE wordcount.jar store A into 'inputDir‘ load 'outputDir' as (word:chararray, count: int) 'org.myorg.WordCount inputDir outputDir'; C = sort B by count;

FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' AS dt, uid CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count;

Hive

Pig

• Program Fragmentation • Impedance Mismatch • Breaks optimization

7

Page 8: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Sky Language

MapReduce style functions

(Map, Reduce, Join, CoGroup, Cross, ...)

Relational Set Operations

(filter, map, group, join, aggregate, ...)

Database / UDF Runtime

Scala Embedded Language

Optimizer

Write like a programming language, execute like a database...

8

Page 9: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Sky Language

Add a bit of

"languages and compilers"

sauce to the database

stack

9

Page 10: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Scala API by Example

• The classical word count example

val input = TextFile(textInput)

val words = input flatMap { line =>

line.split("\\W+") }

val counts = words groupBy { word => word } count()

10

Page 11: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Scala API by Example

• The classical word count example

val input = TextFile(textInput)

val words = input flatMap { line =>

line.split("\\W+") }

val counts = words groupBy { word => word } count()

In-situ data source

Transformation function

Group by entire data type (the words)

Count per group

11

Page 12: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Scala API by Example

• Graph Triangles (Friend-of-a-Friend problem) o Recommending friends, finding important connections

• 1) Enumerate candidate triads

• 2) Close as triangles

12

Page 13: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Scala API by Example

case class Edge(from: Int, to: Int)

case class Triangle(apex: Int, base1: Int, base1: Int)

val vertices = DataSource("hdfs:///...", CsvFormat[Edge])

val byDegree = vertices map { projectToLowerDegree }

val byID = byDegree map { (x) => if (x.from < x.to) x

else Edge(x.to, x.from) }

val triads = byDegree groupBy { _.from } reduceGroup { buildTriads }

val triangles = triads join byID

where { t => (t.base1, t.base2) }

isEqualTo { e => (e.from, e.to) }

map { (triangle, edge) => triangle } 13

Page 14: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Scala API by Example

case class Edge(from: Int, to: Int)

case class Triangle(apex: Int, base1: Int, base1: Int)

val vertices = DataSource("hdfs:///...", CsvFormat[Edge])

val byDegree = vertices map { projectToLowerDegree }

val byID = byDegree map { (x) => if (x.from < x.to) x

else Edge(x.to, x.from) }

val triads = byDegree groupBy { _.from } reduceGroup { buildTriads }

val triangles = triads join byID

where { t => (t.base1, t.base2) }

isEqualTo { e => (e.from, e.to) }

map { (triangle, edge) => triangle }

Custom Data Types In-situ data source

14

Page 15: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Scala API by Example

case class Edge(from: Int, to: Int)

case class Triangle(apex: Int, base1: Int, base2: Int)

val vertices = DataSource("hdfs:///...", CsvFormat[Edge])

val byDegree = vertices map { projectToLowerDegree }

val byID = byDegree map { (x) => if (x.from < x.to) x

else Edge(x.to, x.from) }

val triads = byDegree groupBy { _.from } reduceGroup { buildTriads }

val triangles = triads join byID

where { t => (t.base1, t.base2) }

isEqualTo { e => (e.from, e.to) }

map { (triangle, edge) => triangle }

Relational Join

Non-relational library function

Non-relational function

15

Page 16: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Scala API by Example

case class Edge(from: Int, to: Int)

case class Triangle(apex: Int, base1: Int, base2: Int)

val vertices = DataSource("hdfs:///...", CsvFormat[Edge])

val byDegree = vertices map { projectToLowerDegree }

val byID = byDegree map { (x) => if (x.from < x.to) x

else Edge(x.to, x.from) }

val triads = byDegree groupBy { _.from } reduceGroup { buildTriads }

val triangles = triads join byID

where { t => (t.base1, t.base2) }

isEqualTo { e => (e.from, e.to) }

map { (triangle, edge) => triangle }

Key References

16

Page 17: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Optimizing Programs • Program optimization happens in two phases

1. Data type and function code analysis inside the Scala Compiler

2. Relational-style optimization of the data flow

Run Time

Scala Compiler

Parser Program Type

Checker

Execution

Code Generation

Stratosphere Optimizer

Instantiate Finalize

Glue Code Create

Schedule Optimize

Analyze Data Types

Generate Glue Code

Instantiate

17

Page 18: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Type Analysis/Code Gen

• Types and Key Selectors are mapped to flat schema

• Generated code for interaction with runtime

Primitive Types, Arrays, Lists

Single Value

Tuples Tuples / Classes

Nested Types

Recursively flattened

recursive types

Tuples (w/ BLOB for

recursion)

Int, Double,

Array[String],

...

(a: Int, b: Int, c: String)

class T(x: Int, y: Long)

class T(x: Int, y: Long)

class R(id: String, value: T)

(a: Int, b: Int, c: String)

(x: Int, y: Long)

class Node(id: Int, left: Node,

right: Node) (id:Int, left:BLOB,

right:BLOB)

(x: Int, y: Long)

(id:String, x:Int, y:Long)

18

Page 19: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Optimization

val orders = DataSource(...)

val items = DataSource(...)

val filtered = orders filter { ... }

val prio = filtered join items where { _.id } isEqualTo { _.id }

map {(o,li) => PricedOrder(o.id, o.priority, li.price)}

val sales = prio groupBy {p => (p.id, p.priority)} aggregate ({_.price},SUM)

Filter

Grp/Agg Join

Orders Items

partition(0)

sort (0,1)

partition(0)

sort (0)

Filter

Join

Grp/Agg

Orders Items

(0,1)

(0) = (0)

(∅)

case class Order(id: Int, priority: Int, ...)

case class Item(id: Int, price: double, )

case class PricedOrder(id, priority, price)

19

Page 20: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Iterative Programs • Many programs have a loop and make

multiple passes over the data o Machine Learning algorithms iteratively refine the model

o Graph algorithms propagate information one hop by hop

20

Step Step Step Step Step

Client

Iteration

Loop outside the system

Loop inside the system

Page 21: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Why Iterations

• Algorithms that need iterations o Clustering (K-Means, …)

o Gradient descent

o Page-Rank

o Logistic Regression

o Path algorithms on graphs (shortest paths, centralities, …)

o Graph communities / dense sub-components

o Inference (believe propagation)

o …

All the hot algorithms for building predictive models

21

Page 22: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Two Types of Iterations

Bulk Iterations Incremental Iterations

(aka. Workset Iterations)

Iterative Function

Initial Dataset

Result

Initial Workset

Initial Solutionset

Iterative Function

State

Result

22

Page 23: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Iterations inside the System

0

200

400

600

800

1000

1200

1400

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

# V

erti

ces

(th

ou

san

ds)

Iteration

Naïve

Incremental

0

1000

2000

3000

4000

5000

6000

Twitter Webbase (20)

Computations performed in each iteration for connected communities of a social graph

Runtime (secs)

23

Page 24: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Iterative Program (Scala)

def step = (s: DataSet[Vertex], ws: DataSet[Vertex]) => {

val min = ws groupBy {_.id} reduceGroup { x => x.minBy { _.component } }

val delta = s join minNeighbor where { _.id } isEqualTo { _.id }

flatMap { (c,o) => if (c.component < o.component)

Some(c) else None }

val nextWs = delta join edges where {v => v.id} isEqualTo {e => e.from}

map { (v, e) => Vertex(e.to, v.component) }

(delta, nextWs)

}

val components = vertices.iterateWithWorkset(initialWorkset, {_.id}, step)

24

Page 25: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Iterative Program (Scala)

def step = (s: DataSet[Vertex], ws: DataSet[Vertex]) => {

val min = ws groupBy {_.id} reduceGroup { x => x.minBy { _.component } }

val delta = s join minNeighbor where { _.id } isEqualTo { _.id }

flatMap { (c,o) => if (c.component < o.component)

Some(c) else None }

val nextWs = delta join edges where {v => v.id} isEqualTo {e => e.from}

map { (v, e) => Vertex(e.to, v.component) }

(delta, nextWs)

}

val components = vertices.iterateWithWorkset(initialWorkset, {_.id}, step)

Define Step function

Return Delta and next Workset Invoke Iteration

25

Page 26: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Iterative Program (Java)

26

Page 27: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Graph Processing in Stratosphere

27

Page 28: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Optimizing Iterative Programs

Caching Loop-invariant Data Pushing work „out of the loop“

Maintain state as index

28

Page 29: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Support for YARN

• Clusters are typically shared between applications o Different users

o Different systems, or different versions of the same system

• YARN manages cluster as a collection of resources o Allows systems to deploy themselves on the cluster for a task

Stratosphere Client

YARN Manager

29

Page 30: Release Candiate V 0 - Stratospherestratosphere.eu/assets/slides/stephan_ewen-stratosphere...New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs

Project: http://stratosphere.eu Dev: http://github.com/stratosphere Tweet: #StratoSummit

Be Part of a Great Open Source Project

Use Stratosphere & give us feedback on the experience

Partner with us and become a pilot user/customer

Contribute to the system

30