40
Aggregators: modeling data queries functionally Oscar Boykin, Twitter @posco

Aggregators: Data Day Texas, 2015

Embed Size (px)

Citation preview

Page 1: Aggregators: Data Day Texas, 2015

Aggregators: modeling data queries functionally

Oscar Boykin, Twitter @posco

Page 2: Aggregators: Data Day Texas, 2015

Or:

Aggregators: composable aggregation for scalding, spark, summingbird, and plain scala

Page 3: Aggregators: Data Day Texas, 2015

@Twitter

How to compute size of a list in Map/Reduce?

3

2 3 5 7 11 13 17

Page 4: Aggregators: Data Day Texas, 2015

@Twitter

How to compute size of a list in Map/Reduce?

4

2 3 5 7 11 13 17

1 1 1 1 1 1 1

map(x => 1)

Page 5: Aggregators: Data Day Texas, 2015

@Twitter

How to compute size of a list in Map/Reduce?

5

2 3 5 7 11 13 17

1 1 1 1 1 1 1

222

374

reduce {(x, y) => x+y}

Page 6: Aggregators: Data Day Texas, 2015

Associative functions: f(a,f(b,c)) == f(f(a,b),c)

also called “semigroups”

Page 7: Aggregators: Data Day Texas, 2015

we want map+semigroup in one

abstraction!

Page 8: Aggregators: Data Day Texas, 2015

@Twitter

Getting the average

8

2 3 5 7 11 13 17

Page 9: Aggregators: Data Day Texas, 2015

@Twitter

Getting the average

9

2 3 5 7 11 13 17

(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)

map(x => (1,x))

Page 10: Aggregators: Data Day Texas, 2015

@Twitter

Getting the average

10

2 3 5 7 11 13 17

(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)

2,242, 5

3,417,584,17

2,12

reduce(Semigroup.plus)

Page 11: Aggregators: Data Day Texas, 2015

@Twitter

Getting the average

11

2 3 5 7 11 13 17

(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)

7,58 8.285

map(case (c, s) => s/c.toDouble)

Page 12: Aggregators: Data Day Texas, 2015

We really want map+semigroup+map in one abstraction!

Page 13: Aggregators: Data Day Texas, 2015

trait Aggregator[In, Middle, Out] { def prepare(i: In): Middle def semigroup: Semigroup[Middle] def present(m: Middle): Out }

https://github.com/twitter/algebird

Page 14: Aggregators: Data Day Texas, 2015

How do we use this?

Page 15: Aggregators: Data Day Texas, 2015

@Twitter 15

Page 16: Aggregators: Data Day Texas, 2015

@Twitter 16

Page 17: Aggregators: Data Day Texas, 2015

@Twitter 17

Page 18: Aggregators: Data Day Texas, 2015

@Twitter 18

Page 19: Aggregators: Data Day Texas, 2015

Not such a new idea. Scalding had a mapReduceMap function in the first release:

Page 20: Aggregators: Data Day Texas, 2015

But why should we be excited?

Page 21: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

Page 22: Aggregators: Data Day Texas, 2015

“Does not compose” is the new “is a piece of crap”

paraphrasing Dan Rosen @mergeconflict

Page 23: Aggregators: Data Day Texas, 2015

Aggregators Compose

!=💩Aggregator

Page 24: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

Page 25: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

composePrepare

Page 26: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

composePrepare

Function + Aggregator = Aggregator

Page 27: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

Page 28: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

andThenPresent

Page 29: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

andThenPresent

Aggregator + Function = Aggregator

Page 30: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

Page 31: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

Aggregator 1 Aggregator 2

Page 32: Aggregators: Data Day Texas, 2015

map (prepare)

reduce (semigroup)

map (present)

Joined Aggregator

Aggregator * Aggregator = Aggregator

Page 33: Aggregators: Data Day Texas, 2015

Aggregators are Applicative Functors

Functor: has a map method map(t: A[T])(fn: T => U): A[U]

Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]

Page 34: Aggregators: Data Day Texas, 2015

Aggregators are Applicative Functors

Functor: has a map method map(t: A[T])(fn: T => U): A[U]

Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]

Page 35: Aggregators: Data Day Texas, 2015

Let’s go to the REPL

http://bit.ly/AggregatingWithAlice

https://gist.github.com/johnynek/814fc1e77aad1d295bb7

Page 36: Aggregators: Data Day Texas, 2015

Aggregators “just work” with scala collections

Aggregators are built in to Scalding

Aggregators are easy to use with Spark

Page 37: Aggregators: Data Day Texas, 2015

@Twitter

Algebird with spark: https://github.com/twitter/algebird/pull/397

37

Page 38: Aggregators: Data Day Texas, 2015

@Twitter

Algebird with spark: https://github.com/twitter/algebird/pull/397

38

Page 39: Aggregators: Data Day Texas, 2015

Key Points1) Aggregators encapsulate very general query

logic independent of how it is executed (in memory, scalding, spark, you name it)

2) Aggregators compose so you can define parts you use, and easily glue them together

3) Algebird has many advanced, well tested Aggregators: TopK, HyperLogLog, CountMinSketch, Mean, Stddev, …

Page 40: Aggregators: Data Day Texas, 2015

Oscar Boykin @posco / [email protected]

Algebird has these aggregators and more:

https://github.com/twitter/algebird