Upload
johnynek
View
94
Download
1
Embed Size (px)
Citation preview
Aggregators: modeling data queries functionally
Oscar Boykin, Twitter @posco
Or:
Aggregators: composable aggregation for scalding, spark, summingbird, and plain scala
How to compute size of a list in Map/Reduce?
3
2 3 5 7 11 13 17
How to compute size of a list in Map/Reduce?
4
2 3 5 7 11 13 17
1 1 1 1 1 1 1
map(x => 1)
How to compute size of a list in Map/Reduce?
5
2 3 5 7 11 13 17
1 1 1 1 1 1 1
222
374
reduce {(x, y) => x+y}
Associative functions: f(a,f(b,c)) == f(f(a,b),c)
also called “semigroups”
we want map+semigroup in one
abstraction!
Getting the average
8
2 3 5 7 11 13 17
Getting the average
9
2 3 5 7 11 13 17
(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
map(x => (1,x))
Getting the average
10
2 3 5 7 11 13 17
(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
2,242, 5
3,417,584,17
2,12
reduce(Semigroup.plus)
Getting the average
11
2 3 5 7 11 13 17
(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
7,58 8.285
map(case (c, s) => s/c.toDouble)
We really want map+semigroup+map in one abstraction!
trait Aggregator[In, Middle, Out] { def prepare(i: In): Middle def semigroup: Semigroup[Middle] def present(m: Middle): Out }
https://github.com/twitter/algebird
How do we use this?
@Twitter 15
@Twitter 16
@Twitter 17
@Twitter 18
Not such a new idea. Scalding had a mapReduceMap function in the first release:
But why should we be excited?
map (prepare)
reduce (semigroup)
map (present)
“Does not compose” is the new “is a piece of crap”
paraphrasing Dan Rosen @mergeconflict
Aggregators Compose
!=💩Aggregator
map (prepare)
reduce (semigroup)
map (present)
map (prepare)
reduce (semigroup)
map (present)
composePrepare
map (prepare)
reduce (semigroup)
map (present)
composePrepare
Function + Aggregator = Aggregator
map (prepare)
reduce (semigroup)
map (present)
map (prepare)
reduce (semigroup)
map (present)
andThenPresent
map (prepare)
reduce (semigroup)
map (present)
andThenPresent
Aggregator + Function = Aggregator
map (prepare)
reduce (semigroup)
map (present)
map (prepare)
reduce (semigroup)
map (present)
Aggregator 1 Aggregator 2
map (prepare)
reduce (semigroup)
map (present)
Joined Aggregator
Aggregator * Aggregator = Aggregator
Aggregators are Applicative Functors
Functor: has a map method map(t: A[T])(fn: T => U): A[U]
Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]
Aggregators are Applicative Functors
Functor: has a map method map(t: A[T])(fn: T => U): A[U]
Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]
Let’s go to the REPL
http://bit.ly/AggregatingWithAlice
https://gist.github.com/johnynek/814fc1e77aad1d295bb7
Aggregators “just work” with scala collections
Aggregators are built in to Scalding
Aggregators are easy to use with Spark
Algebird with spark: https://github.com/twitter/algebird/pull/397
37
Algebird with spark: https://github.com/twitter/algebird/pull/397
38
Key Points1) Aggregators encapsulate very general query
logic independent of how it is executed (in memory, scalding, spark, you name it)
2) Aggregators compose so you can define parts you use, and easily glue them together
3) Algebird has many advanced, well tested Aggregators: TopK, HyperLogLog, CountMinSketch, Mean, Stddev, …
Oscar Boykin @posco / [email protected]
Algebird has these aggregators and more:
https://github.com/twitter/algebird