15
Tetra Data Blitz 10/1/2015

Monoids monoids everywhere

Embed Size (px)

Citation preview

Page 1: Monoids monoids everywhere

Tetra Data Blitz10/1/2015

Page 2: Monoids monoids everywhere

Monoids Monoids

Everywherein ~5 minutes

Kevin Faro

Page 3: Monoids monoids everywhere

http://s2.quickmeme.com/img/44/44b0bd758f8ee5c81362923f0d5c8e017c9ddf623925e60c29a4c015b89fbb45.jpg

Page 4: Monoids monoids everywhere

Oh, that wasn’t clear enough?An operation is considered a monoid if:

1. it is associative a. (a●b)●c=a●(b●c)

2. it has an identity element a. e●a=a●e=a

Page 5: Monoids monoids everywhere

Examples● Addition

○ associative: (1+2)+3=1+(2+3)=6○ identity: 0+1=1+0=1

● Multiplication○ associative: (1*2)*3=1*(2*3)=6○ identity: 1*2=2*1=2

● Min○ you get the idea ...

● Max● Set Union

Page 6: Monoids monoids everywhere

Let’s take a look at algebird

http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/

Page 7: Monoids monoids everywhere

https://izbicki.me/img/uploads/2013/05/fry-300x225.jpg

Page 8: Monoids monoids everywhere

Why is this so awesome?!?!● Divide and Conquer● Parallelization● Incrementalism

Sound Familiar?

● map/REDUCE○ perfect for the reduce phase ○ see Scalding: expenses.groupBy('shoppingLocation) { _.sum[Double]('cost -> 'totalCost) }

● Streaming○ perfect for maintaining running calculations on streams of data (storm, …)

Page 9: Monoids monoids everywhere

Approximate Data Structures● HyperLogLog

○ an algorithm for the count-distinct problem, approximating the number of distinct elements in a Set.

● Count-min Sketch○ a probabilistic data structure that provides an approximate frequency table.

● MinHash○ estimates how similar two sets are (approximate Jaccard Similarity)

● Bloom filter○ a probabilistic data structure that is used to test whether an element is a member of a Set ○ can answer definitely No or maybe Yes

Page 10: Monoids monoids everywhere

Examples● HyperLogLog

○ How many unique twitter handles tweeted @justinbieber in the past month?

● Count-min Sketch○ What are the frequencies of the hashtags in those tweets?

● MinHash○ How similar are the followers of @justinbieber(~70M) to the followers of @katyperry

(~76M)

● Bloom filter○ Did Kevin tweet to @justinbieber in the past month? maybe yes. Must be a false positive,

can you really trust a bloom filter?!?!?

Page 11: Monoids monoids everywhere

How did that get in there?

Page 12: Monoids monoids everywhere

https://highlyscalable.files.wordpress.com/2012/04/probabilistic-sizes.png

This is better than Spanks™!

Page 13: Monoids monoids everywhere

Thanks Twitter

https://github.com/twitter/algebird*

* Sorry, Algebird doesn’t have a cool logo. Don’t blame me, blame Twitter!

Page 14: Monoids monoids everywhere

Kevin Faro

[email protected]://github.com/kevin-faro

http://cdn.meme.am/instances/500x/63234695.jpg