99
PROBABILISTIC ALGORITHMS for fun and pseudorandom profit Tyler Treat / 12.5.2015

Probabilistic algorithms for fun and pseudorandom profit

Embed Size (px)

Citation preview

Page 1: Probabilistic algorithms for fun and pseudorandom profit

PROBABILISTIC ALGORITHMS for fun and pseudorandom profit

Tyler Treat / 12.5.2015

Page 2: Probabilistic algorithms for fun and pseudorandom profit

ABOUT THE SPEAKER

➤ Backend engineer at Workiva

➤ Messaging platform tech lead

➤ Distributed systems

➤ bravenewgeek.com @tyler_treat

[email protected]

Page 3: Probabilistic algorithms for fun and pseudorandom profit

Time

Data

Batch (days, hours)

Meh, data (I can store this on

my laptop)

Streaming (minutes, seconds)

Oi, data! (We’re gonna need a bigger boat…)

Real-Time™ (I need it now, dammit!)

Big data™ (IoT, sensors)

/dev/null

Page 4: Probabilistic algorithms for fun and pseudorandom profit

Time

Data

Batch (days, hours)

Meh, data (I can store this on

my laptop)

Streaming (minutes, seconds)

Oi, data! (We’re gonna need a bigger boat…)

Real-Time™ (I need it now, dammit!)

Big data™ (IoT, sensors)

Not Interesting

Kinda Interesting

Pretty Interesting

/dev/null

Page 5: Probabilistic algorithms for fun and pseudorandom profit

http://bravenewgeek.com/stream-processing-and-probabilistic-methods/

Page 6: Probabilistic algorithms for fun and pseudorandom profit

THIS TALK IS NOT

➤ About Samza, Storm, Spark Streaming et al.

➤ Strictly about stream-processing techniques

➤ Mathy

➤ Statistics-y

Page 7: Probabilistic algorithms for fun and pseudorandom profit

THIS TALK IS

➤ About basic probability theory

➤ About practical design trade-offs

➤ About algorithms & data structures

➤ About dealing with large or unbounded datasets

➤ A marriage of CS & engineering

Page 8: Probabilistic algorithms for fun and pseudorandom profit

OUTLINE

➤ Terminology & context

➤ Why probabilistic algorithms?

➤ Bloom filters & variants

➤ Count-min sketch

➤ HyperLogLog

Page 9: Probabilistic algorithms for fun and pseudorandom profit

Randomized Algorithms

Las Vegas Algorithms Monte Carlo Algorithms

Random Input

Correct result Gamble on speed

Deterministic speed Gamble on result

Page 10: Probabilistic algorithms for fun and pseudorandom profit

Randomized Algorithms

Las Vegas Algorithms Monte Carlo Algorithms

Random Input

Correct result Gamble on speed

Deterministic speed Gamble on result

Page 11: Probabilistic algorithms for fun and pseudorandom profit

DEFINING SOME TERMINOLOGY

➤ Online - processing elements as they arrive

➤ Offline - entire dataset is known ahead of time

➤ Real-time - hard constraint on response time

➤ A priori knowledge - something known beforehand

Page 12: Probabilistic algorithms for fun and pseudorandom profit

BATCH VS STREAMING

➤ Batch

➤ Offline

➤ Heuristics/multiple passes

➤ Data structures less important

documents search index

Page 13: Probabilistic algorithms for fun and pseudorandom profit

BATCH VS STREAMING

➤ Streaming

➤ Online, one pass

➤ Usually real-time (but not necessarily)

➤ Potentially unbounded

transactions

caches

fraud

analytics

Page 14: Probabilistic algorithms for fun and pseudorandom profit

3 DATA INTEGRATION QUESTIONS

➤ How do you get the data?

➤ How do you disseminate the data?

➤ How do you process the data?

Page 15: Probabilistic algorithms for fun and pseudorandom profit

3 DATA INTEGRATION QUESTIONS

➤ How do you get the data (quickly)?

➤ How do you disseminate the data (quickly)?

➤ How do you process the data (quickly)?

Page 16: Probabilistic algorithms for fun and pseudorandom profit

Denormalization is critical to performance at scale.

Page 17: Probabilistic algorithms for fun and pseudorandom profit

How to count the number of distinct document views across Wikipedia?

Page 18: Probabilistic algorithms for fun and pseudorandom profit

10b531cb-914c-4b3e-ac1d-11678dd72f7a

3,042,568

16-byte GUID 8-byte integer

Page 19: Probabilistic algorithms for fun and pseudorandom profit

10b531cb-914c-4b3e-ac1d-11678dd72f7a

5d5d5a78-f98f-4eee-bc83-762b3c78f1ea

3558d299-45ef-4fc9-b9ec-902e4943c7f8

6febb745-c987-4c51-afd2-90a55f357d7b

6f3f199e-4cc3-4c68-9d2a-00c31eb199f3

3,042,568

1,250,763

982,531

24,703,289

7,401,050

Page 20: Probabilistic algorithms for fun and pseudorandom profit

Wikipedia has ~38 million pages.

Page 21: Probabilistic algorithms for fun and pseudorandom profit

38,000,000 pages x

(16-byte guid + 8-byte integer)

≈ 1GB

Page 22: Probabilistic algorithms for fun and pseudorandom profit

➤ Not unreasonable for modern hardware

➤ Held in memory for lifetime of process so will move to old GC generations—expensive to collect!

➤ Now we want to track views per unique IP address

➤ >4 billion IPv4 addresses

➤ Naive solutions quickly become intractable

Page 23: Probabilistic algorithms for fun and pseudorandom profit

DISTRIBUTED SYSTEMS TRADE-OFFS

Consistency

Availability

Partition Tolerance

Page 24: Probabilistic algorithms for fun and pseudorandom profit

DATA PROCESSING TRADE-OFFS

Time

Accuracy

Space

Page 25: Probabilistic algorithms for fun and pseudorandom profit

HAVE YOUR CAKE AND EAT IT TOO?

Stream Processing

Batch Processing

App

The “Lambda Architecture”

Page 26: Probabilistic algorithms for fun and pseudorandom profit

Probabilistic algorithms trade accuracy for space and performance.

Page 27: Probabilistic algorithms for fun and pseudorandom profit

“Sketching” data structures make this trade by storing a summary of the dataset when storing it entirely is prohibitively expensive.

Page 28: Probabilistic algorithms for fun and pseudorandom profit

Bloom FiltersB. H. Bloom.

Space/Time Trade-offs in Hash Coding with Allowable Errors. 1970.

Page 29: Probabilistic algorithms for fun and pseudorandom profit

Answers a simple question: is this element a member of a set?

S ⊆ 𝕌 x ∈ S

Page 30: Probabilistic algorithms for fun and pseudorandom profit

SET MEMBERSHIP

➤ Is this URL malicious?

➤ Is this IP address blacklisted?

➤ Is this word contained in the document?

➤ Is this record in the database?

➤ Has this transaction been processed?

Page 31: Probabilistic algorithms for fun and pseudorandom profit

Hash Table

entry for each member

Page 32: Probabilistic algorithms for fun and pseudorandom profit

Bit Array

bit for each element in universe

0101101000101110010100100110101…

Page 33: Probabilistic algorithms for fun and pseudorandom profit

BLOOM FILTERS

➤ Bloom filters store set memberships

➤ Answers “not in set” or “probably in set”

Page 34: Probabilistic algorithms for fun and pseudorandom profit

Bloom Filter Secondary Store

Do you have key 1?

no

no

Do you have key 2?

Here’s key 2

yes

Necessary access

Here’s key 2

Do you have key 3?

no

yes

Unnecessary access

no

yes

no

Page 35: Probabilistic algorithms for fun and pseudorandom profit

BLOOM FILTERS

➤ 2 operations: add, lookup

➤ Allocate bit array of length m

➤ k hash functions

➤ Configure m and k for desired false-positive rate

Page 36: Probabilistic algorithms for fun and pseudorandom profit

BLOOM FILTERS

➤ Add element:

➤ Hash with k functions to get k indices

➤ Set bits at each index

➤ Lookup:

➤ Hash with k functions to get k indices

➤ Check bit at each index

➤ If any bit is unset, element not in set

Page 37: Probabilistic algorithms for fun and pseudorandom profit

BLOOM FILTERS

➤ Benefits:

➤ More space-efficient than hash table or bit array

➤ Can determine trade-off between accuracy and space

➤ Drawbacks:

➤ Some elements potentially more sensitive to false positives than others (solvable by partitioning)

➤ Can’t remove elements

➤ Requires a priori knowledge of the dataset

➤ Over-provisioned filter wastes space

Page 38: Probabilistic algorithms for fun and pseudorandom profit

Bloom filters are great for efficient offline processing, but what about streaming?

Page 39: Probabilistic algorithms for fun and pseudorandom profit

BLOOM FILTERS WITH A TWIST

➤ Rotating Bloom filters

➤ e.g. remember everything in the last hour

➤ Scalable Bloom Filters

➤ Dynamically allocating chained filters

➤ Stable Bloom Filters

➤ Continuously evict stale data

Page 40: Probabilistic algorithms for fun and pseudorandom profit

Scalable Bloom FiltersP. S. Almeida, C. Baquero, N. Preguiça, D. Hutchison.

Scalable Bloom Filters. 2007.

Page 41: Probabilistic algorithms for fun and pseudorandom profit
Page 42: Probabilistic algorithms for fun and pseudorandom profit
Page 43: Probabilistic algorithms for fun and pseudorandom profit
Page 44: Probabilistic algorithms for fun and pseudorandom profit
Page 45: Probabilistic algorithms for fun and pseudorandom profit
Page 46: Probabilistic algorithms for fun and pseudorandom profit

l

P0 P0 P0 P0

P0 = error prob. of 1 filter

l = # filters

P = compound error prob.

P=1- (1-P0)i=0

l-1

Page 47: Probabilistic algorithms for fun and pseudorandom profit

P0 = 0.1

P=1- (1-P0)i=0

l-1

Page 48: Probabilistic algorithms for fun and pseudorandom profit

SCALABLE BLOOM FILTERS

➤ Questions:

➤ When to add a new filter?

➤ How to place a tight upper bound on P?

Page 49: Probabilistic algorithms for fun and pseudorandom profit

SCALABLE BLOOM FILTERS

➤ When to add a new filter?

➤ Fill ratio p = # set bits / # bits

➤ Add new filter when target p is reached

➤ Optimal target p = 0.5 (math follows from paper)

Page 50: Probabilistic algorithms for fun and pseudorandom profit

SCALABLE BLOOM FILTERS

➤ How to place a tight upper bound on P?

➤ Apply tightening ratio r to P0, where 0 < r < 1

➤ Start with 1 filter, error probability P0

➤ When full, add new filter, error probability P1=P0r

➤ Results in geometric series:

➤ Series converges on target error probability P

Page 51: Probabilistic algorithms for fun and pseudorandom profit

P0 = 0.1 r = 0.5

P

P=1- (1-P0r i)i=0

l-1

Page 52: Probabilistic algorithms for fun and pseudorandom profit

SCALABLE BLOOM FILTERS

➤ Add elements to last filter

➤ Check each filter on lookups

➤ Tightening ratio r controls m and k for new filters

Page 53: Probabilistic algorithms for fun and pseudorandom profit

SCALABLE BLOOM FILTERS

➤ Benefits:

➤ Can grow dynamically to accommodate dataset

➤ Provides tight upper bound on false-positive rate

➤ Can control growth rate

➤ Drawbacks:

➤ Size still proportional to dataset

➤ Additional computation on adds (negligible amortized)

Page 54: Probabilistic algorithms for fun and pseudorandom profit

Stable Bloom FiltersF. Deng, D. Rafiei.

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters. 2006.

Page 55: Probabilistic algorithms for fun and pseudorandom profit

DUPLICATE DETECTION

➤ Query processing

➤ URL crawling

➤ Monitoring distinct IP addresses

➤ Advertiser click streams

➤ Graph processing

Page 56: Probabilistic algorithms for fun and pseudorandom profit
Page 57: Probabilistic algorithms for fun and pseudorandom profit

Bloom filters are remarkably useful for dealing with graph data.

Page 58: Probabilistic algorithms for fun and pseudorandom profit

GRAPH PROCESSING

➤ Detecting cycles

➤ Pruning search space

➤ E.g. often used in bioinformatics

➤ Storing chemical structures, properties, and molecular fingerprints in filters to optimize searches and determine structural similarities

➤ Rapid classification of DNA sequences as large as the human genome

Page 59: Probabilistic algorithms for fun and pseudorandom profit

GRAPH PROCESSING

➤ Store crawled nodes in memory

➤ Set of nodes may be too large to fit in memory

➤ Store crawled nodes in secondary storage

➤ Too many searches to perform in limited time

Page 60: Probabilistic algorithms for fun and pseudorandom profit

Precisely eliminating duplicates in an unbounded stream isn’t feasible with

limited space and time.

Page 61: Probabilistic algorithms for fun and pseudorandom profit

Efficacy/Efficiency Conjecture: In many situations, a quick answer with an allowable error rate is better than a precise one that is slow.

Page 62: Probabilistic algorithms for fun and pseudorandom profit

Staleness Conjecture: In many situations, more recent data has more value than stale data.

Page 63: Probabilistic algorithms for fun and pseudorandom profit

STABLE BLOOM FILTERS

➤ Discards old data to make room for new data

➤ Replace bit array with array of d-bit counters

➤ Initialize counters to zero

➤ Maximum counter value Max = 2d - 1

Page 64: Probabilistic algorithms for fun and pseudorandom profit

STABLE BLOOM FILTERS

➤ Add element:

➤ Select P random counters and decrement by one

➤ Hash with k functions to get k indices

➤ Set counters at each index to Max

➤ Lookup:

➤ Hash with k functions to get k indices

➤ Check counter at each index

➤ If any counter is zero, element not in set

Page 65: Probabilistic algorithms for fun and pseudorandom profit

STABLE BLOOM FILTERS

➤ Classic Bloom filter a special case of SBF w/ d=1, P=0

➤ Tight upper bound on false positives

➤ FP rate asymptotically approaches configurable fixed constant (stable-point property)

➤ See paper for math and parameter settings

➤ Evicting data introduces false negatives

Page 66: Probabilistic algorithms for fun and pseudorandom profit

STABLE BLOOM FILTERS

➤ Benefits:

➤ Fixed memory allocation

➤ Evicts old data to make room for new data

➤ Provides tight upper bound on false positives

➤ Drawbacks:

➤ Introduces false negatives

➤ Additional computation on adds

Page 67: Probabilistic algorithms for fun and pseudorandom profit

Count-Min SketchG. Cormode, S. Muthukrishnan.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications. 2003.

Page 68: Probabilistic algorithms for fun and pseudorandom profit

Can we count element frequencies using sub-linear space?

page views

94.136.205.1

132.208.90.15

54.222.151.15

7

4

11

Page 69: Probabilistic algorithms for fun and pseudorandom profit

COUNT-MIN SKETCH

➤ Approximates frequencies in sub-linear space

➤ Matrix with w columns and d rows

➤ Each row has a hash function

➤ Each cell initialized to zero

➤ When element arrives:

➤ Hash for each row

➤ Increment each counter by 1

➤ freq(element) = min counter value

Page 70: Probabilistic algorithms for fun and pseudorandom profit

COUNT-MIN SKETCH

➤ Why the minimum?

➤ Possibility for collisions between elements

➤ Counter may be incremented by multiple elements

➤ Taking minimum counter value gives closer approximation

Page 71: Probabilistic algorithms for fun and pseudorandom profit

COUNT-MIN SKETCH

➤ Benefits:

➤ Simple!

➤ Sub-linear space

➤ Useful for detecting “heavy hitters”

➤ Easy to track top-k by adding a min-heap

➤ Drawbacks:

➤ Biased estimator: may overestimate, never underestimates

➤ Better suited to Zipfian distributions & rare events

Page 72: Probabilistic algorithms for fun and pseudorandom profit

HyperLogLogP. Flajolet, É. Fusy, O. Gandouet, F. Meunier.

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. 2007.

Page 73: Probabilistic algorithms for fun and pseudorandom profit

How do we count distinct things in a stream?

Page 74: Probabilistic algorithms for fun and pseudorandom profit

COUNTING PROBLEMS

➤ E.g. how many different words are used in Wikipedia?

➤ Counter per element explodes memory

➤ Usually requires memory proportional to cardinality

➤ Can we approximate cardinality with constant space?

Page 75: Probabilistic algorithms for fun and pseudorandom profit

HYPERLOGLOG

➤ The name: can estimate cardinality of set w/ cardinality Nmax using loglog(Nmax) + O(1) bits

➤ Hash element to integer

➤ Count number of leading 0’s in binary form of hash

➤ Track highest number of leading 0’s, n

➤ Cardinality ≈ 2n+1

Page 76: Probabilistic algorithms for fun and pseudorandom profit

HYPERLOGLOG

➤ stream = [“foo”, “bar”, “baz”, “qux”]

➤ h(“foo”) = 10100001

➤ h(“bar”) = 01110111

➤ h(“baz”) = 01110100

➤ h(“qux”) = 10100011

➤ n = 1

➤ |stream| ≈ 2n+1 = 22 = 4

Page 77: Probabilistic algorithms for fun and pseudorandom profit
Page 78: Probabilistic algorithms for fun and pseudorandom profit

It’s actually not magic but just a few really clever observations.

Page 79: Probabilistic algorithms for fun and pseudorandom profit

With 50/50 odds, how long will it take to flip 3 heads in a row? 20? 100?

Page 80: Probabilistic algorithms for fun and pseudorandom profit

HYPERLOGLOG

➤ Replace “heads” and “tails” with 0’s and 1’s

➤ Count leading consecutive 0’s in binary form of hash

➤ E.g. imagine a 4-bit hash, 16 possible values: ➤ 0000 4 leading 0’s

➤ 0001 3 leading 0’s

➤ 0011, 0010 2 leading 0’s

➤ 0100, 0111, 0110, 0101 1 leading 0’s

➤ 1111, 1110, 1001 1010, 1101, 1100 1011, 1000 0 leading 0’s

➤ Assume good hash function → 1/16 odds for each permutation

Page 81: Probabilistic algorithms for fun and pseudorandom profit

HYPERLOGLOG

➤ Track highest number of leading 0’s, n

➤ n = 0 → 8/16=1/2 odds

➤ n = 1 → 4/16=1/4 odds

➤ n = 2 → 2/16=1/8 odds

➤ n = 3 → 1/16 odds

➤ Cardinality ≈ how many things did we have to look?

➤ E.g. highest count = 1 → 1/4 odds → cardinality 4

Page 82: Probabilistic algorithms for fun and pseudorandom profit
Page 83: Probabilistic algorithms for fun and pseudorandom profit

HYPERLOGLOG

➤ 1/2 of all binary numbers start with 1

➤ Each additional bit cuts the probability in half:

➤ 1/4 start with 01

➤ 1/8 start with 001

➤ 1/16 start with 0001

➤ etc.

➤ P(run of length n) = 1 / 2n+1

➤ Seeing 001 has 1/8 probability, meaning we had to look at approximately 8 things til we saw it (cardinality 8)

➤ Cardinality ≈ prob-1 (reciprocal of probability)

Page 84: Probabilistic algorithms for fun and pseudorandom profit

What about outliers?

Page 85: Probabilistic algorithms for fun and pseudorandom profit

HYPERLOGLOG

➤ Use multiple buckets

➤ Use first few bits of hash to determine bucket

➤ Use remaining bits to count 0’s

➤ Each bucket tracks its own count

➤ Take harmonic mean of all buckets to get cardinality ➤ min(x1…xn) ≤ H(x1…xn) ≤ n min(x1…xn)

01011010001011100101001001101010bucket counting space

Page 86: Probabilistic algorithms for fun and pseudorandom profit

http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html

Number of distinct words in all of Shakespeare's work

Page 87: Probabilistic algorithms for fun and pseudorandom profit

HYPERLOGLOG

➤ Benefits:

➤ Constant memory

➤ Super fast (calculating MSB is cheap)

➤ Can give accurate count with <1% error

➤ Drawbacks:

➤ Has a margin of error (albeit small)

Page 88: Probabilistic algorithms for fun and pseudorandom profit

What did we learn?

Page 89: Probabilistic algorithms for fun and pseudorandom profit

Data processing has trade-offs.

Page 90: Probabilistic algorithms for fun and pseudorandom profit

Probabilistic algorithms trade accuracy for speed and space.

Page 91: Probabilistic algorithms for fun and pseudorandom profit

Often we only care about answers that are mostly correct but available now.

Page 92: Probabilistic algorithms for fun and pseudorandom profit

Sometimes the “right” answer is impossible to compute or simply doesn’t exist.

Page 93: Probabilistic algorithms for fun and pseudorandom profit

But mostly…

Page 94: Probabilistic algorithms for fun and pseudorandom profit

Probabilistic algorithms are just damn cool.

Page 95: Probabilistic algorithms for fun and pseudorandom profit

What about the code?

Page 96: Probabilistic algorithms for fun and pseudorandom profit

ALGORITHM IMPLEMENTATIONS

➤ Algebird - https://github.com/twitter/algebird ➤ Bloom filter

➤ Count-min sketch

➤ HyperLogLog

➤ stream-lib - https://github.com/addthis/stream-lib ➤ Bloom filter

➤ Count-min sketch

➤ HyperLogLog

➤ Boom Filters - https://github.com/tylertreat/BoomFilters ➤ Bloom filter

➤ Scalable Bloom filter

➤ Stable Bloom filter

➤ Count-min sketch

➤ HyperLogLog

Page 97: Probabilistic algorithms for fun and pseudorandom profit

OTHER COOL PROBABILISTIC ALGORITHMS

➤ Counting Bloom filter (and many other Bloom variations)

➤ Bloomier filter (encode functions instead of sets)

➤ Cuckoo filter (Bloom filter w/ cuckoo hashing)

➤ q-digest (quantile approximation)

➤ t-digest (online accumulation of rank-based statistics)

➤ Locality-sensitive hashing (hash similar items to same buckets)

➤ MinHash (set similarity)

➤ Miller–Rabin (primality testing)

➤ Karger’s algorithm (min cut of connected graph)

Page 98: Probabilistic algorithms for fun and pseudorandom profit

@tyler_treat

github.com/tylertreat

bravenewgeek.com

Thanks

We’re hiring!

Page 99: Probabilistic algorithms for fun and pseudorandom profit

BIBLIOGRAPHYAlmeida, P., Baquero, C., Preguica, N., Hutchison, D. 2007. Scalable Bloom Filters; http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf

Bloom, B. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors; https://www.cs.upc.edu/~diaz/p422-bloom.pdf

Cormode, G., & Muthukrishnan, S. 2003. An Improved Data Stream Summary: The Count-Min Sketch and its Applications; http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf

Deng, F., & Rafiei, D. 2006. Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters; https://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf

Flajolet, P., Fusy, É, Gandouet, O., Meunier, F. 2007. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm; http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J. 2010. Classification of DNA sequences using Bloom filters. Bioinformatics, 26(13); http://bioinformatics.oxfordjournals.org/content/26/13/1595.full.pdf

Tarkoma, S., Rothenberg, C., & Lagerspetz, E. 2011. Theory and Practice of Bloom Filters for Distributed Systems. IEEE Communications Surveys & Tutorials, 14(1); https://gnunet.org/sites/default/files/TheoryandPracticeBloomFilter2011Tarkoma.pdf

Treat, T. 2015. Stream Processing and Probabilistic Methods: Data at Scale; http://bravenewgeek.com/stream-processing-and-probabilistic-methods