Sublinear Algorihms for Big Data

Sublinear Algorihms for Big Data

Grigory Yaroslavtsevhttp://grigory.us

Lecture 3

http://grigory.us/

SOFSEM 2015

• URL: http://www.sofsem.cz• 41st International Conference on Current Trends in Theory

and Practice of Computer Science (SOFSEM’15)• When and where?

– January 24-29, 2015. Czech Republic, Pec pod Snezkou• Deadlines:

– August 1st (tomorrow!): Abstract submission– August 15th: Full papers (proceedings in LNCS)

• I am on the Program Committee ;)

http://www.sofsem.cz/

Today

• Count-Min (continued), Count Sketch• Sampling methods– -sampling– -sampling

• Sparse recovery– -sparse recovery

• Graph sketching

Recap

• Stream: elements from universe , e.g.

• = frequency of in the stream = # of occurrences of value

Count-Min

• are -wise independent hash functions• Maintain counters with values:

elements in the stream with • For every the value and so:

• If and then:.

More about Count-Min

• Authors: Graham Cormode, S. Muthukrishnan [LATIN’04]• Count-Min is linear:

Count-Min(S1 + S2) = Count-Min(S1) + Count-Min(S2)

• Deterministic version: CR-Precis• Count-Min vs. Bloom filters

– Allows to approximate values, not just 0/1 (set membership)– Doesn’t require mutual independence (only 2-wise)

• FAQ and Applications: – https://sites.google.com/site/countminsketch/home/– https://sites.google.com/site/countminsketch/home/faq

https://sites.google.com/site/countminsketch/home/

https://sites.google.com/site/countminsketch/home/faq

Fully Dynamic Streams

• Stream: updates that define vector where . • Example: For

• Count Sketch: Count-Min with random signs and median instead of min:

Count Sketch

• In addition to use random signs

• Estimate:

• Parameters:

-Sampling

• Stream: updates that define vector where . • -Sampling: Return random and :

Application: Social Networks

• Each of people in a social network is friends with some arbitrary set of other people

• Each person knows only about their friends• With no communication in the network, each

person sends a postcard to Mark Z.• If Mark wants to know if the graph is

connected, how long should the postcards be?

Optimal estimation

• Yesterday: -approximate – space for – space for

• New algorithm: Let be an -sample.Return , where is an estimate of

• Expectation:

Optimal estimation

• New algorithm: Let be an -sample.Return , where is an estimate of

• Variance:

• Exercise: Show that • Overall:

– Apply average + median to copies

-Sampling: Basic Overview

• Assume Weight by where

where • For some value return if there is a unique such

that

• Probability is returned if is large enough:

• Probability some value is returned repeat times.

-Sampling: Part 1

• Use Count-Sketch with parameters to sketch • To estimate

and • Lemma: With high probability if

• Corollary: With high probability if and

• Exercise: for large

Proof of Lemma• Let • By the analysis of Count Sketch and by Markov:

• If then • If , then

where the last inequality holds with probability

• Take median over repetitions high probability

-Sampling: Part 2

• Let if and otherwise• If there is a unique with then return • Note that if then and so

,therefore

• Lemma: With probability there is a unique such that . If so then

• Thm: Repeat times. Space:

Proof of Lemma

• Let We can upper-bound

Similarly, • Using independence of , probability of unique

with :

Proof of Lemma

• Let We can upper-bound

Similarly, • We just showed:

• If there is a unique i, probability is:

-sampling

• Maintain and -approximation to • Hash items using for • For each , maintain:

• Lemma: At level there is a unique element in the streams that maps to (with constant probability)

• Uniqueness is verified if . If so, then output as the index and as the count.

Proof of Lemma

• Let and note that • For any • Probability there exists a unique such that ,

• Holds even if are only 2-wise independent

Sparse Recovery

• Goal: Find such that is minimized among s with at most non-zero entries.

• Definition: • Exercise: where are indices of largest

• Using space we can find such that and

Count-Min Revisited• Use Count-Min with • For let for some row • Let be the indices with max. frequencies. Let

be the event there doesn’t exist with • Then for

• Because w.h.p. all ’s approx . up to

Sparse Recovery Algorithm

• Use Count-Min with • Let be frequency estimates:

• Let be with all but the k-th largest entries replaced by 0.

• Lemma:

• Let be indices corresponding to largest value of and .• For a vector and denote as the vector formed by

zeroing out all entries of except for those in .

+ 2 + 2

Thank you!

• Questions?

Documents

Sublinear Algorihms for Big Data