Upload
niesha
View
54
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Sublinear Algorihms for Big Data. Lecture 3. Grigory Yaroslavtsev http://grigory.us. SOFSEM 2015. URL: http://www.sofsem.cz 41 st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM’15) When and where? - PowerPoint PPT Presentation
Citation preview
SOFSEM 2015
• URL: http://www.sofsem.cz• 41st International Conference on Current Trends in Theory
and Practice of Computer Science (SOFSEM’15)• When and where?
– January 24-29, 2015. Czech Republic, Pec pod Snezkou• Deadlines:
– August 1st (tomorrow!): Abstract submission– August 15th: Full papers (proceedings in LNCS)
• I am on the Program Committee ;)
Today
• Count-Min (continued), Count Sketch• Sampling methods– -sampling– -sampling
• Sparse recovery– -sparse recovery
• Graph sketching
Recap
• Stream: elements from universe , e.g.
• = frequency of in the stream = # of occurrences of value
Count-Min
• are -wise independent hash functions• Maintain counters with values:
elements in the stream with • For every the value and so:
• If and then:.
More about Count-Min
• Authors: Graham Cormode, S. Muthukrishnan [LATIN’04]• Count-Min is linear:
Count-Min(S1 + S2) = Count-Min(S1) + Count-Min(S2)
• Deterministic version: CR-Precis• Count-Min vs. Bloom filters
– Allows to approximate values, not just 0/1 (set membership)– Doesn’t require mutual independence (only 2-wise)
• FAQ and Applications: – https://sites.google.com/site/countminsketch/home/– https://sites.google.com/site/countminsketch/home/faq
Fully Dynamic Streams
• Stream: updates that define vector where . • Example: For
• Count Sketch: Count-Min with random signs and median instead of min:
Count Sketch
• In addition to use random signs
• Estimate:
• Parameters:
-Sampling
• Stream: updates that define vector where . • -Sampling: Return random and :
Application: Social Networks
• Each of people in a social network is friends with some arbitrary set of other people
• Each person knows only about their friends• With no communication in the network, each
person sends a postcard to Mark Z.• If Mark wants to know if the graph is
connected, how long should the postcards be?
Optimal estimation
• Yesterday: -approximate – space for – space for
• New algorithm: Let be an -sample.Return , where is an estimate of
• Expectation:
Optimal estimation
• New algorithm: Let be an -sample.Return , where is an estimate of
• Variance:
• Exercise: Show that • Overall:
– Apply average + median to copies
-Sampling: Basic Overview
• Assume Weight by where
where • For some value return if there is a unique such
that
• Probability is returned if is large enough:
• Probability some value is returned repeat times.
-Sampling: Part 1
• Use Count-Sketch with parameters to sketch • To estimate
and • Lemma: With high probability if
• Corollary: With high probability if and
• Exercise: for large
Proof of Lemma• Let • By the analysis of Count Sketch and by Markov:
• If then • If , then
where the last inequality holds with probability
• Take median over repetitions high probability
-Sampling: Part 2
• Let if and otherwise• If there is a unique with then return • Note that if then and so
,therefore
• Lemma: With probability there is a unique such that . If so then
• Thm: Repeat times. Space:
Proof of Lemma
• Let We can upper-bound
Similarly, • Using independence of , probability of unique
with :
Proof of Lemma
• Let We can upper-bound
Similarly, • We just showed:
• If there is a unique i, probability is:
-sampling
• Maintain and -approximation to • Hash items using for • For each , maintain:
• Lemma: At level there is a unique element in the streams that maps to (with constant probability)
• Uniqueness is verified if . If so, then output as the index and as the count.
Proof of Lemma
• Let and note that • For any • Probability there exists a unique such that ,
• Holds even if are only 2-wise independent
Sparse Recovery
• Goal: Find such that is minimized among s with at most non-zero entries.
• Definition: • Exercise: where are indices of largest
• Using space we can find such that and
Count-Min Revisited• Use Count-Min with • For let for some row • Let be the indices with max. frequencies. Let
be the event there doesn’t exist with • Then for
• Because w.h.p. all ’s approx . up to
Sparse Recovery Algorithm
• Use Count-Min with • Let be frequency estimates:
• Let be with all but the k-th largest entries replaced by 0.
• Lemma:
• Let be indices corresponding to largest value of and .• For a vector and denote as the vector formed by
zeroing out all entries of except for those in .
+ 2 + 2
Thank you!
• Questions?