25
Sublinear Algorihms for Big Data Grigory Yaroslavtsev http://grigory.us Lectur e 3

Sublinear Algorihms for Big Data

  • Upload
    niesha

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Sublinear Algorihms for Big Data. Lecture 3. Grigory Yaroslavtsev http://grigory.us. SOFSEM 2015. URL: http://www.sofsem.cz 41 st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM’15) When and where? - PowerPoint PPT Presentation

Citation preview

Page 1: Sublinear Algorihms  for Big Data

Sublinear Algorihms for Big Data

Grigory Yaroslavtsevhttp://grigory.us

Lecture 3

Page 2: Sublinear Algorihms  for Big Data

SOFSEM 2015

• URL: http://www.sofsem.cz• 41st International Conference on Current Trends in Theory

and Practice of Computer Science (SOFSEM’15)• When and where?

– January 24-29, 2015. Czech Republic, Pec pod Snezkou• Deadlines:

– August 1st (tomorrow!): Abstract submission– August 15th: Full papers (proceedings in LNCS)

• I am on the Program Committee ;)

Page 3: Sublinear Algorihms  for Big Data

Today

• Count-Min (continued), Count Sketch• Sampling methods– -sampling– -sampling

• Sparse recovery– -sparse recovery

• Graph sketching

Page 4: Sublinear Algorihms  for Big Data

Recap

• Stream: elements from universe , e.g.

• = frequency of in the stream = # of occurrences of value

Page 5: Sublinear Algorihms  for Big Data

Count-Min

• are -wise independent hash functions• Maintain counters with values:

elements in the stream with • For every the value and so:

• If and then:.

Page 6: Sublinear Algorihms  for Big Data

More about Count-Min

• Authors: Graham Cormode, S. Muthukrishnan [LATIN’04]• Count-Min is linear:

Count-Min(S1 + S2) = Count-Min(S1) + Count-Min(S2)

• Deterministic version: CR-Precis• Count-Min vs. Bloom filters

– Allows to approximate values, not just 0/1 (set membership)– Doesn’t require mutual independence (only 2-wise)

• FAQ and Applications: – https://sites.google.com/site/countminsketch/home/– https://sites.google.com/site/countminsketch/home/faq

Page 7: Sublinear Algorihms  for Big Data

Fully Dynamic Streams

• Stream: updates that define vector where . • Example: For

• Count Sketch: Count-Min with random signs and median instead of min:

Page 8: Sublinear Algorihms  for Big Data

Count Sketch

• In addition to use random signs

• Estimate:

• Parameters:

Page 9: Sublinear Algorihms  for Big Data

-Sampling

• Stream: updates that define vector where . • -Sampling: Return random and :

Page 10: Sublinear Algorihms  for Big Data

Application: Social Networks

• Each of people in a social network is friends with some arbitrary set of other people

• Each person knows only about their friends• With no communication in the network, each

person sends a postcard to Mark Z.• If Mark wants to know if the graph is

connected, how long should the postcards be?

Page 11: Sublinear Algorihms  for Big Data

Optimal estimation

• Yesterday: -approximate – space for – space for

• New algorithm: Let be an -sample.Return , where is an estimate of

• Expectation:

Page 12: Sublinear Algorihms  for Big Data

Optimal estimation

• New algorithm: Let be an -sample.Return , where is an estimate of

• Variance:

• Exercise: Show that • Overall:

– Apply average + median to copies

Page 13: Sublinear Algorihms  for Big Data

-Sampling: Basic Overview

• Assume Weight by where

where • For some value return if there is a unique such

that

• Probability is returned if is large enough:

• Probability some value is returned repeat times.

Page 14: Sublinear Algorihms  for Big Data

-Sampling: Part 1

• Use Count-Sketch with parameters to sketch • To estimate

and • Lemma: With high probability if

• Corollary: With high probability if and

• Exercise: for large

Page 15: Sublinear Algorihms  for Big Data

Proof of Lemma• Let • By the analysis of Count Sketch and by Markov:

• If then • If , then

where the last inequality holds with probability

• Take median over repetitions high probability

Page 16: Sublinear Algorihms  for Big Data

-Sampling: Part 2

• Let if and otherwise• If there is a unique with then return • Note that if then and so

,therefore

• Lemma: With probability there is a unique such that . If so then

• Thm: Repeat times. Space:

Page 17: Sublinear Algorihms  for Big Data

Proof of Lemma

• Let We can upper-bound

Similarly, • Using independence of , probability of unique

with :

Page 18: Sublinear Algorihms  for Big Data

Proof of Lemma

• Let We can upper-bound

Similarly, • We just showed:

• If there is a unique i, probability is:

Page 19: Sublinear Algorihms  for Big Data

-sampling

• Maintain and -approximation to • Hash items using for • For each , maintain:

• Lemma: At level there is a unique element in the streams that maps to (with constant probability)

• Uniqueness is verified if . If so, then output as the index and as the count.

Page 20: Sublinear Algorihms  for Big Data

Proof of Lemma

• Let and note that • For any • Probability there exists a unique such that ,

• Holds even if are only 2-wise independent

Page 21: Sublinear Algorihms  for Big Data

Sparse Recovery

• Goal: Find such that is minimized among s with at most non-zero entries.

• Definition: • Exercise: where are indices of largest

• Using space we can find such that and

Page 22: Sublinear Algorihms  for Big Data

Count-Min Revisited• Use Count-Min with • For let for some row • Let be the indices with max. frequencies. Let

be the event there doesn’t exist with • Then for

• Because w.h.p. all ’s approx . up to

Page 23: Sublinear Algorihms  for Big Data

Sparse Recovery Algorithm

• Use Count-Min with • Let be frequency estimates:

• Let be with all but the k-th largest entries replaced by 0.

• Lemma:

Page 24: Sublinear Algorihms  for Big Data

• Let be indices corresponding to largest value of and .• For a vector and denote as the vector formed by

zeroing out all entries of except for those in .

+ 2 + 2

Page 25: Sublinear Algorihms  for Big Data

Thank you!

• Questions?