24
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld

Lower bounds on data stream computations

Embed Size (px)

DESCRIPTION

Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld. Lower bounds on data stream computations. Previously. We proved 3 theorems concerning space complexity of data stream algorithms. - PowerPoint PPT Presentation

Citation preview

Page 1: Lower bounds on data stream computations

Lower bounds on data streamcomputations

Seminar in Communication Complexity

By Michael UmanskyInstructor: Ronitt Rubinfeld

Page 2: Lower bounds on data stream computations

Previously...

We proved 3 theorems concerning space complexity of data stream algorithms.

Using the streaming model discussed earlier, we found out some lower bounds for the MAX, MAXNEIGHBOR, MAXTOTAL and MAXPATH algorithms.

And now, for something completely different.

Page 3: Lower bounds on data stream computations

Today

In this lecture, I introduce lower bounds from communication complexity.

Trust me they are correct.

Using these bounds and (mostly) reductions, our goal is to prove even more theorems. Theorems are good.

I'll prove 3 of them.

Starting with “Theorem 4”.

Page 4: Lower bounds on data stream computations

Theorem 4

Setting: Sequence of m numbers in {1,...,n}.– Multiple occurences are allowed.

Claim: Finding the k most frequent items requires Ω(n/k) space.

Moreover, random sampling yields an upper bound of O(n (log m + log n) / k).

We're going to use a blackbox to prove it.

Page 5: Lower bounds on data stream computations

Theorem 4 blackbox

Alon-Matias-Szegedy: Finding the most frequent number in a sequence of length m in range {1,...,n} takes Ω(n) space.

Proof outline: Reduction. Namely, we create a new stream that we can (ab)use this blackbox on.

The reduction will replace each number in the sequence with a sequence of numbers:– Each i in {1,...,n} is replaced with

ki+1,...,ki+k.– In total, nk numbers.

Page 6: Lower bounds on data stream computations

Reduction example

Our data stream is {4,5,3,2,7,3,4,5,1} in range {1,...,10} and we want to obtain the 2 most occuring numbers.

The reduction will create the numbers: {9,10}, {11, 12}, {7, 8}, {5, 6}, {15,16}, {7, 8},

{9, 10}, {11,12}, {3, 4}

The most occuring numbers in the original sequence are the most occuring number in the new sequence.

Page 7: Lower bounds on data stream computations

Proof outline

If xi=x

j, then the sequences created by the

reduction coincide. Otherwise, they are disjoint.

If xi occurs l times in the stream, it'll occur kl times

in the new stream.

It follows that finding one of the k most frequent items in one pass requires Ω(n/k) space. Running this 'algorithm' k times we get the AMS theorem.

Great success.

Page 8: Lower bounds on data stream computations

As for the upper bound

Reminder: a Monte-Carlo algorithm is a randomized algorithm that succeeds with a high probability.

So we'll show a Monte-Carlo algorithm that succeeds with high probability to get the right upper bound.

Page 9: Lower bounds on data stream computations

The Monte-Carlo algorithm Before reading the stream:

– Sample each number with probability 1/k.– Only keep a counter for the sampled numbers.

Read the stream normally.

Output the successfully sampled number with largest count.

With constant probability, one of the k-th most frequent numbers has been sampled successfully.

This requires O(n (log m + log n) / k) space. Epic win.

Page 10: Lower bounds on data stream computations

And now for somethingcompletely different

Introducing the approximate median problem (AMP).

Reminder: The median is the value which separates the higher half of the set from the lower half.

We want to approximate that. Why? Because it's cool.

Page 11: Lower bounds on data stream computations

This slide isn't the median problem

First, a blackbox from communication complexity.

Consider the bit-vector probing problem:

– Let A have a bit sequence of length m and B an index i. B needs to know x

i, the i-th

input bit.

– But the communication is one way only, B can not send anything to A.

Ideas?

Page 12: Lower bounds on data stream computations

Blackbox cont.

Turns out there isn't a better method for A to send the i-th bit than to send the entire string to B.– So it takes Ω(m) space.

But what about randomization?– Too bad, any algorithm that succeeds in

guessing xi

– With probability better than (1+ε)/2– Requires at least εm bits of

communication.

Page 13: Lower bounds on data stream computations

Approximate median problem

Goal: Find a number whose rank is in the interval [m/2 – εm, m/2 + εm].

It can be solved by a one-pass Monte-Carlo algorithm with 1/10 error probability.

Takes O(log n (log 1/ε)2 / ε) space.

I have a truly magnificent proof of this theorem. This slideshow is too small to contain it.

Page 14: Lower bounds on data stream computations

AMP cont.

Motivation: We want to prove a corresponding lower bound on this problem.

How: We show that any 1-pass Las Vegas algorithm that solves the ε-AMP requires Ω(1/ε) space.

We show a reduction from the bit-vector probing problem.

Page 15: Lower bounds on data stream computations

AMP lower bound proof

Let B be a bit vector, followed by a query index i.

This is translated to a sequence of numbers as follows:– First, output 2j+b

j, for each j.

– Then, upon getting the query, output n-i+1 copies of 0 and i+1 copies of 2(n+1).

Page 16: Lower bounds on data stream computations

Reduction example

B = (0,1,0,1,1,0,1,1,0,1), i=5.

The reduction maps:– 2j+b

j: [2,5,6,9,11,12,15,17,18,21]

– N-i+1=6 copies of 0: [0,0,0,0,0,0]– i+1=6 copies of 22=2(n+1):

[22,22,22,22,22,22]

The median of this set is 11. It's LSB is 1. Which is exactly the value of b

5.

Page 17: Lower bounds on data stream computations

AMP proof cont.

It is easily verified that the least significant bit of the median of this sequence is the value of b

i (that

is, the bit we seek).

Choose ε=1/2n. Therefore the ε-approximate median is the exact median. This is true because we have 2n numbers in the “reduced” stream.

Therefore any one-pass algorithm that requires fewer than 1/2ε = n bits of memory can be used...

Page 18: Lower bounds on data stream computations

AMP proof cont.

… to derive a communication protocol that requires fewer than n bits to be communicated from A to B in solving bit vector probing.

But every protocol that solves bit vector probing must communicate n bits.

Contradiction. Quod erat demonstratum.

Page 19: Lower bounds on data stream computations

Corollary

What's the point I've been trying to make?

Randomization can sometimes reduce space complexity significantly, at the cost of guarantee of output correctness.

Moving right along.

Page 20: Lower bounds on data stream computations

Some graph theory

A graph can be considered as a stream.– Example: Adjacency list.

This means some graph-theoretic problems can be approximated or solved using data stream and communication complexity techniques.

I'll address a small part of them.

Page 21: Lower bounds on data stream computations

Why is this good?

Suppose we can read the stream more than once (we don't have enough memory to store it but we do have access).

But the amount of times we can read the stream is finite.

What possible graph theoretic problems could we approximate with this method?

Page 22: Lower bounds on data stream computations

Theorem 6

In P passes, the following problems on an n-node graph take Ω(n / P) space:– Computing connected components– Computing k-edge connected components.– Computing k-vertex connected components.– Testing graph planarity.– Finding the sinks of a directed graph.

I'll prove graph connectivity.

Page 23: Lower bounds on data stream computations

Connected components

Proof by reduction of DISJOINT to the graph connectivity problem. Reminder: DISJOINT(x,y) returns 1 iff there exists i such that x

i=y

i.

Given bit vectors A and B, construct a graph with vertices {a,b,1,...,n}.

Insert an edge (a,i) iff i is in A's vector and an edge (i,b) iff it's in B's vector.

The graph is connected iff there exists a bit that's set in both vectors.

Page 24: Lower bounds on data stream computations

Connectivity cont.

From communication complexity, we know that every DISJOINT-solving protocol sends Ω(n) bits.

So if we have P passes over the data, one of the passes must use Ω(n / P) space. This is a total cheating hack by the way. Blame HRR.

QED anyway.

That's all folks!