Lower bounds on data stream computations

Preview:

DESCRIPTION

Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld. Lower bounds on data stream computations. Previously. We proved 3 theorems concerning space complexity of data stream algorithms. - PowerPoint PPT Presentation

Citation preview

Lower bounds on data streamcomputations

Seminar in Communication Complexity

By Michael UmanskyInstructor: Ronitt Rubinfeld

Previously...

We proved 3 theorems concerning space complexity of data stream algorithms.

Using the streaming model discussed earlier, we found out some lower bounds for the MAX, MAXNEIGHBOR, MAXTOTAL and MAXPATH algorithms.

And now, for something completely different.

Today

In this lecture, I introduce lower bounds from communication complexity.

Trust me they are correct.

Using these bounds and (mostly) reductions, our goal is to prove even more theorems. Theorems are good.

I'll prove 3 of them.

Starting with “Theorem 4”.

Theorem 4

Setting: Sequence of m numbers in {1,...,n}.– Multiple occurences are allowed.

Claim: Finding the k most frequent items requires Ω(n/k) space.

Moreover, random sampling yields an upper bound of O(n (log m + log n) / k).

We're going to use a blackbox to prove it.

Theorem 4 blackbox

Alon-Matias-Szegedy: Finding the most frequent number in a sequence of length m in range {1,...,n} takes Ω(n) space.

Proof outline: Reduction. Namely, we create a new stream that we can (ab)use this blackbox on.

The reduction will replace each number in the sequence with a sequence of numbers:– Each i in {1,...,n} is replaced with

ki+1,...,ki+k.– In total, nk numbers.

Reduction example

Our data stream is {4,5,3,2,7,3,4,5,1} in range {1,...,10} and we want to obtain the 2 most occuring numbers.

The reduction will create the numbers: {9,10}, {11, 12}, {7, 8}, {5, 6}, {15,16}, {7, 8},

{9, 10}, {11,12}, {3, 4}

The most occuring numbers in the original sequence are the most occuring number in the new sequence.

Proof outline

If xi=x

j, then the sequences created by the

reduction coincide. Otherwise, they are disjoint.

If xi occurs l times in the stream, it'll occur kl times

in the new stream.

It follows that finding one of the k most frequent items in one pass requires Ω(n/k) space. Running this 'algorithm' k times we get the AMS theorem.

Great success.

As for the upper bound

Reminder: a Monte-Carlo algorithm is a randomized algorithm that succeeds with a high probability.

So we'll show a Monte-Carlo algorithm that succeeds with high probability to get the right upper bound.

The Monte-Carlo algorithm Before reading the stream:

– Sample each number with probability 1/k.– Only keep a counter for the sampled numbers.

Read the stream normally.

Output the successfully sampled number with largest count.

With constant probability, one of the k-th most frequent numbers has been sampled successfully.

This requires O(n (log m + log n) / k) space. Epic win.

And now for somethingcompletely different

Introducing the approximate median problem (AMP).

Reminder: The median is the value which separates the higher half of the set from the lower half.

We want to approximate that. Why? Because it's cool.

This slide isn't the median problem

First, a blackbox from communication complexity.

Consider the bit-vector probing problem:

– Let A have a bit sequence of length m and B an index i. B needs to know x

i, the i-th

input bit.

– But the communication is one way only, B can not send anything to A.

Ideas?

Blackbox cont.

Turns out there isn't a better method for A to send the i-th bit than to send the entire string to B.– So it takes Ω(m) space.

But what about randomization?– Too bad, any algorithm that succeeds in

guessing xi

– With probability better than (1+ε)/2– Requires at least εm bits of

communication.

Approximate median problem

Goal: Find a number whose rank is in the interval [m/2 – εm, m/2 + εm].

It can be solved by a one-pass Monte-Carlo algorithm with 1/10 error probability.

Takes O(log n (log 1/ε)2 / ε) space.

I have a truly magnificent proof of this theorem. This slideshow is too small to contain it.

AMP cont.

Motivation: We want to prove a corresponding lower bound on this problem.

How: We show that any 1-pass Las Vegas algorithm that solves the ε-AMP requires Ω(1/ε) space.

We show a reduction from the bit-vector probing problem.

AMP lower bound proof

Let B be a bit vector, followed by a query index i.

This is translated to a sequence of numbers as follows:– First, output 2j+b

j, for each j.

– Then, upon getting the query, output n-i+1 copies of 0 and i+1 copies of 2(n+1).

Reduction example

B = (0,1,0,1,1,0,1,1,0,1), i=5.

The reduction maps:– 2j+b

j: [2,5,6,9,11,12,15,17,18,21]

– N-i+1=6 copies of 0: [0,0,0,0,0,0]– i+1=6 copies of 22=2(n+1):

[22,22,22,22,22,22]

The median of this set is 11. It's LSB is 1. Which is exactly the value of b

5.

AMP proof cont.

It is easily verified that the least significant bit of the median of this sequence is the value of b

i (that

is, the bit we seek).

Choose ε=1/2n. Therefore the ε-approximate median is the exact median. This is true because we have 2n numbers in the “reduced” stream.

Therefore any one-pass algorithm that requires fewer than 1/2ε = n bits of memory can be used...

AMP proof cont.

… to derive a communication protocol that requires fewer than n bits to be communicated from A to B in solving bit vector probing.

But every protocol that solves bit vector probing must communicate n bits.

Contradiction. Quod erat demonstratum.

Corollary

What's the point I've been trying to make?

Randomization can sometimes reduce space complexity significantly, at the cost of guarantee of output correctness.

Moving right along.

Some graph theory

A graph can be considered as a stream.– Example: Adjacency list.

This means some graph-theoretic problems can be approximated or solved using data stream and communication complexity techniques.

I'll address a small part of them.

Why is this good?

Suppose we can read the stream more than once (we don't have enough memory to store it but we do have access).

But the amount of times we can read the stream is finite.

What possible graph theoretic problems could we approximate with this method?

Theorem 6

In P passes, the following problems on an n-node graph take Ω(n / P) space:– Computing connected components– Computing k-edge connected components.– Computing k-vertex connected components.– Testing graph planarity.– Finding the sinks of a directed graph.

I'll prove graph connectivity.

Connected components

Proof by reduction of DISJOINT to the graph connectivity problem. Reminder: DISJOINT(x,y) returns 1 iff there exists i such that x

i=y

i.

Given bit vectors A and B, construct a graph with vertices {a,b,1,...,n}.

Insert an edge (a,i) iff i is in A's vector and an edge (i,b) iff it's in B's vector.

The graph is connected iff there exists a bit that's set in both vectors.

Connectivity cont.

From communication complexity, we know that every DISJOINT-solving protocol sends Ω(n) bits.

So if we have P passes over the data, one of the passes must use Ω(n / P) space. This is a total cheating hack by the way. Blame HRR.

QED anyway.

That's all folks!