22
Derandomizing Algorithms on Product Distributions and Other Applications of Order-Based Extraction Ariel Gabizon * Avinatan Hassidim Abstract Getting the deterministic complexity closer to the best known randomized complexity is an im- portant goal in algorithms and communication protocols. In this work, we investigate the case where instead of one input, the algorithm/protocol is given multiple inputs sampled independently from an arbitrary unknown distribution. We show that in this case a strong and generic derandomization result can be obtained by a simple argument. Our method relies on extracting randomness from “same-source” product distributions, which are distributions generated from multiple independent samples from the same source. The extraction process succeeds even for arbitrarily low min-entropy, and is based on the order of the values and not on the values themselves (this may be seen as a generalization of the classical method of Von- Neumann [26] extended by Elias [7] for extracting randomness from a biased coin.) The tools developed in the paper are generic, and can be used in several other problems. We present applications to streaming algorithms, and to implicit probe search [8]. We also refine our method to handle product distributions, where the i’th sample comes from one of several arbitrary unknown distributions. This requires creating a new set of tools, which may also be of independent interest. * Department of Computing Science, Simon Fraser University, Vancouver, Canada. [email protected]. MIT, Cambridge, Massachusetts. [email protected]

Derandomizing Algorithms on Product Distributions and Other

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Derandomizing Algorithms on Product Distributions and Other

Derandomizing Algorithms on Product Distributions and Other

Applications of Order-Based Extraction

Ariel Gabizon∗Avinatan Hassidim†

Abstract

Getting the deterministic complexity closer to the best known randomized complexity is an im-portant goal in algorithms and communication protocols. In this work, we investigate the case whereinstead of one input, the algorithm/protocol is given multiple inputs sampled independently froman arbitrary unknown distribution. We show that in this case a strong and generic derandomizationresult can be obtained by a simple argument.

Our method relies on extracting randomness from “same-source” product distributions, whichare distributions generated from multiple independent samples from the same source. The extractionprocess succeeds even for arbitrarily low min-entropy, and is based on the order of the values andnot on the values themselves (this may be seen as a generalization of the classical method of Von-Neumann [26] extended by Elias [7] for extracting randomness from a biased coin.)

The tools developed in the paper are generic, and can be used in several other problems. Wepresent applications to streaming algorithms, and to implicit probe search [8]. We also refine ourmethod to handle product distributions, where the i’th sample comes from one of several arbitraryunknown distributions. This requires creating a new set of tools, which may also be of independentinterest.

∗Department of Computing Science, Simon Fraser University, Vancouver, Canada. [email protected].†MIT, Cambridge, Massachusetts. [email protected]

Page 2: Derandomizing Algorithms on Product Distributions and Other

1 Introduction

A central goal in complexity theory is achieving derandomization in as many settings as possible.The object of derandomization is to take computational tasks that can be achieved with the aid ofrandomness, and find ways to perform them using less randomness, or ideally none at all. We wantto achieve derandomization without increasing the use of other resources by much. For example, wewould like the amount of time, space, communication, etc. used in the deterministic solution to besimilar to the corresponding quantities in the original randomized solution.

In this paper we deal with both algorithms for decision problems and communication complexityprotocols. In the first case, a long line of work initiated by [24, 4, 28, 21, 13] shows that, assuming certaincircuit lower bounds, any randomized polynomial time algorithm can be converted into a deterministicpolynomial time algorithm However, proving such lower bounds seems well beyond reach and in fact,Kabanets and Impagliazzo[16] building on Impagliazzo et. al [12] show that proving lower bounds isnecessary for proving such results. For communication complexity, there are exponential separationsbetween deterministic and randomized protocols (see [18]).

It thus seems well motivated to look for relaxed (but still interesting) models where derandomizationcan be achieved. Consider the case of time-bounded algorithms. A first (very naıve) attempt at sucha relaxation might be to require that instead of succeeding on every input, we succeed with highprobability on any distribution of inputs. Of course, this is no relaxation at all, as we can considerdistributions concentrated on one hard input. A natural way to further relax this is to require high-probability of success only on distributions of inputs that can be efficiently sampled. Impagliazzoand Wigderson [14], followed by Trevisan and Vadhan[25], give conditional derandomizations (andunconditional Gap Theorems for BPP) in this model. Another type of relaxation, which we investigatehere, is to allow arbitrary distributions on individual inputs, but to require multiple independentsamples from the same distribution1. In this setting, when receiving k inputs for large enough k, wewould like our deterministic algorithm to solve all k inputs correctly, at a running time close to k-timesthe running time of the randomized algorithm. Note that in the case of a distribution concentrated onone hard input, the running time on this input will be amortized over k instances. Similarly, we wouldlike a deterministic communication protocol that when receiving k inputs from an arbitrary distribution(over the inputs of both parties) solves all instances correctly with a number of communication bits thatis close to k times the number of bits used by the randomized protocol. We show that such results canbe achieved by simple argument. We show that our constructions are almost optimal, in some sense.Here is a concrete example, which gives the feel of the parameters.

1.1 A motivating example and result

Consider the equality problem in communication complexity: Alice and Bob receive n-bit strings x andy, respectively. They want to decide whether x = y. The deterministic communication complexity isn, and shared randomness reduces this to O(1). Repeating the randomized protocol we get that forany k, O(log k) communication bits suffice such that Alice and Bob will have the incorrect answer withprobability at most 1/100k.

Consider the setting discussed above: Alice and Bob are now given k-tuples of instances (x1, . . . , xk)and (y1, . . . , yk) respectively, such that each pair (xi, yi) is sampled independently from the same arbi-trary unknown distributionD. Obviously, we have a deterministic protocol that uses k·n communicationbits for solving the entire sequence correctly, and a public coin randomized protocol using O(k log k)

1We also consider the case of multiple samples that are not from the same distribution. Moreover, one might want toconsider the case of multiple samples that are correlated in some way, and this might be a direction for further work.

1

Page 3: Derandomizing Algorithms on Product Distributions and Other

communication bits solving the entire sequence correctly with high probability. We show that whenk > c · n log n for some universal constant c, there exists a deterministic protocol using O(k log k) com-munication bits, which solves all instances correctly with probability 2/3. This result is almost optimal— if k < n/ log n facing the same hard input k times any deterministic protocol must send more thank log k bits to succeed.

1.2 Main Results

The parameters presented above are derived from the following theorem

Theorem 1.1. Let f : 0, 1n × 0, 1n → 0, 1 be any function. Let PR be a public coin randomizedprotocol with error ε for f using cr communication bits and r random bits. Let PD be a deterministicprotocol for f using cd communication bits. For every integer k ≥ min10 · r ·n, 100 · r2 · (cd/cr), thereexists a deterministic protocol P using at most k · (cr + log r+ 6) communication bits, such that for anydistribution D on 0, 1n × 0, 1n,

Pr((x1,y1),...,(xk,yk))←D⊕k(P ((x1, y1), . . . , (xk, yk)) = (f(x1, y1), . . . , f(xk, yk))) ≥ 1− (ε · k + 2−r),

where D⊕k denotes the distribution consisting of k independent copies of D.

We get a similar theorem in the case of algorithms for decision problems.

Theorem 1.2. Let C be the class of product distributions on (0, 1n)k. Let f : 0, 1n → 0, 1 be anyfunction. Let AR be a randomized, two-sided error, algorithm with error ε for f , running in time trand using r random bits, and let AD be a deterministic algorithm for f running in time td. For everyinteger k ≥ 10 · (td/tr) · r, there exists a deterministic algorithm A that runs in time k · tr +O(n · k),such that for any distribution D on 0, 1n,

Pr(x1,...,xk)←D⊕k(A(x1, . . . , xk) = (f(x1), . . . , f(xk)) ≥ 1− (ε · k + 2−r).

Remark 1.3. • We state Theorem 1.2 for two-sided error randomized algorithms, but it is easy tosee in our proofs that if the original randomized algorithm has one-sided error, so will the resultantdeterministic algorithm for multiple inputs. Similarly, if we start with a Las-Vegas randomizedalgorithm, we get a deterministic algorithm for multiple inputs that either answers correctly on allinputs or answers ‘I don’t know’. (The same holds when one-sided error or Las-Vegas randomizedprotocols are used in Theorem 1.1)

• The reader may wonder whether Theorem 1.2 is interesting as incase the original deterministicalgorithm AD is exponential, we will require an exponential number of independent inputs to usethe theorem, and thus still need exponential time. We note again that nothing better is possiblein this model (unless a worst case derandomization is achieved). Also one gets more interestinginstantations in the case where AD’s running time is a larger polynomial than AR (this is the casein the currently known algorithms for primiality testing) , and in cases where AD is polynomialor linear and AR is sublinear - as is the case in many property testing algorithms. That said, weagree that the communication complexity setting of Theorem 1.1 is probably more convincing.

We also consider the case where the inputs or sampled several arbitrary distributions. To formallypresent our results, we need the following definition.

Definition 1.4. Let D1, . . . , Dd be any distributions on (0, 1n×0, 1n). A d-part product distribu-tion (defined using D1, . . . , Dd) on (0, 1n×0, 1n)k, is a distribution X = (X1, . . . , Xk) such that theXi’s are all independent, and for each 1 ≤ i ≤ k, Xi is distributed according to Dj, for some 1 ≤ j ≤ d.

2

Page 4: Derandomizing Algorithms on Product Distributions and Other

Our main theorem for d-part product distributions is as follows.

Theorem 1.5. Fix any positive integers d, n and k and let C be the class of d-part product distributionson (0, 1n)k. Let f : 0, 1n → 0, 1 be any function. Let AR and AD be algorithms for f , similarlyto Theorem 1.2. For any 0 < γ < 1 and any k ≥ (td/tr) · ((r · 8(d + 1)) + 16·(2d)5

γ ), there exists adeterministic algorithm A that runs in time at most k · tr + O(n · k · d2) that solves f on C with errorε · k + γ.

The equivalent theorem for communication protocols can be found in Appendix A.

1.3 Overview of technique — using ‘content independent’ extraction

We sketch the proof of Theorem 1.2. The proof of Theorem 1.1 is similar but requires additionaltechnical details. We are given a sequence of inputs (x1, . . . , xk) and we want to deterministicallycompute f(x1), . . . , f(xk) very efficiently. We distinguish between two cases. In the first there are‘few’ distinct inputs among x1, . . . , xk. In this case we simply run the deterministic algorithm AD onall these inputs and as there are few of them, it will not take too long.2 In the second case, we have‘many’3 distinct inputs among x1, . . . , xk. In this case, we extract a random string from the sequence,and use that random string to run the randomized algorithm AR on each input. Let z1, . . . , zs be thedistinct values among x1, . . . , xk. A potential problem with this approach would be that the randomstring we are using depends on the values z1, . . . , zs and thus might be a ‘bad’ string for some zi withhigh probability. This does not occur as our extraction method is essentially independent of the actualvalues of the inputs. More specifically, the random string we extract is simply a function of the orderin which (the potentially multiple instances of) z1, . . . , zs appear in the sequence. This may be seenas a generalization of the classical method of Von-Neumann [26] extended by Elias [7] on extractingrandomness from a biased coin. (see also the work of Peres[22])

Remark 1.6. As the inputs in the sequence are independent, a more straightforward approach mighthave been to apply a (deterministic) multi-source extractor on the inputs. However, multi-source ex-tractors require that each input be sampled from a distribution having a certain min-entropy. Thus, touse a multi-source extractor we would have needed to assume the individual inputs come from such adistribution, and would not get results for arbitrary distributions.

Generalizing to multiple distributions We now sketch the ideas used to prove Theorem 1.5. Asin the above, the problem essentially reduces to extracting randomness from d-part product distributionsconditioned on seeing ‘many’ distinct values. Moreover, the extraction procedure should be independentof the actual values and depend only on their order.Consider the following simple example: We are given 3 independent samples, such that the first andthird are sampled from a distribution D1 distributed on values a, b ∈ 0, 1n. The second sample comesfrom a distribution D2 that gives probability one to a value c ∈ 0, 1n such that c 6= a, b. In ourterminology, this is a 2-part product distribution D on (0, 1n)3. Let us look at D conditioned onseeing 3 distinct values. In this case we have a uniform distribution on the sequences (a, c, b) and (b, c, a)(note that we indeed have a uniform distribution on these sequences no matter how D1 is distributed ona and b). This suggests the following method for extracting one bit: Given x ∈ (0, 1n)3 , for each pairof indices i < j ∈ 1, . . . , 3, let zi,j be 1 if xi < xj by lexicographical ordering, and 0 otherwise. Nowoutput the sum mod 2 of the zi,j ’s. Let us call this function the ‘all-pairs compare’ (APC) function.

2In the case of Theorem 1.1 there is an additional complication here of having Alice and Bob conclude what indeedare the distinct input pairs (xi, yi) with small communication.

3‘many’ in this sketch roughly corresponds to the number of random bits used by the randomized algorithm for f .

3

Page 5: Derandomizing Algorithms on Product Distributions and Other

The APC function has the property that if (x1, x2, x3) are all distinct then any substitution of the orderof a pair of elements changes the output value. Note that it is essential that all the xi’s are distinct.For example, it is easy to check that for any a < b, APC(a, a, b) = APC(b, a, a). Thus to extract manyrandom bits, we need many ‘blocks’ where all inputs are distinct. This suggests the following extractionscheme for d-part product distributions conditioned on seeing many distinct values: Given a sequenceof inputs, delete the values that appear ‘too many times’ in the sequence. Now divide the (possiblytrimmed) sequence into blocks of d + 1 inputs each. Count the number of blocks such that all inputsin the block are distinct. If there are at least m such blocks - where m is the number of bits we wantto extract - output the APC function on each block. It can be shown that if we start out with enoughdistinct values (where the exact number is a function of d and m) with high probability we will indeedhave m blocks of distinct inputs.

1.4 Related work

Goldreich and Wigderson [10], using an observation of Noam Nisan, attain results similar to ours forthe case of the uniform distribution4. Their technique uses seeded extractors, and their correctnessargument is different (and would not work for product distributions of arbitrary distributions). Barak,Braverman, Chen and Rao [?] show that randomized communication protocols require about k-timesthe communication bits to solve k instances with high probability over product distributions. Togetherwith our result this shows that deterministic and randomized protocols have approximately5 the samecomputational power in this setting.

1.5 Applications

Beside our main results, we present two applications of extracting randomness based on the order ofelements in a sequence.

1.5.1 Implicit probe search

For domain size m and table size n, implicit probe search is the problem of searching for an elementx ∈ [m] in a table T containing n elements from [m] using as few queries as possible to T . ArrangingT by the regular ordering of [m] and using binary search we can always use at most log n queries. Yao[27] showed that when m is allowed to be arbitrarily large as a function of n, log n queries are necessary.Fiat and Naor [8] showed that when m = poly(n), T can be efficiently arranged such that a constantnumber of queries suffice6. The results of [8] are obtained by reducing this problem to the one ofexplicitly constructing rainbows, which may be viewed as a kind of randomness extraction problem (seeAppendix B for details). Using this reduction we extend their results, showing that for any m ≤ 2n,O( logn

log logn) = o(log n) queries suffice. Thus, we show that even when the domain is exponentially largethere is a scheme significantly better than binary search.

1.5.2 Streaming algorithms

The data stream model was introduced by Munro et al. [19] (see also the seminal work by Alon, Matiasand Szegedy [1]). In this model, an algorithm is presented with a sequence of n elements, and itsgoal is to estimate a function of it, when it is allowed to pass over the data just once. The algorithmruns in bounded space, usually poly-logarithmic in n. We restrict our attention to algorithms whichperform in poly-logarithmic space, and compute a frequency moment of the input. For this problem,it is known that even when the order of the appearance of elements in the stream is chosen in an

4In fact, for this case their probability of error is smaller than ours.5with the exception that in our result deterministic algorithms only work for large enough k.6Gabizon and Shaltiel[9] showed that for m = npolylogn a constant number of queries also suffice, although with today’s

dispersers [8] could have gotten the same result.

4

Page 6: Derandomizing Algorithms on Product Distributions and Other

adversarial manner, the algorithm can approximate the p’th moment for 0 ≤ p ≤ 2 [15, 1], and thatthis is not possible for moments p > 2, see [3, 6].

A relaxation of the problem assumes that the adversary chooses the values of the elements, but theyare presented to the algorithm in a random ordering [5, 11, 2]. For a random ordering of the elements,known bounds only imply that one cannot approximate moments larger than 2.5, although it is believedthat the right lower bound is 2, as in the adversarial ordering case. We show a strong derandomizationresult in this model, which enables concentrating on proving lower bounds for deterministic algorithms7.

We briefly sketch the proof, showing how any randomized algorithm can be simulated by a de-terministic one. Let R be a randomized algorithm which approximates (to within any constant) amoment p > 2, with any constant success probability. We present a deterministic algorithm D with thesame success probability and approximation ratio, up to an (1 + n−α) factor, for a constant α < 0.25.To simulate R, D first extracts randomness from the beginning of the stream, using the extractorspresented later. If the number of elements required to extract enough randomness is small, it usesthis randomness to load a pseudo random generator against space bounded machines, and uses this tosimulate the random algorithm on the rest of the input; we prove that with high probability this doesnot change the quality of the approximation by much. If the randomness requires many elements, thedeterministic algorithm simply counts the number of appearances of the first polylog n different ele-ments in the stream; we prove that with high probability this is sufficient to approximate the frequencymoments over the entire stream 8. See details in Appendix C.

2 Preliminaries

For background on communication complexity we refer the reader to [18]. The following definitions willbe useful for discussing high probability of success on a sequence on inputs.

Definition 2.1. Let C be a class of distributions on (0, 1n)k. Let f : 0, 1n → 0, 1 be any function.We say that a deterministic algorithm A solves f on C with error ε, if for any distribution X in C, whensampling a sequence (x1, . . . , xk) according to X, A answers correctly on all inputs in the sequence withprobability at least 1− ε. That is,

Pr(x1,...,xk)←X(A(x1, . . . , xk) = (f(x1), . . . , f(xk))) ≥ 1− ε.

An equivalent definition holds for communication protocols; see Appendix A for details. We defineextractors for families of distributions. Note that we are talking only about deterministic extractors.

Definition 2.2. Let C be a class of distributions on a set Ω. A function E : Ω → 0, 1m is anextractor for C with error γ (also called a γ-extractor for C), if for every distribution X in C, E(X) isγ-close to uniform.

3 The main result

A product distribution consists of multiple independent samples from an arbitrary distribution.7Approximating frequency moments is perhaps the most common studied problem in this model. Our results apply to

other problems as well. For adversarial order there are separations between randomized and deterministic algorithms8We note that it is impossible to use PRG’s and exhaust over the seeds, as the stream appears just once (in the

adversarial order model deterministic algorithm are provably weaker than random ones). Also, the coins used by therandom algorithm should be uncorrelated with the ordering of the elements; this is the reason for running it only on partof the stream. Getting a strong result requires some fine tuning of the parameters.

5

Page 7: Derandomizing Algorithms on Product Distributions and Other

Definition 3.1 (Product Distributions). A distribution X = (X1, . . . , Xk) on (0, 1n)k is a productdistribution if it consists of k independent samples from the same distribution D, where D can be anydistribution over 0, 1n.

Our method relies on the fact that product distributions can be written as convex combinations ofdistributions that just permute a fixed set of values. We now define these distributions.

Definition 3.2 (Multinomial distributions). The class of multinomial distributions on (0, 1n)k con-sists of all distributions of the following form:Let z1, . . . , zs ∈ 0, 1n be distinct strings and let a1, . . . , as be non-zero positive integers such that∑s

i=1 ai = k. The multinomial distribution X on (0, 1n)k defined by z1, . . . , zs, a1, . . . , as, is the uni-form distribution on sequences of n-bit strings of length k such that for 1 ≤ i ≤ k, the string zi appearsai times in the sequence. Moreover, we call such a distribution X an s-valued multinomial distribution.

Lemma 3.3. Any product distribution is a convex combination of multinomial distributions.

Proof. Let X = (X1, . . . , Xk) be a product distribution on (0, 1n)k. For any distinct z1, . . . , zs ∈0, 1n and positive integers a1, . . . , as such that

∑si=1 ai = k. Condition X on the event that the

distinct strings outputted were z1, . . . , zs and zi appears ai times. Given this conditioning X, becauseof the independence of X1, . . . , Xk, any sequence where zi appears ai times has equal probability, andtherefore we get a multinomial distribution. Writing X as a convex combination of such conditionaldistributions, the lemma follows.

For positive integers k, a1, . . . , as such that∑s

i=1 ai = k, the multinomial coefficient(

ka1,...,as

)is the

number of different sequences of length k consisting of s distinct elements such that the i’th elementappears ai times:

(k

a1,...,as

)= k!

a1!···as! . We use the following estimate:

Lemma 3.4. For any integers s ≤ k with k ≥ 32 and s ≥ 4 we have log(

ka1,...,as

)≥ s·log k

4 .

The following claim will enable us to convert uniform distributions over arbitrary sized sets intodistributions over binary strings that are close to uniform. A proof can be found in [17].

Claim 3.5. Let N > M be any integers. Suppose that R is uniformly distributed over 1, . . . , N.Then R mod M is 1

bN/Mc -close to uniform on 0, . . . ,M − 1.

Lemma 3.6. Fix integers s ≤ k with k ≥ 32 and s ≥ 4, and let t = b s·log k8 c. There exists an extractor

E : (0, 1n)k → 0, 1t for the class of s-valued multinomial distributions with error γ = 2−t. E iscomputable in time O(k · n).

Proof. Given a sequence (x1, . . . , xk), let z1 < z2 . . . < zs be the distinct elements that appear in thesequence, where < denotes the lexicographical ordering. For i = 1, . . . , s denote by ai the number oftimes zi appears in the sequence. Let S be the set of sequences of length k over 1, . . . , s such that iappears ai times. Then |S| =

(k

a1,...,as

). The work of Ryabko and Matchikina[23] gives a correspondence

of S with 1, . . . ,(

ka1,...,as

) computable in time O(k·n). 9 Let r be the image of the sequence (x1, . . . , xk)

in 1, . . . ,(

ka1,...,as

) through this correspondence. Define E(x1, . . . , xk) , r (mod 2t). For any s-valued

multinomial distribution X, r is uniformly distributed. Thus using Claim 3.5, E(X) will be γ-closeto uniform for γ ≤ 1

b( ka1,...,as

)/2tc ≤12t , where we used the definition of t and Claim 3.4 in the second

inequality.9[23] do this for the case s=2, but it is easy to reduce to this case.

6

Page 8: Derandomizing Algorithms on Product Distributions and Other

A basic principle in this work, is that when we restrict our input distribution to a componentthat only ‘reorders’ a fixed set of values, we can use randomness extracted from the input to run ouralgorithm or protocol. The following definition and two lemmata formalize this.

Definition 3.7. We say a distribution X on (0, 1n)k is same-valued, if there is a fixed set of valuesz1, . . . , zs ⊆ 0, 1n , such that the support of X consists of sequences x1, . . . , xk such that the set ofdistinct values in each sequence is exactly z1, . . . , zs.

Lemma 3.8. Let C be a class of same-valued distributions on (0, 1n)k. Let E : (0, 1n)k → 0, 1mbe a γ-extractor for C computable in time tE. Fix any f : 0, 1n → 0, 1 and any ε > 0, and let AR bea randomized algorithm computing f with error ε running in time tr. Then, there exists a deterministicalgorithm A running in time k · tr + tE that solves f on C with error ε · k + γ.

Proof. Given x1, . . . , xk ∈ (0, 1n)k, A computes r = E(x1, . . . , xk) and outputs AR(xi, r) for everyi ∈ [k]. The probability that a uniformly chosen r ∈ 0, 1m is bad for some xi is at most ε · k. Thus,the probability that r = E(X) is bad for some xi is at most ε · k + γ.

We are now ready to prove the theorems stated in the introduction.

Proof of Theorem 1.2. Given k instances x1, . . . , xk, let z1, z2 . . . , zs be the distinct elements that appearin x1, . . . , xk. The algorithm A works as follows. Denote t = b s·log k

8 c. If t ≤ r or s ≤ 4, we run AD onzi for every i ∈ [s]. This takes time td · s ≤ td · 10 · r ≤ k · tr, which satisfies the theorem for the chosenvalue of k. Thus we can assume t ≥ r and s ≥ 4. In this case, we compute y = E(x1, . . . , xk), whereE is the extractor for multinomial distributions from Lemma 3.6.10 We now apply the randomizedalgorithm AR on all inputs using y as randomness. Let X be a product distribution on (0, 1n)k.By Lemma 3.3, X is a convex combination of multinomial distributions. Thus, it is enough to provethe theorem in the case that X itself is a multinomial distribution. But in this case, as a multinomialdistribution is same-valued the claim follows immediately from Lemma 3.8. As the running time of Eis O(n · k), the total running time of A is at most k · tr + O(n · k).

For communication protocols the proof requires additional details, as Alice and Bob need to com-municate to find out what are the distinct input pairs (x, y). The proof appears in Appendix A.

3.1 On the optimality of our scheme

Using the notation of Theorem 1.2, our method works given at least k = O((td/tr) · r) samples. Canwe get a similar result for smaller k? It is easy to see that to get running time k · tr we need k ≥ td/tr.In Appendix E we prove a stronger lower bound for a restricted type of scheme. We also show that ourextraction scheme is almost optimal.

4 Handling multiple distributions

In this section we show that a similar derandomization for can be achieved when the sequence ofinputs is sampled independently from several distributions. It is convenient to view d-part productdistributions (see Definition 1.4) as convex combinations of certain same-valued distributions.

10Note that using td, r > 1 (otherwise the claim is trivial), we get k ≥ 10 · td · r ≥ 40, and thus can use Lemma 3.6.

7

Page 9: Derandomizing Algorithms on Product Distributions and Other

Definition 4.1. A d-multinomial distribution on (0, 1n)k is a distribution X = (X1, . . . , Xk) suchthat there is a partition C1 ∪ . . . ∪ Cd = [k] into disjoint subsets such that for every i 6= j ∈ [d], X|Ciand X|Cj are independent, and for every i ∈ [d] XCi is a multinomial source. It will be convenient toallow some of the Ci’s to be empty. Thus, every d′-multinomial distribution for some 1 ≤ d′ ≤ d is alsoa d-multinomial distribution.

For distinct strings z1, . . . , zs ⊆ 0, 1n and positive integers a1, . . . , as such that∑s

i=1 ai = k denoteby Dd

z1,...,zs,a1,...,as the set of d-multinomial distributions whose support consists of sequences where ziappears ai times.

Lemma 4.2. A d-part product distribution is a convex combination of d-multinomial sources.

4.1 The all pairs extractor

In the following definition, for strings x, y ∈ 0, 1n we denote by (x < y) the value 1 if x < y(by lexicographical ordering of strings) and 0 otherwise. For an integer l, define the l-string-all-pairscompare function APC : (0, 1n)l → 0, 1 by

APC(x1, . . . , xl) ,⊕

1≤i<j≤l(xi < xj),

where ⊕ denotes addition modulo 2. That is, we take the parity of comparisons between all pairs.

Claim 4.3. Fix integers l and n. The l-string all-pairs compare function APC : (0, 1n)l → 0, 1 isan extractor with error ε = 0 for the subclass Dd

z1,...,zl,1,...,1of d-multinomial distributions for any d < l.

Proof. We first prove the following claim. Let x = (x1, . . . , xl) ∈ (0, 1n)l be a sequence such thatxi 6= xj for all i < j ∈ [l]. Denote by xi↔j the sequence obtained from x by swapping xi and xj . Weshow that for all i < j ∈ [l], APC(x) 6= APC(xi↔j): To see this11 notice that swapping adjacent valueschanges the value of E. That is, for every 1 ≤ i ≤ l − 1, APC(x) 6= APC(xi↔i+1). Loosely speaking,this is because one comparison has changed and the rest have stayed the same. Formally,

APC(xi↔i+1) = APC(x)⊕ (xi < xi+1)⊕ (xi+1 < xi) = APC(x)⊕ 1.

Now note that xi↔j can obtained from x by an odd number of swap operations performed on adjacentplaces: j − (i+ 1) swap operations to move xi to the (j − 1)’th position and another j − i operationsto move xj to the i’th position. Thus we have shown that APC(x) 6= APC(xi↔j) for all i < j ∈ [l].Returning to the original claim, let X be a distribution in Dd

z1,...,zl,1,...,1. Recall that this means there

are disjoint subsets C1 ∪ . . . ∪ Cd = [l] such that X|Ci is a multinomial distribution. As d < l theremust be an i such that |Ci| > 1. Assume w.l.g. that |C1| > 1, and fix two indices i < j ∈ C1. Lookat the distribution X conditioned on a fixing of values in all positions except i and j. Under such aconditioning, we are left we two distinct strings z and z′ that are to be assigned in these positions,and as X|C1 is a multinomial distribution, each of the two possible assignments has probability half.From our previous argument it follows that the different assignments will lead to different values of E.Thus, under any such conditioning APC(X) is uniform. Viewing X as a convex combination of suchconditional distributions finishes the proof.

11Another way to see this is that if x1, . . . , xl are distinct, the APC function just corresponds to the sign of thepermutation which sorts the values. Swapping two elements changes the sign.

8

Page 10: Derandomizing Algorithms on Product Distributions and Other

4.2 Reducing to all pairs

It will be useful to talk about d-multinomial distributions where ‘no value appears too frequently’. Thefollowing definition formalizes such a notion.

Definition 4.4. Let X be a d-multinomial distribution on (0, 1n)k. We say that X is δ-bounded if Xbelongs to a subclass Dd

z1,...,zs,a1,...,as of d-multinomial distributions such that for every i ∈ [s] ai ≤ δ · k.That is, no value zi appears in more than a δ-fraction of the indices.

The following lemma shows that a general d-multinomial distribution can be converted into a δ-bounded one, provided it has enough distinct values.

Lemma 4.5. Fix any 0 < δ < 1 and integers n and k. There is a deterministic algorithm F suchthat for any s-valued d-multinomial distribution X on (0, 1n)k with s ≥ (1/δ) · log k, the distributionF (X) is a convex combination of s′-valued δ-bounded d-multinomial distributions on (0, 1n)k

′, for

some s′ ≥ s− (1/δ) · log k and k′ ≤ k.

Proof. Given x = (x1, . . . , xk), F operates as follows:

1. Check if there exists a value z ∈ 0, 1n such that xi = z for more than a δ-fraction of the xi’s.If so, let z be the most common value in the sequence and remove all xi’s with xi = z.

2. If sequence was changed and it is non-empty, repeat first step on the newly obtained sequence.

Each application of the first step on a d-multinomial distribution, results in a convex combination ofd-multinomial distributions. After m repetitions of the first step we are left with at most (1− δ)m ·k <e−δ·m · k strings. Thus, after (1/δ) · log k repetitions we are left with an empty sequence, and thereforethe number of repetitions is bounded by (1/δ) · log k. Since each repetition reduces the number of valuesby one, the final components are s′-valued for some s′ ≥ s− (1/δ) · log k, as required.

Theorem 4.6 shows how to extract many random bits from δ-bounded d-multinomial distributions.

Theorem 4.6. Fix any integers n,m, d and k such that (d + 1)|k and m ≤ kd+1 . The exractor E :

(0, 1n)k → 0, 1m be defined as follows. Given input x ∈ (0, 1n)k, first partition x into kd+1 blocks

x1, . . . , xkd+1 , each containing (d+ 1) n-bit strings. We say a block is good if all (d+ 1) n-bit strings in

the block are distinct. If there are at least m good blocks xi1 , . . . , xim, E outputs the all-pairs comparefunction on each one, E(x) = APC(xi1), . . . , APC(xim). Otherwise, E outputs the 0-string.

Fix any 0 < γ < 1 and let δ = γ16d5

. E is a γ-extractor for the class of s-valued δ-boundedd-multinomial distributions on (0, 1n)k, whenever s ≥ m · 4(d+ 1).

The theorem will follow easily from the following lemma.

Lemma 4.7. Fix any integers n, s, d and k such that (d + 1)|k. Fix any 0 < γ < 1 and let δ =γ

8·(d+1)·d4 . Let X be an s-valued δ-bounded d-multinomial distribution on (0, 1n)k. Partitioning astring x ∈ (0, 1n)k and defining a good block as in Theorem 4.6, we have

Prx←X

(x has less than

s

4 · (d+ 1)good blocks

)≤ γ.

Proof. Let C1∪ . . .∪Cd = [k] be the subsets defining X. That is, X|Ci is some multinomial distribution.We show that after removing frequent values as in Lemma 4.5, each one of the underlying distributions

9

Page 11: Derandomizing Algorithms on Product Distributions and Other

X|Ci is either rare or bounded. Note that if for some 0 < η < 1 and i ∈ [d] X|Ci is not η-bounded,then |Ci| ≤ δ

η · k. Taking η = 2δ · d · (d+ 1)2 we get that at most ( δ2δ·d·(d+1)2

· d) · k = 12(d+1)2

· k indices

belong to sets Ci such that X|Ci is not η-bounded. Thus, at most (d+ 1) · ( 12(d+1)2

· k) = k2(d+1) blocks

contain an index j ∈ [k] belonging to a subset Ci where X|Ci is not η-bounded. Therefore, we have atleast k

d+1 −k

2(d+1) = k2(d+1) blocks such that all indices in the block belong to a set Ci such that X|Ci is

η-bounded. Assume without loss of generality that the first k2(d+1) blocks have this property. For each

i = 1, . . . , k2(d+1) define a random variable Zi by Zi = 1 if Xi is bad, and 0 otherwise.

Note that E(Zi) = Pr(Zi = 1) ≤ (d+1)·d2 · η = δ · (d2)(d+ 1)3. Define Z =

∑ k2·(d+1)

i=1 Zi. Then,

E(Z) ≤ δ · (d2)(d+ 1)3 · k

2 · (d+ 1)=δ · d2 · (d+ 1)2

2· k ≤ δ · 2d4 · k.

Therefore, using Markov’s inequality, for any 0 < γ < 1, Pr(Z > 1γ · (δ · 2d

4 · k)) ≤ γ. Conversely, withprobability at least 1− γ, we have at least k

2·(d+1) −δγ · 2d

4 · k good blocks. Finally, noting that k ≥ sand using the value of δ we get

k

2 · (d+ 1)− δ

γ· 2d4 · k ≥ s

2 · (d+ 1)− δ

γ· 2d4 · s ≥ s

4 · (d+ 1),

and the lemma follows.

proof of Theorem 4.6. Let X be a δ-bounded s-valued d-multinomial distribution on (0, 1n)k. Letl = k

d+1 . For subsets Z1, . . . , Zl ⊆ 0, 1n, with |Zi| ≤ d + 1, we define the distribution XZ1,...,Zl to beX conditioned on the event that for every 1 ≤ i ≤ l, the set of distinct values in Xi is exactly Zi. Wecan view X as a convex combination of the distributions XZ1,...,Zl . Note that these distributions aresimply concatenations of independent d-multinomial distributions on (0, 1n)d+1. Call a distributionXZ1,...,Zl ‘good’ if for at least m values i ∈ [l], |Zi| = d + 1, i.e., the i’th block contains d + 1 distinctelements. It follows from Lemma 4.7 that the mass of ‘good’ distributions in the convex combinationrepresenting X, is at least 1 − γ. using Claim 4.3, for a good distribution XZ1,...,Zl , E(XZ1,...,Zl) iscompletely uniform. Thus, E(X) is γ-close to uniform.

Using our conversion from general d-multinomial distributions to δ-bounded d-multinomial distri-butions, we get an extractor for general d-multinomial distributions.

Corollary 4.8 (Extractors for d-multinomial distributions). Fix any integers n,m, d and k and any0 < γ < 1. There is a γ-extractor E : (0, 1n)k → 0, 1m for the class of s-valued d-multinomialdistributions whenever s ≥ m · 8(d+ 1) + 16·(2d)5

γ · log k. E is computable in time O(nk · d2)

Proof. Let F be the algorithm from Lemma 4.5 for δ = γ16·(2d)5 Given x ∈ (0, 1n)k, our extractor

E works by first applying F on x. We then possibly add at most d n-bit strings to F (x) to makethe number of n-bit strings it contains a multiple of d + 1 (at each step, we add the lexicographicallyfirst string that does not yet appear in F (x)). We then compute E′(F (x)), where E′ is the extractorfor δ-bounded d-multinomial distributions from Theorem 4.6. Let X be an s-valued d-multinomialdistribution. Lemma 4.5 guarantees that F (X) is a convex combination of s′-valued δ-bounded d-multinomial distribution for s′ ≥ s − (1/δ) · log k ≥ m · 8(d + 1). The possible additions make thecomponents of F (X) δ-bounded 2d-multinomial distributions. As s′ ≥ m · 8(d + 1) > m · 4(2d + 1) itnow follows from Theorem 4.6 that E(X) = E′(F (X)) is γ-close to uniform.

Using Corollary 4.8, the proof of Theorem 1.5 is similar to the one of Theorem 1.2.

10

Page 12: Derandomizing Algorithms on Product Distributions and Other

Acknowledgments

We thank Oded Goldreich, Aram Harrow, Jelani Nelson, Noam Nisan, Krzysztof Onak, Ran Raz,Ronen Shaltiel, Leonard Shculman, Chris Umans and Avi Wigderson for stimulating discussions andhelpful comments.

References

[1] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequencymoments. In STOC ’96: Proceedings of the twenty-eighth annual ACM symposium on Theory ofcomputing, pages 20–29, New York, NY, USA, 1996. ACM.

[2] Alexandr Andoni, Andrew McGregor, Krzysztof Onak, and Rina Panigrahy. Better bounds forfrequency moments in random-order streams, August 15 http://arxiv.org/abs/0808.2222, 2008.

[3] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approachto data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702–732, 2004.

[4] M. Blum and S. Micali. How to generate cryptographically strong sequences of pseudo-randombits. SIAM Journal on Computing, 13(4):850–864, November 1984.

[5] A. Chakrabarti, G. Cormode, and A. McGregor. Robust lower bounds for communication andstream computation. In Proceedings of the fourtieth annual ACM symposium on Theory of com-puting, pages 641–650. ACM New York, NY, USA, 2008.

[6] Amit Chakrabarti, Subhash Khot, and Xiaodong Sun. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In IEEE Conference on Computational Com-plexity, pages 107–117, 2003.

[7] P. Elias. The efficient construction of an unbiased random sequence. Ann. Math. Statist., 43(3):865–870, 1972.

[8] A. Fiat and M.Naor. Implicit O(1) probe search. SICOMP: SIAM Journal on Computing, 22,1993.

[9] A. Gabizon and R. Shaltiel. Increasing the output length of zero-error dispersers. In RANDOM,pages 430–443, 2008.

[10] O. Goldreich and A. Wigderson. Derandomization that is rarely wrong from short advice that istypically good. In RANDOM, see also Lecture Notes in Computer Sceince, 209–223, 2002.

[11] S. Guha and A. McGregor. Stream order and order statistics: Quantile estimation in random-orderstreams. SIAM Journal of Computing, 2008.

[12] Russell Impagliazzo, Valentine Kabanets, and Avi Wigderson. In search of an easy witness: Ex-ponential time vs. probabilistic polynomial time. In Sixteenth Annual IEEE Conference on Com-putational Complexity, pages 1–12, 2001.

[13] Russell Impagliazzo and Avi Wigderson. P = BPP if E requires exponential circuits: Derandom-izing the XOR lemma. In Proceedings of the Twenty-Ninth Annual ACM Symposium on Theoryof Computing, pages 220–229, El Paso, Texas, 4–6 May 1997.

11

Page 13: Derandomizing Algorithms on Product Distributions and Other

[14] Russell Impagliazzo and Avi Wigderson. Randomness vs. time: De-randomization under a uniformassumption. In 39th Annual Symposium on Foundations of Computer Science. IEEE, 1998.

[15] P. Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computa-tion. Journal of the ACM (JACM), 53(3):307–323, 2006.

[16] V. Kabanets and R. Impagliazzo. Derandomizing polynomial identity tests means proving circuitlower bounds. CMPCMPL: Computational Complexity, 13, 2004.

[17] J. Kamp and D. Zuckerman. Deterministic extractors for bit-fixing sources and exposure-resilientcryptography. SIAM J. Comput, 36(5):1231–1247, 2007.

[18] E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press New York,1997.

[19] JI Munro and MS Paterson. Selection and sorting with limited storage. In Foundations of ComputerScience, 1978., 19th Annual Symposium on, pages 253–258, 1978.

[20] N. Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449–461, 1992.

[21] N. Nisan and A. Wigderson. Hardness vs randomness. Journal of Computer and System Sciences,49(2):149–167, October 1994.

[22] Y. Peres. Iterating von neumann’s procedure for extracting random bits. Annal of Statistics, 20,1992.

[23] Ryabko and Matchikina. Fast and efficient construction of an unbiased random sequence.IEEETIT: IEEE Transactions on Information Theory, 46, 2000.

[24] A. Shamir. On the generation of cryptographically strong pseudorandom sequences. ACM Trans.on Computer Sys., 1(1):38, February 1983.

[25] L. Trevisan and S. Vadhan. Pseudorandomness and average-case complexity via uniform reduc-tions. In Annual IEEE Conference on Computational Complexity (formerly Annual Conference onStructure in Complexity Theory), volume 17, 2002.

[26] J. von Neumann. Various techniques used in connection with random digits. Applied Math Series,12:36–38, 1951.

[27] A. C.-C. Yao. Should tables be sorted? J. ACM, 28(3):615–628, 1981.

[28] Andrew C. Yao. Theory and applications of trapdoor functions (extended abstract). In 23rd AnnualSymposium on Foundations of Computer Science, pages 80–91, Chicago, Illinois, 3–5 November1982. IEEE.

A Communication Complexity

In this appendix we present the proofs for the derandomization results in the communication complexitymodel. The main difference is that the players need to coordinate their actions. For example, one playermay have the same value many times, while the other player has a distribution on many values. Incase both distributions do not have many different values we must solve all possible

(s2

)pairs, and thus

need to increase k. We begin by defining a good deterministic protocol for multiple instances

12

Page 14: Derandomizing Algorithms on Product Distributions and Other

Definition A.1. Let C be a class of distributions on (0, 1n × 0, 1n)k. Let f : 0, 1n × 0, 1n →0, 1 be any function. We say that a deterministic protocol P solves f on C with error ε, if forany distribution Z in C, when sampling a sequence ((x1, y1), . . . , (xk, yk)) according to Z , A answerscorrectly on all inputs in the sequence with probability at least 1− ε. That is,

Pr((x1,y1),...,(xk,yk))←X(A((x1, y1), . . . , (xk, yk)) = (f(x1, y1), . . . , f(xk, yk))) ≥ 1− ε.

The proof of the following lemma is similar to the one of 3.8

Lemma A.2. Let C′ be a class of distributions on (0, 1n×0, 1n)k. Let C be a class of same-valueddistributions on (0, 1n)k such that for any Z = ((X1, Y1), . . . , (Xk, Yk)) in C′, X = (X1, . . . , Xk) is inC. Let E : (0, 1n)k → 0, 1m be a γ-extractor for C. Fix any f : (0, 1n×0, 1n)→ 0, 1 and anyε > 0, and let PR be a randomized public coin protocol for f using r random bits and cr communicationbits. Then, there exists a deterministic communication protocol P using k · cr + r communication bitsthat solves f on C′ with error ε · k + γ.

Using this Lemma, we can prove the main derandomization result for the communication complexitymodel.

Proof of Theorem 1.1. Let z1, z2 . . . , zs be the distinct elements that appear in the sequence (x1, . . . , xk)received by Alice, and let v1, . . . , vl be the distinct elements that appear in the sequence (y1, . . . , yk)received by Bob. The protocol P works as follows:

1. Denote t1 = b s·log k8 c, and t2 = b l·log k

8 c. Alice checks whether t1 ≤ r or s ≤ 4 and Bob checkswhether t2 ≤ r or v ≤ 4, and they communicate the results of these checks.

2. If both answers where positive,

(a) Bob sends Alice k · log l ≤ k · (log r + 4) bits indicating for each 1 ≤ i ≤ k, for which j ∈ [l]xi = vj .

(b) They run the more efficient of the following two protocols (according to the values r, k, cdand using (only) that s, l ≤ 10 · r).• Bob sends Alice v1, . . . , vl and Alice computes f(xi, yi) for all i ∈ [k].• For 1 ≤ i ≤ k, Alice checks whether the instance (xi, yi) has appeared previously in the

sequence and indicates to Bob whether to solve this instance using PD or move to thenext one (note that Alice can do this using the information sent in Step 2a).

3. Otherwise, assume w.l.g. that t1 ≥ r and s ≥ 4. In this case Alice computes y = E(x1, . . . , xk)where E is the extractor from Lemma 3.6, and sends y to Bob. Alice and Bob use y as a sharedrandom string to run PR on all instances.

Let Z = ((X1, Y1), . . . , (Xk, Yk)) be a product distribution on (0, 1n·2)k. By Lemma 3.3, Z is aconvex combination of multinomial distributions. Thus, it is enough to prove the theorem in the casethat Z itself is a multinomial distribution. In case we have run one of the protocols of Step 2, thecorrectness is obvious. Note that in this case, the number of communication bits used is at most

2 + k · (log r + 4) + min10 · r · n, 100 · r2 · cd + k ≤ k · (log r + 6 + cr),

using our assumption on k, which satisfies the claim of the theorem. Otherwise we have used step 3. Inthis case, note that X = (X1, . . . , Xk) and Y = (Y1, . . . Yk) are multinomial distributions one of which

13

Page 15: Derandomizing Algorithms on Product Distributions and Other

is s-valued for s such that b s·log k8 c ≥ r. As multinomial distributions are same-valued, the correctness

now follows from Lemma A.2. In this case, the number of communication bits used is

2 + r + k · cr ≤ k · (cr + log r + 6),

as required.

Finally, we present the result for communication protocols with multiple distributions. The proofis similar to the one of Theorem 1.5, with the required adjustments of Theorem 1.1.

Theorem A.3. Fix any integer d, and let C be the class of d-part product distributions on (0, 12·n)k.Let f : 0, 1n × 0, 1n → 0, 1 be any function. Let

• PR be a public coin randomized protocol with error ε for f using cr communication bits and rrandom bits

• PD be a deterministic protocol for f running using cd communication bits.

Fix any 0 < γ < 1 and denote ` = r · 8(d + 1) + 16·(2d)5γ . For every integer k ≥ ` ·min` · (cd/cr), n,

there is a deterministic protocol P using at most k · (cr + log r + 6) communication bits that solves fon C with error ε · k + γ.

B Rainbows and Implicit Probe Schemes

In this section we discuss an application of same-source extractors to the problem of implicit probesearch. Loosely speaking, this is the problem of searching for an element in a table with few probes,when no additional information but the elements themselves is stored.

Definition B.1 (Implicit probe search scheme). Fix integer parameters n, k and q such that k < n.The implicit probe search problem is as follows: Store a subset S ⊆ 0, 1n of size 2k in a table T ofsize 2k, (where every table entry holds only a single element of S), such that given x ∈ 0, 1n we candetermine whether x ∈ S using q queries to T . A solution to this problem is called an implicit q-probescheme with table size 2k and domain size 2n.We will say a scheme is efficient if given x ∈ 0, 1n (and given access to the table T storing theelements of S), we can determine whether x ∈ S in poly(n)-time.

Fiat and Naor [8] investigated implicit O(1)-probe schemes, i.e., schemes where the number ofqueries is a constant not depending on n and k. They showed that this problem is unsolvable when nis large enough relative to k (this improves a previous bound by Yao [27]). They also gave an efficientimplicit O(1)-probe scheme whenever k = δ · n for any constant δ > 0. They did this by reducing theproblem to the task of constructing a combinatorial object called a rainbow.

Definition B.2. (Taken from [8])

• A t-sequence over a set U is a sequence of length t without repetitions, of elements in U .

• An (m,n, k, t)-rainbow is a coloring of all t-sequences over 0, 1n with 2m colors such that forany S ⊆ 0, 1n of size 2k, the t-sequences over S are colored in all colors.

We will say an (m,n, k, t)-rainbow is explicit if it is poly(n)-time computable.12

12This definition makes sense as the other parameters m, k and t will always be at most n in this section.

14

Page 16: Derandomizing Algorithms on Product Distributions and Other

For readers familiar with extractor literature, a rainbow may be viewed as a seedless ‘zero-errordisperser’ where our ‘weak random source’ consists of multiple independent copies of the same sourcewith the additional restriction that the same sample cannot appear twice.

Fiat and Naor showed that rainbows imply implicit probe schemes.

Theorem B.3. [8] Fix any n, k with log n ≤ k ≤ n. Given an explicit (k, n, k, t)-rainbow we canconstruct an efficient implicit O(t)-probe scheme with table size 2k and domain size 2n.

We first present a simple rainbow construction for small k with few colors (which is implicit in [8]).Later on, we show how to get to 2k colors.

Lemma B.4. For any n, k and t such that t ≥ 9 and log t ≤ k ≤ n, there is an explicit (t log t/3, n, k, t)-rainbow.

Proof. The rainbow will simply output the ‘permutation order’ of the input. More precisely, giveninput x1, . . . , xt, where the xi’s are distinct, let σ ∈ St be the unique permutation of 1, . . . , t suchthat xσ(1) < . . . < xσ(t) are in lexicographical order. Output the index of σ in St. Since, for t ≥ 9(using Stirling’s approximation)

log |St| = log t! ≥ t log t/3,

we are done.

The following instantiation of Lemma B.4 will be convenient for us.

Corollary B.5. Fix any constant c ≥ 1 and let d = 6 · c. For every large enough k and n with k < nand k > d·logn

log logn , there is an explicit (c · log n, n, k, d·lognlog logn)-rainbow.

Proof. Taking t = 6c · lognlog logn in Lemma B.4, we get an (m,n, k, t)-rainbow where

m = t log t/3 ≥ 2c · log nlog log n

· (log log n− log log log n) ≥ c · log n,

for large enough n.

To increase the number of colors we use the following lemma from [9].

Lemma B.6. There exist constants 0 < η < 1 and c such that for every sufficiently large k, n andm = ηk: There is a poly(n)-time computable function F : (0, 1n)2 × 0, 1d=c·logn → 0, 1m suchthat for any 2 independent sources X1, X2 with min-entropy at least k, for every b ∈ 0, 1m, thereexists y ∈ 0, 1d such that

Pr(F (X1, X2, y) = b) ≥ 2−m+1.

Using this lemma we get the following rainbow construction

Theorem B.7. For every large enough k and n with log n ≤ k < n, there is an explicit(k, n, k,O(log n/ log log n))-rainbow

Proof. We will show that for some constant 0 < η < 1 there is an explicit (ηk, n, k,O(log n/ log logn))-rainbow. It then follows from [8] (see remark after Theorem 6 in that paper) that there is an explicit(k, n, k,O(log n/ log log n)) rainbow.

15

Page 17: Derandomizing Algorithms on Product Distributions and Other

Let F be the function from Lemma B.6. Let D1 be the (c · log n, n, k, t)-rainbow from Corollary B.5taking c to be the coefficient of log n in the seed length of F , where t = O(log n/ log log n). We definea function D by

D(x1, . . . , xt+2) , F (xt+1, xt+2, D1(x1, . . . , xt)).

Let S ⊆ 0, 1n be a subset of size 2k and let (Xt+1, Xt+2) be the distribution consisting of 2 independentcopies of the uniform distribution on S. Fix any b ∈ 0, 1m. Fix a y such that

Pr(F (Xt+1, Xt+2, y) = b) ≥ 2−m+1

Since Pr(Xt+1 = Xt+2) = 2−k and 2−k < 2−m+1, assuming say η ≤ 1/2, then there must be dis-tinct xt+1, xt+2 ∈ S such that F (xt+1, xt+2) = b. Fixing a t-sequence in x1, . . . , xt over S such thatD1(x1, . . . , xt) = y, we get

D(x1, . . . , xt+2) = F (xt+1, xt+2, y) = b.

Thus, for any b ∈ 0, 1m we have found a (t+2)-sequence over S mapped to b and the claim follows.

Corollary B.8. For every large enough k and n with log n ≤ k < n, there exists an efficient implicitO(log n/ log log n)-probe scheme for table size 2k and domain size 2n

We phrase the result, perhaps more naturally, with query complexity and domain size as a functionof table size.

Corollary B.9. For every large enough m and n with m ≤ 2n, there exists an efficient implicitO(log logm/ log log logm)-probe scheme for table size n. In particular, for m = 2n there is an efficientimplicit O(log n/ log logn)-probe scheme.

C Data Stream

Let x1, . . . xn be the stream of elements, and assume that we are trying to compute the p’th moment, fora constant p > 2. Let R be a randomized algorithm which computes the moment up to multiplicativefactor (1 + ε) with success probability 1 − δ, where the probability is taken over the ordering of theelements as well as the coin tosses of the algorithm. Let S be a bound on the space required by R, whereS is poly logarithmic in n, and let r be a bound on the number of coins R uses. Assume r is at mostpolynomial in n (having r quasi polynomial does not change the result). We present a deterministicalgorithm D, which approximates the p’th moment13 up to a multiplicative factor of (1 + ε+ εD) withsuccess probability 1 − δ − εD, with a space bound of O(log2 n · S · log r) where εD = n−α for someα > 0 (the analysis here requires α < 0.25; this can be improved by refining the algorithm.)

C.1 Preliminaries

Let fp(~y) denote the p’th moment; that is, if the distinct elements of ~y are y1, . . . ym where yi appearsci times then fp(~y) = (

∑i cpi )

1/p. The algorithm will use a Pseudo Random Generator (PRG) againstbounded space. Nisan showed in [20], that

Theorem 1 from [20] there exists a constant c > 0 such that there exists a deterministic pseudorandom generator which converts a random seed of length cS log r bits to r bits, such that if R is adeterministic algorithm which runs under space S, the probability that R distinguishes the truly randombits from the output of the PRG is O(1/n).

13similar techniques work for other functions such as Reiny entropy, entropy etc.

16

Page 18: Derandomizing Algorithms on Product Distributions and Other

We ignore the errors induced by this construction, by playing with the constants. The generator itselfruns in space O(S log r).

The common way to use this PRG is to exhaust over all possible random strings, and to simulatethe random algorithm on all possible outputs. However, this approach is not possible in the datastreaming model, as the input appears just once. Therefore we use the randomness of the ordering, toload the PRG. In general, this may not be possible, as the coin flips of the random algorithm need tobe independent of the ordering. To solve this, we later prove a continuity result on frequency moments.Results of similar flavor hold for other problems as well.

C.2 Algorithm

The deterministic algorithm begins by compressing the beginning of the stream. If this builds upentropy quickly, it uses this entropy to load the PRG, and simulate the random algorithm on the restof the stream. If the entropy rise is slow, it goes over the stream, counting the number of appearancesof the elements which appeared in the beginning. Using these numbers, it produces an estimate to thetotal p’th moment. Formally

1. Let k be the first element such that the multinomial coefficient describing x1, . . . xk is at least2d·logn where d = c ·S · log r, and c is the constant from the PRG. Store x1, . . . , xk, by storing themultinomial coefficient, as well as the distinct values.

2. If k is small (at most n3/4):

(a) D applies E from 3.6 on the multinomial, to extract r bits which are 1/n close to uniform.

(b) D uses the r random bits to load Nisan’s PRG from [20], and uses it to simulate R on therest of the stream.

(c) Output R(xk+1, . . . , xn).

3. If k > n3/4

(a) Let s denote the number of distinct values of x1, . . . , xk. Assume without loss of generalitythat the distinct elements are x1, . . . , xs.

(b) D goes over the series, counting the number of appearances of x1, . . . , xs. Let ai denote thenumber of appearances of xi.

(c) Output (∑

i api )

1/p.

C.3 Analysis

We first show that D runs in polylogarithmic space.

Theorem C.1. D runs in space at most O(log2 n · c · S · log r)

Proof. Remember d = c ·S · log r. Computing the the multinomial coefficient of x1, . . . , xk is done usingat most O(d · log n log logn) bits. Storing the distinct elements of x1, . . . , xk requires at most d log2 nbits, as there are at most d · log n such elements. If k is small, D proceeds by simulating R; this isdone in space S < d. If k is large, one needs to compute a1, . . . , as. to bound s, note that if there are sdifferent values than the multinomial coefficient is at least s! > 2s. Therefore, s < d log n, and this canalso be done in space d log2 n, which gives the required bound.

17

Page 19: Derandomizing Algorithms on Product Distributions and Other

The correctness proof consists of two different cases.

Theorem C.2. For any constant p > 2 and α < 0.25, if k < n3/4, then fp(xk+1, . . . , xn) ≤ fp(x1, . . . , xn) ≤(1 + n−α)fp(xk+1, . . . , xn)

Proof. The first inequality is trivial. To show the second inequality, we need to differentiate betweenelements which appeared many times in x1, . . . , xk (and will therefore probably appear again), andbetween elements which appeared few times (which will change the approximation ratio only slightly).Assume without loss of generality that x1, . . . , xl appeared more than 2 log n + 1 times in x1, . . . , xk,and that xl+1, . . . , xs appeared less than 2 log n+ 1 times. Let βi denote the number of appearances ofxi in x1, . . . , xk. We need to prove

fp(x1, . . . , xn) < (1 + n−α)fp(xk+1, . . . , xn)

or equivalently fpp (x1, . . . , xn) < (1 + n−α)pfpp (xk+1, . . . , xn). Using the bound (1 + n−α)p > 1 + pnα

from the Binomial Theorem, we prove a stronger statement, namely that

fpp (x1, . . . , xn)− fpp (xk+1, . . . , xn) < pn−αfpp (x1, . . . , xn) (1)

We analyze the LHS of the inequality (1).

fpp (x1, . . . , xn)− fpp (xk+1, . . . , xn) =s∑i=1

(γi + βi)p − γpi

Where γi denotes the number of appearances of the i’th element in xk+1, . . . , xn.

The following lemma helps us consider heavy elements and light ones differently

Lemma C.3. If βi > 2 log n+ 1 then ∀ε > 0 and large enough n, Pr(γi < (1− ε)n/k · βi) < O(1/n)

Note that (1 − ε)n/k > nα > nα/p. Applying a union bound over at most s heavy elements, withprobability 1−O(polylog n/n)∑

i; γi>nα/pβi

(γi + βi)p − γpi <p

2n−α

∑i; γi>nα/pβi

γpi

When βi is small, assuming that we are in the probable even and γi is also small:∑i; γi<nα/pβi

(γi + βi)p − γpi <p

2n1−α

To finish the proof, we look at the RHS of (1):

pn−αfpp (x1, . . . , xn) ≥ pn−α maxn,∑i

γpi ≥p

2(n1−α +

∑i

γpi )

The theorem gives that if k is small D has almost the same success probability as R (up to aO(log n/n) term which comes from the PRG and from the density of the heavy elements), and outputsan approximation which is almost as good (up to (1 + n−α)).

If k > n3/4, assume without loss of generality that x1, . . . , xs are all the unique elements of x1, . . . , xk,and note that s < d · log n. for 1 ≤ i ≤ s, Let ai denote the number of appearances of xi in x1, . . . , xn.D computes a1, . . . , as, and outputs (

∑i api )

1/p.

18

Page 20: Derandomizing Algorithms on Product Distributions and Other

Theorem C.4. For any constant p > 2 and α < 0.25, if k > n3/4, then with probability 1 − n−α wehave (

∑i api )

1/p ≤ fp(x1, . . . , xn) ≤ (1 + n−α)(∑

i api )

1/p

Proof. As in Theorem C.2, the first inequality is trivial, and holds with probability 1. As for the second,denote x = n−

∑i ai. Given that xs+1, . . . , xk contain only the elements of x1, . . . , xs, the probability

that x n0.25 is exponentially small. We only require a weaker bound, for example

Pr(x >√n) < O(1/n)

To prove a bound (which is stronger than the one we need), go over x1, . . . xn, putting red balls in allthe places where an element of x1, . . . , xs appeared, and blue balls elsewhere. Given that the first sballs are red, if there are x >

√n blue balls, the probability that none of them appeared in xs+1, . . . , xk

is(1− x/n− s+ 1)(1− x/n− s+ 2) . . . (1− x/n− k) < (1− x/n− k)k−s

and this probability is exponentially small.

Assume that x <√n. The pigeonhole principle gives that there are heavy elements among the s

first elements. These elements dominate the moment, and the difference between estimates is bounded1 + n−α.

D Proofs of Technical Lemmas

This appendix presents the calculations done in the paper.

Lemma 3.4 For any integers s ≤ k with k ≥ 32 and s ≥ 4 we have

log(

k

a1, . . . , as

)≥ s · log k

4

Proof. It follows from stirling’s formula that for any integer k√

2πk(k/e)k ≤ k! ≤ 3√

2πk(k/e)k.

Thus, (k

a1, . . . , as

)≥ k!

(k − s+ 1)!≥

√2πk(k/e)k

3 ·√

2π(k − s+ 1)((k − s+ 1)/e)k−s+1

≥ (k/e)k

3 · (k/e)k−s+1= (1/3) · (k/e)s−1

Thus,

log(

k

a1, . . . , as

)≥ (s− 1) log k − (s− 1) log e− log 3

Using s ≥ 4≥ (3/4)s log k − 2.5s ≥ (1/4)s log k

The last inequality follows as

(s log k)/2 ≥ 2.5s↔ log k ≥ 5s↔ k ≥ 32

19

Page 21: Derandomizing Algorithms on Product Distributions and Other

E Lower bounds

How large does k have to be to get a deterministic scheme for product distributions running in timethat is almost k-times the randomized running time? We have given an upper bound of k = O(td ·r/tr).In the theorem below we give a somewhat weaker lower bound for algorithms of the following form.

Given (x1, . . . , xk) either

1. for all i run AR(xi, E(x1, . . . , xk)) for some function E.

2. for all i run AD(xi).

Theorem E.1. Fix deterministic and randomized algorithms AD, AR respectively for a language L.Denote the running times of AD and AR by td and tr respectively. Let k and n be any large enoughintegers. Assume that there is an algorithm A of the above form running in time k ·tr+f(n, k), such thatf(n, k) < k · tr, that answers correctly on all instances x1, . . . , xk ∈ 0, 1n with probability 4/5. Let r∗

be the minimal number of random bits used by an algorithm for L running in time tr+f(n, k)+k · log2 kwith error 1/3. Then ,

k · log k = Ω(r∗ · tdtr

).

Thus, either k is super-polynomial in r∗ or

k = Ω(

r∗ · tdlog r∗ · tr

).

Proof. Assume we have such an A for k = r∗·tdc·log k·tr for a large enough constant c to be determined later.

We will construct a randomized algorithm A′R for f with error 1/3 running in time tr+f(n, k)+k ·log2 kusing less than r∗ random bits, leading to a contradiction. Define a distribution D on 0, 1n as follows:Let s = 8 ·r∗/(c · log k) , and fix distinct elements z0, . . . , zs ∈ 0, 1n. D will give z0 probability 1− 10·s

kand, for 1 ≤ i ≤ s, D gives zi probability 10/k. By Chernoff, with probability at least 99/100, forlarge enough14 s we have between 5s to 15s appearances of the elements z1, . . . , zs in a sequence of kindependent samples from D (while the rest of the sequence is equal to z0). Also, with probability atleast 99/100 there are at least s/4 distinct elements in the sequence.15 Denote by X the distributionD⊕k conditioned on having 5s−15s appearances of z1, . . . , zs and having at least s/4 distinct elements.As X has mass at least 98/100 in D⊕k, A must be correct on sequences from X with probability atleast 4/5− 2/100 > 3/4. Note also that as sequences from X contain at least s/4 distinct elements Amust invoke AR on all of them: Otherwise it would have running time at least

s/4 · td = 2 · r∗ · td

c · log k≥ 2k · tr > k · tr + f(n, k).

The algorithm A′R will work as follows: Given input z = z0 it will produce attempt to sample(x1, . . . , xk) from the distribution X(substituting z as z0) as described below. If it succeeds, then it willrun AR(z, E(x1, . . . , xk)) and answer accordingly. We attempt to sample from X as follows: Choose asubset of [k] of size 15 · s. For each index in the subset, for 1 ≤ i ≤ s, choose zi with probability 10/k,

14if s is bounded by a constant so is r∗/ log k in which case the theorem is correct by the bound k ≥ td/tr.15The probability of a certain zi appearing is 1 − (1 − 10/k)k ≥ 1 − e−10 ≥ 999/1000. Define a random variable Y to

be the number of distinct elements from z1, . . . , zs appearing. Then, E(Y ) ≥ 999/1000 · s. Denoting δ = Pr(Y ≤ s/4)we have 999/1000 · s ≤ δ · s/4 + (1− δ) · s, which implies δ ≤ 1/100.

20

Page 22: Derandomizing Algorithms on Product Distributions and Other

and choose z0 with probability 1 − 10 · s/k. Put z0 in the rest of the indices. As noted before withprobability at least 98/100 we will have s/4 distinct elements and at least 5s appearances of z1, . . . , zs.Conditioned on this event we have a uniform element from X.

The sampling process will take at most (c/16) · s · log k ≤ r∗/2 random bits for some constant c,and time k · log2 k. The algorithm A′r will be correct with probability 3/4 conditioned on succeedingto generate a sample from X, and thus it’s total error will be at most 1/4 + 2/100 < 1/3.

Altogether, we have obtained a randomized algorithm using less than r∗ random bits running intime at most tr + f(n, k) + k · log2 k with error at most 1/3, which is a contradiction to the minimalityof r∗.

In the theorem below we show our extraction scheme is almost optimal.

Theorem E.2. Let C be the class of product distributions on (0, 1n)k conditioned on having at leasts distinct values. Then C contains a distribution X with H∞(X) ≤ O(s · log k). Thus any extractor forC with error ε ≤ 1/2 can extract at most O(s · log k) bits.

Proof. As a first step to construct X, we define a distribution D on 0, 1n as follows: Fix distinctelements z0, . . . , z2s ∈ 0, 1n. D will give z0 probability 1− 2s

k and, for 1 ≤ i ≤ 2s, D gives zi probability1/k. Let us denote by X the product distribution D⊕k conditioned on seeing at least s distinct elements.Denote by X ′ the distribution X conditioned on having between s and 3s appearances of z1, . . . , z2s.As the min-entropy of a distribution is at most the log of its support size, using the bound of Lemma3.4 , we have H∞(X ′) ≤ O(s · log k + log s) = O(s · log k). Note that the log s term came from having2s+ 1 options as to how many appearances of z1, . . . , z2s we have.

By Chebychev, with probability at least 1−2/s we have between s to 3s appearances of the elementsz1, . . . , z2s in a sequence of k independent samples from D. Thus, X ′ has mass at least 1 − 2/s in Xand therefore H∞(X) = H∞(X ′) +O(log s) = O(s · log k).

21