Chernoff Bound

7/30/2019 Chernoff Bound

1/49

Chernoff Bounds

Let X1,..., Xn be independent 0-1 random variables with

Pr(Xi = 1) = pi Pr(Xi = 0) = 1 pi.

Let X = ni=1 Xi,

= E[X] =n

i=1

E[Xi] =n

i=1

pi

We want a bound on

Pr(|X | > ).


2/49


3/49

The Basic IdeaUsing Markov inequality we have:For any t > 0,

Pr(X a) = Pr(etX eta) E[etX]eta

.

Similarly, for any t < 0

Pr(X a) = Pr(etX

eta

) E[etX]

eta .

Pr(X

a)

mint>0

E[etX]

eta.

Pr(X

a)

mint


4/49

Moment Generating Function

Definition

The moment generating function of a random variable X is defined

for any real value t as

MX(t) = E[etX].


5/49

Theorem

Let X be a random variable with moment generating functionMX(t). Assuming that exchanging the expectation and

differentiation operands is legitimate, then for all n 1E[xn] = M

(n)X (0),

where M(n)X (0) is the n-th derivative of MX(t) evaluated at t = 0.

Proof.

M(n)X (t) = E[X

netX].

Computed at t = 0 we get

M(n)X (0) = E[X

n].


6/49

Theorem

Let X and Y be two random variables. If

MX(t) = MY(t)

for all t (, ) for some > 0, then X and Y have the samedistribution.

Theorem

If X and Y are independent random variables then

MX+Y(t) = MX(t)MY(t).

Proof.

MX+Y(t) = E[et(X+Y)] = E[etx]E[etY] = MX(t)MY(t).


7/49

Chernoff Bound for Sum of Bernoulli Trials

Let X1, . . . , Xn be a sequence of independent Bernoulli trials with

Pr(Xi = 1) = pi. Let X =n

i=1 Xi, and let

= E[X] = E

n

i=1

Xi

=

ni=1

E[Xi] =n

i=1

pi.

MXi(t) = E[etXi]

= piet + (1

pi)

= 1 + pi(et 1) epi(et1).


8/49

Taking the product of the n generating functions we get

MX(t) =n

i=1

MXi(t)

n

i=1

epi(et1)

= eni=1 pi(e

t1)

= e(et1)


9/49

Theorem

Let X1, . . . , Xn be independent Bernoulli random variables suchthat Pr(Xi = 1) = pi.

1 For any > 0,

Pr(X (1 + )) 0

Pr(X (1 + )) = Pr(etX et(1+)) (4) E[e

tX]

et(1+)

0, we can set t = ln(1 + ) > 0 to get:

Pr(X (1 + )) e

(1 + )(1+)

.

W h h f 0 1


11/49

We show that for 0 < < 1,

e

(1 + )(1+) e2/3

or thatf() = (1 + ) ln(1 + ) + 2/3 0

in that interval. Computing the derivatives of f() we get

f() = 1 1 + 1 + ln(1 + ) + 23 (5)

= ln(1 + ) + 23

, (6)

f() =

1

1 + +

2

3. (7)

f() < 0 for 0 < 1/2, and f() > 0 for > 1/2.f() first decreases and then increases over the interval [0, 1].Since f(0) = 0 and f(1) < 0, f()

0 in the interval [0, 1].

Since f(0) = 0, we have that f() 0 in that interval.


12/49

For R 6, 5.

Pr(X (1 + ))

e

(1 + )(1+)

e

6R

2R.


13/49

Theorem

Let X1, . . . , Xn be independent Bernoulli random variables suchthat Pr(Xi = 1) = pi. LetX =

ni=1 Xi and = E[X].

For 0 < < 1,

Pr(X (1 )) e2/2 (8)


14/49


15/49

We need to show:

f() = (1 )ln(1 ) + 12

2 0. (10)

Differentiating f() we get

f() = ln(1 ) + ,f() = 1

1 + 1.

f(0) = 0, and since f() 0 in the range [0, 1), f() ismonotonically decreasing in that interval.

E l C i fli


16/49

Example: Coin flips

Let X be the number of heads in a sequence of n independent fair

coin flips.

Pr

|X n

2| 1

2

4n ln n

= Pr

X n2

1 +

4 ln n

n

+PrX n

2 1 4 ln n

n e 13 n2 4 ln nn + e 12 n2 4 ln nn 2

n.


17/49

Using the Chebyshevs bound we had:

Pr|X n2 | n4 4n .

Using the Chernoff bound in this case, we obtain

Pr|X n2 | n4 = PrX n2 1 + 12

+ Pr

X n

2

1 1

2

e

13n2

14 + e

12n2

14

2e n24 .

E l E ti ti P t


18/49

Example: Estimating a Parameter

Evaluating the probability that a particular gene mutationoccurs in the population.

Given a DNA sample, a lab test can determine if it carries themutation.

The test is expensive and we would like to obtain a relativelyreliable estimate from a minimum number of samples. p = the unknown value; n = number of samples, pn had the mutation.

Given sufficient number of samples we expect the value p tobe in the neighborhood of sampled value p, but we cannotpredict any single value with high confidence.

Confidence Interval


19/49

Confidence Interval

Instead of predicting a single value for the parameter we give aninterval that is likely to contain the parameter.

Definition

A 1 q confidence interval for a parameter T is an interval[p , p+ ] such that

Pr(T [p , p+ ]) 1 q.

We want to minimize 2 and q, with minimum n.Using pn as our estimate for pn, we need to compute and q suchthat

Pr(p [p , p+ ]) = Pr(np [n(p ), n(p+ )]) 1 q.


20/49

The random variable here is the interval [p , p+ ] (or thevalue p), while p is a fixed (unknown) value.

np has a binomial distribution with parameters n and p, andE[p] = p. If p / [p , p+ ] then we have one of thefollowing two events:

1 If p < p , then np n(p+ ) = np(1 + p), or np is largerthan its expectation by a

pfactor.

2 If p > p+ , then np n(p ) = np(1 p

), and np is

smaller than its expectation by a p

factor.


21/49

Pr(p [p , p+ ])

= Pr(np np(1

p)) + Pr(np np(1 +

p))

e 12np( p)2 + e 13np( p)2

= en2

2p + en2

3p .

But the value of p is unknown, A simple solution is to use the factthat p 1 to prove

Pr(p [p , p+ ]) = e n22 + e n23 .

Setting q = en2

2 + en2

3 , we obtain a tradeoff between , n and

the error probability q.

Better Bound


22/49

Better Bound

The binomial probabilities are monotone increasing up to theexpectation, and then monotone decreasing.

Pr(p [p , p+ ]) Pr(np np(1

p)) + Pr(np np(1 +

p))

maxpp

enp(pp

p)2/2 + max

pp+enp(

pp

p)2/3

e n2

2(p) + e n

2

3(p+) ,

Setting

q = e n

2

2(p) + e n

2

3(p+)

gives a tighter tradeoff between , n and q.

Application: Set Balancing


23/49

Application: Set Balancing

Given an n

n matrix

Awith entries in

{0, 1

}, let

a11 a12 ... a1na21 a22 ... a2n... ... ... ...... ... ... ...

an1 an2 ... ann

b1b2......bn

=

c1c2......cn

.

Find a vector b with entries in {1, 1} that minimizes

||Ab|| = maxi=1,...,n

|ci|.


24/49

Theorem

For a random vector b, with entries chosen independently and withequal probability from the set {1, 1},

Pr(||Ab|| 12n ln n) 4n

.


25/49

Consider the i-th row ai = ai,1, . . . , ai,n. Let k be the numberof 1s in that row.

If k 12n ln n clearly |ai b| 12n ln n. If k >

12n ln n, let

Xi = |{j | ai,j = 1 and bj = 1}|

andYi = |{j | ai,j = 1 and bj = 1}|.

Thus, Xi counts the number of +1s in the sum n

j=1

ai,jbj,

Yi counts the number of1s Xi + Yi = k.


26/49

if |Xi Yi|

12n log n then |Xi (k Xi)|

12n log nwhich implies

k

2 (1

12n log n

k ) Xi k

2 (1 +

12n log n

k ).


27/49

Using Chernoff bounds,

Pr

Xi k2

1 +

12n ln n

k2

e( k2 )( 13 )( 12n ln nk2 ) e2 ln n

PrXi k

21

12n ln n

k2 e

( k2

)( 12

)( 12n ln nk2

)

e3 ln n

Hence, for a given row,

Pr(|Xi

Yi

|

12n ln n)

2

n2

Since there are n rows, the probability that any row exceeds thatbound is bounded by 2

n.

Chernoff Bound for Sum of{1, +1} Random


28/49

C e o ou d o Su o { , + } a doVariables

Theorem

Let X1,..., Xn be independent random variables with

Pr(Xi = 1) = Pr(Xi = 1) =1

2 .

Let X =n

1 Xi. For any a > 0,

Pr(X

a)

ea

2/2n

F t > 0


29/49

For any t > 0,

E[etXi] =1

2et +

1

2et.

et = 1 + t+t2

2!+ + t

i

i!+ . . .

and

et

= 1 t+t2

2! + + (1)it

i

i! + . . .

Thus,

E[etXi] =1

2

et +1

2

et = i0

t2i

(2i)!

i0

( t2

2 )i

i!= et

2/2


30/49

E[etX] =

ni=1

E[etXi] ent2

/2,

Pr(X

a) = Pr(etX > eta)

E[etX]

eta

et

2n/2ta.

Setting t = a/n yields

Pr(X a) ea2/2n.


31/49

By symmetry we also have

Corollary

Let X1,..., Xn be independent random variables with

Pr(Xi = 1) = Pr(Xi =

1) =

1

2.

Let X =n

i=1 Xi. Then for any a > 0,

Pr(|X| > a) 2ea2/2n.

Application: Set Balancing Revisited


32/49

g

Theorem

For a random vector b, with entries chosen independently and withequal probability from the set {1, 1},

Pr(

||Ab

||

4n ln n)

2

n

(11)

Consider the i-th row ai = ai,1, ...., ai,n.

Let k be the number of 1s in that row.

Zi =k

j=1 ai,ijbij =k

j=1 bij.

If k 4n ln n then clearly Zi satisfies the bound.


33/49

If k > 4n log n, the k non-zero terms in the sum Zi areindependent random variables, each with probability 1/2 of beingeither +1 or 1.Using the Chernoff bound:

Pr|Zi| >

4n log n

2e4n log n/2k 2

n2,

where we use the fact that n k.

Packet Routing on Parallel Computer


34/49

Communication network:

Nodes - processors, switching nodes. edges - communication links.


35/49

The n-cube:N = 2n nodes.Let x = (x1,..., xn) be the number of node x in binary.Nodes x and y are connected by an edge iff their binary

representations differ in exactly one bit.Bit-wise routing: correct bit i in the i-th transition - route haslength n.


36/49


37/49

A permutation communication request: each node is the sourceand destination of exactly one packet.Up to one packet can cross an edge per step, each packet can

cross up to one edge per step.What is the time to route an arbitrary permutation on the n-cube?


38/49

Two phase routing algorithm:

1 Send packet to a randomly chosen destination.

2 Send packet from random place to real destination.

Path: Correct the bits, starting at x0 to xn1.Any greedy queuing method - if some packet can traverse an edgeone does.


39/49

Theorem

The two phase routing algorithm routes an arbitrary permutationon the n-cube in O(log N) = O(n) parallel steps with highprobability.

We focus first on phase 1. We bound the routing time of agiven packet M.

Let e1,..., em be the m n edges traversed by a given packetM is phase 1.

Let X(e) be the total number of packets that traverse edge e

at that phase. Let T(M) be the number of steps till M finished phase 1.


40/49

Lemma

T(M) mi=1

X(ei).

We call any path P = (e1, e2, . . . , em) of m n edges thatfollows the bit fixing algorithm a possible packet path.

We denote the corresponding nodes v0, v1, . . . , vm, withei = (vi1, vi).

For any possible packet path P, let T(P) = mi=1 X(ei).

If phase I takes more than T steps then for some possible


41/49

p p ppacket path P,

T(P) T

There are at most 2n 2n = 22n possible packet paths. Assume that ek connects (a1,..., ai,..., an) to (a1,.., ai,..., an). Only packets that started in address

(,..., , ai, ...., an)

can traverse edge ek, and only if their destination addressesare

(a1, ...., ai1, ai, , ...., ).

There are 2i1 possible packets, each has probability 2i totraverse ei.


42/49

There are 2i1 possible packets, each has probability 2i totraverse ei.

E[X(ei)] 2i1 2i = 1

2.

E[T(P)]

mi=1

E[X(ei)] 12

m n.

Problem: The X(ei)s are not independent.

A packet is active with respect to possible packet path P if it


43/49

ever use an edge of P.

For k = 1, . . . , N, let Hk = 1 if the packet starting at node kis active, and H

k= 0 otherwise.

The Hk are independent, since each Hk depends only on thechoice of the intermediate destination of the packet startingat node k, and these choices are independent for all packets.

Let H = Nk=1 Hk be the total number of active packets.

E[H] E[T(P)] n

Since H is the sum of independent 0

1 random variables we

can apply the Chernoff bound

Pr(H 6n 6E[H]) 26n.


44/49

For a given possible packet path P,

Pr(T(P) 36n) Pr(H 6n) + Pr(T(P) 36n | H < 6n) 26n + Pr(T(P) 36n | H < 6n).

Lemma


45/49

Lemma

If a packet leaves a path (of another packet) it cannot return tothat path in the same phase.

Proof.

Leaving a path at the i-th transition implies different i-th bit, thisbit cannot be changed again in that phase.

Lemma

The number of transitions that a packet takes on a given path isdistributed G( 12 ).

Proof.

The packet has probability 1/2 of leaving the path in eachtransition.

The probability that the active packets cross edges of P more than36 i i l h h b bili h f i i fli d 36


46/49

36n times is less than the probability that a fair coin flipped 36ntimes comes up heads less than 6n times.Letting Z be the number of heads in 36n fair coin flips, we now

apply the Chernoff bound:

Pr(T(P) 36n | H 6n) Pr(Z 6n)

e18n(2/3)2/2

= e

4n

23n1

.

Pr(T(P) 36n) Pr(H 6n)+ Pr(T(P) 36n | H 6n) 26n + 23n1 23n


47/49

As there are at most 22n possible packet paths in the hypercube,the probability that there is any possible packet path for whichT(P) 36n is bounded by

22n

23n

= 2n

= O(N1

).


48/49

The proof of phase 2 is by symmetry: The proof of phase 1 argued about the number of packets

crossing a given path, no timing considerations.

The path from one packet per node to random locations is

similar to random locations to one packet per node inreverse order.

Thus, the distribution of the number of packets that crosses apath of a given packet is the same.

Oblivious Routing


49/49

Definition

A routing algorithm is oblivious if the path taken by one packet isindependent of the source and destinations of any other packets inthe system.

TheoremGiven an N-node network with maximum degree d the routingtime of any deterministic oblivious routing scheme is

(N

d3 ).

Documents

Chernoff Bound