Ryan O’Donnell - Microsoft Mike Saks - Rutgers Oded Schramm - Microsoft Rocco Servedio - Columbia

Ryan O’Donnell - MicrosoftMike Saks - RutgersOded Schramm - MicrosoftRocco Servedio - Columbia

Part I: Decision trees have large influences

Does anything print?

Can print from Notepad? Right size paper?

Printer mis-setup?File too complicated?Network printer?

Driver OK?

Solved

SolvedDriver OK?

Solved

Solved Call tech support

Printer troubleshooter

f : {Attr1} × {Attr2} × ∙∙∙ × {Attrn} → {−1,1}.

What’s the “best” DT for f, and how to find it?

Depth = worst case # of questions.

Expected depth = avg. # of questions.

Decision tree complexity

1. Identify the most ‘influential’/‘decisive’/‘relevant’ variable.

2. Put it at the root.

3. Recursively build DTs for its children.

Almost all real-world learning algs based on this – CART, C4.5, …

Almost no theoretical (PAC-style) learning algs based on this –

[Blum92, KM93, BBVKV97, PTF-folklore, OS04] – no;

[EH89, SJ03] – sorta.

Conj’d to be good for some problems (e.g., percolation [SS04]) but unprovable…

Building decision trees

Boolean DTs

f : {−1,1}n → {−1,1}.

D(f) = min depth of a DT for f.

0 ≤ D(f) ≤ n.

x1

x2

x3−1

−1

1

1

x2

x3

−1

1

Maj3

Boolean DTs

• {−1,1}n viewed as a probability space, with uniform probability distribution.

• uniformly random path down a DT, plus a uniformly random setting of the unqueried variables, defines a uniformly random input

• expected depth : δ(f).

Influences

influence of coordinate j on f

= the probability that xj is relevant for f

Ij(f) = Pr[ f(x) ≠ f(x (⊕j) ) ].

0 ≤ Ij(f) ≤ 1.

Main question:

If a function f has a “shallow” decision tree, does it

have a variable with “significant” influence?

Main question:

No.

But for a silly reason:

Suppose f is highly biased; say Pr[f = 1] = p ≪ 1.

Then for any j,

Ij(f) = Pr[f(x) = 1, f(x(j)) = −1] + Pr[f(x) = −1, f(x(j)) = 1]

≤ Pr[f(x) = 1] + Pr[f(x(j)) = 1]

≤ p + p

= 2p.

Variance⇒ Influences are always at most 2 min{p,q}.

Analytically nicer expression: Var[f].

• Var[f] = E[f2] – E[f]2

= 1 – (p – q)2 = 1 – (2p − 1)2 = 4p(1 – p) = 4pq.

• 2 min{p,q} ≤ 4pq ≤ 4 min{p,q}.

• It’s 1 for balanced functions.

So Ij(f) ≤ Var[f], and it is fair to say Ij(f) is “significant” if it’s a significant fraction of Var[f].

Main question:

If a function f has a “shallow” decision tree,

does it have a variable with influence at least

a “significant” fraction of Var[f]?

Notation

τ(d) = min max { Ij(f) / Var[f] }.f : D(f) ≤ d

j

Known lower boundsSuppose f : {−1,1}n → {−1,1}.

• An elementary old inequality states

Var[f] ≤ Ij(f).

Thus f has a variable with influence at least Var[f]/n.

• A deep inequality of [KKL88] shows there is always a coord. j

such that Ij(f) ≥ Var[f] · Ω(log n / n).

If D(f) = d then f really has at most 2d variables.

Hence we get τ(d) ≥ 1/2d from the first, and τ(d) ≥ Ω(d/2d) from

KKL.

j = 1

nΣ

Our result

τ(d) ≥ 1/d.

This is tight:

Then Var[SEL] = 1, d = 2, all three variables have infl. ½.

(Form recursive version, SEL(SEL, SEL, SEL) etc., gives Var 1 fcn with d = 2h, all influences 2−h for any h.)

x1

x2

−1

1

x3

1 −1

“SEL”

Our actual main theorem

Given a decision tree f, let δj(f) = Pr[tree queries xj].

Then

Var[f] ≤ δj(f) Ij(f).

Cor: Fix the tree with smallest expected depth.

Then δj(f) = E[depth of a path] =: δ(f) ≤ D(f).

⇒ Var[f] ≤ max Ij · δj = max Ij · δ(f)

⇒ max Ij ≥ Var[f] / δ(f) ≥ Var[f] / D(f).

j = 1

n

Σ

j = 1

nΣ

j = 1

nΣ

ProofPick a random path in the tree. This gives some set

of variables, P = (xJ1, … , xJT

), along with an

assignment to them, βP.

Call the remaining set of variables P and pick a random assignment βP for them too.

Let X be the (uniformly random string) given by combining these two assignments, (βP, βP).

Also, define JT+1, … , Jn = ┴.

ProofLet β’P be an independent random asgn to vbls in P.

Let Z = (β’P, βP).

Note: Z is also uniformly random.xJ1

= –1

xJ2= 1

xJ3= -1

–1

xJT= 1

X = (-1, 1, -1, …, 1, )

Z = ( , )

1, -1, 1, -1

1, -1, 1, -1

J1 J2 J3 JT J T+1 = ···

= J n = ┴P P

1,-1, -1, …,-1

ProofFinally, for t = 0…T, let Yt be the same string as X,

except that Z’s assignments (β’P) for variables xJ1

, … , xJt are swapped in.

Note: Y0 = X, YT = Z.

Y0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 )

Y1 = ( 1, 1, -1, …, 1, 1, -1, 1, -1 )

Y2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 )

· · · ·

YT = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 )

Also define YT+1 = · · · = Yn = Z.

Var[f] = E[f2] – E[f]2

= E[ f(X)f(X) ] – E[ f(X)f(Z) ]

= E[ f(X)f(Y0) – f(X)f(Yn) ]

= E[ f(X) (f(Yt−1) – f(Yt)) ]

≤ E[ |f(Yt−1) – f(Yt)| ]

= 2 Pr[f(Yt−1) ≠ f(Yt)]

= Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j]

= Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j]

t = 1..n

Σ

t = 1..n

Σ

t = 1..n

Σ

t = 1..n

Σ

j = 1..nΣ

t = 1..n

Σ

j = 1..nΣ

Proof… = Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt =

j]

Utterly Crucial Observation:

Conditioned on Jt = j,

(Yt−1, Yt) are jointly distributed exactly as

(W, W’), where W is uniformly random, and W’ is W with jth bit rerandomized.

j = 1..nΣ

t = 1..n

Σ

Y0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 )

Y1 = ( 1, 1, -1, …, 1, 1, -1, 1, -1 )

Y2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 )

· · · ·

YT = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 )

xJ1= –1

xJ2= 1

xJ3= 1

–1

xJT= 1

X = (-1, 1, -1, …, 1, )

Z = ( , )

1, -1, 1, -1

1, -1, 1, -1

J1 J2 J3 JT J T+1 = ···

= J n = ┴P P

1,-1, -1, …,-1

Proof

… = Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt =

j]

= Pr[Jt = j] · 2 Pr[f(W) ≠ f(W’)]

= Pr[Jt = j] · Ij(f)

= Ij · Pr[Jt = j]

= Ij δj.

j = 1..nΣ

t = 1..n

Σ

j = 1..nΣ

t = 1..n

Σ

j = 1..nΣ

t = 1..n

Σj = 1..n

Σ

j = 1..nΣ

t = 1..n

Σ

Part II: Lower bounds for monotone graph

properties

Monotone graph properties

Consider graphs on v vertices; let n = ( ).

“Nontrivial monotone graph property”:

• “nontrivial property”: a (nonempty, nonfull) subset of all v-vertex graphs

• “graph property”: closed under permutations of the vertices ( no edge is ‘distinguished’)

• monotone: adding edges can only put you into the property, not take you out

e.g.: Contains-A-Triangle, Connected, Has-Hamiltonian-Path, Non-Planar, Has-at-least-n/2-edges, …

v2

Aanderaa-Karp-Rosenberg conj.

Every nontrivial monotone graph propery has D(f) = n.

[Rivest-Vuillemin-75]: ≥ v2/16.

[Kleitman-Kwiatowski-80] ≥ v2/9.

[Kahn-Saks-Sturtevant-84] ≥ n/2, = n, if v is a prime power.

[Topology + group theory!]

[Yao-88] = n in the bipartite case.

Randomized DTs• Have ‘coin flip’ nodes in the trees that cost nothing.

• Or, probability distribution over deterministic DTs.

Note: We want both 0-sided error and worst-case input.

R(f) = min, over randomized DTs that compute f with 0-error, of max over inputs x, of expected # of queries.

The expectation is only over the DT’s internal coins.

D(Maj3) = 3.

Pick two inputs at random, check if they’re the same. If not, check the 3rd.

R(Maj3) ≤ 8/3.

Let f = recursive-Maj3 [Maj3 (Maj3 , Maj3 , Maj3 ), etc…]

For depth-h version (n = 3h),

D(f) = 3h.

R(f) ≤ (8/3)h.

(Not best possible…!)

Maj3:

Randomized AKR / Yao conj.

Yao conjectured in ’77 that every nontrivial monotone graph property f has R(f) ≥ Ω(v2).

Lower bound Ω( · ) Who

v [Yao-77]

v log 1/12 v [Yao-87]

v5/4 [King-88]

v4/3 [Hajnal-91]

v4/3 log 1/3 v [Chakrabarti-Khot-01]

min{ v/p, v2/log v } [Fried.-Kahn-Wigd.-02]

v4/3 / p1/3 [us]

Outline• Extend main inequality to the p-biased case. (Then LHS is 1.)

• Use Yao’s minmax principle: Show that under p-biased

{−1,1}n,

δ = Σ δj = avg # queries is large for any tree.

• Main inequality: max influence is small ⇒ δ is large.

• Graph property all vbls have the same influence.

• Hence: sum of influences is small ⇒ δ is large.

• [OS04]: f monotone ⇒ sum of influences ≤ √δ.

• Hence: sum of influences is large ⇒ δ is large.

• So either way, δ is large.

Generalizing the inequality

Var[f] ≤ δj(f) Ij(f).

Generalizations (which basically require no proof change):

• holds for randomized DTs

• holds for randomized “subcube partitions”

• holds for functions on any product probability space

f : Ω1 × ∙∙∙ × Ωn → {−1,1}

(with notion of “influence” suitably generalized)

• holds for real-valued functions with (necessary) loss of a

factor, at most √δ

j = 1

nΣ

Closing thoughtIt’s funny that our bound gets stuck roughly at the same level as

Hajnal / Chakrabarti-Khot, n2/3 = v4/3.

Note that n2/3 [I believe] cannot be improved by more than a log factor merely for monotone transitive functions, due to [BSW04].

Thus to get better than v4/3 for monotone graph properties, you must use the fact that it’s a graph property.

Chakrabarti-Khot does definitely use the fact that it’s a graph property (all sorts of graph packing lemmas).

Or do they? Since they get stuck at essentially v4/3, I wonder if there’s any chance their result doesn’t truly need the fact that it’s a graph property…

Documents

Ryan O’Donnell - Microsoft Mike Saks - Rutgers Oded Schramm - Microsoft Rocco Servedio - Columbia