Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdf · •E.g. choose prefix-disambiguous encoding (ℎ) ... 𝑐𝜎=ℎ 𝜎 •Choice of L⋅,

Computational and StatisticalLearning Theory

TTIC 31120

Prof. Nati Srebro

Lecture 4:MDL and PAC-Bayes

Uniform vs Non-Uniform Bias

• No Free Lunch: we need some “inductive bias”

• Limiting attention to hypothesis class ℋ: “flat” bias

• 𝑝 ℎ =1

ℋfor ℎ ∈ ℋ, and 𝑝 ℎ = 0 otherwise

• Non-uniform bias: 𝑝 ℎ encodes bias

• Can use any 𝑝 ℎ ≥ 0, s.t. ℎ 𝑝(ℎ) ≤ 1

• E.g. choose prefix-disambiguous encoding 𝑑(ℎ) and use 𝑝 ℎ = 2− 𝑑 ℎ

• Or, choose 𝑐:𝒰 → 𝒴𝒳 over prefix-disambiguous programs 𝒰 ⊂ 0,1 ∗

and use 𝑝 ℎ = 2− min

𝑐 𝜎 =ℎ𝜎

• Choice of 𝑝 ⋅ , 𝑑 ⋅ or 𝑐 ⋅ encodes are expert knowledge/inductive bias

Minimum Description Length Learning

• Choose “prior” 𝑝(ℎ) s.t. ℎ 𝑝 ℎ ≤ 1 (or description language 𝑑 ⋅ or 𝑐 ⋅ )

• Minimum Description Length learning rule(based on above prior/description language):

𝑀𝐷𝐿𝑝 𝑆 = arg max𝐿𝑆 ℎ =0

𝑝(ℎ) = arg min𝐿𝑆 ℎ =0

|𝑑 ℎ |

• For any 𝒟, w.p. ≥ 1 − 𝛿,

𝐿 𝑀𝐷𝐿𝑝 𝑆 ≤ infℎ 𝑠.𝑡.𝐿𝒟 ℎ =0

− log 𝑝 ℎ + log 2/𝛿

2𝑚

Sample complexity: 𝑚 = 𝑂− log 𝑝(ℎ)

𝜖2= 𝑂

𝑑 ℎ

𝜖2

(more careful analysis: 𝑂𝑑 ℎ∗

𝜖)

MDL and Universal Learning• Theorem: For any ℋ and 𝑝:ℋ → [0,1], s.t. ℎ 𝑝(ℎ) ≤ 1, and any

source distribution 𝒟, if there exists ℎ with 𝐿 ℎ = 0 and 𝑝 ℎ > 0, then w.p. ≥ 1 − 𝛿 over 𝑆 ∼ 𝒟𝑚:

𝐿 𝑀𝐷𝐿𝑝 𝑆 ≤− log 𝑝 ℎ + log 2/𝛿

2𝑚

• Can learn any countable class!

• Class of all computable functions, with 𝑝 ℎ = 2− min

𝑐 𝜎 =ℎ𝜎

.

• Class enumerable with 𝑛:ℋ → ℕ with 𝑝 ℎ = 2−𝑛 ℎ

• But VCdim(all computable functions)=∞ !

• Why no contradiction to Fundamental Theorem?

• PAC Learning: Sample complexity 𝑚 𝜖, 𝛿 is uniform for all ℎ ∈ ℋ.Depends only on class ℋ, not on specific ℎ∗

• MDL: Sample complexity 𝑚(𝜖, 𝛿, 𝒉) depends on ℎ.

Uniform and Non-Uniform Learnability

• Definition: A hypothesis class ℋ is agnostically PAC-Learnable if there

exists a learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∃𝑚 𝜖, 𝛿 , ∀𝒟, ∀𝒉, ∀𝑆∼𝒟𝑚 𝜖,𝛿𝛿 ,

𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖

• Definition: A hypothesis class ℋ is non-uniformly learnable if there exists

a learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ, ∃𝑚 𝜖, 𝛿, 𝒉 , ∀𝒟, ∀𝑆∼𝒟𝑚 𝜖,𝛿,𝒉𝛿 ,


• Theorem: ∀𝒟, if there exists ℎ with 𝐿 ℎ = 0, then ∀𝑆∼𝒟𝑚𝛿

𝐿 𝑀𝐷𝐿𝑝 𝑆 ≤− log 𝑝 ℎ + log 2/𝛿

2𝑚

Compete also with ℎ s.t. 𝐿 ℎ > 0?

Allowing Errors: From MDL to SRM

𝐿 ℎ ≤ 𝐿𝑆 ℎ +− log 𝑝 ℎ + log 2/𝛿

2𝑚

• Structural Risk Minimization:

𝑆𝑅𝑀𝑝 𝑆 = argminℎ

𝐿𝑆 ℎ +− log 𝑝 ℎ

2𝑚

• Theorem: For any prior 𝑝(ℎ), ℎ 𝑝(ℎ) ≤ 1, and any source distribution 𝒟, w.p. ≥ 1 − 𝛿 over 𝑆 ∼ 𝒟𝑚:

𝐿 𝑆𝑅𝑀𝑝 𝑆 ≤ infℎ

𝐿 ℎ + 2− log 𝑝 ℎ + log 2/𝛿

𝑚

Minimized by MDLMinimized

by ERM

fit thedata match the prior /

simple / short description

Non-Uniform Learning: Beyond Cardinality• MDL still essentially based on cardinality (“how many hypothesis are simpler

then me”) and ignores relationship between predictors.

• Generalizes the cardinality bound: Using 𝑝 ℎ =1

ℋwe get

𝑚 𝜖, 𝛿, ℎ = 𝑚 𝜖, 𝛿 =log ℋ + log 2/𝛿

𝜖2

• Can we treat continuous classes (e.g. linear predictors)?Move from cardinality to “growth function”?

• E.g.:

• ℋ = 𝑠𝑖𝑔𝑛 𝑓 𝜙 𝑥 | 𝑓:ℝ𝑑 → ℝ is a polynomial , 𝜙:𝒳 → ℝ𝑑

• VCdim(ℋ)=∞• ℋ is uncountable, and there is no distribution with ∀ℎ∈ℋ𝑝 ℎ > 0• But what if we bias toward lower order polynomials?

• Answer 1: prior over hypothesis classes• Write ℋ =∪ℋ𝑟 (e.g. ℋ𝑟 =degree-𝑟 polynomials)• Use prior 𝑝(𝐻𝑟) over hypothesis classes

Prior Over Hypothesis Classes

• VC bound: ∀𝑟ℙ ∀ℎ∈ℋ𝑟𝐿 ℎ ≤ 𝐿𝑆 ℎ + 𝑂

VCdim ℋ𝑟 +log 1 𝛿𝑟

𝑚≥ 1 − 𝛿𝑟

• Setting 𝛿𝑟 = 𝑝 ℋ𝑟 ⋅ 𝛿 and taking a union bound,

∀𝑆∼𝒟𝑚𝛿 ∀ℋ𝑟

∀ℎ∈ℋ𝑟𝐿 ℎ ≤ 𝐿𝑆 ℎ + 𝑂

VCdim ℋ𝑟 − log 𝑝 ℋ𝑟 + log 1 𝛿𝑚

• Structural Risk Minimization over hypothesis classes:

𝑆𝑅𝑀𝑝 𝑆 = arg minℎ∈ℋ𝑟

𝐿𝑆 ℎ + 𝐶− log 𝑝 ℋ𝑟 + VCdim ℋ𝑟

𝑚

• Theorem: w.p. ≥ 1 − 𝛿,

𝐿𝒟 𝑆𝑅𝑀𝑝 𝑆 ≤ minℋ𝑟,ℎ∈ℋ𝑟

𝐿𝒟 ℎ + 𝑂− log 𝑝 ℋ𝑟 + VCdim ℋ𝑟 + log

1𝛿

𝑚

Structural Risk Minimization• Theorem: For a prior 𝑝 ℋ𝑟 with ℋ𝑟

𝑝 ℋ𝑟 ≤ 1 and any 𝒟, ∀𝑆∼𝒟𝑚𝛿 ,

𝐿𝒟 𝑆𝑅𝑀𝑝 𝑆 ≤ minℋ𝑟,ℎ∈ℋ𝑟

𝐿𝒟 ℎ + 𝑂− log 𝑝 ℋ𝑟 + VCdim ℋ𝑟 + log

1𝛿

𝑚

• For ℋ𝑖 = ℎ𝑖 :

• 𝑉𝐶𝑑𝑖𝑚 ℋ𝑟 = 0

• Reduces to “standard” SRM with a prior over hypothesis

• For 𝑝 ℋ𝑟 = 1

• Reduces to ERM over a finite-VC class

• More general. Eg for polynomials over 𝜙 𝑥 ∈ ℝ𝑑 with 𝑝 degree 𝑟 = 2−𝑟,

𝑚 𝜖, 𝛿, ℎ = 𝑂degree(ℎ) + 𝑑 + 1 degree ℎ + log

1𝛿

𝜖2

• Allows non-uniform learning of a countable union of finite-VC classes


• Definition: A hypothesis class ℋ is agnostically PAC-Learnable if there exists a

learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∃𝑚 𝜖, 𝛿 , ∀𝒟, ∀𝒉, ∀𝑆∼𝒟𝑚 𝜖,𝛿𝛿 ,


• Definition: A hypothesis class ℋ is non-uniformly learnable if there exists a

learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ, ∃𝑚 𝜖, 𝛿, 𝒉 , ∀𝒟, ∀𝑆∼𝒟𝑚 𝜖,𝛿,𝒉𝛿 ,


• Theorem: A hypothesis class ℋ is non-uniformly learnable if and only ifit is a countable union of finite VC class (ℋ =∪𝑖∈ℕ ℋ𝑖, VCdim ℋ𝑖 < ∞)

• Definition: A hypothesis class ℋ is “consistently learnable” if there exists a

learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ ∀𝓓, ∃𝑚 𝜖, 𝛿, ℎ, 𝓓 , ∀𝑆∼𝒟𝑚 𝜖,𝛿,ℎ,𝓓𝛿 ,


Consistency

• 𝒳 countable (e.g. 𝒳 = 0,1 ∗), ℋ = ±1 𝒳 (all possible functions)

• ℋ is uncountable, it is not a countable union of finite VC classes, and is thus not non-uniformly learnable

• Claim: ℋ is “consistently learnable” using𝐸𝑅𝑀ℋ 𝑆 𝑥 = 𝑀𝐴𝐽𝑂𝑅𝐼𝑇𝑌 𝑦𝑖 𝑠. 𝑡. 𝑥𝑖 , 𝑦𝑖 ∈ 𝑆

• Proof sketch: for any 𝒟,

• Sort 𝒳 by decreasing probability. The tail has diminishing probability and thus for any 𝜖, there exists some prefix 𝒳′ of the sort s.t. the tail 𝒳 ∖𝒳′ has probability mass ≤ 𝜖/2.

• We’ll give up on the tail. 𝒳′ is finite, and so ±1 ℋ is also finite.

• Why only “consistently learnable”?

• Size of 𝒳‘ required to capture 1 − 𝜖/2 of mass depends on 𝒟.


• Definition: A hypothesis class ℋ is agnostically PAC-Learnable if there exists a

learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∃𝑚 𝜖, 𝛿 , ∀𝒟, ∀𝒉, ∀𝑆∼𝒟𝑚 𝜖,𝛿𝛿 ,


• (Agnostically) PAC-Learnable iff VCdim ℋ < ∞

• Definition: A hypothesis class ℋ is non-uniformly learnable if there exists a

learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ, ∃𝑚 𝜖, 𝛿, 𝒉 , ∀𝒟, ∀𝑆∼𝒟𝑚 𝜖,𝛿,𝒉𝛿 ,


• Non-uniformly learnable iff ℋ is a countable union of finite VC classes

• Definition: A hypothesis class ℋ is “consistently learnable” if there exists a

learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ ∀𝓓, ∃𝑚 𝜖, 𝛿, ℎ, 𝓓 , ∀𝑆∼𝒟𝑚 𝜖,𝛿,ℎ,𝓓𝛿 ,


SRM In Practice

𝑆𝑅𝑀𝑝 𝑆 = arg minℎ∈ℋ𝑟

𝐿𝑆 ℎ + 𝐶− log𝑝 ℋ𝑟 + VCdim ℋ𝑟

𝑚

• Bound is loose anyway. Better to view as bi-criteria optimization:argmin 𝐿𝑠 ℎ and − log 𝑝 ℋ𝑟 + VCdim ℋ𝑟

E.g. serialize asargmin 𝐿𝑠 ℎ + 𝜆 − log 𝑝 ℋ𝑟 + VCdim ℋ𝑟

• Typically −log 𝑝 ℋ𝑟 , VCdim ℋ𝑟 monotone in “complexity” 𝑟argmin 𝐿𝑠 ℎ 𝐚𝐧𝐝 𝑟(ℎ)

where𝑟 ℎ = min 𝑟 𝑠. 𝑡. ℎ ∈ ℋ𝑟

SRM as a Bi-Criteria Problemargmin 𝐿𝑠 ℎ and 𝑟(ℎ)

Regularization Path = argminℎ 𝐿𝑠 ℎ + 𝜆 ⋅ 𝑟 ℎ | 0 ≤ 𝜆 ≤ ∞

Select 𝜆 using a validation set—exact bound not needed

𝑟(ℎ)

𝐿𝑆(ℎ)

ℎ

Regularization Path(Pareto Frontier)

𝜆 = ∞

𝜆 = 0

Non-Uniform Learning: Beyond Cardinality• MDL still essentially based on cardinality (“how many hypothesis are

simpler then me”) and ignores relationship between predictors.

• Can we treat continuous classes (e.g. linear predictors)?Move from cardinality? Take into account that many predictors are similar?

• Answer 1: prior 𝒑(𝓗) over hypothesis class

• Answer 2: PAC-Bayes Theory

• Prior distribution 𝑃 (not necessarily discrete) over ℋ

David McAllester

PAC-Bayes

• Until now (MDL, SRM) we used a discrete “prior” (discrete “distribution” 𝑝 ℎover hypothesis, or discrete “distribution” 𝑝(ℋ𝑟) over hypothesis classes)

• Instead: encode inductive bias as distribution 𝑃 over hypothesis

• Use randomized (averaged) predictor ℎ𝑄, where for each prediction choosesℎ ∼ 𝑄 and predicts ℎ(𝑥)

• ℎ𝑄 𝑥 = 𝑦 w. p. ℙℎ∼𝑄 ℎ 𝑥 = 𝑦

• 𝐿𝒟 ℎ𝑄 = 𝔼ℎ∼𝑄 𝐿𝒟 ℎ

• Theorem: for any distribution 𝑃 over hypothesis and any 𝒟, ∀𝑆∼𝒟𝑚𝛿

𝐿𝒟 ℎ𝑄 − 𝐿𝑆 ℎ𝑄 ≤𝐾𝐿 𝑄||𝑃 + log 2𝑚

𝛿2 𝑚 − 1

KL-Divergence

𝐾𝐿(𝑄| 𝑃 = 𝔼ℎ∼𝑄 log𝑑𝑄

𝑑𝑃

= ℎ 𝑞 ℎ log𝑞 ℎ

𝑝 ℎfor discrete dist with pmf 𝑝, 𝑞

= ∫ 𝑓𝑄 ℎ log𝑓𝑄 ℎ

𝑓𝑃 ℎ𝑑ℎ for continuous distributions

• Measures how much 𝑄 deviates from 𝑄

• 𝐾𝐿(𝑄| 𝑃 ≥ 0, and𝐾𝐿(𝑄| 𝑃 = 0 if and only if 𝑄 = 𝑃

• If 𝑄 𝐴 > 0 while 𝑃 𝐴 = 0, 𝐾𝐿(𝑄| 𝑃 = ∞ (other direction is allowed)

• 𝐾𝐿(𝐻1| 𝐻0 =information per sample for rejecting 𝐻0 when 𝐻1 is true

• 𝐾𝐿 𝒬||Unif 𝑛 = log𝑛 − 𝐻(𝒬)

• 𝐼 𝑋, 𝑌 = 𝐾𝐿 𝑃 𝑋, 𝑌 ||𝑃 𝑋 𝑃 𝑌

Solomon

Kullback

Richard

Leibler

PAC-Bayes• For any distribution 𝒫 over hypothesis and any 𝒟, ∀𝑆∼𝒟𝑚

𝛿

𝐿𝒟 ℎ𝒬 − 𝐿𝑆 ℎ𝒬 ≤𝐾𝐿 𝑄||𝑃 + log 2𝑚

𝛿2 𝑚 − 1

• Can only use hypothesis in the support of 𝑃 (otherwise 𝐾𝐿(𝑄| 𝑃 = ∞)

• For a finite ℋ with 𝑃 = Unif(ℋ)

• Consider 𝑄 = point mass on h

• 𝐾𝐿(𝑄| 𝑃 = log ℋ

• Generalizes cardinality bound (up to log𝑚)

• More generally, for a discrete 𝑃 and 𝑄 = point mass on h

• 𝐾𝐿(𝑄| 𝑃 = 𝑞 ℎ log𝑞 ℎ

𝑝 ℎ=

1

𝑝 ℎ

• Generalizes MDL/SRM (up to log𝑚)

• For continuous 𝑃 (eg over linear predictors or polynomials)

• For 𝑄=point-mass (or any discrete), 𝐾𝐿(𝑄| 𝑃 = ∞

• Take ℎ𝑄 as average over similar hypothesis (eg with same behavior on 𝑆)

PAC-Bayes

𝐿𝒟 ℎ𝑄 ≤ 𝐿𝑆 ℎ𝑄 +𝐾𝐿 𝑄||𝑃 + log 2𝑚

𝛿2 𝑚 − 1

• What learning rule does the PAC-Bayes bound suggest?

𝑄𝜆 = argmin𝑄

𝐿𝑆 ℎ𝑄 + 𝜆 ⋅ 𝐾𝐿 𝑄||𝑃

• Theorem:𝑞𝜆 ℎ ∝ 𝑝 ℎ 𝑒−𝛽𝐿𝑆 ℎ

for some “inverse temperature” 𝛽

• As 𝜆 → ∞ we ignore the data, corresponding to infinite temperature, 𝛽 → 0

• As 𝜆 → 0 we insist on minimizing 𝐿𝑆 ℎ𝑄 , corresponding to zero temperature, 𝛽 → ∞, and the prediction becomes ERM (or rather, a distribution over the ERM hypothesis in the support of 𝑃)

PAC-Bayes vs Bayes

Bayesian approach:• Assume ℎ ∼ 𝒫,

• 𝑦1, … , 𝑦𝑚 iid conditioned on ℎ, with 𝑦𝑖|𝑥𝑖 , ℎ = ℎ(𝑥𝑖), 𝑤. 𝑝. 1 − 𝜈

−ℎ(𝑥𝑖), 𝑤. 𝑝. 𝜈

Use posterior:

𝑝 ℎ 𝑆 ∝ 𝑝 ℎ 𝑝 𝑆 ℎ

= 𝑝 ℎ 𝑖 𝑝 𝑥𝑖 𝑝(𝑦𝑖|𝑥𝑖)

∝ 𝑝 ℎ 𝑖𝜈

1−𝜈

ℎ 𝑥𝑖 ≠𝑦𝑖= 𝑝 ℎ

𝜈

1−𝜈

𝑖 ℎ 𝑥𝑖 ≠𝑦𝑖

= 𝑝 ℎ 𝑒−𝛽𝐿𝑠 ℎ

where 𝛽 = 𝑚 log1−𝜈

𝜈

PAC-Bayes vs BayesPAC-Bayes

• 𝑃 encodes inductive bias, not assumption about reality

• SRM-type bound minimized by Gibbs distribution

𝑞𝜆 ℎ ∝ 𝑝 ℎ 𝑒−𝛽𝐿𝑆 ℎ

• Post-hoc guarantee always valid (∀𝑆𝛿),

with no assumption about reality

𝐿𝒟 ℎ𝑄 ≤ 𝐿𝑆 ℎ𝑄 +𝐾𝐿 𝑄||𝑃 + log 2𝑚

𝛿2 𝑚 − 1

• Bound valid for any 𝑄

• If inductive bias very different from reality, bound will be high

Bayesian Approach• 𝒫 is prior over reality

• Posterior given by Gibbs distribution

𝑞𝜆 ℎ ∝ 𝑝 ℎ 𝑒−𝛽𝐿𝑆 ℎ

• Risk analysis assuming prior

PAC-Bayes: Tighter Version

• For any distribution 𝑃 over hypothesis and any source distribution 𝒟, ∀𝑆∼𝒟𝑚𝛿

𝐾𝐿 𝐿𝑆 ℎ𝑄 ||𝐿𝒟 ℎ𝑄 ≤𝐾𝐿 𝑄||𝑃 + log 2𝑚

𝛿𝑚 − 1

where 𝐾𝐿(𝛼| 𝛽 = 𝛼 log𝛼

𝛽+ 1 − 𝛼 log

1−𝛼

1−𝛽for 𝛼, 𝛽 ∈ [0,1]

𝐿𝒟 ℎ𝑄 ≤ 𝐿𝑆 ℎ𝑄 +2𝐿𝑠 ℎ𝑄 𝐾𝐿 𝑄𝑃 + log

2𝑚𝛿

𝑚 − 1+2 𝐾𝐿 𝑄𝑃 + log

2𝑚𝛿

𝑚− 1

• This generalizes the realizable case (𝐿𝑆 ℎ𝑄 = 0, and so only the 1

𝑚term

appears) and the agnostic case (where the 1 𝑚 term is dominant)

• Numerically much tighter

• Can also be used as a tail bound instead of Hoeffding or Bernstein also with cardinality or VC-based guarantees. Arises naturally in PAC-Bayes.

Documents

Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdf · •E.g. choose prefix-disambiguous encoding (ℎ) ... 𝑐𝜎=ℎ 𝜎 •Choice of L⋅,