Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Computational and StatisticalLearning Theory
TTIC 31120
Prof. Nati Srebro
Lecture 4:MDL and PAC-Bayes
Uniform vs Non-Uniform Bias
• No Free Lunch: we need some “inductive bias”
• Limiting attention to hypothesis class ℋ: “flat” bias
• 𝑝 ℎ =1
ℋfor ℎ ∈ ℋ, and 𝑝 ℎ = 0 otherwise
• Non-uniform bias: 𝑝 ℎ encodes bias
• Can use any 𝑝 ℎ ≥ 0, s.t. ℎ 𝑝(ℎ) ≤ 1
• E.g. choose prefix-disambiguous encoding 𝑑(ℎ) and use 𝑝 ℎ = 2− 𝑑 ℎ
• Or, choose 𝑐:𝒰 → 𝒴𝒳 over prefix-disambiguous programs 𝒰 ⊂ 0,1 ∗
and use 𝑝 ℎ = 2− min
𝑐 𝜎 =ℎ𝜎
• Choice of 𝑝 ⋅ , 𝑑 ⋅ or 𝑐 ⋅ encodes are expert knowledge/inductive bias
Minimum Description Length Learning
• Choose “prior” 𝑝(ℎ) s.t. ℎ 𝑝 ℎ ≤ 1 (or description language 𝑑 ⋅ or 𝑐 ⋅ )
• Minimum Description Length learning rule(based on above prior/description language):
𝑀𝐷𝐿𝑝 𝑆 = arg max𝐿𝑆 ℎ =0
𝑝(ℎ) = arg min𝐿𝑆 ℎ =0
|𝑑 ℎ |
• For any 𝒟, w.p. ≥ 1 − 𝛿,
𝐿 𝑀𝐷𝐿𝑝 𝑆 ≤ infℎ 𝑠.𝑡.𝐿𝒟 ℎ =0
− log 𝑝 ℎ + log 2/𝛿
2𝑚
Sample complexity: 𝑚 = 𝑂− log 𝑝(ℎ)
𝜖2= 𝑂
𝑑 ℎ
𝜖2
(more careful analysis: 𝑂𝑑 ℎ∗
𝜖)
MDL and Universal Learning• Theorem: For any ℋ and 𝑝:ℋ → [0,1], s.t. ℎ 𝑝(ℎ) ≤ 1, and any
source distribution 𝒟, if there exists ℎ with 𝐿 ℎ = 0 and 𝑝 ℎ > 0, then w.p. ≥ 1 − 𝛿 over 𝑆 ∼ 𝒟𝑚:
𝐿 𝑀𝐷𝐿𝑝 𝑆 ≤− log 𝑝 ℎ + log 2/𝛿
2𝑚
• Can learn any countable class!
• Class of all computable functions, with 𝑝 ℎ = 2− min
𝑐 𝜎 =ℎ𝜎
.
• Class enumerable with 𝑛:ℋ → ℕ with 𝑝 ℎ = 2−𝑛 ℎ
• But VCdim(all computable functions)=∞ !
• Why no contradiction to Fundamental Theorem?
• PAC Learning: Sample complexity 𝑚 𝜖, 𝛿 is uniform for all ℎ ∈ ℋ.Depends only on class ℋ, not on specific ℎ∗
• MDL: Sample complexity 𝑚(𝜖, 𝛿, 𝒉) depends on ℎ.
Uniform and Non-Uniform Learnability
• Definition: A hypothesis class ℋ is agnostically PAC-Learnable if there
exists a learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∃𝑚 𝜖, 𝛿 , ∀𝒟, ∀𝒉, ∀𝑆∼𝒟𝑚 𝜖,𝛿𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
• Definition: A hypothesis class ℋ is non-uniformly learnable if there exists
a learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ, ∃𝑚 𝜖, 𝛿, 𝒉 , ∀𝒟, ∀𝑆∼𝒟𝑚 𝜖,𝛿,𝒉𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
• Theorem: ∀𝒟, if there exists ℎ with 𝐿 ℎ = 0, then ∀𝑆∼𝒟𝑚𝛿
𝐿 𝑀𝐷𝐿𝑝 𝑆 ≤− log 𝑝 ℎ + log 2/𝛿
2𝑚
Compete also with ℎ s.t. 𝐿 ℎ > 0?
Allowing Errors: From MDL to SRM
𝐿 ℎ ≤ 𝐿𝑆 ℎ +− log 𝑝 ℎ + log 2/𝛿
2𝑚
• Structural Risk Minimization:
𝑆𝑅𝑀𝑝 𝑆 = argminℎ
𝐿𝑆 ℎ +− log 𝑝 ℎ
2𝑚
• Theorem: For any prior 𝑝(ℎ), ℎ 𝑝(ℎ) ≤ 1, and any source distribution 𝒟, w.p. ≥ 1 − 𝛿 over 𝑆 ∼ 𝒟𝑚:
𝐿 𝑆𝑅𝑀𝑝 𝑆 ≤ infℎ
𝐿 ℎ + 2− log 𝑝 ℎ + log 2/𝛿
𝑚
Minimized by MDLMinimized
by ERM
fit thedata match the prior /
simple / short description
Non-Uniform Learning: Beyond Cardinality• MDL still essentially based on cardinality (“how many hypothesis are simpler
then me”) and ignores relationship between predictors.
• Generalizes the cardinality bound: Using 𝑝 ℎ =1
ℋwe get
𝑚 𝜖, 𝛿, ℎ = 𝑚 𝜖, 𝛿 =log ℋ + log 2/𝛿
𝜖2
• Can we treat continuous classes (e.g. linear predictors)?Move from cardinality to “growth function”?
• E.g.:
• ℋ = 𝑠𝑖𝑔𝑛 𝑓 𝜙 𝑥 | 𝑓:ℝ𝑑 → ℝ is a polynomial , 𝜙:𝒳 → ℝ𝑑
• VCdim(ℋ)=∞• ℋ is uncountable, and there is no distribution with ∀ℎ∈ℋ𝑝 ℎ > 0• But what if we bias toward lower order polynomials?
• Answer 1: prior over hypothesis classes• Write ℋ =∪ℋ𝑟 (e.g. ℋ𝑟 =degree-𝑟 polynomials)• Use prior 𝑝(𝐻𝑟) over hypothesis classes
Prior Over Hypothesis Classes
• VC bound: ∀𝑟ℙ ∀ℎ∈ℋ𝑟𝐿 ℎ ≤ 𝐿𝑆 ℎ + 𝑂
VCdim ℋ𝑟 +log 1 𝛿𝑟
𝑚≥ 1 − 𝛿𝑟
• Setting 𝛿𝑟 = 𝑝 ℋ𝑟 ⋅ 𝛿 and taking a union bound,
∀𝑆∼𝒟𝑚𝛿 ∀ℋ𝑟
∀ℎ∈ℋ𝑟𝐿 ℎ ≤ 𝐿𝑆 ℎ + 𝑂
VCdim ℋ𝑟 − log 𝑝 ℋ𝑟 + log 1 𝛿𝑚
• Structural Risk Minimization over hypothesis classes:
𝑆𝑅𝑀𝑝 𝑆 = arg minℎ∈ℋ𝑟
𝐿𝑆 ℎ + 𝐶− log 𝑝 ℋ𝑟 + VCdim ℋ𝑟
𝑚
• Theorem: w.p. ≥ 1 − 𝛿,
𝐿𝒟 𝑆𝑅𝑀𝑝 𝑆 ≤ minℋ𝑟,ℎ∈ℋ𝑟
𝐿𝒟 ℎ + 𝑂− log 𝑝 ℋ𝑟 + VCdim ℋ𝑟 + log
1𝛿
𝑚
Structural Risk Minimization• Theorem: For a prior 𝑝 ℋ𝑟 with ℋ𝑟
𝑝 ℋ𝑟 ≤ 1 and any 𝒟, ∀𝑆∼𝒟𝑚𝛿 ,
𝐿𝒟 𝑆𝑅𝑀𝑝 𝑆 ≤ minℋ𝑟,ℎ∈ℋ𝑟
𝐿𝒟 ℎ + 𝑂− log 𝑝 ℋ𝑟 + VCdim ℋ𝑟 + log
1𝛿
𝑚
• For ℋ𝑖 = ℎ𝑖 :
• 𝑉𝐶𝑑𝑖𝑚 ℋ𝑟 = 0
• Reduces to “standard” SRM with a prior over hypothesis
• For 𝑝 ℋ𝑟 = 1
• Reduces to ERM over a finite-VC class
• More general. Eg for polynomials over 𝜙 𝑥 ∈ ℝ𝑑 with 𝑝 degree 𝑟 = 2−𝑟,
𝑚 𝜖, 𝛿, ℎ = 𝑂degree(ℎ) + 𝑑 + 1 degree ℎ + log
1𝛿
𝜖2
• Allows non-uniform learning of a countable union of finite-VC classes
Uniform and Non-Uniform Learnability
• Definition: A hypothesis class ℋ is agnostically PAC-Learnable if there exists a
learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∃𝑚 𝜖, 𝛿 , ∀𝒟, ∀𝒉, ∀𝑆∼𝒟𝑚 𝜖,𝛿𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
• Definition: A hypothesis class ℋ is non-uniformly learnable if there exists a
learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ, ∃𝑚 𝜖, 𝛿, 𝒉 , ∀𝒟, ∀𝑆∼𝒟𝑚 𝜖,𝛿,𝒉𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
• Theorem: A hypothesis class ℋ is non-uniformly learnable if and only ifit is a countable union of finite VC class (ℋ =∪𝑖∈ℕ ℋ𝑖, VCdim ℋ𝑖 < ∞)
• Definition: A hypothesis class ℋ is “consistently learnable” if there exists a
learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ ∀𝓓, ∃𝑚 𝜖, 𝛿, ℎ, 𝓓 , ∀𝑆∼𝒟𝑚 𝜖,𝛿,ℎ,𝓓𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
Consistency
• 𝒳 countable (e.g. 𝒳 = 0,1 ∗), ℋ = ±1 𝒳 (all possible functions)
• ℋ is uncountable, it is not a countable union of finite VC classes, and is thus not non-uniformly learnable
• Claim: ℋ is “consistently learnable” using𝐸𝑅𝑀ℋ 𝑆 𝑥 = 𝑀𝐴𝐽𝑂𝑅𝐼𝑇𝑌 𝑦𝑖 𝑠. 𝑡. 𝑥𝑖 , 𝑦𝑖 ∈ 𝑆
• Proof sketch: for any 𝒟,
• Sort 𝒳 by decreasing probability. The tail has diminishing probability and thus for any 𝜖, there exists some prefix 𝒳′ of the sort s.t. the tail 𝒳 ∖𝒳′ has probability mass ≤ 𝜖/2.
• We’ll give up on the tail. 𝒳′ is finite, and so ±1 ℋ is also finite.
• Why only “consistently learnable”?
• Size of 𝒳‘ required to capture 1 − 𝜖/2 of mass depends on 𝒟.
Uniform and Non-Uniform Learnability
• Definition: A hypothesis class ℋ is agnostically PAC-Learnable if there exists a
learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∃𝑚 𝜖, 𝛿 , ∀𝒟, ∀𝒉, ∀𝑆∼𝒟𝑚 𝜖,𝛿𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
• (Agnostically) PAC-Learnable iff VCdim ℋ < ∞
• Definition: A hypothesis class ℋ is non-uniformly learnable if there exists a
learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ, ∃𝑚 𝜖, 𝛿, 𝒉 , ∀𝒟, ∀𝑆∼𝒟𝑚 𝜖,𝛿,𝒉𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
• Non-uniformly learnable iff ℋ is a countable union of finite VC classes
• Definition: A hypothesis class ℋ is “consistently learnable” if there exists a
learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ ∀𝓓, ∃𝑚 𝜖, 𝛿, ℎ, 𝓓 , ∀𝑆∼𝒟𝑚 𝜖,𝛿,ℎ,𝓓𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
SRM In Practice
𝑆𝑅𝑀𝑝 𝑆 = arg minℎ∈ℋ𝑟
𝐿𝑆 ℎ + 𝐶− log𝑝 ℋ𝑟 + VCdim ℋ𝑟
𝑚
• Bound is loose anyway. Better to view as bi-criteria optimization:argmin 𝐿𝑠 ℎ and − log 𝑝 ℋ𝑟 + VCdim ℋ𝑟
E.g. serialize asargmin 𝐿𝑠 ℎ + 𝜆 − log 𝑝 ℋ𝑟 + VCdim ℋ𝑟
• Typically −log 𝑝 ℋ𝑟 , VCdim ℋ𝑟 monotone in “complexity” 𝑟argmin 𝐿𝑠 ℎ 𝐚𝐧𝐝 𝑟(ℎ)
where𝑟 ℎ = min 𝑟 𝑠. 𝑡. ℎ ∈ ℋ𝑟
SRM as a Bi-Criteria Problemargmin 𝐿𝑠 ℎ and 𝑟(ℎ)
Regularization Path = argminℎ 𝐿𝑠 ℎ + 𝜆 ⋅ 𝑟 ℎ | 0 ≤ 𝜆 ≤ ∞
Select 𝜆 using a validation set—exact bound not needed
𝑟(ℎ)
𝐿𝑆(ℎ)
ℎ
Regularization Path(Pareto Frontier)
𝜆 = ∞
𝜆 = 0
Non-Uniform Learning: Beyond Cardinality• MDL still essentially based on cardinality (“how many hypothesis are
simpler then me”) and ignores relationship between predictors.
• Can we treat continuous classes (e.g. linear predictors)?Move from cardinality? Take into account that many predictors are similar?
• Answer 1: prior 𝒑(𝓗) over hypothesis class
• Answer 2: PAC-Bayes Theory
• Prior distribution 𝑃 (not necessarily discrete) over ℋ
David McAllester
PAC-Bayes
• Until now (MDL, SRM) we used a discrete “prior” (discrete “distribution” 𝑝 ℎover hypothesis, or discrete “distribution” 𝑝(ℋ𝑟) over hypothesis classes)
• Instead: encode inductive bias as distribution 𝑃 over hypothesis
• Use randomized (averaged) predictor ℎ𝑄, where for each prediction choosesℎ ∼ 𝑄 and predicts ℎ(𝑥)
• ℎ𝑄 𝑥 = 𝑦 w. p. ℙℎ∼𝑄 ℎ 𝑥 = 𝑦
• 𝐿𝒟 ℎ𝑄 = 𝔼ℎ∼𝑄 𝐿𝒟 ℎ
• Theorem: for any distribution 𝑃 over hypothesis and any 𝒟, ∀𝑆∼𝒟𝑚𝛿
𝐿𝒟 ℎ𝑄 − 𝐿𝑆 ℎ𝑄 ≤𝐾𝐿 𝑄||𝑃 + log 2𝑚
𝛿2 𝑚 − 1
KL-Divergence
𝐾𝐿(𝑄| 𝑃 = 𝔼ℎ∼𝑄 log𝑑𝑄
𝑑𝑃
= ℎ 𝑞 ℎ log𝑞 ℎ
𝑝 ℎfor discrete dist with pmf 𝑝, 𝑞
= ∫ 𝑓𝑄 ℎ log𝑓𝑄 ℎ
𝑓𝑃 ℎ𝑑ℎ for continuous distributions
• Measures how much 𝑄 deviates from 𝑄
• 𝐾𝐿(𝑄| 𝑃 ≥ 0, and𝐾𝐿(𝑄| 𝑃 = 0 if and only if 𝑄 = 𝑃
• If 𝑄 𝐴 > 0 while 𝑃 𝐴 = 0, 𝐾𝐿(𝑄| 𝑃 = ∞ (other direction is allowed)
• 𝐾𝐿(𝐻1| 𝐻0 =information per sample for rejecting 𝐻0 when 𝐻1 is true
• 𝐾𝐿 𝒬||Unif 𝑛 = log𝑛 − 𝐻(𝒬)
• 𝐼 𝑋, 𝑌 = 𝐾𝐿 𝑃 𝑋, 𝑌 ||𝑃 𝑋 𝑃 𝑌
Solomon
Kullback
Richard
Leibler
PAC-Bayes• For any distribution 𝒫 over hypothesis and any 𝒟, ∀𝑆∼𝒟𝑚
𝛿
𝐿𝒟 ℎ𝒬 − 𝐿𝑆 ℎ𝒬 ≤𝐾𝐿 𝑄||𝑃 + log 2𝑚
𝛿2 𝑚 − 1
• Can only use hypothesis in the support of 𝑃 (otherwise 𝐾𝐿(𝑄| 𝑃 = ∞)
• For a finite ℋ with 𝑃 = Unif(ℋ)
• Consider 𝑄 = point mass on h
• 𝐾𝐿(𝑄| 𝑃 = log ℋ
• Generalizes cardinality bound (up to log𝑚)
• More generally, for a discrete 𝑃 and 𝑄 = point mass on h
• 𝐾𝐿(𝑄| 𝑃 = 𝑞 ℎ log𝑞 ℎ
𝑝 ℎ=
1
𝑝 ℎ
• Generalizes MDL/SRM (up to log𝑚)
• For continuous 𝑃 (eg over linear predictors or polynomials)
• For 𝑄=point-mass (or any discrete), 𝐾𝐿(𝑄| 𝑃 = ∞
• Take ℎ𝑄 as average over similar hypothesis (eg with same behavior on 𝑆)
PAC-Bayes
𝐿𝒟 ℎ𝑄 ≤ 𝐿𝑆 ℎ𝑄 +𝐾𝐿 𝑄||𝑃 + log 2𝑚
𝛿2 𝑚 − 1
• What learning rule does the PAC-Bayes bound suggest?
𝑄𝜆 = argmin𝑄
𝐿𝑆 ℎ𝑄 + 𝜆 ⋅ 𝐾𝐿 𝑄||𝑃
• Theorem:𝑞𝜆 ℎ ∝ 𝑝 ℎ 𝑒−𝛽𝐿𝑆 ℎ
for some “inverse temperature” 𝛽
• As 𝜆 → ∞ we ignore the data, corresponding to infinite temperature, 𝛽 → 0
• As 𝜆 → 0 we insist on minimizing 𝐿𝑆 ℎ𝑄 , corresponding to zero temperature, 𝛽 → ∞, and the prediction becomes ERM (or rather, a distribution over the ERM hypothesis in the support of 𝑃)
PAC-Bayes vs Bayes
Bayesian approach:• Assume ℎ ∼ 𝒫,
• 𝑦1, … , 𝑦𝑚 iid conditioned on ℎ, with 𝑦𝑖|𝑥𝑖 , ℎ = ℎ(𝑥𝑖), 𝑤. 𝑝. 1 − 𝜈
−ℎ(𝑥𝑖), 𝑤. 𝑝. 𝜈
Use posterior:
𝑝 ℎ 𝑆 ∝ 𝑝 ℎ 𝑝 𝑆 ℎ
= 𝑝 ℎ 𝑖 𝑝 𝑥𝑖 𝑝(𝑦𝑖|𝑥𝑖)
∝ 𝑝 ℎ 𝑖𝜈
1−𝜈
ℎ 𝑥𝑖 ≠𝑦𝑖= 𝑝 ℎ
𝜈
1−𝜈
𝑖 ℎ 𝑥𝑖 ≠𝑦𝑖
= 𝑝 ℎ 𝑒−𝛽𝐿𝑠 ℎ
where 𝛽 = 𝑚 log1−𝜈
𝜈
PAC-Bayes vs BayesPAC-Bayes
• 𝑃 encodes inductive bias, not assumption about reality
• SRM-type bound minimized by Gibbs distribution
𝑞𝜆 ℎ ∝ 𝑝 ℎ 𝑒−𝛽𝐿𝑆 ℎ
• Post-hoc guarantee always valid (∀𝑆𝛿),
with no assumption about reality
𝐿𝒟 ℎ𝑄 ≤ 𝐿𝑆 ℎ𝑄 +𝐾𝐿 𝑄||𝑃 + log 2𝑚
𝛿2 𝑚 − 1
• Bound valid for any 𝑄
• If inductive bias very different from reality, bound will be high
Bayesian Approach• 𝒫 is prior over reality
• Posterior given by Gibbs distribution
𝑞𝜆 ℎ ∝ 𝑝 ℎ 𝑒−𝛽𝐿𝑆 ℎ
• Risk analysis assuming prior
PAC-Bayes: Tighter Version
• For any distribution 𝑃 over hypothesis and any source distribution 𝒟, ∀𝑆∼𝒟𝑚𝛿
𝐾𝐿 𝐿𝑆 ℎ𝑄 ||𝐿𝒟 ℎ𝑄 ≤𝐾𝐿 𝑄||𝑃 + log 2𝑚
𝛿𝑚 − 1
where 𝐾𝐿(𝛼| 𝛽 = 𝛼 log𝛼
𝛽+ 1 − 𝛼 log
1−𝛼
1−𝛽for 𝛼, 𝛽 ∈ [0,1]
𝐿𝒟 ℎ𝑄 ≤ 𝐿𝑆 ℎ𝑄 +2𝐿𝑠 ℎ𝑄 𝐾𝐿 𝑄𝑃 + log
2𝑚𝛿
𝑚 − 1+2 𝐾𝐿 𝑄𝑃 + log
2𝑚𝛿
𝑚− 1
• This generalizes the realizable case (𝐿𝑆 ℎ𝑄 = 0, and so only the 1
𝑚term
appears) and the agnostic case (where the 1 𝑚 term is dominant)
• Numerically much tighter
• Can also be used as a tail bound instead of Hoeffding or Bernstein also with cardinality or VC-based guarantees. Arises naturally in PAC-Bayes.