Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Lecture 10: PAC learning
Introduction to Learningand Analysis of Big Data
Kontorovich and Sabato (BGU) Lecture 10 1 / 24
Sample complexity analysis
We mentioned some relationships between sample complexity andlearning:
I Larger sample size =⇒ smaller overfittingI A richer class H =⇒ larger sample complexity.I ERM on linear predictors: sample complexity is O(d).I Hard-SVM: sample complexity is O(min(d , 1
γ2∗
)).
Why does this relationship hold?
What properties of H make it “richer” or “simpler” to learn?
What is the actual statistical complexity of learning with H?
We want answers that hold for all distributions D over X × Y.
The question
Given a hypothesis class H, how many examples are needed to guaranteethat an ERM algorithm over H will output a low-error predictor?
•Kontorovich and Sabato (BGU) Lecture 10 2 / 24
Simplifying assumptions
Assume that D is realizable by H.
Definition
D is realizable by H if there exists some h∗ ∈ H such that err(h∗,D) = 0.
For any x with non-zero probability in D, its label must be h∗(x).
So, for any training sample S ∼ Dm, err(h∗, S) = 0.
Suppose we run an ERM with H on S . Then
hS ∈ argminh∈H
err(h, S).
For any S ∼ Dm, err(hS , S) = 0.
But hS could be different from h∗.I E.g. thresholds: the exact threshold is not known from S .
Can we guarantee that err(hS ,D) = 0?
•Kontorovich and Sabato (BGU) Lecture 10 3 / 24
What can we guarantee?We cannot guarantee that for all training samples, err(hS ,D) = 0:
I There is a chance that S turns out very different from D.
I Even if S is quite good, we cannot always find h∗ exactly.
So, we will only require that for almost all samples, err(hS ,D) is low.I Set a confidence parameter δ ∈ (0, 1). Allow a δ-fraction of the
samples to cause the algorithm to choose a very bad hS .
I Set an error parameter ε ∈ (0, 1). Require the rest of the non-badsamples to have err(hS ,D) ≤ ε.
I We want a size of the training sample m that will guarantee:
PS∼Dm [err(hS ,D) ≤ ε] ≥ 1− δ
(ε, δ)-sample complexity: The sample size m that is needed to geterror ε with probability 1− δ for any distribution.
•Kontorovich and Sabato (BGU) Lecture 10 4 / 24
What can we guarantee?
Assume that D is realizable by H, err(h∗,D) = 0.
Run an ERM algorithm. Get hS such that err(hS ,S) = 0.
The algorithm may select any h ∈ H with err(h,S) = 0.
We need to guarantee that for any h ∈ H that the algorithm might select,err(h,D) ≤ ε.
S is a good sample if:
All h ∈ H with err(h,D) > ε have err(h,S) > 0.
We need to find a sample size m such that:
For any distribution D which is realizable by H,
PS∼Dm [S is a good sample] ≥ 1− δ.
•Kontorovich and Sabato (BGU) Lecture 10 5 / 24
What can we guarantee?
Fix some “bad” hbad ∈ H: with err(hbad,D) > ε.
What is the probability that err(hbad,S) = 0 (i.e., hbad “looks good”)?
err(hbad,S) = 0 iff for all (x , y) ∈ S , hbad(x) = y .
We have:P(X ,Y )∼D[hbad(X ) 6= Y ] = err(hbad,D) > ε.
S ∼ Dm consists of independent random pairs from D, so
P[err(hbad,S) = 0] = PS∼Dm [∀(x , y) ∈ S , hbad(x) = y ]
= (P(X ,Y )∼D[hbad(x) = y ])m
≤ (1− ε)m
≤ e−εm.
But we need no bad h to look good on S .
•Kontorovich and Sabato (BGU) Lecture 10 6 / 24
Guarantee for a finite hypothesis class
Assume a finite H = {h1, h2, . . . , hk}.I If not finite (e.g. thresholds), can usually discretize.
All samples S ∼ Dm
err(h1, S) = 0
err(h7, S) = 0
err(h18, S) = 0
Suppose h1, h7, h18 have err(h,D) > ε.
For samples outside the small circles, ERMcannot select a bad h.
Probability mass of circle for hi :
pi := PS∼Dm [err(hi ,S) = 0] ≤ e−εm.
Size outside small circles: at least
1−∑
i :err(hi ,D)>ε
pi .
This is an application of the union bound:P[A or B] ≤ P[A] + P[B].
•Kontorovich and Sabato (BGU) Lecture 10 7 / 24
Guarantee for a finite hypothesis classProbability that ERM selects a bad h:
PS∼Dm [err(hS ,D) > ε]
≤ P[∃h ∈ H s.t. err(h,D) > ε and err(h,S) = 0]
≤∑
h:err(h,D)>ε
P[err(h,S) = 0]
≤∑
h:err(h,D)>ε
e−εm
≤ |H|e−εm.
Our confidence parameter is δ, so we want
PS∼Dm [err(hS ,D) > ε] ≤ δ.
If m ≥ log(|H|)+log(1/δ)ε , then
PS∼Dm [err(hS ,D) > ε] ≤ |H|e−εm ≤ δ.
•Kontorovich and Sabato (BGU) Lecture 10 8 / 24
Probably Approximately Correct learning
Theorem
Let ε, δ ∈ (0, 1). For any finite hypothesis class H, and any distribution Dover X × Y which is realizable by H, if the training sample size m has
m ≥ log(|H|) + log(1/δ)
ε
then any ERM algorithm with training sample size m gets an error of atmost ε, with a probability of at least 1− δ over the random trainingsamples.
The ERM alg. Probably finds an Approximately Correct hypothesis.
This is called PAC-learning.
“With high probability” (w.h.p.) ≡ with probability at least 1− δ.
•
Kontorovich and Sabato (BGU) Lecture 10 9 / 24
Probably Approximately Correct learning
The (ε, δ)-sample complexity for learning H in the realizable setting is atmost
log(|H|) + log(1/δ)
ε.
For a better accuracy (= lower ε), need linearly more samples.
For higher confidence (= lower δ), need logarithmically more samples.
If H is larger, need more examples for same confidence and accuracy!
What happens if H includes all possible functions?
Overfitting: err(hS ,S)� err(hS ,D).
With probability 1− δ,
err(hS ,D)− err(hS ,S) ≤ log(|H|) + log(1/δ)
m.
Larger sample size, smaller H =⇒ less overfitting.
•Kontorovich and Sabato (BGU) Lecture 10 10 / 24
Example: Which diet allows living beyond 90?
Given a person’s diet, predict whether they will live beyond 90.
Suppose we consider d possible foods.
A person’s diet is encoded as a binary vector describing which foods theyeat: X = {0, 1}d .
Require 95% probability of getting prediction error less than 10%.
I δ = 0.05, ε = 0.1.
Sufficient training sample size: m ≥ log(|H|)+log(1/δ)ε .
Set H = Boolean conjunctions of some features (foods) or their negation.
I E.g. h(x) = ¬x(2) ∧ x(14) ∧ x(17) ∧ ¬x(32)
Then |H| = 3d
Sufficient sample size: m ≥ d log(3)+log(1/0.05)0.1 ≈ 11d + 30.
Smaller d means m can be smaller, but analysis holds only if D remainsrealizable by H.
•Kontorovich and Sabato (BGU) Lecture 10 11 / 24
PAC learning in the agnostic setting
The agnostic setting
Make no assumptions on D. Given ε, δ ∈ (0, 1), require that withprobability at least 1− δ over S ∼ Dm,
err(hS ,D) ≤ infh∈H
err(h,D) + ε.
Try to get close to the best rule in H.
If D happens to be realizable by H, the agnostic requirement is thesame as the requirement in the realizable setting.
What sample size do we need in the agnostic setting with ERM?
•
Kontorovich and Sabato (BGU) Lecture 10 12 / 24
Sample size for the agnostic settingSuppose we run ERM on S ∼ Dm and get hS .
In the agnostic setting, err(hS , S) might be non-zero.
Can we guarantee that err(hS ,S) is close to err(hS ,D)?
Fix some h ∈ H. We will bound
|err(h, S)− err(h,D)| =
∣∣∣∣∣ 1
m
m∑i=1
I[h(xi ) 6= yi ]− P(X ,Y )∼D[h(X ) 6= Y ]
∣∣∣∣∣ .Define Zi = I[h(xi ) 6= yi ].
Z1, . . . ,Zm are statistically independent.
∀i ≤ m,P[Zi = 1] = err(h,D).
Hoeffding’s inequality
Let Z1, . . . ,Zm be independent random variables over {0, 1},where for all i ≤ m, P[Zi = 1] = p. Then
P
[∣∣∣∣∣ 1
m
m∑i=1
Zi − p
∣∣∣∣∣ ≥ ε]≤ 2 exp(−2ε2m).
Conclusion: for any fixed h ∈ H, P[|err(h,S)− err(h,D)| ≥ ε] ≤ 2 exp(−2ε2m).
•Kontorovich and Sabato (BGU) Lecture 10 13 / 24
Sample size for the agnostic setting
We showed that for any ε ∈ (0, 1), and any h ∈ H,
P[|err(h,S)− err(h,D)| ≥ ε] ≤ 2 exp(−2ε2m).
For the ERM algorithm, hS ∈ argminh∈H err(h,S).
S is a good sample if for all h ∈ H, |err(h, S)− err(h,D)| ≤ ε/2.
Let h∗ ∈ argminh∈H err(h,D). For a good sample S :
err(hS ,D) ≤ err(hS , S) + ε/2 ≤ err(h∗,S) + ε/2 ≤ err(h∗,D) + ε.
What is the probability that a sample S ∼ Dm is not good?
P[∃h ∈ H, |err(h,S)− err(h,D)| ≥ ε/2] ≤ |H| · 2 exp(−ε2m/2).
Set
m ≥ 2 log(|H|) + 2 log(2/δ)
ε2.
ThenPS∼Dm [S is a good sample] ≥ 1− δ.
•Kontorovich and Sabato (BGU) Lecture 10 14 / 24
Agnostic PAC learning guarantees
Theorem
Let ε, δ ∈ (0, 1). For any finite hypothesis class H, and any distribution Dover X × Y, if the training sample size m has
m ≥ 2 log(|H|) + 2 log(2/δ)
ε2
then any ERM algorithm with training sample size m gets an error of atmost ε, with a probability of at least 1− δ over the random trainingsamples.
Compare to sample size for ERM in the realizable case:
m ≥ log(|H|) + log(1/δ)
ε
Main difference: In agnostic setting, dependence on ε is stronger.
•Kontorovich and Sabato (BGU) Lecture 10 15 / 24
The Bias-Complexity tradeoffLet H ⊆ H′, both finite.
Approximation error: errapp(H,D) := infh∈H err(h,D).
For all D, errapp(H,D) ≥ errapp(H′,D)
Estimation error: Let hS ,H be the output of ERM for H on S .
errest(S ,H,D) := err(hS ,H,D)− infh∈H
err(h,D).
With a probability 1− δ,
errest(S ,H,D) ≤√
2 log(|H|) + 2 log(2/δ)
m.
A bound on overfitting: With a probability 1− δ,
|err(hS ,H,S)− err(hS ,H,D)| ≤√
log(|H|) + log(2/δ)
2m.
Bounds for H are smaller than bounds for H′.Trade-off: approximation error vs. estimation error/overfitting.
•Kontorovich and Sabato (BGU) Lecture 10 16 / 24
Computational complexity of ERM
ERM with a hypothesis class HGiven a training sample S ∼ Dm, output hS such that
hS ∈ argminh∈H
err(h,S).
We showed a bound on the statistical complexity of ERM in the realizableand agnostic cases.
What about the computational complexity?
Naive algorithm (finite H): calculate err(h,S) for all h ∈ H, choose smallest.
If H is infinite, discretize it, or try all possible labelings.
But even a finite H might be very large:I H = Boolean conjunctions.
I Sample size is O(d)
I But |H| = 3d , naive algorithm is O(3d)
•Kontorovich and Sabato (BGU) Lecture 10 17 / 24
The Computational complexity of ERMThe true computational complexity of ERM depends on H.
Sometimes can be much better than enumerating H.
Example for realizable setting: H = Boolean conjunctions over d features.
ERM algorithm for Boolean conjunctions (realizable setting)
input A training sample S ,output A function hS : X → Y.1: Xpos ← {x | (x , 1) ∈ S}.2: Start with conjunction of all literals (h always returns 0)3: for x ∈ Xpos do4: for i = 1 to d do5: if x(i) is positive then6: Remove negation of feature i from conjunction.7: else8: Remove feature i from conjunction.9: end if
10: end for11: end for12: Return final conjunction
•Kontorovich and Sabato (BGU) Lecture 10 18 / 24
Computational complexity of ERM
This ERM algorithm for Boolean conjunctions is linear in d .I For the agnostic (non-realizable) setting: NP-hard.
Some hypothesis classes don’t have an efficient ERM algorithm,even in the realizable setting.
H =3-DNF: All disjunctions of 3 Boolean conjunctions:
h(x) := A1(x) ∨ A2(X ) ∨ A3(x), Ai (x) are Boolean conjunctions.
I |H| ≤ 33d . Sufficient sample size: log(|H|)+log(1/δ)ε ≤ 3d log(3)+log(1/δ)
ε .
I But no ERM algorithm polynomial in n, unless RP = NP.
For 3-DNF, there is a trick.I There is a class H′ which contains the class 3-DNF, and has an
efficient ERM algorithm
I H′ is richer than 3-DNF: higher sample complexity.
I Tradeoff between statistical complexity and computational complexity!
•Kontorovich and Sabato (BGU) Lecture 10 19 / 24
Computational complexity in agnostic setting
Many hypothesis classes have an efficient algorithm in the realizablesetting, but not in the agnostic setting.
I Recall linear predictors.
A possible solution:I Try to find an hS ∈ H with a low err(hS ,S).
I If m ≥ log(|H|)+log(2/δ)2ε2 , then with probability 1− δ,
∀h ∈ H, |err(h,S)− err(h,D)| ≤ ε.
I Use any heuristic to find a hS .
I Get the guarantee err(hS ,D) ≤ err(hS ,S) + ε.
I No guarantee on distance from minh∈H err(h,D).
I Soft-SVM is based on the same idea (but with an infinite class).
•
Kontorovich and Sabato (BGU) Lecture 10 20 / 24
A Heuristic learning algorithm for Boolean conjunction
Boolean conjunction: ERM is NP-hard in the agnostic case.
A greedy heuristic:I Start with a function h that is always trueI In each iteration t,
F Add to h the literal that would decrease err(h,S) the most.F Stop when no literal decreases the error anymore.
I Return the last h.
No guarantee that err(hS , S) is close to minh∈H err(h,S).
No guarantee that err(hS , S) is low.
But if m ≥ log(|H|)+log(2/δ)2ε2
, then with high probability,
err(hS ,D) ≤ err(hS ,S) + ε.
•Kontorovich and Sabato (BGU) Lecture 10 21 / 24
Infinite hypothesis classes
For a finite H:
Realizable setting: err(hS ,D) ≤ ε with probability 1− δ if
m ≥ log(|H|) + log(1/δ)
ε.
Agnostic setting: err(hS ,D) ≤ infh∈H err(h,D) + ε with probability1− δ if
m ≥ 2 log(|H|) + 2 log(2/δ)
ε2.
Required sample size depends on log(|H|).
What if H is infinite?
•
Kontorovich and Sabato (BGU) Lecture 10 22 / 24
Infinite hypothesis classes
General H, could be infinite.
Need a property of H that measures its sample complexity.
VC(H): the VC-dimension of H.
I VC(H) is the size of largest set of examples that can be labeled in allpossible label combinations using hypotheses from H.
I VC(H) measures how much “variation” exists in the functions in H.I For linear predictors: VC(H) = d .
Sample complexity bounds for infinite H use VC(H) instead of log(|H|).
Think of 2VC(H) as the “effective size” of H.
2VC(H) ≤ |H| for all H.
There are classes with an infinite VC-dimension:
I Such classes are not learnable for a general distribution D.I If a class is not learnable, there is no sample size that would guarantee ε, δ
PAC-learning for all distributions D.I All finite classes are learnableI some infinite classes are learnableI If X is infinite, the class of all functions H = YX is not learnable.
•Kontorovich and Sabato (BGU) Lecture 10 23 / 24
PAC learning: Summary
PAC-learning addresses distribution-free learning with a givenhypothesis class.
PAC analysis provides bounds on sample complexity and onoverfitting.
The sample complexity of ERM is near-optimal amongdistribution-free algorithms.
But for many problems there is no efficient ERM algorithm.
Approaches to get efficient algorithms:I Use a heuristic to find a hypothesis with a low error on the sampleI Change the hypothesis class to one that can be done efficientlyI Change the goal: try to minimize a different loss that matches the task.
•
Kontorovich and Sabato (BGU) Lecture 10 24 / 24