Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Lecture 10: PAC learning

Introduction to Learningand Analysis of Big Data

Kontorovich and Sabato (BGU) Lecture 10 1 / 24

Sample complexity analysis

We mentioned some relationships between sample complexity andlearning:

I Larger sample size =⇒ smaller overfittingI A richer class H =⇒ larger sample complexity.I ERM on linear predictors: sample complexity is O(d).I Hard-SVM: sample complexity is O(min(d , 1

γ2∗

)).

Why does this relationship hold?

What properties of H make it “richer” or “simpler” to learn?

What is the actual statistical complexity of learning with H?

We want answers that hold for all distributions D over X × Y.

The question

Given a hypothesis class H, how many examples are needed to guaranteethat an ERM algorithm over H will output a low-error predictor?

•Kontorovich and Sabato (BGU) Lecture 10 2 / 24

Simplifying assumptions

Assume that D is realizable by H.

Definition

D is realizable by H if there exists some h∗ ∈ H such that err(h∗,D) = 0.

For any x with non-zero probability in D, its label must be h∗(x).

So, for any training sample S ∼ Dm, err(h∗, S) = 0.

Suppose we run an ERM with H on S . Then

hS ∈ argminh∈H

err(h, S).

For any S ∼ Dm, err(hS , S) = 0.

But hS could be different from h∗.I E.g. thresholds: the exact threshold is not known from S .

Can we guarantee that err(hS ,D) = 0?


What can we guarantee?We cannot guarantee that for all training samples, err(hS ,D) = 0:

I There is a chance that S turns out very different from D.

I Even if S is quite good, we cannot always find h∗ exactly.

So, we will only require that for almost all samples, err(hS ,D) is low.I Set a confidence parameter δ ∈ (0, 1). Allow a δ-fraction of the

samples to cause the algorithm to choose a very bad hS .

I Set an error parameter ε ∈ (0, 1). Require the rest of the non-badsamples to have err(hS ,D) ≤ ε.

I We want a size of the training sample m that will guarantee:

PS∼Dm [err(hS ,D) ≤ ε] ≥ 1− δ

(ε, δ)-sample complexity: The sample size m that is needed to geterror ε with probability 1− δ for any distribution.


What can we guarantee?

Assume that D is realizable by H, err(h∗,D) = 0.

Run an ERM algorithm. Get hS such that err(hS ,S) = 0.

The algorithm may select any h ∈ H with err(h,S) = 0.

We need to guarantee that for any h ∈ H that the algorithm might select,err(h,D) ≤ ε.

S is a good sample if:

All h ∈ H with err(h,D) > ε have err(h,S) > 0.

We need to find a sample size m such that:

For any distribution D which is realizable by H,

PS∼Dm [S is a good sample] ≥ 1− δ.


What can we guarantee?

Fix some “bad” hbad ∈ H: with err(hbad,D) > ε.

What is the probability that err(hbad,S) = 0 (i.e., hbad “looks good”)?

err(hbad,S) = 0 iff for all (x , y) ∈ S , hbad(x) = y .

We have:P(X ,Y )∼D[hbad(X ) 6= Y ] = err(hbad,D) > ε.

S ∼ Dm consists of independent random pairs from D, so

P[err(hbad,S) = 0] = PS∼Dm [∀(x , y) ∈ S , hbad(x) = y ]

= (P(X ,Y )∼D[hbad(x) = y ])m

≤ (1− ε)m

≤ e−εm.

But we need no bad h to look good on S .


Guarantee for a finite hypothesis class

Assume a finite H = {h1, h2, . . . , hk}.I If not finite (e.g. thresholds), can usually discretize.

All samples S ∼ Dm

err(h1, S) = 0

err(h7, S) = 0

err(h18, S) = 0

Suppose h1, h7, h18 have err(h,D) > ε.

For samples outside the small circles, ERMcannot select a bad h.

Probability mass of circle for hi :

pi := PS∼Dm [err(hi ,S) = 0] ≤ e−εm.

Size outside small circles: at least

1−∑

i :err(hi ,D)>ε

pi .

This is an application of the union bound:P[A or B] ≤ P[A] + P[B].


Guarantee for a finite hypothesis classProbability that ERM selects a bad h:

PS∼Dm [err(hS ,D) > ε]

≤ P[∃h ∈ H s.t. err(h,D) > ε and err(h,S) = 0]

≤∑

h:err(h,D)>ε

P[err(h,S) = 0]

≤∑

h:err(h,D)>ε

e−εm

≤ |H|e−εm.

Our confidence parameter is δ, so we want

PS∼Dm [err(hS ,D) > ε] ≤ δ.

If m ≥ log(|H|)+log(1/δ)ε , then

PS∼Dm [err(hS ,D) > ε] ≤ |H|e−εm ≤ δ.


Probably Approximately Correct learning

Theorem

Let ε, δ ∈ (0, 1). For any finite hypothesis class H, and any distribution Dover X × Y which is realizable by H, if the training sample size m has

m ≥ log(|H|) + log(1/δ)

ε

then any ERM algorithm with training sample size m gets an error of atmost ε, with a probability of at least 1− δ over the random trainingsamples.

The ERM alg. Probably finds an Approximately Correct hypothesis.

This is called PAC-learning.

“With high probability” (w.h.p.) ≡ with probability at least 1− δ.

•


Probably Approximately Correct learning

The (ε, δ)-sample complexity for learning H in the realizable setting is atmost

log(|H|) + log(1/δ)

ε.

For a better accuracy (= lower ε), need linearly more samples.

For higher confidence (= lower δ), need logarithmically more samples.

If H is larger, need more examples for same confidence and accuracy!

What happens if H includes all possible functions?

Overfitting: err(hS ,S)� err(hS ,D).

With probability 1− δ,

err(hS ,D)− err(hS ,S) ≤ log(|H|) + log(1/δ)

m.

Larger sample size, smaller H =⇒ less overfitting.


Example: Which diet allows living beyond 90?

Given a person’s diet, predict whether they will live beyond 90.

Suppose we consider d possible foods.

A person’s diet is encoded as a binary vector describing which foods theyeat: X = {0, 1}d .

Require 95% probability of getting prediction error less than 10%.

I δ = 0.05, ε = 0.1.

Sufficient training sample size: m ≥ log(|H|)+log(1/δ)ε .

Set H = Boolean conjunctions of some features (foods) or their negation.

I E.g. h(x) = ¬x(2) ∧ x(14) ∧ x(17) ∧ ¬x(32)

Then |H| = 3d

Sufficient sample size: m ≥ d log(3)+log(1/0.05)0.1 ≈ 11d + 30.

Smaller d means m can be smaller, but analysis holds only if D remainsrealizable by H.


PAC learning in the agnostic setting

The agnostic setting

Make no assumptions on D. Given ε, δ ∈ (0, 1), require that withprobability at least 1− δ over S ∼ Dm,

err(hS ,D) ≤ infh∈H

err(h,D) + ε.

Try to get close to the best rule in H.

If D happens to be realizable by H, the agnostic requirement is thesame as the requirement in the realizable setting.

What sample size do we need in the agnostic setting with ERM?

•


Sample size for the agnostic settingSuppose we run ERM on S ∼ Dm and get hS .

In the agnostic setting, err(hS , S) might be non-zero.

Can we guarantee that err(hS ,S) is close to err(hS ,D)?

Fix some h ∈ H. We will bound

|err(h, S)− err(h,D)| =

∣∣∣∣∣ 1

m

m∑i=1

I[h(xi ) 6= yi ]− P(X ,Y )∼D[h(X ) 6= Y ]

∣∣∣∣∣ .Define Zi = I[h(xi ) 6= yi ].

Z1, . . . ,Zm are statistically independent.

∀i ≤ m,P[Zi = 1] = err(h,D).

Hoeffding’s inequality

Let Z1, . . . ,Zm be independent random variables over {0, 1},where for all i ≤ m, P[Zi = 1] = p. Then

P

[∣∣∣∣∣ 1

m

m∑i=1

Zi − p

∣∣∣∣∣ ≥ ε]≤ 2 exp(−2ε2m).

Conclusion: for any fixed h ∈ H, P[|err(h,S)− err(h,D)| ≥ ε] ≤ 2 exp(−2ε2m).


Sample size for the agnostic setting

We showed that for any ε ∈ (0, 1), and any h ∈ H,

P[|err(h,S)− err(h,D)| ≥ ε] ≤ 2 exp(−2ε2m).

For the ERM algorithm, hS ∈ argminh∈H err(h,S).

S is a good sample if for all h ∈ H, |err(h, S)− err(h,D)| ≤ ε/2.

Let h∗ ∈ argminh∈H err(h,D). For a good sample S :

err(hS ,D) ≤ err(hS , S) + ε/2 ≤ err(h∗,S) + ε/2 ≤ err(h∗,D) + ε.

What is the probability that a sample S ∼ Dm is not good?

P[∃h ∈ H, |err(h,S)− err(h,D)| ≥ ε/2] ≤ |H| · 2 exp(−ε2m/2).

Set

m ≥ 2 log(|H|) + 2 log(2/δ)

ε2.

ThenPS∼Dm [S is a good sample] ≥ 1− δ.


Agnostic PAC learning guarantees

Theorem

Let ε, δ ∈ (0, 1). For any finite hypothesis class H, and any distribution Dover X × Y, if the training sample size m has

m ≥ 2 log(|H|) + 2 log(2/δ)

ε2

then any ERM algorithm with training sample size m gets an error of atmost ε, with a probability of at least 1− δ over the random trainingsamples.

Compare to sample size for ERM in the realizable case:

m ≥ log(|H|) + log(1/δ)

ε

Main difference: In agnostic setting, dependence on ε is stronger.


The Bias-Complexity tradeoffLet H ⊆ H′, both finite.

Approximation error: errapp(H,D) := infh∈H err(h,D).

For all D, errapp(H,D) ≥ errapp(H′,D)

Estimation error: Let hS ,H be the output of ERM for H on S .

errest(S ,H,D) := err(hS ,H,D)− infh∈H

err(h,D).

With a probability 1− δ,

errest(S ,H,D) ≤√

2 log(|H|) + 2 log(2/δ)

m.

A bound on overfitting: With a probability 1− δ,

|err(hS ,H,S)− err(hS ,H,D)| ≤√

log(|H|) + log(2/δ)

2m.

Bounds for H are smaller than bounds for H′.Trade-off: approximation error vs. estimation error/overfitting.


Computational complexity of ERM

ERM with a hypothesis class HGiven a training sample S ∼ Dm, output hS such that

hS ∈ argminh∈H

err(h,S).

We showed a bound on the statistical complexity of ERM in the realizableand agnostic cases.

What about the computational complexity?

Naive algorithm (finite H): calculate err(h,S) for all h ∈ H, choose smallest.

If H is infinite, discretize it, or try all possible labelings.

But even a finite H might be very large:I H = Boolean conjunctions.

I Sample size is O(d)

I But |H| = 3d , naive algorithm is O(3d)


The Computational complexity of ERMThe true computational complexity of ERM depends on H.

Sometimes can be much better than enumerating H.

Example for realizable setting: H = Boolean conjunctions over d features.

ERM algorithm for Boolean conjunctions (realizable setting)

input A training sample S ,output A function hS : X → Y.1: Xpos ← {x | (x , 1) ∈ S}.2: Start with conjunction of all literals (h always returns 0)3: for x ∈ Xpos do4: for i = 1 to d do5: if x(i) is positive then6: Remove negation of feature i from conjunction.7: else8: Remove feature i from conjunction.9: end if

10: end for11: end for12: Return final conjunction


Computational complexity of ERM

This ERM algorithm for Boolean conjunctions is linear in d .I For the agnostic (non-realizable) setting: NP-hard.

Some hypothesis classes don’t have an efficient ERM algorithm,even in the realizable setting.

H =3-DNF: All disjunctions of 3 Boolean conjunctions:

h(x) := A1(x) ∨ A2(X ) ∨ A3(x), Ai (x) are Boolean conjunctions.

I |H| ≤ 33d . Sufficient sample size: log(|H|)+log(1/δ)ε ≤ 3d log(3)+log(1/δ)

ε .

I But no ERM algorithm polynomial in n, unless RP = NP.

For 3-DNF, there is a trick.I There is a class H′ which contains the class 3-DNF, and has an

efficient ERM algorithm

I H′ is richer than 3-DNF: higher sample complexity.

I Tradeoff between statistical complexity and computational complexity!


Computational complexity in agnostic setting

Many hypothesis classes have an efficient algorithm in the realizablesetting, but not in the agnostic setting.

I Recall linear predictors.

A possible solution:I Try to find an hS ∈ H with a low err(hS ,S).

I If m ≥ log(|H|)+log(2/δ)2ε2 , then with probability 1− δ,

∀h ∈ H, |err(h,S)− err(h,D)| ≤ ε.

I Use any heuristic to find a hS .

I Get the guarantee err(hS ,D) ≤ err(hS ,S) + ε.

I No guarantee on distance from minh∈H err(h,D).

I Soft-SVM is based on the same idea (but with an infinite class).

•


A Heuristic learning algorithm for Boolean conjunction

Boolean conjunction: ERM is NP-hard in the agnostic case.

A greedy heuristic:I Start with a function h that is always trueI In each iteration t,

F Add to h the literal that would decrease err(h,S) the most.F Stop when no literal decreases the error anymore.

I Return the last h.

No guarantee that err(hS , S) is close to minh∈H err(h,S).

No guarantee that err(hS , S) is low.

But if m ≥ log(|H|)+log(2/δ)2ε2

, then with high probability,

err(hS ,D) ≤ err(hS ,S) + ε.


Infinite hypothesis classes

For a finite H:

Realizable setting: err(hS ,D) ≤ ε with probability 1− δ if

m ≥ log(|H|) + log(1/δ)

ε.

Agnostic setting: err(hS ,D) ≤ infh∈H err(h,D) + ε with probability1− δ if

m ≥ 2 log(|H|) + 2 log(2/δ)

ε2.

Required sample size depends on log(|H|).

What if H is infinite?

•


Infinite hypothesis classes

General H, could be infinite.

Need a property of H that measures its sample complexity.

VC(H): the VC-dimension of H.

I VC(H) is the size of largest set of examples that can be labeled in allpossible label combinations using hypotheses from H.

I VC(H) measures how much “variation” exists in the functions in H.I For linear predictors: VC(H) = d .

Sample complexity bounds for infinite H use VC(H) instead of log(|H|).

Think of 2VC(H) as the “effective size” of H.

2VC(H) ≤ |H| for all H.

There are classes with an infinite VC-dimension:

I Such classes are not learnable for a general distribution D.I If a class is not learnable, there is no sample size that would guarantee ε, δ

PAC-learning for all distributions D.I All finite classes are learnableI some infinite classes are learnableI If X is infinite, the class of all functions H = YX is not learnable.


PAC learning: Summary

PAC-learning addresses distribution-free learning with a givenhypothesis class.

PAC analysis provides bounds on sample complexity and onoverfitting.

The sample complexity of ERM is near-optimal amongdistribution-free algorithms.

But for many problems there is no efficient ERM algorithm.

Approaches to get efficient algorithms:I Use a heuristic to find a hypothesis with a low error on the sampleI Change the hypothesis class to one that can be done efficientlyI Change the goal: try to minimize a different loss that matches the task.

•


Documents

Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational