38
CS 886 Applied Machine Learning Learning Theory Light Dan Lizotte University of Waterloo 23 May 2013 Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 1 / 37

CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

CS 886Applied Machine LearningLearning Theory Light

Dan Lizotte

University of Waterloo

23 May 2013

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 1 / 37

Page 2: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Lecture 12: Learning theory

• True error vs. Training Error of a hypothesis

• VC-dimension

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 2 / 37

Page 3: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Binary classification: The golden goal

Given:

• The set of all possible instances X

• A target function (or concept) f : X → {0, 1}• A set of hypotheses H

• A set of training examples D (containing positive and negativeexamples of the target function)

〈x1, f (x1)〉, . . . 〈xn, f (xn)〉

Determine:A hypothesis h ∈ H such that h(x) = f (x) for all x ∈ X.

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 3 / 37

Page 4: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Approximate Concept Learning

• Requiring a learner to acquire the right concept is too strict

• Instead, we will allow the learner to produce a good approximationto the actual concept

• For any instance space, there is a non-uniform likelihood of seeingdifferent instances

• We assume that there is a fixed probability distribution P on thespace of instances X

• The learner is trained and tested on examples whose inputs aredrawn independently and randomly according to P .

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 4 / 37

Page 5: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Recall: Two Notions of Error

• The empirical error of hypothesis h with respect to target concept fand a data sample tells how often h(x) 6= f (x) over the sample

• The true error of hypothesis h with respect to target concept f tellshow often h(x) 6= f (x) over future, unseen instances (but drawnaccording to P )

• Questions:• Can we bound the true error of a hypothesis given only empiricalerror?

• How many examples are needed for a good approximation? This iscalled the sample complexity of the problem

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 5 / 37

Page 6: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

True Error of a Hypothesis

+

+-

-

f h

Instance space X

-

Where fand h disagree

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 6 / 37

Page 7: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

True Error Definition

• The set of instances on which the target concept and the hypothesisdisagree is denoted: S = {x|h(x) 6= f (x)}

• The true error of h with respect to f is:∑x∈S

P (x)

This is the probability of making an error on an instance randomlydrawn from X according to P

• Let ε ∈ (0, 1) be an error tolerance parameter. We say that h is agood approximation of f (to within ε) if and only if the true error ofh is less than ε.

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 7 / 37

Page 8: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Rote Learner

• Let X = {0, 1}p. Let P be the uniform distribution over X.

• Let the concept f be generated by randomly assigning a label toevery instance in X.

• Let D ⊂ X be a set of training instances.The hypothesis h is generated by memorizing D and giving arandom answer otherwise.

• What is the empirical error of h on D?

• What is the true error of h?

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 8 / 37

Page 9: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Empirical risk minimization

• Suppose we are given a hypothesis class H

• We have a magical learning machine that can sift through H andoutput the hypothesis with the smallest training error, hemp

• This is process is called empirical risk minimization

• Is this a good idea?

• What can we say about the error of the other hypotheses in h?

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 9 / 37

Page 10: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

First tool: The union bound

• Let E1 . . . Ek be k different events (not necessarily independent).Then:

P (E1 ∪ · · · ∪ Ek) ≤ P (E1) + · · ·+ P (Ek)

• Note that this is usually loose, as events may be correlated

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 10 / 37

Page 11: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Second tool: Hoeffding bound

• Let Z1 . . . Zn be n independent identically distributed (iid) binaryvariables, drawn from a Bernoulli (weighted coin) distribution:

P (Zi = 1) = φ and P (Zi = 0) = 1− φ

• Let φ be the sample mean:φ = 1n

∑ni=1 Zi

• Let ε be a fixed error tolerance parameter. Then:

P (|φ− φ| > ε) ≤ 2e−2ε2n

• In other words, if you have lots of examples, the empirical mean is agood estimator of the true probability.

• Note: other similar concentration inequalities can be used (e.g.Chernoff, Bernstein, etc.)

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 11 / 37

Page 12: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Finite hypothesis space

• Suppose we are considering a finite hypothesis class H = {h1, . . . hk}• Choose a hypothesis hi ∈ H at random.

• Suppose we sample data according to our distribution and letZj = 1 iff hi(xj) 6= yj

• So e(hi) (the true error of hi) is the expected value of Zj• Let e(hi) = 1

n

∑nj=1 Zj (this is the empirical error of hi on the data

set we have)

• Using the Hoeffding bound, we have:

P (|e(hi)− e(hi)| > ε) ≤ 2e−2ε2n

• So, if we have lots of data, the empirical error of a hypothesis hi willbe close to its true error with high probability. NOTE WE DID NOTTRAIN ON THE DATA.

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 12 / 37

Page 13: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

What about all hypotheses?

• We showed that the empirical error is “close” to the true error forone hypothesis.

• Let Ei denote the event |e(hi)− e(hi)| > ε

• Can we guarantee this is true for all hypothesis?

P (∃hi ∈ H, |e(hi)− e(hi)| > ε) = P (E1 ∪ . . . Ek)

≤k∑i=1

P (Ei) (union bound)

≤k∑i=1

2e−2ε2n (shown before)

= 2ke−2ε2n

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 13 / 37

Page 14: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

A uniform convergence bound

• We showed that:

P (∃hi ∈ H, |e(hi)− e(hi)| > ε) ≤ 2ke−2ε2n

• So we have:

1− P (∃hi ∈ H, |e(hi)− e(hi)| > ε) ≥ 1− 2ke−2ε2n

or, in other words:

P (∀hi ∈ H, |e(hi)− e(hi)| < ε) ≥ 1− 2ke−2ε2n

• This is called a uniform convergence result because the bound holdsfor all hypotheses

• What is this good for?

If all the empirical errors are close to thetrue errors, the hi with the best empirical error is probably the hiwith the best true error.

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 14 / 37

Page 15: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

A uniform convergence bound

• We showed that:

P (∃hi ∈ H, |e(hi)− e(hi)| > ε) ≤ 2ke−2ε2n

• So we have:

1− P (∃hi ∈ H, |e(hi)− e(hi)| > ε) ≥ 1− 2ke−2ε2n

or, in other words:

P (∀hi ∈ H, |e(hi)− e(hi)| < ε) ≥ 1− 2ke−2ε2n

• This is called a uniform convergence result because the bound holdsfor all hypotheses

• What is this good for? If all the empirical errors are close to thetrue errors, the hi with the best empirical error is probably the hiwith the best true error.

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 14 / 37

Page 16: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Sample complexity

• Suppose we want to guarantee that with probability at least 1− δ,the sample (training) error is within ε of the true error.

• From our bound, we can set δ ≥ 2ke−2ε2n

• Solving for n, we get that the number of samples should be:

n ≥1

2ε2log2k

δ=1

2ε2log2|H|δ

• So the number of samples needed is logarithmic in the size of thehypothesis space

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 15 / 37

Page 17: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Conjunctions of Boolean Literals

• Let H be the space of all pure conjunctive formulae over p Booleanattributes. Then |H| = 3p (why?)

• From the previous result, we get:

n ≥1

2ε2log2|H|δ= p

1

2ε2log6

δ

• This is linear in p

• Hence, conjunctions are “easy to learn"

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 16 / 37

Page 18: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Arbitrary Boolean functions

• Let H be the space of all Boolean functions on p Booleanattributes. Then |H| = 22p (why?)

• From the previous result, we get:

n ≥1

2ε2log2|H|δ=1

2ε2(2p + log

1

δ)

• This is exponential in p

• Hence, arbitrary Boolean functions are “hard to learn"

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 17 / 37

Page 19: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Another application: Bounding the true error

• Our inequality revisited:

P (∀hi ∈ H, |e(hi)− e(hi)| < ε) ≥ 1− 2ke−2ε2n = 1− δ

• Suppose we hold n and δ fixed, and we solve for ε. Then we get:

|e(hi)− e(hi)| ≤√1

2nlog2k

δ

inside the probability term.

• Can we now prove anything about the generalization power of theempirical risk minimization algorithm?

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 18 / 37

Page 20: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Empirical risk minimization

Let h∗ be the best hypothesis in our class (in terms of true error). Basedon our uniform convergence assumption, we can bound the true error ofhemp as follows:

e(hemp) ≤ e(hemp) + ε (for ε as per previous slide)

≤ e(h∗) + ε (because hemp has better training error

than any other hypothesis)

≤ e(h∗) + 2ε (by using the result on h∗)

≤ e(h∗) + 2

√1

2nlog2|H|δ

(from previous slide)

This bounds how much worse hemp is, wrt the best hypothesis we canhope for!

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 19 / 37

Page 21: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Overfitting/Underfitting Revisited

• We showed that, given n examples, with probability at least 1− δ,

e(hemp) ≤(minh∈H

e(h)

)+ 2

√1

2nlog2|H|δ

• Suppose now that we are considering two hypothesis classes H ⊆ H′• The first term would be smaller for H′ (we have a larger hypothesisclass, hence more able to fit the ‘truth’)

• The second term would be larger (overfitting is more likely)

• Note, though, that if H is infinite, this result is not very useful...

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 20 / 37

Page 22: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Shattering a set of instances

• A dichotomy of a set S is a partition of S into two disjoint subsets.

• A set of instances D is shattered by hypothesis space H if and onlyif for every dichotomy of D there exists some hypothesis in Hconsistent with this dichotomy.

• This idea will let us measure the complexity of a hypothesis spaceeven if it is infinite.

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 21 / 37

Page 23: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Three instances

Can these three points be shattered by the hypothesis space consisting ofa set of circles?

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 22 / 37

Page 24: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Three instances

Can these three points be shattered by the hypothesis space consisting ofa set of circles? + +

+

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 23 / 37

Page 25: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Three instances

Can these three points be shattered by the hypothesis space consisting ofa set of circles?

+ +

+

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 24 / 37

Page 26: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Three instances

Can these three points be shattered by the hypothesis space consisting ofa set of circles? + !

+

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 25 / 37

Page 27: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Three instances

Can these three points be shattered by the hypothesis space consisting ofa set of circles?

+ !

+

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 26 / 37

Page 28: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Three instances

Can three points be shattered by the hypothesis space consisting of a setof circles?

+

! !

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 27 / 37

Page 29: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Three instancesCan three points be shattered by the hypothesis space consisting of a setof circles?

+

! !

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 28 / 37

Page 30: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Three instances

Can three points be shattered by the hypothesis space consisting of a setof circles?

!

! !

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 29 / 37

Page 31: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Three instances

Can three points be shattered by the hypothesis space consisting of a setof circles? ! !

!What about 4 points?

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 30 / 37

Page 32: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Example: Four instances

• These cannot be shattered, because we can label the farther 2points as +, and the circle that contains them will necessarilycontain the other points

• So circles can shatter one data set of three points (the one we’vebeen analyzing), but there is no set of four points that can beshattered by circles (check this by yourself!)

• Note that not all sets of size 3 can be shattered!

• We say that the VC dimension of circles is 3

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 31 / 37

Page 33: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

The Vapnik-Chervonenkis (VC) Dimension

• The Vapnik-Chervonenkis dimension, V C(H), of hypothesis space Hdefined over instance space X is the size of the largest finite subsetof X shattered by H. If arbitrarily large finite sets of X can beshattered by H, then V C(H) ≡ ∞.

• VC dimension measures how many distinctions the hypotheses fromH are able to make

• This is, in some sense, the number of “effective degrees of freedom”

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 32 / 37

Page 34: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Establishing the VC dimension

• Play the following game with the adversary:• You are allowed to choose k points. This actually gives you a lot offreedom!

• The adversary then labels these points any way it wants• You now have to produce a hypothesis, out of your hypothesis class,which correctly produces these labels.

If you are able to succeed at this game,the VC dimension is at least k .

• To show that it is no greater than k , you have to show that for anyset of k +1 points, the adversary can find a labeling that you cannotcorrectly reproduce with any of your hypotheses.

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 33 / 37

Page 35: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

VC dimension of linear decision surfaces

• Consider a linear threshold unit in the plane.

• First, show there exists a set of 3 points that can be shattered by aline =⇒ VC dimension of lines in the plane is at least 3.

• To show it is at most 3, show that NO set of 4 points can beshattered.

• For an p-dimensional space, VC dimension of linear estimators isp + 1.

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 34 / 37

Page 36: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Error bounds using VC dimension

• Recall our error bound in the finite case:

e(hemp) ≤(minh∈H

e(h)

)+ 2

√1

2nlog2|H|δ

• Vapnik showed a similar result, but using VC dimension instead ofthe size of the hypothesis space:

• For a hypothesis class H with VC dimension V C(H), given nexamples, with probability at least 1− δ, we have:

e(hemp) ≤(minh∈H

e(h)

)+O

(√V C(H)

nlog

n

V C(H)+1

nlog1

δ

)

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 35 / 37

Page 37: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Remarks on VC dimension

• The previous bound is tight up to log factors. In other words, forhypotheses classes with large VC dimension, we can show that thereexists some data distribution which will produce a badapproximation.

• For many reasonable hypothesis classes (e.g. linear approximators)the VC dimension is linear in the number of “parameters” of thehypothesis.

• This shows that to learn “well”, we need a number of examples thatis linear in the VC dimension (so linear in the number of parameters,in this case).

• An important property: if H1 ⊆ H2 then V C(H1) ≤ V C(H2).

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 36 / 37

Page 38: CS886 AppliedMachineLearning LearningTheoryLightdlizotte/teaching/cs886... · Lecture12: Learningtheory • Trueerrorvs. TrainingErrorofahypothesis • VC-dimension Dan Lizotte (University

Structural risk minimization

e(hemp) ≤(minh∈H

e(h)

)+O

(√V C(H)

nlog

n

V C(H)+1

nlog1

δ

)

• We have used this bound to measure the true error of thehypothesis with the smallest training error

• Why not use the bound directly to get the best hypothesis?

• We can measure the training error, and add to that the quantitysuggested by the rightmost term

• We pick the hypothesis that is best in terms of this sum!

• This approach is called structural risk minimization, and can be usedinstead of cross-validation

Dan Lizotte (University of Waterloo) CS 886 - Learning Theory Light 23 May 2013 37 / 37