The VC dimension - SVCL · 23 VC dimension to formalize this, we denote the 2n point sample by and the cardinality of the set of distinct functions by the maximum cardinality over

The VC dimension

Nuno Vasconcelos ECE Department, UCSD

2

Margins and VC dimensionwe have been talking about techniques that achieve good generalization by maximizing the marginit turns out that the quantity of interest is the so-called VC dimension of the family of functions implemented by the learning machinethe margin allows us to control this dimension and this is what makes it important in the next two lectures we are going to formalize this a little betterdisclaimer: the topic is a bit technical, we will only do a superficial review of the main resultsthe idea is to get the big picture

3

Loss functions and Riskgoal of the learning machine: to find the set of parameters that minimizes the risk (expected value of the loss)

in practice it is impossible to evaluate the risk, because we do not know what PX,Y(x,y) is.all we have is a training set

we estimate the risk by the empirical risk on this training set

{ }dxdyxfyLyxP

xfyLEfR

YX

YX

)](,[),(

)](,[)(

,

,

∫=

=

( ) ( ){ }nn yxyxD ,,,, 11 K=

∑=

=n

iiiemp xfyL

nR

1)](,[1 )(α

4

Empirical risk minimizationthe ERM principle recommends choosing the function f that minimizes this empirical riskwe have already seen that this is a bad idea when F, the set of possible f, is unconstrainedhere is an f* that will always achieve zero empirical risk

clearly has zero error, but• on a different sample it is unlikely that we will get the same xi• outside the training set it always say “0”, which is clearly bad

the point is that we have to constrain F to have meaningful generalization

⎩⎨⎧ =

=otherwise

xxyxf ii

,0 if,

)(*

5

The law of large numberslet’s start by trying to understand the relationship between risk and empirical risksince we are dealing with classification we use the loss

if we have a sequence of iid errors ξi = |f(xi) – yi| the convergence of the empirical risk

to the risk E[ξ] is determined by the law of large numbersthere are many variations, we will do here the so-called Hoeffding boundfor this we need some intermediate results

|)(|)](,[ iiii xfyxfyL −=

∑i

inξ1

6

The Markov inequalitythe basic result is the Markov inequalityMarkov: let ξ be a non-negative r.v. with distribution Pξ(ξ). Then for all λ > 0

Proof:λ

ξλξ 1])[( ≤≥ EP

])[( ][ )(][

)()(][

][

][0

ξλξξλξξξλ

ξξξξξξξ

ξλ

ξλ

EPEdPE

dPdPE

E

E

≥=≥

≥=

∫

∫∫∞

∞∞

7

The law of large numberswe will also use the followingLemma: let x be a r.v. with distribution E[x] = 0, a ≤ x ≤ b. Then for s > 0

Proof:• by convexity of the exponential

• and, from E[x]=0

8)( 22

][abs

sx eeE−

≤

( )sasbsbsasasbsx eeab

xeab

aeab

beabxbe

abaxe −

−+

−−

−=

−−

+−−

≤

sbsasbsasx peepeab

aeab

be +−=−

−−

≤ )1( ⎟⎠⎞

⎜⎝⎛

−−=

abap

8

The law of large numbers•

with u = s(b-a) and φ(u) = -pu + log(1-p+peu)• Taking derivatives

,

and

• By Taylor expansion with reminder, there is a φ such that

( )( ) )()()(

)(

1 1)1(

uabpsabs

saabssbsasx

eepepepeppeepe

φ=+−=

+−=+−≤−−−

−

( ) peppp

peppep

u uu

u

+−+−=

+−+−=

∂∂

−11φ

0)0( =∂∂uφ

( ) 41

)1()1(

22

2

≤−+

−=

∂∂

−

−

u

u

eppepp

uφ

8)(

82)0()0()(

222

2

22 absuu

uu

uu −=≤

∂∂

+∂∂

+=φφφφ

9

The law of large numberswe are now ready to show Hoeffding’s resultTheorem: if xi are iid such that for all ε > 0 we have

Proof:• From Markov, for non-negative x,• hence, for all s > 0 and x > 0,

• and

{ } 2

2

)(2][ abn

nn eSESP −−

≤≥−ε

ε

εε /][)( xExP ≤≥

∑=∈i

ini xn

Sbax 1 and ],[

εεε s

sxssx

eeEeePxP ][)()( ≤>=≥

( ) ][][ ][ nn SESssnn eEeSESP −≤≥− εε

10

The law of large numbers•

• choosing s = (4εn)/(b-a)2

( )

( )

( )

( ) ( )2

2222

2

2

2

i

][

28

exp

8exp

iid) (x ][exp

][exp

][][

abn

s

i

s

iii

s

iii

s

SESssnn

eabnse

abnsEe

xExnsEe

xExnsEe

eEeSESP nn

−−

−

−

−

−

−−

=⎭⎬⎫

⎩⎨⎧ −

=

⎥⎥⎦

⎤

⎢⎢⎣

⎡

⎭⎬⎫

⎩⎨⎧ −

≤

⎥⎦

⎤⎢⎣

⎡

⎭⎬⎫

⎩⎨⎧ −=

⎥⎥⎦

⎤

⎢⎢⎣

⎡

⎭⎬⎫

⎩⎨⎧

−=

≤≥−

∏

∏

∑

εε

ε

ε

ε

εε

11

Sidenotethere are many variations on the law of large numbers, this is only onethey all have the flavor that

this is an amazing result when you think of it• estimating expectations by empirical means converges

exponentially fast

note that this is not an abstract theoretical result• it is the reason why it is worth studying statistics• why would we want to compute statistics if they did not converge

to the true quantities?

{ } ( )nnn eOSESP −≤≥− ε][

12

Empirical risk vs risknoting that

the theorem

seems to indicate that

i.e. the empirical risk converges to the risk exponentially fastthis is the best that one could hope for, there seems to be no reason to use anything other than ERM

( ) ( )2

22

][ abn

nn eSESP −−

≤≥−ε

ε

ni

iemp Sn

fR == ∑ξ1][ ][][][ nSEEfR == ξ

( ) nemp efRfRP

2

][][ εε −≤≥−

13

Empirical risk vs risksince we know that ERM is not that good, something must be wrongthe problem is that the bounds assume independent errors ξi

since we are choosing f (by ERM) so that the mean of ξiis as small as possible, this is not the casewe need to look for alternative ways to understand the relationship between the two risksfor them to be equivalent we need the ERM solution f* to converge to the lowest value of R[f]this turns out not to be possible unless we restrict F

14

Convergenceconsider the values of the empirical risk and the risk over F

they are something like thisfopt minimizes the risk, f* the empirical risk for a particular sampleclearly

and

R Remp

fopt f* Fℑ∈∀≥−

ℑ∈∀≥−

ffRfRffRfR

empemp

opt

,0][][,0][][

*

0][][,0][][*

*

≥−

≥−

fRfRfRfR

empopt

emp

opt

15

Convergence

or

note that, because fopt is a fixed function, independent of the sample, we can use Hoeffding on the last term

R Remp

fopt f* F

0][][,0][][ ** ≥−≥− fRfRfRfR empopt

empopt

( )][][

][][sup

][][

][][

][][][][0

**

**

*

*

optoptemp

empf

optoptemp

emp

empopt

emp

opt

fRfR

fRfR

fRfR

fRfR

fRfRfRfR

−+

−≤

−+

−=

−+

−≤

ℑ∈

16

Convergencewhich leads to

hence, if

then

it turns out that (*) is also a sufficient condition for (**)

RRemp

fopt f* F

∞→→− nfRfR optoptemp as 0][][

( ) ∞→→−ℑ∈

as nfRfR empf

0][][sup **

*

fRfRfRfR

empopt

emp

opt

0][][0][][

*

*

→−

→−

(*)

(**)

17

Consistency of ERMTheorem: (VC) The condition

is necessary and sufficient for consistency of ERMthis shows that consistency depends on the class of functions F, but is not terribly useful in practicewe next look at properties of F that ensure convergence

the first thing to do is to try to bound this quantity

( ) 0][][suplim =⎥⎦⎤

⎢⎣⎡ >−

ℑ∈∞→εfRfRP emp

fn

18

VC Boundsthe probability

is easy to bound when F is finitelet F be the set F = {f1, ..., fM} and

the set of samples for which the risks obtained with the ithfunction differ by more than εthen, if M=2,

( ) ⎥⎦⎤

⎢⎣⎡ >−

ℑ∈ε][][sup fRfRP emp

f

( ){ }εε >−= ][][|),(),...,,( 11 iempinni fRfRyxyxC

( ) ( )( ) ( ) ( ) ( ) ( )211121

21][][sup

εεεεεε

εεε

CPCPCCPCPCP

CCPfRfRP empf

+≤∩−+=

∪=⎥⎦⎤

⎢⎣⎡ >−

ℑ∈

19

VC Boundsin general,

this is called the union boundrecalling that

and noting that the fi are fixed, from which the errors are independent, we can now • just apply the LLN to each of the P(Cε

i)• each of them is bounded by O(e-nε)• the overall bound is O(Me-nε) and convergence exponentially fast

( ) ( )∑=ℑ∈

≤⎥⎦⎤

⎢⎣⎡ >−

M

i

iemp

fCPfRfRP

1][][sup εε

( ){ }εε >−= ][][|),(),...,,( 11 iempinni fRfRyxyxC

20

VC Boundsthe problem is the case when F contains an infinite number of functions. In this case the method of proof will fail.one of the main VC results is the solution to thisthey showed that • the probability of the empirical risk on a sample of n points

differing from the risk by more that ε can be bounded by• twice the probability that it differs from the empirical risk on a

second sample of size 2n by more than ε/2

21

VC BoundsTheorem: for nε2 > 2

where • the 1st P refers to sample of size n and the 2nd to that of size 2n. • in the latter case, Remp measures the loss on first half and R’emp the

loss on the second half

this is intuitive:• if the Remps on two independent n-samples are close then they

should also be close to the true error rate

the practical significance is that• this makes F effectively finite• when we restrict the functions to 2n points there are at most 22n

different elements in the set

( ) ( ) ⎥⎦⎤

⎢⎣⎡ >−≤⎥⎦

⎤⎢⎣⎡ >−

ℑ∈ℑ∈2/][][sup2][][sup ' εε fRfRPfRfRP empemp

femp

f

22

VC dimensiongraphically

since there are 2n points where the functions are either +1 or -1, the total number of distinct elements is at most 22n

in practice, it can, of course, be smallerthe VC dimension is determined by this number

all these functions are thesame when restricted to the

sample points

x1 x2 x2n

23

VC dimensionto formalize this, we denote the 2n point sample by

and the cardinality of the set of distinct functions by

the maximum cardinality over all possible samples of size 2n is the shattering coefficient (or covering number) of F

it is a measure of the complexity of F, the number of ways in which it can separate the two classes

{ }),(,),,( 22112 nnn yxyxZ K=

( )nZN 2,ℑ

( )nN 2,ℑ

24

VC dimensionif N(F, n) = 2n, all possible separations can be implemented by functions in Fin this case the functions in F are said to shatter n points

note:• this means that there are n points that can be separated in all

possible ways• does not mean that this applies to all sets of n points

example: • let F be the set of lines in R2. For 3 points in general position

• the set of lines shatters three points on R2!

x

o x

o

x o

o

o x

x

x o

x

x x

o

o ox

o o

o

x x

25

VC dimensionexample: • find a set of four points that is shattered (that is can be separated

in all possible ways) by a line in R2• the following configuration is never possible (xor)

• the set of lines does not shatter four points on R2!

this brings us to the VC dimension

example• the VC dimension of the set of hyperplanes in Rd is d+1

o

o x

x

ℑ∈=ℑ functions by shattered be canthat pointsof #max )(VC

26

VC dimensionwhy is the VC dimension important?this will be clear next class, where we will show that

by controlling the VC dimension• we upper bound difference between risk and empirical risk• effectively control the generalization ability of the learning machine

and this is really what structural risk minimization is aboutbut before we get into that, I will tie all of this to the margin, which is the quantity that we can control in practicethere are various results relating margin and VC dimension, we next go over a simple one

[ ])(][][ ℑ+≤ VCfRfR emp φ

27

The role of the marginTheorem: consider hyperplanes of the form wTx=0, where w is normalized wrt a sample D = {(x1,y1),...,(xn,yn)} in the usual way

Then, the set of functions

defined on X* = {x1, ..., xn} and satisfying

has VC dimension such that

where R is the radius of the smallest sphere centered at the origin and containing X*.

1min =iT

ixw

( ){ }xw Tsgn=ℑ

λ≤w

22)( λRVC ≤ℑ

28

The role of the marginProof:• we need to show that the max number of points shattered by

normalized hyperplanes with ||w|| ≤ λ is upper bounded by R2λ2

• assume {x1, ..., xr} are shattered by normalized hyperplanes of ||w|| ≤ λ

• then, for all y1, ..., yr in {1,-1}, there is a w s.t, ||w|| ≤ λ and

• summing over all i

• and by Cauchy-Schwarz

ixwy iT

i ∀≥ ,1

rxywi

iiT ≥∑

r

∑∑∑∑ ≤⇔≤≤≤i

iii

iii

iii

iiT xyxyxywxywr

λλ (*)

29

The role of the margin• next, consider that the yi are iid r.v.s uniformly distributed on {1,-1}

( )

( )

[ ] [ ] [ ][ ] 2222

2

00

2

2

rRxxyE

xyExxyEyE

xyxxyyE

xyxyxyE

xyxyE

xyxyExyE

ii

iii

i ijiij

Tiji

i ijiij

Tiji

i ijiijj

Tii

i jjj

Tii

jjj

T

iii

iii

≤==

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡+=

⎥⎦

⎤⎢⎣

⎡+=

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛+=

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛=

⎥⎥⎦

⎤

⎢⎢⎣

⎡

∑∑

∑ ∑

∑ ∑

∑ ∑

∑ ∑

∑∑∑

≠

≠

≠

1321

321321

30

The role of the margin• hence

• which means that there must be at least one set of labels for which this inequality holds, i.e. there is at least one set of labels such that

• since (*) holds for this set, we have

• since this holds for any r, it will also for the max number of points that can be shattered, i.e. the VC dimension and

22

rRxyEi

ii ≤⎥⎥⎦

⎤

⎢⎢⎣

⎡∑

22

rRxyi

ii ≤∑

RrrR xyλr

iii

22222

λ≤⇔≤≤⎟⎠⎞

⎜⎝⎛ ∑

22)( λRVC ≤ℑ

31

The role of the marginnotes:• there are extensions to the case where b ≠ 0 but they are a bit nore

complicated• the theorem basically says that if

then

• when we maximize the margin (minimize ||w||) we are decreasing the value of λ and, therefore,

• decreasing the upper bound on the VC dimension• note that, as long as ||w|| is finite, the VC dimension is finite even if

the dimension of the space is infinite• since γ = 1/||w||, this is always true if the margin is strictly greater

than zero

λ≤w

22)( λRVC ≤ℑ

32

Documents

The VC dimension - SVCL · 23 VC dimension to formalize this, we denote the 2n point sample by and the cardinality of the set of distinct functions by the maximum cardinality over