Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
The VC dimension
Nuno Vasconcelos ECE Department, UCSD
2
Margins and VC dimensionwe have been talking about techniques that achieve good generalization by maximizing the marginit turns out that the quantity of interest is the so-called VC dimension of the family of functions implemented by the learning machinethe margin allows us to control this dimension and this is what makes it important in the next two lectures we are going to formalize this a little betterdisclaimer: the topic is a bit technical, we will only do a superficial review of the main resultsthe idea is to get the big picture
3
Loss functions and Riskgoal of the learning machine: to find the set of parameters that minimizes the risk (expected value of the loss)
in practice it is impossible to evaluate the risk, because we do not know what PX,Y(x,y) is.all we have is a training set
we estimate the risk by the empirical risk on this training set
{ }dxdyxfyLyxP
xfyLEfR
YX
YX
)](,[),(
)](,[)(
,
,
∫=
=
( ) ( ){ }nn yxyxD ,,,, 11 K=
∑=
=n
iiiemp xfyL
nR
1)](,[1 )(α
4
Empirical risk minimizationthe ERM principle recommends choosing the function f that minimizes this empirical riskwe have already seen that this is a bad idea when F, the set of possible f, is unconstrainedhere is an f* that will always achieve zero empirical risk
clearly has zero error, but• on a different sample it is unlikely that we will get the same xi• outside the training set it always say “0”, which is clearly bad
the point is that we have to constrain F to have meaningful generalization
⎩⎨⎧ =
=otherwise
xxyxf ii
,0 if,
)(*
5
The law of large numberslet’s start by trying to understand the relationship between risk and empirical risksince we are dealing with classification we use the loss
if we have a sequence of iid errors ξi = |f(xi) – yi| the convergence of the empirical risk
to the risk E[ξ] is determined by the law of large numbersthere are many variations, we will do here the so-called Hoeffding boundfor this we need some intermediate results
|)(|)](,[ iiii xfyxfyL −=
∑i
inξ1
6
The Markov inequalitythe basic result is the Markov inequalityMarkov: let ξ be a non-negative r.v. with distribution Pξ(ξ). Then for all λ > 0
Proof:λ
ξλξ 1])[( ≤≥ EP
])[( ][ )(][
)()(][
][
][0
ξλξξλξξξλ
ξξξξξξξ
ξλ
ξλ
EPEdPE
dPdPE
E
E
≥=≥
≥=
∫
∫∫∞
∞∞
7
The law of large numberswe will also use the followingLemma: let x be a r.v. with distribution E[x] = 0, a ≤ x ≤ b. Then for s > 0
Proof:• by convexity of the exponential
• and, from E[x]=0
8)( 22
][abs
sx eeE−
≤
( )sasbsbsasasbsx eeab
xeab
aeab
beabxbe
abaxe −
−+
−−
−=
−−
+−−
≤
sbsasbsasx peepeab
aeab
be +−=−
−−
≤ )1( ⎟⎠⎞
⎜⎝⎛
−−=
abap
8
The law of large numbers•
with u = s(b-a) and φ(u) = -pu + log(1-p+peu)• Taking derivatives
,
and
• By Taylor expansion with reminder, there is a φ such that
( )( ) )()()(
)(
1 1)1(
uabpsabs
saabssbsasx
eepepepeppeepe
φ=+−=
+−=+−≤−−−
−
( ) peppp
peppep
u uu
u
+−+−=
+−+−=
∂∂
−11φ
0)0( =∂∂uφ
( ) 41
)1()1(
22
2
≤−+
−=
∂∂
−
−
u
u
eppepp
uφ
8)(
82)0()0()(
222
2
22 absuu
uu
uu −=≤
∂∂
+∂∂
+=φφφφ
9
The law of large numberswe are now ready to show Hoeffding’s resultTheorem: if xi are iid such that for all ε > 0 we have
Proof:• From Markov, for non-negative x,• hence, for all s > 0 and x > 0,
• and
{ } 2
2
)(2][ abn
nn eSESP −−
≤≥−ε
ε
εε /][)( xExP ≤≥
∑=∈i
ini xn
Sbax 1 and ],[
εεε s
sxssx
eeEeePxP ][)()( ≤>=≥
( ) ][][ ][ nn SESssnn eEeSESP −≤≥− εε
10
The law of large numbers•
• choosing s = (4εn)/(b-a)2
( )
( )
( )
( ) ( )2
2222
2
2
2
i
][
28
exp
8exp
iid) (x ][exp
][exp
][][
abn
s
i
s
iii
s
iii
s
SESssnn
eabnse
abnsEe
xExnsEe
xExnsEe
eEeSESP nn
−−
−
−
−
−
−−
=⎭⎬⎫
⎩⎨⎧ −
=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎭⎬⎫
⎩⎨⎧ −
≤
⎥⎦
⎤⎢⎣
⎡
⎭⎬⎫
⎩⎨⎧ −=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎭⎬⎫
⎩⎨⎧
−=
≤≥−
∏
∏
∑
εε
ε
ε
ε
εε
11
Sidenotethere are many variations on the law of large numbers, this is only onethey all have the flavor that
this is an amazing result when you think of it• estimating expectations by empirical means converges
exponentially fast
note that this is not an abstract theoretical result• it is the reason why it is worth studying statistics• why would we want to compute statistics if they did not converge
to the true quantities?
{ } ( )nnn eOSESP −≤≥− ε][
12
Empirical risk vs risknoting that
the theorem
seems to indicate that
i.e. the empirical risk converges to the risk exponentially fastthis is the best that one could hope for, there seems to be no reason to use anything other than ERM
( ) ( )2
22
][ abn
nn eSESP −−
≤≥−ε
ε
ni
iemp Sn
fR == ∑ξ1][ ][][][ nSEEfR == ξ
( ) nemp efRfRP
2
][][ εε −≤≥−
13
Empirical risk vs risksince we know that ERM is not that good, something must be wrongthe problem is that the bounds assume independent errors ξi
since we are choosing f (by ERM) so that the mean of ξiis as small as possible, this is not the casewe need to look for alternative ways to understand the relationship between the two risksfor them to be equivalent we need the ERM solution f* to converge to the lowest value of R[f]this turns out not to be possible unless we restrict F
14
Convergenceconsider the values of the empirical risk and the risk over F
they are something like thisfopt minimizes the risk, f* the empirical risk for a particular sampleclearly
and
R Remp
fopt f* Fℑ∈∀≥−
ℑ∈∀≥−
ffRfRffRfR
empemp
opt
,0][][,0][][
*
0][][,0][][*
*
≥−
≥−
fRfRfRfR
empopt
emp
opt
15
Convergence
or
note that, because fopt is a fixed function, independent of the sample, we can use Hoeffding on the last term
R Remp
fopt f* F
0][][,0][][ ** ≥−≥− fRfRfRfR empopt
empopt
( )][][
][][sup
][][
][][
][][][][0
**
**
*
*
optoptemp
empf
optoptemp
emp
empopt
emp
opt
fRfR
fRfR
fRfR
fRfR
fRfRfRfR
−+
−≤
−+
−=
−+
−≤
ℑ∈
16
Convergencewhich leads to
hence, if
then
it turns out that (*) is also a sufficient condition for (**)
RRemp
fopt f* F
∞→→− nfRfR optoptemp as 0][][
( ) ∞→→−ℑ∈
as nfRfR empf
0][][sup **
*
fRfRfRfR
empopt
emp
opt
0][][0][][
*
*
→−
→−
(*)
(**)
17
Consistency of ERMTheorem: (VC) The condition
is necessary and sufficient for consistency of ERMthis shows that consistency depends on the class of functions F, but is not terribly useful in practicewe next look at properties of F that ensure convergence
the first thing to do is to try to bound this quantity
( ) 0][][suplim =⎥⎦⎤
⎢⎣⎡ >−
ℑ∈∞→εfRfRP emp
fn
18
VC Boundsthe probability
is easy to bound when F is finitelet F be the set F = {f1, ..., fM} and
the set of samples for which the risks obtained with the ithfunction differ by more than εthen, if M=2,
( ) ⎥⎦⎤
⎢⎣⎡ >−
ℑ∈ε][][sup fRfRP emp
f
( ){ }εε >−= ][][|),(),...,,( 11 iempinni fRfRyxyxC
( ) ( )( ) ( ) ( ) ( ) ( )211121
21][][sup
εεεεεε
εεε
CPCPCCPCPCP
CCPfRfRP empf
+≤∩−+=
∪=⎥⎦⎤
⎢⎣⎡ >−
ℑ∈
19
VC Boundsin general,
this is called the union boundrecalling that
and noting that the fi are fixed, from which the errors are independent, we can now • just apply the LLN to each of the P(Cε
i)• each of them is bounded by O(e-nε)• the overall bound is O(Me-nε) and convergence exponentially fast
( ) ( )∑=ℑ∈
≤⎥⎦⎤
⎢⎣⎡ >−
M
i
iemp
fCPfRfRP
1][][sup εε
( ){ }εε >−= ][][|),(),...,,( 11 iempinni fRfRyxyxC
20
VC Boundsthe problem is the case when F contains an infinite number of functions. In this case the method of proof will fail.one of the main VC results is the solution to thisthey showed that • the probability of the empirical risk on a sample of n points
differing from the risk by more that ε can be bounded by• twice the probability that it differs from the empirical risk on a
second sample of size 2n by more than ε/2
21
VC BoundsTheorem: for nε2 > 2
where • the 1st P refers to sample of size n and the 2nd to that of size 2n. • in the latter case, Remp measures the loss on first half and R’emp the
loss on the second half
this is intuitive:• if the Remps on two independent n-samples are close then they
should also be close to the true error rate
the practical significance is that• this makes F effectively finite• when we restrict the functions to 2n points there are at most 22n
different elements in the set
( ) ( ) ⎥⎦⎤
⎢⎣⎡ >−≤⎥⎦
⎤⎢⎣⎡ >−
ℑ∈ℑ∈2/][][sup2][][sup ' εε fRfRPfRfRP empemp
femp
f
22
VC dimensiongraphically
since there are 2n points where the functions are either +1 or -1, the total number of distinct elements is at most 22n
in practice, it can, of course, be smallerthe VC dimension is determined by this number
all these functions are thesame when restricted to the
sample points
x1 x2 x2n
23
VC dimensionto formalize this, we denote the 2n point sample by
and the cardinality of the set of distinct functions by
the maximum cardinality over all possible samples of size 2n is the shattering coefficient (or covering number) of F
it is a measure of the complexity of F, the number of ways in which it can separate the two classes
{ }),(,),,( 22112 nnn yxyxZ K=
( )nZN 2,ℑ
( )nN 2,ℑ
24
VC dimensionif N(F, n) = 2n, all possible separations can be implemented by functions in Fin this case the functions in F are said to shatter n points
note:• this means that there are n points that can be separated in all
possible ways• does not mean that this applies to all sets of n points
example: • let F be the set of lines in R2. For 3 points in general position
• the set of lines shatters three points on R2!
x
o x
o
x o
o
o x
x
x o
x
x x
o
o ox
o o
o
x x
25
VC dimensionexample: • find a set of four points that is shattered (that is can be separated
in all possible ways) by a line in R2• the following configuration is never possible (xor)
• the set of lines does not shatter four points on R2!
this brings us to the VC dimension
example• the VC dimension of the set of hyperplanes in Rd is d+1
o
o x
x
ℑ∈=ℑ functions by shattered be canthat pointsof #max )(VC
26
VC dimensionwhy is the VC dimension important?this will be clear next class, where we will show that
by controlling the VC dimension• we upper bound difference between risk and empirical risk• effectively control the generalization ability of the learning machine
and this is really what structural risk minimization is aboutbut before we get into that, I will tie all of this to the margin, which is the quantity that we can control in practicethere are various results relating margin and VC dimension, we next go over a simple one
[ ])(][][ ℑ+≤ VCfRfR emp φ
27
The role of the marginTheorem: consider hyperplanes of the form wTx=0, where w is normalized wrt a sample D = {(x1,y1),...,(xn,yn)} in the usual way
Then, the set of functions
defined on X* = {x1, ..., xn} and satisfying
has VC dimension such that
where R is the radius of the smallest sphere centered at the origin and containing X*.
1min =iT
ixw
( ){ }xw Tsgn=ℑ
λ≤w
22)( λRVC ≤ℑ
28
The role of the marginProof:• we need to show that the max number of points shattered by
normalized hyperplanes with ||w|| ≤ λ is upper bounded by R2λ2
• assume {x1, ..., xr} are shattered by normalized hyperplanes of ||w|| ≤ λ
• then, for all y1, ..., yr in {1,-1}, there is a w s.t, ||w|| ≤ λ and
• summing over all i
• and by Cauchy-Schwarz
ixwy iT
i ∀≥ ,1
rxywi
iiT ≥∑
r
∑∑∑∑ ≤⇔≤≤≤i
iii
iii
iii
iiT xyxyxywxywr
λλ (*)
29
The role of the margin• next, consider that the yi are iid r.v.s uniformly distributed on {1,-1}
( )
( )
[ ] [ ] [ ][ ] 2222
2
00
2
2
rRxxyE
xyExxyEyE
xyxxyyE
xyxyxyE
xyxyE
xyxyExyE
ii
iii
i ijiij
Tiji
i ijiij
Tiji
i ijiijj
Tii
i jjj
Tii
jjj
T
iii
iii
≤==
⎥⎥
⎦
⎤
⎢⎢
⎣
⎡+=
⎥⎦
⎤⎢⎣
⎡+=
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛+=
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛=
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
∑∑
∑ ∑
∑ ∑
∑ ∑
∑ ∑
∑∑∑
≠
≠
≠
1321
321321
30
The role of the margin• hence
• which means that there must be at least one set of labels for which this inequality holds, i.e. there is at least one set of labels such that
• since (*) holds for this set, we have
• since this holds for any r, it will also for the max number of points that can be shattered, i.e. the VC dimension and
22
rRxyEi
ii ≤⎥⎥⎦
⎤
⎢⎢⎣
⎡∑
22
rRxyi
ii ≤∑
RrrR xyλr
iii
22222
λ≤⇔≤≤⎟⎠⎞
⎜⎝⎛ ∑
22)( λRVC ≤ℑ
31
The role of the marginnotes:• there are extensions to the case where b ≠ 0 but they are a bit nore
complicated• the theorem basically says that if
then
• when we maximize the margin (minimize ||w||) we are decreasing the value of λ and, therefore,
• decreasing the upper bound on the VC dimension• note that, as long as ||w|| is finite, the VC dimension is finite even if
the dimension of the space is infinite• since γ = 1/||w||, this is always true if the margin is strictly greater
than zero
λ≤w
22)( λRVC ≤ℑ
32