13
10-701 Introduction to Machine Learning Homework 3, version 1.4 Due Oct 30, 11:59 am Rules: 1. Homework submission is done via CMU Autolab system. Please package your writeup and code into a zip or tar file, e.g ., let submit.zip contain writeup.pdf and your code. Submit the package to https://autolab.cs.cmu.edu/courses/10701-f15. 2. Like conference websites, repeated submission is allowed. Please feel free to refine your answers since we will only grade the latest version. Submitting incomplete solutions early will be helpful in preventing last minute panic as well. 3. Autolab may allow submission after the deadline, note however it is because of the late day policy. Please see course website for policy on late submission. 4. We recommend that you typeset your homework using appropriate software such as L A T E X. If you are writing please make sure your homework is cleanly written up and legible. The TAs will not invest undue effort to decrypt bad handwriting. 5. You are allowed to collaborate on the homework, but you should write up your own solution and code. Please indicate your collaborators in your submission. 1

10-701 Introduction to Machine Learningepxing/Class/10701/files/homework3... · 2015-12-07 · 10-701 Introduction to Machine Learning Homework 3,version 1.4 Due Oct 30, 11:59 am

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

10-701 Introduction to Machine Learning

Homework 3, version 1.4 Due Oct 30, 11:59 am

Rules:

1. Homework submission is done via CMU Autolab system. Please package your writeup and code intoa zip or tar file, e.g ., let submit.zip contain writeup.pdf and your code. Submit the package tohttps://autolab.cs.cmu.edu/courses/10701-f15.

2. Like conference websites, repeated submission is allowed. Please feel free to refine your answers since wewill only grade the latest version. Submitting incomplete solutions early will be helpful in preventinglast minute panic as well.

3. Autolab may allow submission after the deadline, note however it is because of the late day policy.Please see course website for policy on late submission.

4. We recommend that you typeset your homework using appropriate software such as LATEX. If you arewriting please make sure your homework is cleanly written up and legible. The TAs will not investundue effort to decrypt bad handwriting.

5. You are allowed to collaborate on the homework, but you should write up your own solution and code.Please indicate your collaborators in your submission.

1

1 Neural Networks (50 Points) (Zhiting)

1.1 Neural network for regression

Figure.1 shows a two-layer neural network which learns a function f : X → Y where X = (X1, X2) ∈ R2.The weights w = w1, . . . , w6 can be arbitrary. There are two possible choices for the function implementedby each unit in this network:

• S: signed sigmoid function S(a) = sign[σ(a)− 0.5] = sign[ 11+exp−a − 0.5]

• L: linear function L(a) = ca

where in both cases a =∑i wiXi.

1. Assign proper activation functions (S or L) to each unit in Figure.1 so this neural network simulates alinear regression: Y = β1X1 + β2X2.

Answer:

L, L, L

2. Assign proper activation functions (S or L) for each unit in Figure.1 so this neural network sim-ulates a binary logistic regression classifier: Y = arg maxy P (Y = y|X), where P (Y = 1|X) =exp(β1X1+β2X2)

1+exp(β1X1+β2X2), and P (Y = −1|X) = 1

1+exp(β1X1+β2X2). Derive β1 and β2 in terms of w1, . . . , w6.

Answer:

L, L, S

β1 = c(w1w5 + w2w6)

β2 = c(w3w5 + w4w6)

3. Assign proper activation functions (S or L) to each unit in Figure.1 so this neural network simulates aboosting classifier which combines two logistic regression classifiers, f1 : X → Y1 and f2 : X → Y2, toproduce its final prediction: Y = sign[α1Y1 +α2Y2]. Use the same distribution in problem 1.1.2 for f1and f2. Derive α1 and α2 in terms of w1, . . . , w6.

Answer:

S, S, S

α1 = w5

α2 = w6

1.2 Convolutional neural networks

1. Count the total number of parameters in LeNet (pp.46, slides of Lecture.8). How many parameters inall of the convolutional layers? How many parameters in all of the fully-connected layers?

Note:

(a) The filter size of each convolutional and pooling(subsampling) layer:C1: 5× 5 (i.e., each unit of C1 has a 5× 5 receptive field in its preceding layer);S2: 2× 2;C3: 5× 5;S4: 2× 2;

2

3 Neural Network and Regression (18 pts)

Consider a two-layer neural network to learn a function f : X → Y where X = 〈X1, X2〉 consists oftwo attributes. The weights, w1, · · · , w6, can be arbitrary. There are two possible choices for thefunction implemented by each unit in this network:

• S: signed sigmoid function S(a) = sign[σ(a)− 0.5] = sign[ 11+exp(−a) − 0.5]

• L: linear function L(a) = c a

where in both cases a =∑

iwiXi

1. (4 pts) Assign proper activation functions (S or L) to each unit in the following graph so thisneural network simulates a linear regression: Y = β1X1 + β2X2.

2. (4 pts) Assign proper activation functions (S or L) for each unit in the following graph so thisneural network simulates a binary logistic regression classifier: Y = argmaxy P (Y = y|X),

where P (Y = 1|X) = exp(β1X1+β2X2)1+exp(β1X1+β2X2)

, P (Y = −1|X) = 11+exp(β1X1+β2X2)

.

3. (3 pts) Following problem 3.2, derive β1 and β2 in terms of w1, · · · , w6.

4

Figure 1: A two-layer neural network.

(b) Fully-connected layers in LeNet include C5, F6, and OUTPUT

Answer:

ConvLayers

C1: (5 * 5 + 1) * 6 = 156

S2: 0 or 2 * 6 = 12

C3: (5 * 5 * 3 + 1) * 6 + (5 * 5 * 4 + 1) * 9 + (5 * 5 * 6 + 1) = 1516(Or: (16 * (6 * 5 * 5 + 1)) = 2416)

S2: 0 or 2 * 16 = 32

FCLayers

C5: (5 * 5 * 16 + 1) * 120 = 48120

F6: (120 + 1) * 84 = 10164

OUTPUT: (84 + 1) * 10 = 850

2. In a convolutional layer the units are organized into planes, each of which is called a feature map. Theunits within a feature map (indexed q) have different inputs, but all share a common weight vector,w(q). A convolutional network is usually trained through backprorogation. Let J (q) be the number of

units in the qth feature map, z(q)j the activation of the jth unit, x

(q)ji the ith input for the jth unit,

w(q)i the ith element of w(q), L the training loss. Derive the gradient of w

(q)i .

Answer:

∂L

∂w(m)i

=∑

j

∂L

∂z(m)j

∂z(m)j

∂w(m)i

=∑

j

δ(m)j x

(m)ji

1.3 Gradient vanishing/explosion

In this problem we will study the difficulty of back-propagation in training deep neural networks. Forsimplicity, we consider the simplest deep neural network: one with just a single neuron in each layer, wherethe output of the neuron in the jth layer is zj = σ(aj) = σ(wjzj−1 + bj). Here σ is some activation functionwhose derivative on x is σ′(x). Let m be the number of layers in the neural network, L the training loss.

3

1. Derive the derivative of L w.r.t. b1 (the bias of the neuron in the first layer).

Answer:∂L∂b1

= ∂L∂zm

σ′(a1)∏mk=2 σ

′(ak)wk

2. Assume the activation function is the usual sigmoid function σ(x) = 1/(1 + exp−x). The weights ware initialized to be |wj | < 1 (j = 1, . . . ,m).

(a) Explain why the above gradient (∂L/∂b1) tends to vanish (→ 0) when m is large.

Answer:

The derivative of sigmoid function reaches a maximum at σ′(0) = 14 . Since |wj | < 1, we have

|wjσ′(aj)| < 14 < 1

(b) Even if |w| is large, the above gradient would also tend to vanish, rather than explode (→ ∞).Explain why. (A rigorous proof is not required.)

Answer:

To avoid the vanishing gradient problem we need |wσ′(a)| ≥ 1. But, the σ′(a) term also dependson w : σ′(a) = σ(wz + b). If we make w large we tend to make wz + b very large, and σ′(a) verysmall.

3. One of the approaches to (partially) address the gradient vanishing/explosion problem is to use therectified linear (ReL) activation function instead of the sigmoid. The ReL activation function is σ(x) =max0, x. Explain why ReL can alleviate the gradient vanishing problem as faced by sigmoid.

Answer:

As long as a > 0, σ′(a) = 1. So we don’t have the issue as in 2(b).

4. A second approach to (partially) address the gradient vanishing/explosion problem is layer-wise pre-training. Restricted Boltzemann machine (RBM) is one of the widely-used models for layer-wisepre-training. Figure 2 shows an example of RBM which includes K hidden units h, and J input unitsv. Let us define the joint distribution as the following general form:

P (v,h) =1

Zexp

(∑

i

θiφi(v,h)

), (1)

where Z =∑

v,h exp (∑i θiφi(v,h)) is the normalization term; φi(v,h) are some features; θi are

the parameters corresponding to the weights in the RBM. Consider the simplest learning algorithm,gradient descent. Show that

∂ logP (v)

∂θi=∑

h

φi(v,h)P (h|v)−∑

v,h

φi(v,h)P (v,h). (2)

Answer:

4

would appear in n terms in that product (i.e., one term for each edge in which it appears). Therefore,we can see that:

P (X | µ,Σ) ∝∏

(i,j)∈Eψi,j(xi, xj)ψi(xi)

1n(i)ψj(xj)

1n(j) , (9)

which is the form that is provided to us in the problem sheet. Now, we see that if (i, j) /∈ E, then inorder for the value of this density to remain unchanged we need to have that Ωij = 0 (i.e., so that allthe terms of the product term corresponding to that edge are equal to 1). It is not difficult to see thenthat given XV \i,j and Ωij = 0, there is not “coupling” between the terms involving Xi and Xj andthe conditional probability density function factorizes with respect to these two variables, implyingthat Xi ⊥ Xj | XV \i,j. Furthermore, following the same way of reasoning, if Xi ⊥ Xj | XV \i,j weneed to have no coupling between the terms involving Xi and Xj in the above equations. This meansthat (i, j) must not be in E. Thus, we have argued (but not proved formally as this was not requiredfrom the problem statement) that (i, j) /∈ E ⇔ Xi ⊥ Xj | XV \i,j.

3.3 Ising Model

We have that:

p(Xs = 1 | X1, . . . , Xn\Xs; θ) =p(X1, . . . , Xs = 1, . . . , Xn; θ)

p(X1, . . . , Xs = 0, . . . , Xn; θ) + p(X1, . . . , Xs = 1, . . . , Xn; θ),

=exp

(zs + θs +

∑t∈V s.t. (s,t)∈E θs,txt

)

exp (zs) + exp(zs + θs +

∑t∈V s.t. (s,t)∈E θs,txt

) ,

=exp

(θs +

∑t∈V s.t. (s,t)∈E θs,txt

)

1 + exp(θs +

∑t∈V s.t. (s,t)∈E θs,txt

) ,

(10)

where:zs =

v∈V \sθvxv +

(v,t)∈Ev,t6=s

θv,txvxt. (11)

It is easy to see that this is a logistic regression model over the neighbors of Xs.

3.4 Boltzmann Machines

We have that:

P (x,y) =1∑

x,y exp (∑k θkφk(x,y))

exp

(∑

k

θkφk(x,y)

)⇒ (12)

P (x) =1∑

x,y exp (∑k θkφk(x,y))

y

exp

(∑

k

θkφk(x,y)

)⇒ (13)

logP (x) = log∑

y

exp

(∑

k

θkφk(x,y)

)− log

x,y

exp

(∑

k

θkφk(x,y)

)⇒ (14)

∂ logP ((x))

∂θl=∑

y

exp (∑k θkφk(x,y))∑

y exp (∑k θkφk(x,y))

φl(x,y)−∑

x,y

exp (∑k θkφk(x,y))∑

x,y exp (∑k θkφk(x,y))

φl(x,y),

=∑

y

P (x,y)

P (x)φl(x,y)−

x,y

P (x,y)φl(x,y),

=∑

y

P (y | x)φl(x,y)−∑

x,y

P (x,y)φl(x,y).

(15)

5

Restricted Boltzmann Machine

© Eric Xing @ CMU, 2015 21

Figure 2: A restricted Boltzmann machine.

5

2 Support Vector Machines (50 Points) (Yuntian)

2.1 Support Vector Regression (25 Points)

We now extend support vector machines (SVM) to regression problems. Recall that in regression problems,we have n data points (xi, yi)

ni=1 where xi ∈ Rm and yi ∈ R. Given a function class F (e.g . linear or

quadratic functions), we want to fit a function f ∈ F on the training set:

f? = argminf∈F

C

n∑

i=1

l(f(xi), yi) +R(f) (3)

where l(·, ·) is the loss function, R(f) is the regularization term, C controls the regularization strength. Thefirst part tries to fit data, and the second part penalizes complex f to avoid over-fitting.

In the support vector regression (SVR) framework, we consider linear function class F = x → wTx(we do not consider interception term for simplicity). We use `2-regularizer R(f) = 1

2‖w‖22 for f(x) = wTx.For the loss function l, similar to the hinge-loss function in SVM classification, we employ an ε-insensitiveerror function

lε(f(x), y) =

0 if |f(x)− y| < ε

|f(x)− y| − ε otherwise.(4)

Then we get the following optimization problem:

w? = argminw

C

n∑

i=1

lε(wTxi, yi) +

1

2‖w‖22. (5)

1. Write down the dual problem of SVR. (Hint: follow the derivations for SVM)

Answer:

We introduce slack variables ξi, ξ∗i and write the original problem as

min 12‖w‖22 + C

∑ni=1(ξi + ξ∗i )

s.t. ξi ≥ yi − wTxi − εξ∗i ≥ wTxi − yi − εξi, ξ

∗i ≥ 0

(6)

By introducing slack variables ηi, η∗i , αi, α

∗i , the Lagrangian is

L =1

2‖w‖22 + C

n∑

i=1

(ξi + ξ∗i )−n∑

i=1

ηiξi −n∑

i=1

η∗i ξ∗i

+

n∑

i=1

αi(yi − wTxi − ε− ξi) +

n∑

i=1

α∗i (wTxi − yi − ε− ξ∗i ) (7)

The primal problem isminw,ξi,ξ∗i

maxαi,α∗

i ,ηi,η∗i

L

s.t. αi, α∗i , ηi, η

∗i ≥ 0

(8)

So the dual problem ismax

αi,α∗i ,ηi,η

∗i

minw,ξi,ξ∗i

L

s.t. αi, α∗i , ηi, η

∗i ≥ 0

(9)

6

We can get minw,ξi,ξ∗i

L by taking the derivatives and solve for w, ξi, ξ∗i , so the dual problem is

maxαi,α∗

i ,ηi,η∗i

− 12

∑ni=1

∑nj=1(αi − α∗i )(αj − α∗j )xTi xj +

∑ni=1(αi − α∗i )yi −

∑ni=1(αi + α∗i )ε

s.t. 0 ≤ αi ≤ C0 ≤ α∗i ≤ C

(10)

2. Write down the KKT conditions, and explain what are the “support vectors”.

Answer: The KKT conditions are:

∂L

∂w= w −

n∑

i=1

(αi − α∗i )xi = 0 (11)

∂L

∂ξi= C − αi − ηi = 0 (12)

∂L

∂ξ∗i= C − α∗i − η∗i = 0 (13)

αi, α∗i , ηi, η

∗i ≥ 0 (14)

ηiξi = 0, η∗i ξ∗i = 0 (15)

αi(yi − wTxi − ε− ξi) = 0 (16)

α∗i (wTxi − yi − ε− ξ∗i ) = 0 (17)

The support vectors are those with αi − α∗i 6= 0.

3. Derive a kernelized version of SVR. For a test point x, write down the prediction rule.

Answer: In a kernelized version, for any kernel function K(·, ·), just replace the xTi xj with K(xi, xj).For a test point x, the prediction rule is

f(x) =

n∑

i=1

(αi − α∗i )K(xi, x) (18)

4. Give one reason why do we usually solve the dual problem of SVR and SVM instead of the primal.

Answer: We can introduce kernel functions here.

5. Implement SVR on a 1-D toy dataset. Each line of the dataset contains a training instance (xi, yi)(separated by a tab). For this problem, you need to

• Use RBF kernel k(xi, xj) = exp(−‖xi−xj‖222h2 ), and take h = 0.5, C = 4, ε = 0.1.

• Plot the prediction curve for x ∈ [0, 1] and show the support vectors versus other training pointsin the training dataset.

(Hint: You are allowed to use optimization toolkits such as CVX or Matlab’s inbuilt function quadprogto solve the dual problem.)

Answer:

clc ;clear ;

data = dlmread( ’ SVR dataset . txt ’ ) ;

7

Figure 3: SVR results.

x = data ( : , 1 ) ;y = data ( : , 2 ) ;

h = 0 . 5 ;C = 4 ;eps = 0 . 1 ;

n = s ize (x , 1 ) ;

% Parameters are 2n by 1 vector , whose f i r s t n e lements are% p = alpha − a lpha ˆ s t a r and next n e lements are% q = alpha + alpha ˆ s t a r

H = zeros (2 ∗ n , 2 ∗ n ) ;for i = 1 : n

for j = 1 : ntemp = norm( x ( i ) − x ( j ) , 2 ) ;H( i , j ) = exp(− temp ˆ 2 / 2 / h ˆ 2 ) ;

endend

f = − eps ∗ ones (2 ∗ n , 1 ) ;f ( 1 : 5 0 , 1) = y ;

A = zeros (4 ∗ n , 2 ∗ n ) ;for i = 1 : n

A( i , i ) = 0 . 5 ;A( i , i + n) = 0 . 5 ;

A( i + n , i ) = −0.5;A( i + n , i + n) = 0 . 5 ;

A( i + 2 ∗ n , i ) = −0.5;

8

A( i + 2 ∗ n , i + n) = −0.5;

A( i + 3 ∗ n , i ) = 0 . 5 ;A( i + 3 ∗ n , i + n) = −0.5;

end

b = zeros (4 ∗ n , 1 ) ;b ( 1 : 2 ∗ n , 1) = C;

params = quadprog (H, −f , A, b ) ;

%% Plot

p = params ( 1 : n , 1 ) ;q = params (n + 1 :100 , 1 ) ;

l i m i t = 0 . 00 001 ;sv idx = find (abs (p) > l i m i t ) ;nsv idx = find (abs (p) <= l i m i t ) ;

x sv = x ( sv idx , 1 ) ;x nsv = x ( nsv idx , 1 ) ;y sv = y ( sv idx , 1 ) ;y nsv = y ( nsv idx , 1 ) ;

cut = 99 ;x a x i s = 0 :1/ cut : 1 ;y a x i s = zeros ( s ize ( x a x i s ) ) ;

for i = 1 : cut + 1K = zeros (n , 1 ) ;for j = 1 : n

K( j , 1) = exp( − norm( x a x i s ( i ) − x ( j ) , 2) ˆ 2 / 2 / h ˆ 2 ) ;endy a x i s ( i ) = dot (K, p ) ;

end

figure % opens new f i g u r e windowplot ( x ax i s , y a x i s ) ;hold on ;s c a t t e r ( x sv , y sv , ’ ∗ ’ ) ;hold on ;s c a t t e r ( x nsv , y nsv ) ;legend ( ’ p r e d i c t i o n curve ’ , ’ support v e c to r s ’ ) ;

2.2 Support Kernel Machines (20 Points)

In SVM, the kernel function can be viewed as a similarity measure between data points. In some classificationscenarios, features may come from different sources or modalities, e.g. in some tasks the data may containboth image features and text features. In that case, since these are different representations, they havedifferent measures of similarity corresponding to different kernels. In such a case, we want to learn a

9

combination of kernels instead of using a single kernel. There is significant amount of work in combingkernels, here we adapt the notations in [1].

We begin by considering a linear case of Support Kernel Machine (SKM). Suppose the data pointsxi ∈ X = Rk. We also assume we are given a decomposition of Rk = Rk1 × · · · × Rkm , so that each datapoint xi can be decomposed into m block components, i.e. xi = (x1i, · · · , xmi) where each xji is in general avector. In real tasks, each block may correspond to a certain kind of representation, e.g. x1i may correspondto image features and x2i may be text features.

Our goal is to find a linear classifier of the form y = sign(wTx+ b) where w has the same block decom-position w = (w1, · · · , wm) ∈ Rk1+···+km . Recall that in linear SVM the objective is:

minimizew∈Rk1×···×Rkm

ξi≥0, b∈R

1

2‖w‖22 + C

n∑

i=1

ξi (19)

subject to yi(∑

j

wTj xji + b) ≥ 1− ξi, ∀i ∈ 1, · · · , n (20)

In SKM, we encourage the sparsity of the vector w at the level of blocks. The primal problem for the SKMis defined as:

minimizew∈Rk1×···×Rkm

ξi≥0, b∈R

1

2(

m∑

j=1

dj‖wj‖2)2 + C

n∑

i=1

ξi (21)

subject to yi(∑

j

wTj xji + b) ≥ 1− ξi, ∀i ∈ 1, · · · , n (22)

where dj > 0 can be seen as constant.

1. By introducing dual variables αi ≥ 0 and βi ≥ 0, we get the Lagrangian function

L =1

2(

m∑

j=1

dj‖wj‖2)2 + C

n∑

i=1

ξi −n∑

i=1

αi(yi(

m∑

j=1

wTj xji + b)− 1 + ξi)−n∑

i=1

βiξi (23)

Denote γ =∑mj=1 dj‖wj‖2.

(a) Show that at the minimum of the Lagrangian function, i.e. w = argminw L for this and thefollowing questions,

‖wj‖2djγ = wTj

n∑

i=1

αiyixji, ∀j ∈ 1, · · · ,m (24)

Answer: It is apparent that the conclusion holds when wj = 0. When wj 6= 0, we have

∂L∂wj

= γdjwj

‖wj‖2 −∑ni=1 αiyixji (25)

At the minimum, ∂L∂wj

= 0, we have

γdjwj‖wj‖2

=

n∑

i=1

αiyixji (26)

Multiply both sides by wTj , we have

‖wj‖2γdj = wTj

n∑

i=1

αiyixji (27)

10

(b) Show that ‖∑ni=1 αiyixji‖2 ≤ djγ, ∀j ∈ 1, · · · ,m.

Note: for (b) you can get full credit if you only consider wj 6= 0, but you can get 5 extra points ifyou include wj = 0 case in your proof. A hint is that L is not differentiable w.r.t. wj if wj = 0,and you may refer to this link for how to deal with that case by using ∂‖x‖2 = g : ‖g‖2 ≤ 1 ifx = 0.

Answer: If wj 6= 0. Use the intermediate results of (a), we have

γdjwj‖wj‖2

=

n∑

i=1

αiyixji (28)

Take the l2 norm of both sides, we have

γdj = ‖n∑

i=1

αiyixji‖2 (29)

If wj = 0, 0 ∈ ∂L, we have that for some g with ‖g‖2 ≤ 1,

γdjg =

n∑

i=1

αiyixji (30)

Take the l2 norm of both sides, we have

γdj‖g‖2 = ‖n∑

i=1

αiyixji‖2 (31)

As ‖g‖2 ≤ 1, we have djγ ≥ ‖∑ni=1 αiyixji‖2.

(c) Show that

• if ‖∑i αiyixji‖2 < djγ, then wj = 0,

• if ‖∑i αiyixji‖2 = djγ, then ∃ηj ≥ 0, such that wj = ηj∑i αiyixji.

Answer: If wj 6= 0, using the intermediate results of (b) we have

γdj = ‖n∑

i=1

αiyixji‖2 (32)

So if ‖∑ni=1 αiyixji‖2 < djγ, wj = 0.

On the other hand, we have proved in (a) that if wj 6= 0, γdjwj

‖wj‖2 =∑ni=1 αiyixji. Let ηj be

‖wj‖2γdj

, then wj = ηj∑ni=1 αiyixji.

2. Recall from homework 2 that `1 norm can encourage sparsity. Explain the effect of the regularizationterm 1

2 (∑mj=1 dj‖wj‖2)2.

Answer: Recall that l1 regularizer has the effect of putting some coefficients to zero, and if dj‖wj‖2 = 0,then wj = 0. Therefore some blocks wj tend to zero due to this regularizer (group lasso).

3. Now we extend the above analysis to a kernelized version. Assume that we have a mapping φ :X → Rf which is generally a non-linear function. We assume that φ(x) has m block componentsφ(x) = (φ1(x), · · · , φm(x))red, and we also assume w has the same decomposition w = (w1, · · · , wm).Show that at the minimum of the Lagrangian function, ∃ηj ≥ 0 such that wj = ηj

∑ni=1 αiyiφj(xi).

Answer: Since it is a mapping for x, we can simply substitute xji with φj(xi) and every conclusionsabove still hold. The proof completes by using the conclusion in 1(c) and replace xji with φj(xi).

11

2.3 SVM Error Analysis (5 Points)

In this problem, we want to analyze the error of SVM classification. Assume that we have n data points(xi, yi)

ni=1 where xi ∈ Rm and yi = 1, · · · ,K. Assume that we train an SVM classifier f(x1,y1),···(xn,yn) on

these n data points.For a randomly drawn test data point (xn+1, yn+1), the prediction is ypredn+1 = f(x1,y1),··· ,(xn,yn)(xn+1).

We assume that the n training data points and the test data point (xn+1, yn+1) are drawn i.i.d from someunknown underlying distribution. The expected error rate is defined as:

err = E(x1,y1),···(xn,yn)E(xn+1,yn+1)[1f(x1,y1),··· ,(xn,yn)(xn+1) 6= yn+1

] (33)

where the indicator function 1 A = 1 if A is true, otherwise 0.

1. Show the the expected error rate is equal to the expectation of leave-one-out cross validation error forn+ 1 data points.

Answer: As the points are i.i.d, we have for any i,

E(x1,y1),···(xn,yn)E(xn+1,yn+1)[1f(x1,y1),··· ,(xn,yn)(xn+1) 6= yn+1

]

= E(x1,y1),··· ,(xn+1,yn+1)[1f(x1,y1),··· ,(xi−1,yi−1),(xi+1,yi+1),··· ,(xn+1,yn+1)(xi) 6= yi

] (34)

By looping through i = 1 to i = n+ 1 and averaging, we have

E(x1,y1),···(xn,yn)E(xn+1,yn+1)[1f(x1,y1),··· ,(xn,yn)(xn+1) 6= yn+1

]

=1

n+ 1

n+1∑

i=1

E(x1,y1),··· ,(xn+1,yn+1)[1f(x1,y1),··· ,(xi−1,yi−1),(xi+1,yi+1),··· ,(xn+1,yn+1)(xi) 6= yi

]

= E(x1,y1),··· ,(xn+1,yn+1)[1

n+ 1

n+1∑

i=1

1f(x1,y1),··· ,(xi−1,yi−1),(xi+1,yi+1),··· ,(xn+1,yn+1)(xi) 6= yi

]

(35)

The right hand side is the expectation of the leave on out error rate.

2. In the lecture, we have the statement that “the leave-one-out cross-validation error does not dependon the dimensionality of the feature space but only on the number of support vectors”. Show that thisstatement is true by explaining why

errloocv ≤nsn+ 1

(36)

where errloocv is the leave-one-out cross-validation error for training set (xi, yi)n+1i=1 , ns is the number

of support vectors.

Answer: If we remove a non-support vector and retrain the SVM, the decision boundary before remov-ing it is still the optimum, and if we use that decision boundary, the previous non-support vector willnot be misclassified. So the leave one out error possibly occurs when we remove the support vectors,hence

errloocv ≤nsn+ 1

(37)

Acknowlegements

Thanks for Desai’s homework for providing this solution.

12

References

[1] Francis R Bach, Gert R.G. Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality, andthe smo algorithm. In ICML, 2004. 10

13