Upload
lykien
View
230
Download
4
Embed Size (px)
Citation preview
HW5 Solutions
Gabe Hope
March 2017
1 Problem 1
1.1 Part a
Supposing w
Ty x > w
Ty x + 1. In this case our loss function is defined to be 0,
therefore we know that any partial derivative of the function will also be 0 so:
@L((w1, ..., wk), (x, y))
@wj,l= 0
1.2 Part b
Supposing w
Ty x < w
Ty x+ 1 and j = y. In this case we can write relevant piece
of the loss function as:(maxy0 6=j
w
Ty0x) + 1� w
Tj x
⌘ (maxy0 6=j
w
Ty0x) + 1�
DX
i=1
wj,ixi
Taking the derivative with respect to wj,l we get:
@L((w1, ..., wk), (x, y))
@wj,l= �xl
1.3 Part c
Supposing w
Ty x < w
Ty x + 1 and j = y, where y = argmaxy0 6=yw
Ty0x . In this
case we can write relevant piece of the loss function as:
(maxy0 6=j
w
Ty0x) + 1� w
Ty x
⌘ w
Tj x+ 1� w
Ty x
⌘DX
i=1
wj,ixi + 1� w
Ty x
1
Taking the derivative with respect to wj,l we get:
@L((w1, ..., wk), (x, y))
@wj,l= xl
1.4 Part d
Supposing w
Ty x < w
Ty x+ 1 and j 6= y and j 6= y, where y = argmaxy0 6=yw
Ty0x .
In this case we can write relevant piece of the loss function as:
⌘ w
Ty x+ 1� w
Ty x
Noting that this expression does not include wj , we get that the partial deriva-tive with respect to wj,l is 0:
@L((w1, ..., wk), (x, y))
@wj,l= 0
2 Problem 2
2.1 Part a
Recall that the decision function for the multiclass SVM is:
yi = argmaxywTy xi
For the 2-class (0 or 1) case, this is:
yi = argmax(wT0 xi, w
T1 xi)
We can equivalently write this as:
yi = 1 () w
T1 xi > w
T0 xi
Rearranging our condition we get:
w
T1 xi � w
T0 xi > 0
(w1 � w0)Txi > 0
Note that this is equivalent to a decision function of the form:
yi = 1 () w
Txi > 0
Which is a linear perceptron. Therefore we see that our 2-class multiclass SVMwith parameters w0 and w1 is equivalent to a perceptron with parameter:
w = (w1 � w0)
2
2.2 Part b
Now we consider the generalized case, where again our decision function isdefined as:
yi = argmaxywTy xi
Consider the descision boundary around a particular class k from our set of Kclasses. We see that:
yi = k () w
Tk xi > w
Tj xi 8j 6= k
Equivalently:(wk � wj)
Txi > 0 8j 6= k
We see then that the decision region for class k is formed by the intersectionof the positive regions for K � 1 implicitly defined linear perceptrons, whereeach separates k from another class j. Therefore the boundary around class k
is formed by the intersection of up to K � 1 linear decision boundaries.Accounting for all of the classes, the multiclass SVM’s decsion function is
defined by�K2
�implicit perceptrons. So all of the boundaries defined by the
model will have a piecewise linear form. For the K = 3 case, we can illustratethe decision boundaries like this:
3 Problem 3
See the svmObj function in the code section.
3
4 Problem 4
4.1 Part a
See the gradOpt function in the code section.
4.2 Part b
See code section for the implementation. The plot resulting from training withC = 0.01 is shown below.
4.3 Part c
Training and testing with 7 di↵erent C values, we get the following results:
Test accuracy for each C valueC Accuracy
0.0001 74.86%0.001 84.82%0.01 87.28%0.1 79.66%1 79.44%10 77.54%100 78.22%
4
The best accuracy (with C = 0.01) was: 87.28%. The corresponding con-fusion matrix is shown below.
5
4.4 Part d
Plot of the vector (wk) for each class:
6
5 Code
5.1 Code
function [ X, Y ] = setupData( varargin )
%Preprocess a dataset into and appropriate form for
% our SVM training/prediction code.
%
% varargin: A set of (N' x d) matrices where each
% matrix contains all (N') observations for
% a given class.
%
%
% X: Combined matrix of all observations (d x N)
% Y: Vector of labels for each observation (1 x N)
%Concatenate the 'X' matrices into one training
set
X = double(vertcat(varargin {:}) ');
%Add the extra dimension
X = [X ; ones(1, size(X, 2))];
%Create a vector with the label for each
observation
Y = zeros(1, size(X,2));
start = 1;
for label = 1: nargin
numlabel = size(varargin{label}, 1);
Y(start :(start + numlabel - 1)) = label;
start = start + numlabel;
end
end
7
function [ obj , grad ] = svmObj( w, C, X, Y )
%Computes the SVM objective and gradient.
%
% w: Current model weights (d x k)
% C: Regularization tradeoff hyperparamter
% X: Input observations (d x N)
% Y: Input labels (1 x N)
%
% obj: Objective value
% grad: Gradient matrix
%'Score ' of each class for each observation
wx = w' * X;
%Index of the true class ' score for each obs.
yind = sub2ind(size(wx), Y, 1:size(wx , 2));
%Get the true class scores , then 'remove ' them
wyx = wx(yind);
wx(yind) = -inf;
%Max scores after removing the true classes
[wyhatx , ~] = max(wx, [], 1);
%Loss for each observation
Loss = max(0, wyhatx + 1 - wyx);
%Objective as given in the notes
obj = 0.5 * (w(:)' * w(:)) + C * sum(Loss);
%If needed compute the gradient
if nargout > 1
%Create a #classes x #obs. matrix , where
% entry (i,j) indicates how obs. 'j' should
% contribute to the gradient for class 'i'%[bsxfun applies a given operation row -by-row]
cm = double(bsxfun (@eq , wx , wyhatx));
cm(yind) = -1;
cm = bsxfun (@times , cm , Loss > 0);
%Compute grad by multiplying our contrib.
% matrix by our obs. matrix and adding the
% regularization term
grad = w + C * (cm * X') ';end
end
8
function [ w, traceout ] = gradOpt(func , w0, r0, T,
varargin)
%Generic gradient descent with line search.
%
% func: Function to compute objective and graident
% w0: Initial values of parameters to optimize
% r0: Base learning rate
% T: # of iterations for gradient descent
% varargin: Additional arguments passed to 'func '%
% w: Final trained weights
% trace: Objective value at each iteration
%Initialize the weights and objective trace
trace = zeros(T,1);
w = w0;
%Run for T iterations
for iter = 1:T
[obj , grad] = func(w, varargin {:});
trace(iter) = obj;
%Line search to find optimal step size
r = r0;
for lsiter = 1:20
wnew = w - r * grad;
r = r * 0.5;
if func(wnew , varargin {:}) < obj
w = wnew;
break;
end
end
end
if nargout > 1
traceout = trace;
end
end
9
function [ w, trace ] = svmTrain( X, Y, C, T, r )
%Trains a support vector machine given a dataset.
%
% X: Input observations (d x N)
% Y: Input labels (1 x N)
% C: Regularization tradeoff hyperparamter
% T: # of iterations for gradient descent
% r: Base learning rate for gradient descent
%
% w: Final trained model weights (d x k)
% trace: Objective value at each iteration
%Defaults for the hyperparameters
if nargin < 5
r = 0.5;
end
if nargin < 4
T = 100;
end
if nargin < 3
C = 1;
end
%Randomly initialize the model
nlabels = numel(unique(Y));
dim = size(X, 1);
w0 = randn(dim , nlabels);
%Run gradient descent (w/ or w/o the plot data)
if nargout == 1
w = gradOpt (@svmObj , w0, r, T, C, X, Y);
else
[w,trace ]= gradOpt (@svmObj , w0, r, T, C, X, Y);
end
end
10
function [ Y ] = svmPredict( w, X )
%Makes predictions for a set of observations given
% a trained model.
%
% w: Trained model weights (d x k)
% X: Input observations (d x N)
%
% Y: Output labels (1 x N)
wx = w' * X;
[~, Y] = max(wx , [], 1);
end
function [ acc ] = score( Ytrue , Ypred )
%Computes the accuracy of a set of predictions
%
% Ytrue: True labels of data (1 x N)
% Ypred: Predicted labels of data (1 x N)
%
% acc: Accuracy
acc = sum(double(Ytrue == Ypred)) / numel(Ytrue);
end
function [ mat ] = confusion( Ytrue , Ypred )
%Generate a confusion matrix for a set of predictions.
%
% Ytrue: True labels of data (1 x N)
% Ypred: Predicted labels of data (1 x N)
%
% mat: Confusion matrix
labels = unique ([Ytrue , Ypred]);
nlabels = numel(labels);
mat = zeros(nlabels);
for l1 = 1: nlabels
for l2 = 1: nlabels
mat(l1 , l2) = sum(double(Ytrue == l1 &
Ypred == l2));
end
end
end
11
%hw5.mat
%Runs all of the experiments for homework 5
%Load and combine the training and test data matrices
load digits.mat
[Xtrain , Ytrain] = setupData(train0 , train1 , train2 ,
train3 , train4 , train5 , train6 , train7 , train8 ,
train9);
[Xtest , Ytest] = setupData(test0 , test1 , test2 , test3 ,
test4 , test5 , test6 , test7 , test8 , test9);
%Variables to save our best result
bestC = 0;
bestTrace = [];
bestPred = [];
bestW = [];
bestAcc = 0;
%Train and evaluate the SVM for a range of C values
for C = [0.01 , 0.1, 1, 10, 100]
[w, trace] = svmTrain(Xtrain , Ytrain , C);
Ypred = svmPredict(w, Xtest);
acc = score(Ytest , Ypred);
%Save the best result
if acc > bestAcc
bestC = C;
bestTrace = trace;
bestPred = Ypred;
bestW = w;
bestAcc = acc;
end
end
%Report the test accuracy and confusion matrix
%(Alternatively , use Matlab 's plotconfusion func.)
bestAcc
confusionMat = confusion(Ytest , bestPred)
%Plot the convergence of gradient descent
figure
plot(bestTrace)
title('Objective vs. Iterations ')xlabel('Gradient descent iterations ')
12
ylabel('SVM objective ')
%Visualize the weights for the best classifier
figure
title('Best Classifier Visualization ')for digit = 1:10
img = bestW (1:(end -1), digit);
subplot(2, 5, digit);
imagesc(reshape(img , 28, 28) ');axis off
end
13