HW5 Solutions - Brown University Department of Computer …cs.brown.edu/~pff/engn2520/hw5/hw5_solutions.pdf · · 2017-03-29HW5 Solutions Gabe Hope March 2017 1 Problem 1 1.1 Part

HW5 Solutions

Gabe Hope

March 2017

1 Problem 1

1.1 Part a

Supposing w

Ty x > w

Ty x + 1. In this case our loss function is defined to be 0,

therefore we know that any partial derivative of the function will also be 0 so:

@L((w1, ..., wk), (x, y))

@wj,l= 0

1.2 Part b

Supposing w

Ty x < w

Ty x+ 1 and j = y. In this case we can write relevant piece

of the loss function as:(maxy0 6=j

w

Ty0x) + 1� w

Tj x

⌘ (maxy0 6=j

w

Ty0x) + 1�

DX

i=1

wj,ixi

Taking the derivative with respect to wj,l we get:

@L((w1, ..., wk), (x, y))

@wj,l= �xl

1.3 Part c

Supposing w

Ty x < w

Ty x + 1 and j = y, where y = argmaxy0 6=yw

Ty0x . In this

case we can write relevant piece of the loss function as:

(maxy0 6=j

w

Ty0x) + 1� w

Ty x

⌘ w

Tj x+ 1� w

Ty x

⌘DX

i=1

wj,ixi + 1� w

Ty x

1

Taking the derivative with respect to wj,l we get:

@L((w1, ..., wk), (x, y))

@wj,l= xl

1.4 Part d

Supposing w

Ty x < w

Ty x+ 1 and j 6= y and j 6= y, where y = argmaxy0 6=yw

Ty0x .

In this case we can write relevant piece of the loss function as:

⌘ w

Ty x+ 1� w

Ty x

Noting that this expression does not include wj , we get that the partial deriva-tive with respect to wj,l is 0:

@L((w1, ..., wk), (x, y))

@wj,l= 0

2 Problem 2

2.1 Part a

Recall that the decision function for the multiclass SVM is:

yi = argmaxywTy xi

For the 2-class (0 or 1) case, this is:

yi = argmax(wT0 xi, w

T1 xi)

We can equivalently write this as:

yi = 1 () w

T1 xi > w

T0 xi

Rearranging our condition we get:

w

T1 xi � w

T0 xi > 0

(w1 � w0)Txi > 0

Note that this is equivalent to a decision function of the form:

yi = 1 () w

Txi > 0

Which is a linear perceptron. Therefore we see that our 2-class multiclass SVMwith parameters w0 and w1 is equivalent to a perceptron with parameter:

w = (w1 � w0)

2

2.2 Part b

Now we consider the generalized case, where again our decision function isdefined as:

yi = argmaxywTy xi

Consider the descision boundary around a particular class k from our set of Kclasses. We see that:

yi = k () w

Tk xi > w

Tj xi 8j 6= k

Equivalently:(wk � wj)

Txi > 0 8j 6= k

We see then that the decision region for class k is formed by the intersectionof the positive regions for K � 1 implicitly defined linear perceptrons, whereeach separates k from another class j. Therefore the boundary around class k

is formed by the intersection of up to K � 1 linear decision boundaries.Accounting for all of the classes, the multiclass SVM’s decsion function is

defined by�K2

�implicit perceptrons. So all of the boundaries defined by the

model will have a piecewise linear form. For the K = 3 case, we can illustratethe decision boundaries like this:

3 Problem 3

See the svmObj function in the code section.

3

4 Problem 4

4.1 Part a

See the gradOpt function in the code section.

4.2 Part b

See code section for the implementation. The plot resulting from training withC = 0.01 is shown below.

4.3 Part c

Training and testing with 7 di↵erent C values, we get the following results:

Test accuracy for each C valueC Accuracy

0.0001 74.86%0.001 84.82%0.01 87.28%0.1 79.66%1 79.44%10 77.54%100 78.22%

4

The best accuracy (with C = 0.01) was: 87.28%. The corresponding con-fusion matrix is shown below.

5

4.4 Part d

Plot of the vector (wk) for each class:

6

5 Code

5.1 Code

function [ X, Y ] = setupData( varargin )

%Preprocess a dataset into and appropriate form for

% our SVM training/prediction code.

%

% varargin: A set of (N' x d) matrices where each

% matrix contains all (N') observations for

% a given class.

%

%

% X: Combined matrix of all observations (d x N)

% Y: Vector of labels for each observation (1 x N)

%Concatenate the 'X' matrices into one training

set

X = double(vertcat(varargin {:}) ');

%Add the extra dimension

X = [X ; ones(1, size(X, 2))];

%Create a vector with the label for each

observation

Y = zeros(1, size(X,2));

start = 1;

for label = 1: nargin

numlabel = size(varargin{label}, 1);

Y(start :(start + numlabel - 1)) = label;

start = start + numlabel;

end

end

7

function [ obj , grad ] = svmObj( w, C, X, Y )

%Computes the SVM objective and gradient.

%

% w: Current model weights (d x k)

% C: Regularization tradeoff hyperparamter

% X: Input observations (d x N)

% Y: Input labels (1 x N)

%

% obj: Objective value

% grad: Gradient matrix

%'Score ' of each class for each observation

wx = w' * X;

%Index of the true class ' score for each obs.

yind = sub2ind(size(wx), Y, 1:size(wx , 2));

%Get the true class scores , then 'remove ' them

wyx = wx(yind);

wx(yind) = -inf;

%Max scores after removing the true classes

[wyhatx , ~] = max(wx, [], 1);

%Loss for each observation

Loss = max(0, wyhatx + 1 - wyx);

%Objective as given in the notes

obj = 0.5 * (w(:)' * w(:)) + C * sum(Loss);

%If needed compute the gradient

if nargout > 1

%Create a #classes x #obs. matrix , where

% entry (i,j) indicates how obs. 'j' should

% contribute to the gradient for class 'i'%[bsxfun applies a given operation row -by-row]

cm = double(bsxfun (@eq , wx , wyhatx));

cm(yind) = -1;

cm = bsxfun (@times , cm , Loss > 0);

%Compute grad by multiplying our contrib.

% matrix by our obs. matrix and adding the

% regularization term

grad = w + C * (cm * X') ';end

end

8

function [ w, traceout ] = gradOpt(func , w0, r0, T,

varargin)

%Generic gradient descent with line search.

%

% func: Function to compute objective and graident

% w0: Initial values of parameters to optimize

% r0: Base learning rate

% T: # of iterations for gradient descent

% varargin: Additional arguments passed to 'func '%

% w: Final trained weights

% trace: Objective value at each iteration

%Initialize the weights and objective trace

trace = zeros(T,1);

w = w0;

%Run for T iterations

for iter = 1:T

[obj , grad] = func(w, varargin {:});

trace(iter) = obj;

%Line search to find optimal step size

r = r0;

for lsiter = 1:20

wnew = w - r * grad;

r = r * 0.5;

if func(wnew , varargin {:}) < obj

w = wnew;

break;

end

end

end

if nargout > 1

traceout = trace;

end

end

9

function [ w, trace ] = svmTrain( X, Y, C, T, r )

%Trains a support vector machine given a dataset.

%


% Y: Input labels (1 x N)

% C: Regularization tradeoff hyperparamter

% T: # of iterations for gradient descent

% r: Base learning rate for gradient descent

%

% w: Final trained model weights (d x k)

% trace: Objective value at each iteration

%Defaults for the hyperparameters

if nargin < 5

r = 0.5;

end

if nargin < 4

T = 100;

end

if nargin < 3

C = 1;

end

%Randomly initialize the model

nlabels = numel(unique(Y));

dim = size(X, 1);

w0 = randn(dim , nlabels);

%Run gradient descent (w/ or w/o the plot data)

if nargout == 1

w = gradOpt (@svmObj , w0, r, T, C, X, Y);

else

[w,trace ]= gradOpt (@svmObj , w0, r, T, C, X, Y);

end

end

10

function [ Y ] = svmPredict( w, X )

%Makes predictions for a set of observations given

% a trained model.

%

% w: Trained model weights (d x k)


%

% Y: Output labels (1 x N)

wx = w' * X;

[~, Y] = max(wx , [], 1);

end

function [ acc ] = score( Ytrue , Ypred )

%Computes the accuracy of a set of predictions

%

% Ytrue: True labels of data (1 x N)

% Ypred: Predicted labels of data (1 x N)

%

% acc: Accuracy

acc = sum(double(Ytrue == Ypred)) / numel(Ytrue);

end

function [ mat ] = confusion( Ytrue , Ypred )

%Generate a confusion matrix for a set of predictions.

%

% Ytrue: True labels of data (1 x N)

% Ypred: Predicted labels of data (1 x N)

%

% mat: Confusion matrix

labels = unique ([Ytrue , Ypred]);

nlabels = numel(labels);

mat = zeros(nlabels);

for l1 = 1: nlabels

for l2 = 1: nlabels

mat(l1 , l2) = sum(double(Ytrue == l1 &

Ypred == l2));

end

end

end

11

%hw5.mat

%Runs all of the experiments for homework 5

%Load and combine the training and test data matrices

load digits.mat

[Xtrain , Ytrain] = setupData(train0 , train1 , train2 ,

train3 , train4 , train5 , train6 , train7 , train8 ,

train9);

[Xtest , Ytest] = setupData(test0 , test1 , test2 , test3 ,

test4 , test5 , test6 , test7 , test8 , test9);

%Variables to save our best result

bestC = 0;

bestTrace = [];

bestPred = [];

bestW = [];

bestAcc = 0;

%Train and evaluate the SVM for a range of C values

for C = [0.01 , 0.1, 1, 10, 100]

[w, trace] = svmTrain(Xtrain , Ytrain , C);

Ypred = svmPredict(w, Xtest);

acc = score(Ytest , Ypred);

%Save the best result

if acc > bestAcc

bestC = C;

bestTrace = trace;

bestPred = Ypred;

bestW = w;

bestAcc = acc;

end

end

%Report the test accuracy and confusion matrix

%(Alternatively , use Matlab 's plotconfusion func.)

bestAcc

confusionMat = confusion(Ytest , bestPred)

%Plot the convergence of gradient descent

figure

plot(bestTrace)

title('Objective vs. Iterations ')xlabel('Gradient descent iterations ')

12

ylabel('SVM objective ')

%Visualize the weights for the best classifier

figure

title('Best Classifier Visualization ')for digit = 1:10

img = bestW (1:(end -1), digit);

subplot(2, 5, digit);

imagesc(reshape(img , 28, 28) ');axis off

end

13

Documents

HW5 Solutions - Brown University Department of Computer …cs.brown.edu/~pff/engn2520/hw5/hw5_solutions.pdf · · 2017-03-29HW5 Solutions Gabe Hope March 2017 1 Problem 1 1.1 Part