CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

CS540 Introduction to Artificial IntelligenceLecture 4

Young WuBased on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Dyer

June 15, 2020


Perceptron Algorithm vs Logistic RegressionMotivation

For LTU Perceptrons, w is updated for each instance xisequentially.

w � w � α pai � yi q xi

For Logistic Perceptrons, w is updated using the gradient thatinvolves all instances in the training data.

w � w � αn

i�1

pai � yi q xi


Stochastic Gradient DescentMotivation

Each gradient descent step requires the computation ofgradients for all training instances i � 1, 2, ..., n. It is verycostly.

Stochastic gradient descent picks one instance xi randomly,compute the gradient, and update the weights and biases.

When a small subset of instances is selected randomly eachtime, it is called mini-batch gradient descent.


Stochastic Gradient Descent Diagram 1Motivation


Stochastic Gradient Descent Diagram 2Motivation


Stochastic Gradient Descent, Part 1Algorithm

Inputs, Outputs: same as backpropagation.

Initialize the weights.

Randomly permute (shuffle) the training set. Evaluate theactivation functions at one instance at a time.

Compute the gradient using the chain rule.

BCBw plq

j 1j

� δplqij a

pl�1qij 1

BCBbplqj

� δplqij


Stochastic Gradient Descent, Part 2Algorithm

Update the weights and biases using gradient descent.

For l � 1, 2, ..., L

wplqj 1j Ð w

plqj 1j � α

BCBw plq

j 1j

, j 1 � 1, 2, ....,mpl�1q, j � 1, 2, ....,mplq

bplqj Ð b

plqj � α

BCBbplqj

, j � 1, 2, ....,mplq

Repeat the process until convergent.

|C � C prev | ε


Choice of Learning RateDiscussion

Changing the learning rate α as the weights get closer to theoptimal weights could speed up convergence.

Popular choices of learning rate includeα?t

andα

t, where t is

the current number of iterations.

Other methods of choosing step size include using the secondderivative (Hessian) information, such as Newton’s methodand BFGS, or using information about the gradient in previoussteps, such as adaptive gradient methods like AdaGrad andAdam.


Stochastic vs Full Gradient DescentQuiz

Given the same initial weights and biases, stochastic gradientdescent with instances picked randomly without replacementand full gradient descent lead to the same updated weights.

A: Do not choose this.

B: True.

C: Do not choose this.

D: False.

E: Do not choose this.


Multi-Class ClassificationMotivation

When there are K categories to classify, the labels can take Kdifferent values, yi P t1, 2, ...,Ku.Logistic regression and neural network cannot be directlyapplied to these problems.


Method 1, One VS AllDiscussion

Train a binary classification model with labels y 1i � 1tyi�ju foreach j � 1, 2, ...,K .

Given a new test instance xi , evaluate the activation functionapjqi from model j .

yi � arg maxj

apjqi

One problem is that the scale of apjqi may be different for

different j .


Method 2, One VS OneDiscussion

Train a binary classification model for each of theK pK � 1q

2pairs of labels.

Given a new test instance xi , apply allK pK � 1q

2models and

output the class that receives the largest number of votes.

yi � arg maxj

¸j 1�j

ypj vs j 1qi

One problem is that it is not clear what to do if multipleclasses receive the same number of votes.


One Hot EncodingDiscussion

If y is not binary, use one-hot encoding for y .

For example, if y has three categories, then

yi P$&%��1

00

�� ,��0

10

�� ,��0

01

��,.-


Method 3, Softmax FunctionDiscussion

For both logistic regression and neural network, the last layerwill have K units, aij , for j � 1, 2, ...,K , and the softmaxfunction is used instead of the sigmoid function.

aij � g�wTj xi � bj

�

exp��wT

j xi � bj

K

j 1�1

exp��wT

j 1 xi � bj 1 , j � 1, 2, ...,K


Softmax DerivativesDiscussion

Cross entropy loss is also commonly used with a softmaxactivation function.

The gradient of cross-entropy loss with respect to aij ,component j of the output layer activation for instance i hasthe same form as the one for logistic regression.

BCBaij � aij � yij ñ ∇aiC � ai � yi

The gradient with respect to the weights can be found usingthe chain rule.


Generalization ErrorMotivation

With a large number of hidden units and small enoughlearning rate α, a multi-layer neural network can fit everyfinite training set perfectly.

It does not imply the performance on the test set will be good.

This problem is called overfitting.


Generalization Error DiagramMotivation


Method 1, Validation SetDiscussion

Set aside a subset of the training set as the validation set.

During training, the cost (or accuracy) on the training set willalways be decreasing until it hits 0.

Train the network until the cost (or accuracy) on thevalidation set begins to increase.


Validation Set DiagramDiscussion


Method 2, Drop OutDiscussion

At each hidden layer, a random set of units from that layer isset to 0.

For example, each unit is retained with probability p � 0.5.During the test, the activations are reduced by p � 0.5 (or 50percent).

The intuition is that if a hidden unit works well with differentcombinations of other units, it does not rely on other unitsand it is likely to be individually useful.


Drop Out DiagramDiscussion


Method 3, L1 and L2 RegularizationDiscussion

The idea is to include an additional cost for non-zero weights.

The models are simpler if many weights are zero.

For example, if logistic regression has only a few non-zeroweights, it means only a few features are relevant, so onlythese features are used for prediction.


Method 3, L1 RegularizationDiscussion

For L1 regularization, add the 1-norm of the weights to thecost.

C �n

i�1

pai � yi q2 � λ

��wb

��1

�n

i�1


�m

i�1

|wi | � |b|�

Linear regression with L1 regularization is called LASSO (leastabsolute shrinkage and selection operator).


Method 3, L2 RegularizationDiscussion

For L2 regularization, add the 2-norm of the weights to thecost.

C �n

i�1


��wb

��2

2

�n

i�1


�m

i�1

w2i � b2

�


L1 and L2 Regularization ComparisonDiscussion

L1 regularization leads to more weights that are exactly 0. Itis useful for feature selection.

L2 regularization leads to more weights that are close to 0. Itis easier to do gradient descent because 1-norm is notdifferentiable.


L1 and L2 Regularization DiagramDiscussion


Method 4, Data AugmentationDiscussion

More training data can be created from the existing ones, forexample, by translating or rotating the handwritten digits.


HyperparametersDiscussion

It is not clear how to choose the learning rate α, the stoppingcriterion ε, and the regularization parameters.

For neural networks, it is also not clear how to choose thenumber of hidden layers and the number of hidden units ineach layer.

The parameters that are not parameters of the functions inthe hypothesis space are called hyperparameters.


K Fold Cross ValidationDiscussion

Partition the training set into K groups.

Pick one group as the validation set.

Train the model on the remaining training set.

Repeat the process for each of the K groups.

Compare accuracy (or cost) for models with differenthyperparameters and select the best one.


5 Fold Cross Validation ExampleDiscussion

Partition the training set S into 5 subsets S1,S2, S3,S4, S5

Si X Sj � H and5¤

i�1

Si � S

Iteration Training Validation

1 S2 Y S3 Y S4 Y S5 S1

2 S1 Y S3 Y S4 Y S5 S2

3 S1 Y S2 Y S4 Y S5 S3

4 S1 Y S2 Y S3 Y S5 S4

5 S1 Y S2 Y S3 Y S4 S5


Leave One Out Cross ValidationDiscussion

If K � n, each time exactly one training instance is left out asthe validation set. This special case is called Leave One OutCross Validation (LOOCV).

Documents

CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles