31
Stochastic Gradient Multi-Class Classification Regularization CS540 Introduction to Artificial Intelligence Lecture 4 Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles Dyer June 15, 2020

CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

  • Upload
    others

  • View
    41

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

CS540 Introduction to Artificial IntelligenceLecture 4

Young WuBased on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Dyer

June 15, 2020

Page 2: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Perceptron Algorithm vs Logistic RegressionMotivation

For LTU Perceptrons, w is updated for each instance xisequentially.

w � w � α pai � yi q xi

For Logistic Perceptrons, w is updated using the gradient thatinvolves all instances in the training data.

w � w � αn

i�1

pai � yi q xi

Page 3: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Stochastic Gradient DescentMotivation

Each gradient descent step requires the computation ofgradients for all training instances i � 1, 2, ..., n. It is verycostly.

Stochastic gradient descent picks one instance xi randomly,compute the gradient, and update the weights and biases.

When a small subset of instances is selected randomly eachtime, it is called mini-batch gradient descent.

Page 4: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Stochastic Gradient Descent Diagram 1Motivation

Page 5: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Stochastic Gradient Descent Diagram 2Motivation

Page 6: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Stochastic Gradient Descent, Part 1Algorithm

Inputs, Outputs: same as backpropagation.

Initialize the weights.

Randomly permute (shuffle) the training set. Evaluate theactivation functions at one instance at a time.

Compute the gradient using the chain rule.

BCBw plq

j 1j

� δplqij a

pl�1qij 1

BCBbplqj

� δplqij

Page 7: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Stochastic Gradient Descent, Part 2Algorithm

Update the weights and biases using gradient descent.

For l � 1, 2, ..., L

wplqj 1j Ð w

plqj 1j � α

BCBw plq

j 1j

, j 1 � 1, 2, ....,mpl�1q, j � 1, 2, ....,mplq

bplqj Ð b

plqj � α

BCBbplqj

, j � 1, 2, ....,mplq

Repeat the process until convergent.

|C � C prev |   ε

Page 8: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Choice of Learning RateDiscussion

Changing the learning rate α as the weights get closer to theoptimal weights could speed up convergence.

Popular choices of learning rate includeα?t

andα

t, where t is

the current number of iterations.

Other methods of choosing step size include using the secondderivative (Hessian) information, such as Newton’s methodand BFGS, or using information about the gradient in previoussteps, such as adaptive gradient methods like AdaGrad andAdam.

Page 9: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Stochastic vs Full Gradient DescentQuiz

Given the same initial weights and biases, stochastic gradientdescent with instances picked randomly without replacementand full gradient descent lead to the same updated weights.

A: Do not choose this.

B: True.

C: Do not choose this.

D: False.

E: Do not choose this.

Page 10: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Multi-Class ClassificationMotivation

When there are K categories to classify, the labels can take Kdifferent values, yi P t1, 2, ...,Ku.Logistic regression and neural network cannot be directlyapplied to these problems.

Page 11: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Method 1, One VS AllDiscussion

Train a binary classification model with labels y 1i � 1tyi�ju foreach j � 1, 2, ...,K .

Given a new test instance xi , evaluate the activation functionapjqi from model j .

yi � arg maxj

apjqi

One problem is that the scale of apjqi may be different for

different j .

Page 12: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Method 2, One VS OneDiscussion

Train a binary classification model for each of theK pK � 1q

2pairs of labels.

Given a new test instance xi , apply allK pK � 1q

2models and

output the class that receives the largest number of votes.

yi � arg maxj

¸j 1�j

ypj vs j 1qi

One problem is that it is not clear what to do if multipleclasses receive the same number of votes.

Page 13: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

One Hot EncodingDiscussion

If y is not binary, use one-hot encoding for y .

For example, if y has three categories, then

yi P$&%��1

00

�� ,��0

10

�� ,��0

01

��,.-

Page 14: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Method 3, Softmax FunctionDiscussion

For both logistic regression and neural network, the last layerwill have K units, aij , for j � 1, 2, ...,K , and the softmaxfunction is used instead of the sigmoid function.

aij � g�wTj xi � bj

exp��wT

j xi � bj

K

j 1�1

exp��wT

j 1 xi � bj 1 , j � 1, 2, ...,K

Page 15: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Softmax DerivativesDiscussion

Cross entropy loss is also commonly used with a softmaxactivation function.

The gradient of cross-entropy loss with respect to aij ,component j of the output layer activation for instance i hasthe same form as the one for logistic regression.

BCBaij � aij � yij ñ ∇aiC � ai � yi

The gradient with respect to the weights can be found usingthe chain rule.

Page 16: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Generalization ErrorMotivation

With a large number of hidden units and small enoughlearning rate α, a multi-layer neural network can fit everyfinite training set perfectly.

It does not imply the performance on the test set will be good.

This problem is called overfitting.

Page 17: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Generalization Error DiagramMotivation

Page 18: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Method 1, Validation SetDiscussion

Set aside a subset of the training set as the validation set.

During training, the cost (or accuracy) on the training set willalways be decreasing until it hits 0.

Train the network until the cost (or accuracy) on thevalidation set begins to increase.

Page 19: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Validation Set DiagramDiscussion

Page 20: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Method 2, Drop OutDiscussion

At each hidden layer, a random set of units from that layer isset to 0.

For example, each unit is retained with probability p � 0.5.During the test, the activations are reduced by p � 0.5 (or 50percent).

The intuition is that if a hidden unit works well with differentcombinations of other units, it does not rely on other unitsand it is likely to be individually useful.

Page 21: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Drop Out DiagramDiscussion

Page 22: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Method 3, L1 and L2 RegularizationDiscussion

The idea is to include an additional cost for non-zero weights.

The models are simpler if many weights are zero.

For example, if logistic regression has only a few non-zeroweights, it means only a few features are relevant, so onlythese features are used for prediction.

Page 23: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Method 3, L1 RegularizationDiscussion

For L1 regularization, add the 1-norm of the weights to thecost.

C �n

i�1

pai � yi q2 � λ

�����wb

�����1

�n

i�1

pai � yi q2 � λ

�m

i�1

|wi | � |b|�

Linear regression with L1 regularization is called LASSO (leastabsolute shrinkage and selection operator).

Page 24: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Method 3, L2 RegularizationDiscussion

For L2 regularization, add the 2-norm of the weights to thecost.

C �n

i�1

pai � yi q2 � λ

�����wb

�����2

2

�n

i�1

pai � yi q2 � λ

�m

i�1

w2i � b2

Page 25: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

L1 and L2 Regularization ComparisonDiscussion

L1 regularization leads to more weights that are exactly 0. Itis useful for feature selection.

L2 regularization leads to more weights that are close to 0. Itis easier to do gradient descent because 1-norm is notdifferentiable.

Page 26: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

L1 and L2 Regularization DiagramDiscussion

Page 27: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Method 4, Data AugmentationDiscussion

More training data can be created from the existing ones, forexample, by translating or rotating the handwritten digits.

Page 28: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

HyperparametersDiscussion

It is not clear how to choose the learning rate α, the stoppingcriterion ε, and the regularization parameters.

For neural networks, it is also not clear how to choose thenumber of hidden layers and the number of hidden units ineach layer.

The parameters that are not parameters of the functions inthe hypothesis space are called hyperparameters.

Page 29: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

K Fold Cross ValidationDiscussion

Partition the training set into K groups.

Pick one group as the validation set.

Train the model on the remaining training set.

Repeat the process for each of the K groups.

Compare accuracy (or cost) for models with differenthyperparameters and select the best one.

Page 30: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

5 Fold Cross Validation ExampleDiscussion

Partition the training set S into 5 subsets S1,S2, S3,S4, S5

Si X Sj � H and5¤

i�1

Si � S

Iteration Training Validation

1 S2 Y S3 Y S4 Y S5 S1

2 S1 Y S3 Y S4 Y S5 S2

3 S1 Y S2 Y S4 Y S5 S3

4 S1 Y S2 Y S3 Y S5 S4

5 S1 Y S2 Y S3 Y S4 S5

Page 31: CS540 Introduction to Artificial Intelligence Lecture 4pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_4_P.pdf · Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Stochastic Gradient Multi-Class Classification Regularization

Leave One Out Cross ValidationDiscussion

If K � n, each time exactly one training instance is left out asthe validation set. This special case is called Leave One OutCross Validation (LOOCV).