Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Machine Learning using Matlab
Lecture 6 Neural Network (cont.)
Cost function
Forward propagation● Forward propagation from layer l to layer l+1 is computed as:
● Note when l = 1,
Layer 1 Layer 2 Layer 3 Layer 4
Backpropagation● Backpropagation from layer l+1 to layer l is computed as:
● When l = L,
Layer 1 Layer 2 Layer 3 Layer 4
Example
Layer 1 Layer 2 Layer 3 Layer 4
Forward propagation Backpropagation
Given a training example (x,y), the cost function is first simplified as: . Forward propagation and backpropagation are computed as:
Gradient computation 1. Given training set 2. Set 3. For i = 1 to m
○ Set ○ Perform forward propagation to compute al for ○ Using yi , compute○ Compute○ delta
4.
Random initialization● Instead of initialize the parameters to all zeros, it is important to initialize them
randomly.● The random initialization serves the purpose of symmetry breaking.
Layer 1 Layer 2 Layer 3 Layer 4
Forward propagation Backpropagation
Random initialization - Matlab function● Initial each parameter to a random value in ● function W = randInitializeWeights(L_in, L_out)
epsilon_init = 0.1 W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init end
Advanced optimization● We have taught how to call the existing numerical computing functions to
acquire the optimal parameters:○ Function [J, grad] = costFunction(theta) ...○ optTheta = minFunc(@costFunction, initialTheta, options)
● In the following neural network, we have three parameter matrices, how to feed them into “minFunc” function?
Layer 1 Layer 2 Layer 3 Layer 4
“Unroll” into vectors
Advanced optimization - exampleL = 4, s1= 4, s2 = 5, s3 = 5, s4 = 4
ϴ(1) ∈ ℝ5×5, ϴ(2) ∈ ℝ5×6, ϴ(3) ∈ ℝ4×6
Matlab implementation:
1. Unroll: thetaVec = [ Theta1(:); Theta2(:); Theta3(:)]2. Feed thetaVec into “minFunc”3. Reshape thetaVec in “costFunction”:
a. Theta1 = reshape(thetaVec(1:25),5,5);b. Theta2 = reshape(thetaVect(26:55),5,6);c. Theta3 = reshape(thetaVect(56:79),4,6);d. Compute J and grad
Layer 1 Layer 2 Layer 3 Layer 4
Gradient check● Too many parameters, not sure if the computed gradient is correct or not?!● Recall the definition of numerical estimation of gradients, we can compare the
gradient with the numerical estimation of gradients.
Gradient check
Gradient check● Implementation note:
○ Implement backpropagation to compute gradient○ Implement numerical gradient check to compute estimated gradient○ Make sure they have similar values (less than a threshold)○ Turn off gradient check for training
● Note:○ Be sure to disable your gradient check code, otherwise it is very slow to learn○ Gradient check can be generalized to check gradient of any cost function
Overview: train a neural network● Design a network architecture● Randomly initialized weights● Implement forward propagation to get hypothesis for any xi ● Implement code to compute cost function Jthe● Implement backpropagation to compute partial derivatives● Use gradient check to compare with numerical estimation of gradient
of Jthet, If it works well, then disable gradient checking code● Use gradient descent or advanced optimization method to minimize Jthet as a
function of parameters T
Deep feedforward Neural Networks
Other architectures
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
Application 1
Application 2
Discussion● More parameters, more powerful.
○ Which one is better: more layers or more neurons?○ Disadvantages?
● Neural network is non-convex, gradient descent is susceptible to local optima; however, it works fairly well even though the optima is not global.
● Black box model
From logistic regression to SVMLogistic regression
● Label:
● Hypothesis:
● Objective:
Support Vector Machine (SVM)
● Label:
● Hypothesis:
● Objective:
From logistic regression to SVM● Cost function: ● Cost function:
From logistic regression to SVM● Logistic regression:
● SVM:
SVM - model representation● Given training examples , SVM aims to find an optimal hyperplane
so that:
● Which is equivalent to minimizing the following cost function:
● Here is called hinge loss.
SVM - gradient computing● Because the hinge loss is not differentiable, a sub-gradient is computed:
SVM - intuition
● Which of the linear classifier is optimal?
SVM - intuition
SVM - intuition
1. Maximizing the margin is good according to tuition and PAC theory
2. Implies that only support vectors are important; while other training examples are ignorable
Support vector
Support vector
SVM - intuition● Which linear classifier has better performance?