Machine Learning using Matlab - Uni Konstanz...Example Layer 1 Layer 2 Layer 3 Layer 4 Forward propagation Backpropagation Given a training example (x,y), the cost function is first

Machine Learning using Matlab

Lecture 6 Neural Network (cont.)

Cost function

Forward propagation● Forward propagation from layer l to layer l+1 is computed as:

● Note when l = 1,

Layer 1 Layer 2 Layer 3 Layer 4

Backpropagation● Backpropagation from layer l+1 to layer l is computed as:

● When l = L,


Example


Forward propagation Backpropagation

Given a training example (x,y), the cost function is first simplified as: . Forward propagation and backpropagation are computed as:

Gradient computation 1. Given training set 2. Set 3. For i = 1 to m

○ Set ○ Perform forward propagation to compute al for ○ Using yi , compute○ Compute○ delta

4.

Random initialization● Instead of initialize the parameters to all zeros, it is important to initialize them

randomly.● The random initialization serves the purpose of symmetry breaking.


Forward propagation Backpropagation

Random initialization - Matlab function● Initial each parameter to a random value in ● function W = randInitializeWeights(L_in, L_out)

epsilon_init = 0.1 W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init end

Advanced optimization● We have taught how to call the existing numerical computing functions to

acquire the optimal parameters:○ Function [J, grad] = costFunction(theta) ...○ optTheta = minFunc(@costFunction, initialTheta, options)

● In the following neural network, we have three parameter matrices, how to feed them into “minFunc” function?


“Unroll” into vectors

Advanced optimization - exampleL = 4, s1= 4, s2 = 5, s3 = 5, s4 = 4

ϴ(1) ∈ ℝ5×5, ϴ(2) ∈ ℝ5×6, ϴ(3) ∈ ℝ4×6

Matlab implementation:

1. Unroll: thetaVec = [ Theta1(:); Theta2(:); Theta3(:)]2. Feed thetaVec into “minFunc”3. Reshape thetaVec in “costFunction”:

a. Theta1 = reshape(thetaVec(1:25),5,5);b. Theta2 = reshape(thetaVect(26:55),5,6);c. Theta3 = reshape(thetaVect(56:79),4,6);d. Compute J and grad


Gradient check● Too many parameters, not sure if the computed gradient is correct or not?!● Recall the definition of numerical estimation of gradients, we can compare the

gradient with the numerical estimation of gradients.

Gradient check

Gradient check● Implementation note:

○ Implement backpropagation to compute gradient○ Implement numerical gradient check to compute estimated gradient○ Make sure they have similar values (less than a threshold)○ Turn off gradient check for training

● Note:○ Be sure to disable your gradient check code, otherwise it is very slow to learn○ Gradient check can be generalized to check gradient of any cost function

Overview: train a neural network● Design a network architecture● Randomly initialized weights● Implement forward propagation to get hypothesis for any xi ● Implement code to compute cost function Jthe● Implement backpropagation to compute partial derivatives● Use gradient check to compare with numerical estimation of gradient

of Jthet, If it works well, then disable gradient checking code● Use gradient descent or advanced optimization method to minimize Jthet as a

function of parameters T

Deep feedforward Neural Networks

Other architectures

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)

Application 1

http://www.youtube.com/watch?v=yxuRnBEczUU

Application 2

http://www.youtube.com/watch?v=uHbMt6WDhQ8

Discussion● More parameters, more powerful.

○ Which one is better: more layers or more neurons?○ Disadvantages?

● Neural network is non-convex, gradient descent is susceptible to local optima; however, it works fairly well even though the optima is not global.

● Black box model

From logistic regression to SVMLogistic regression

● Label:

● Hypothesis:

● Objective:

Support Vector Machine (SVM)

● Label:

● Hypothesis:

● Objective:

From logistic regression to SVM● Cost function: ● Cost function:

From logistic regression to SVM● Logistic regression:

● SVM:

SVM - model representation● Given training examples , SVM aims to find an optimal hyperplane

so that:

● Which is equivalent to minimizing the following cost function:

● Here is called hinge loss.

SVM - gradient computing● Because the hinge loss is not differentiable, a sub-gradient is computed:

SVM - intuition

● Which of the linear classifier is optimal?

SVM - intuition

SVM - intuition

1. Maximizing the margin is good according to tuition and PAC theory

2. Implies that only support vectors are important; while other training examples are ignorable

Support vector

Support vector

SVM - intuition● Which linear classifier has better performance?

Documents

Machine Learning using Matlab - Uni Konstanz...Example Layer 1 Layer 2 Layer 3 Layer 4 Forward propagation Backpropagation Given a training example (x,y), the cost function is first