Intro to Neural Networks and Deep Learningjjl5sw/documents/DeepLearningIntro.pdf · Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 UVA CS 6316. Neurons

Intro to Neural Networks and Deep Learning

Jack LanchantinDr. Yanjun Qi

1

UVA CS 6316

Neurons1-Layer Neural Network

Multi-layer Neural Network

Loss Functions

Backpropagation

Nonlinearity Functions

NNs in Practice2

3

x1

x2

x3

W1

w3

W2

x ŷ

ewx+b

1 + ewx+b

Logistic Regression

Sigmoid Function(aka logistic, logit, “S”, soft-step)

4

P(Y=1|x) =

x

ez

1 + ez

Expanded Logistic Regression

x1

x2

x3

Σ

+1

z

z = wT x + b

y = sigmoid(z) =5

px11xp

p = 3

1x1

w1

w2

w3

b1SummingFunction

SigmoidFunction

Multiply by weights

ŷ = P(Y=1|x,w)

Input x

“Neuron”

x1

x2

x3

Σ

+1

z

6

w1

w2

w3

b1

Neurons

7http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Neuron

x1

x2

x3

Σ z ŷ

8

ez

1 + ez

z = wT x

ŷ = sigmoid(z) =px11xp1x1

From here on, we leave out bias for simplicity

Input x

w1

w2

w3

“Block View” of a Neuron

x *

Dot Product Sigmoid

w z

9

y

Input output

Dot Product SigmoidInput output

x *w z ŷ

parameterized block

ez

1 + ez

z = wT x

ŷ = sigmoid(z) =px11xp1x1

Neuron Representation

Σ

10

The linear transformation and nonlinearity together is typically considered a single neuron

ŷ

x1

x2

x3

x

w1

w2

w3

Neuron Representation

*w

11

ŷx

The linear transformation and nonlinearity together is typically considered a single neuron

Neurons

1-Layer Neural NetworkMulti-layer Neural Network

Loss Functions

Backpropagation


NNs in Practice12

13

x1

x2

x3

W1

w3

W2

x ŷ

1-Layer Neural Network (with 4 neurons)

x1

x2

x3

Σ

Σ

Σ

Σ

Input x Output ŷ

14

ŷ1

ŷ2

ŷ3

ŷ4

Wz

Linear Sigmoid

1 layer

matrixvector


x1

x2

x3

Σ

Σ

Σ

Σ

Input x

15

d = 4

p = 3

Wz

Output ŷ

ez

1 + ez

z =WT x

ŷ = sigmoid(z) =px1dxpdx1

dx1 dx1

Element-wise on vector z

ŷ1

ŷ2

ŷ3

ŷ4


x1

x2

x3

Input x

16

ez

1 + ez

z =WT x


dx1 dx1

d = 4

p = 3

W Σ

Σ

Σ

Σ

Output ŷ

ŷ1

ŷ2

ŷ3

ŷ4

Element-wise on vector z

“Block View” of a Neural Network

x *

Dot Product Sigmoid

w z

17

y

Input output

Dot Product SigmoidInput output

x *W z ŷ

W is now a matrix

ez

1 + ez

z =WT x


dx1 dx1

z is now a vector

Neurons

1-Layer Neural Network

Multi-layer Neural NetworkLoss Functions

Backpropagation


NNs in Practice18

19

x1

x2

x3

W1

w3

W2

x ŷ

Multi-Layer Neural Network(Multi-Layer Perceptron (MLP) Network)

20

Outputlayer

x1

x2

x3

x ŷ

W1

w2

Hiddenlayer

weight subscript represents layer number

2-layer NN

Multi-Layer Neural Network (MLP)

21

1st hidden layer

2nd hiddenlayer

Outputlayer

x1

x2

x3

x ŷ

3-layer NN

W1

w3

W2

Multi-Layer Neural Network (MLP)

22

z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =wT h2 ŷ = sigmoid(z3)

x1

x2

x3

h1 h2

W1

w3

W2

x ŷ

1

2

3

hidden layer 1 output

Multi-Class Output MLP

x1

x2

x3

x ŷ

23

h1 h2

z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =WT h2 ŷ = sigmoid(z2)

1

2

3

W1

w3

W2

“Block View” Of MLP

x y

1st hidden layer

2nd hidden layer

Output layer

24

*W1

*W2

*W3

z1 z2 z3h1 h2

“Deep” Neural Networks (i.e. > 1 hidden layer)

25Researchers have successfully used 1000 layers to train an object classifier

Neurons



Loss FunctionsBackpropagation


NNs in Practice26

27

x1

x2

x3

W1

w3

W2

x ŷ E (ŷ)

ŷ = P(y=1|X,W)

28

x1

x2

x3

x

Binary Classification Loss

E= loss = - logP(Y = ŷ | X= x ) = - y log(ŷ) - (1 - y) log(1-ŷ)

W1

w3

W2

this example is for a single sample x

true output

29

x1

x2

x3

Σ

W1

w3

W2

x

Regression Loss

E = loss= ( y - ŷ )212

ŷ

true output

z1

z2

z3

30

x1

x2

x3

x

Σ

Σ

Σ

ŷ1

ŷ2

ŷ3

Multi-Class Classification Loss

“Softmax” function. Normalizing function which converts each class output to a probability.

E = loss = - yj ln ŷjΣj = 1...K

= P( ŷi = 1 | x )

W1 W3

W2

ŷi

“0” for all except true class

K = 3

Neurons



Loss Functions

BackpropagationNonlinearity Functions

NNs in Practice31

32

x1

x2

x3

W1

w3

W2

x ŷ E (ŷ)

Training Neural Networks

33

How do we learn the optimal weights WL for our task??● Gradient descent:

LeCun et. al. Efficient Backpropagation. 1998

WL(t+1) = WL(t) - E WL(t)

But how do we get gradients of lower layers?● Backpropagation!

○ Repeated application of chain rule of calculus○ Locally minimize the objective○ Requires all “blocks” of the network to be differentiable

x ŷ

W1w3

W2

34

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

35



36



37



38



39



40



41



42



43



44



45



46


Tells us: by increasing x by a scale of 1, we decrease f by a scale of 4


47

ŷ = P(y=1|X,W)

x1

x2

x3

W1

w2

x

Backpropagation(binary classification example)

Example on 1-hidden layer NN for binary classification

48


x y*W1

*w2

z1z2h1

49

x y*W1

*w2

z1z2h1

E = loss =


Gradient Descent to Minimize loss: Need to find these!

50


x y*W1

*w2

z1z2h1

51


= ??

= ??

x y*W1

*w2

z1z2h1

52


= ??

= ??

Exploit the chain rule!

x y*W1

*w2

z1z2h1

53


x y*W1

*w2

z1z2h1

54


chain rule

x y*W1

*w2

z1z2h1

55


x y*W1

*w2

z1z2h1

56


x y*W1

*w2

z1z2h1

57


x y*W1

*w2

z1z2h1

58


x y*W1

*w2

z1z2h1

59


x y*W1

*w2

z1z2h1

60


x y*W1

*w2

z1z2h1

61


x y*W1

*w2

z1z2h1

62


already computed

x y*W1

*w2

z1z2h1

63

“Local-ness” of Backpropagation

fx y

“local gradients”activations

gradients


64

“Local-ness” of Backpropagation

fx y

“local gradients”activations

gradients


x

65

Example: Sigmoid Block

sigmoid(x) = (x)


Deep Learning = Concatenation of Differentiable Parameterized Layers (linear & nonlinearity functions)

x y

1st hidden layer

2nd hidden layer

Output layer

66

*W1

*W2

*W3

z1 z2 z3h1 h2

Want to find optimal weights W to minimize some loss function E!

Backprop Whiteboard Demo

x1

x2

1

Σ

Σ

ŷ

67

w1 z1w2

w3w4

b1

b2

z2

h1

h2

Σ

w5

w6

1b3

z1 = x1w1 + x2w3 + b1z2 = x1w2 + x2w4 + b2

h1 =exp(z1)

1 + exp(z1)exp(z2)

1 + exp(z2)h2 =

ŷ = h1w5 + h2w6 + b3

E = ( y - ŷ )2

f1

f2

f3

f4

w(t+1) = w(t) - E w(t)

E w = ??

Neurons



Loss Functions

Backpropagation

Nonlinearity FunctionsNNs in Practice

68

Nonlinearity Functions (i.e. transfer or activation functions)

x1

x2

x3

Σ

SummingFunction

SigmoidFunction

w1

w2

w3

+1

b1

z

x﹡w

Multiply by weights

69


x﹡w

70https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

Name Plot Equation Derivative ( w.r.t x )


x﹡w



Nonlinearity Functions (aka transfer or activation functions)

x﹡w



usually works best in practice

Neurons



Loss Functions

Backpropagation


NNs in Practice73

74

Neural Net Pipeline

x ŷ

W1w3

1. Initialize weights2. For each batch of input x samples S:

a. Run the network “Forward” on S to compute outputs and lossb. Run the network “Backward” using outputs and loss to compute gradientsc. Update weights using SGD (or a similar method)

3. Repeat step 2 until loss convergence

W2

E

75

Non-Convexity of Neural Nets

In very high dimensions, there exists many local minimum which are about the same.

Pascanu, et. al. On the saddle point problem for non-convex optimization 2014

76

Building Deep Neural Nets


fx

y

77

Building Deep Neural Nets

“GoogLeNet” for Object Classification

78

Block Example Implementation


79

Advantage of Neural Nets

As long as it’s fully differentiable, we can train the model to automatically learn features for us.

Documents

Intro to Neural Networks and Deep Learningjjl5sw/documents/DeepLearningIntro.pdf · Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 UVA CS 6316. Neurons