Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Neural Networks
CMSC 422
MARINE CARPUAT
XOR slides by Graham Neubig (CMU)
Neural Networks
ā¢ Today
ā What are Neural Networks?
ā How to make a prediction given an input?
ā Why are neural networks powerful?
ā¢ Next time
ā how to train them?
A warm-up example
sentiment analysis for movie review
ā¢ the movie was horrible +1
ā¢ the actors are excellent -1
ā¢ the movie was not horrible -1
ā¢ he is usually an excellent actor, but not in
this movie +1
Binary classification
via hyperplanes
ā¢ At test time, we check on
what side of the
hyperplane examples fall
š¦ = š ššš(š¤šš„ + š)
Function Approximation
with PerceptronProblem setting
ā¢ Set of possible instances šā Each instance š„ ā š is a feature vector š„ = [š„1, ā¦ , š„š·]
ā¢ Unknown target function š: š ā šā š is binary valued {-1; +1}
ā¢ Set of function hypotheses š» = ā ā: š ā š}ā Each hypothesis ā is a hyperplane in D-dimensional space
Input
ā¢ Training examples { š„ 1 , š¦ 1 , ā¦ š„ š , š¦ š } of unknown
target function š
Output
ā¢ Hypothesis ā ā š» that best approximates target function š
Aside: biological inspiration
Analogy: the
perceptron
as a neuron
Neural Networks
ā¢ We can think of neural networks as
combination of multiple perceptrons
ā Multilayer perceptron
ā¢ Why would we want to do that?
ā Discover more complex decision boundaries
ā Learn combinations of features
What does a 2-layer
perceptron look like?
(illustration on board)
ā¢ Key concepts:
ā Input dimensionality
ā Hidden units
ā Hidden layer
ā Output layer
ā Activation functions
Activation functions
(aka link functions)ā¢ Activation functions are non-linear
functions
ā sign function as in the perceptron
ā hyperbolic tangent and other sigmoid
functions that approximate sign but are
differentiable
ā¢ What happens if the hidden units use the
identify function as an activation function?
Matrix of hidden layer parameters
Vector of output layer parameters
What functions can we approximate
with a 2 layer perceptron?Problem setting
ā¢ Set of possible instances šā Each instance š„ ā š is a feature vector š„ = [š„1, ā¦ , š„š·]
ā¢ Unknown target function š: š ā šā š is binary valued {-1; +1}
ā¢ Set of function hypotheses š» = ā ā: š ā š}
Input
ā¢ Training examples { š„ 1 , š¦ 1 , ā¦ š„ š , š¦ š } of unknown
target function š
Output
ā¢ Hypothesis ā ā š» that best approximates target function š
Two-Layer Networks are
Universal Function Approximators
ā¢ Theorem (Th 9 in CIML):Let F be a continuous function on a bounded subset of D-
dimensional space. Then there exists a two-layer neural
network š¹ with a finite number of hidden units that
approximates F arbitrarily well. Namely, for all x in the
domain of F,
Example: a neural network to
solve the XOR problem
X
O
O
X
Ļ0(x2) = {1, 1}Ļ
0(x1) = {-1, 1}
Ļ0(x4) = {1, -1}Ļ
0(x3) = {-1, -1}
1
1
-1
-1
-1
-1
Ļ1
Ļ2
Ļ1[1]
Ļ1[0]
Ļ1[0]
Ļ1[1]
Ļ1(x1) = {-1, -1}
X OĻ
1(x2) = {1, -1}
O
Ļ1(x3) = {-1, 1}
Ļ1(x4) = {-1, -1}
Exampleā In new space, the examples are linearly separable!
X
O
O
X
Ļ0(x2) = {1, 1}Ļ
0(x1) = {-1, 1}
Ļ0(x4) = {1, -1}Ļ
0(x3) = {-1, -1}
1
1
-1
-1
-1
-1
Ļ0[0]
Ļ0[1]
Ļ1[1]
Ļ1[0]
Ļ1[0]
Ļ1[1]
Ļ1(x1) = {-1, -1}
X O Ļ1(x2) = {1, -1}
OĻ1(x3) = {-1, 1}
Ļ1(x4) = {-1, -1}
1
1
1Ļ
2[0] = y
Example
ā The final net
tanh
tanh
Ļ0[0]
Ļ0[1]
1
Ļ0[0]
Ļ0[1]
1
1
1
-1
-1
-1
-1
1 1
1
1
tanh
Ļ1[0]
Ļ1[1]
Ļ2[0]
Discussion
ā¢ 2-layer perceptron lets us
ā Discover more complex decision boundaries than
perceptron
ā Learn combinations of features that are useful for
classification
ā¢ Key design question
ā How many hidden units?
ā More hidden units yield more complex functions
ā Fewer hidden units requires fewer examples to train
Neural Networks
ā¢ Today
ā What are Neural Networks?
ā¢ Multilayer perceptron
ā How to make a prediction given an input?
ā¢ Simple matrix operations + non-linearities
ā Why are neural networks powerful?
ā¢ Universal function approximators!
ā¢ Next
ā How to train them?