L. Mihalkova, CSMC498F, Fall2010
AdministrativiaTopics this week:
Finish up discussion of how to evaluate hypotheses
Start talking about perceptrons and neural nets
This week’s reading:
Chapter 4
Big Picture:
This is 4th week of talking about supervised learning
After this, 3 classes on learning theory, 1 class on midterm preparation
So, midterm is in 3 weeks!
2
L. Mihalkova, CSMC498F, Fall2010
Perceptrons and Neural Nets
Neural Networks
Biologically inspired to emulate the brain
Many simple components (analogous to the brain cells) work together by passing stimuli to obtain complex behavior
Perceptrons
The simple components that make up a neural network (sort of)
3
L. Mihalkova, CSMC498F, Fall2010
Perceptrons
Sometimes called “linear threshold functions”
Interesting as building blocks of neural nets
But also interesting in their own right
Very simple and easy to use model
Surprisingly effective in many applications
4
L. Mihalkova, CSMC498F, Fall2010
What is a Perceptron
A thresholded, linear combination of the attributes
5
L. Mihalkova, CSMC498F, Fall2010
Training Perceptrons
Can use various techniques
The “Perceptron” Algorithm ✔
Gradient Descent ✔
Linear programming
6
L. Mihalkova, CSMC498F, Fall2010
The Perceptron AlgoGiven:
D: Data set of linearly separable examplesη: Learning rate
Set each wi to a small arbitrary initial valueWhile learner makes mistakes:
For each
If
Set
7
X̄ = !x1, ..., xn; y" # D
y · (n!
i=0
xiwi) ! 0
wi ! wi + !yxi
L. Mihalkova, CSMC498F, Fall2010
Perceptron Algo Properties
Provided that the data are linearly separable, the perceptron algorithm converges within a finite number of iterations
8
L. Mihalkova, CSMC498F, Fall2010
Data Not Linearly Separable
Option A: Add attributes formed as functions of existing attributes
Option B: Settle with good enough -- find a hypothesis that minimizes (squared) error on training data
9
E(W ) ! 12
!
d!D(td " od)2
True value of target attr.
Unthresholded value of target attr.
L. Mihalkova, CSMC498F, Fall2010
Minimizing Squared Error
Will use gradient descent
Initialize W to small arbitrary values
Iterate over examples, each time
computing the gradient
moving in direction opposite to gradient
10
wi ! wi " !"E
"wi
!E
!wi=
!
d!D
(td ! od)xid
i-th attr of d-th example
Unthresholded output
L. Mihalkova, CSMC498F, Fall2010
Stochastic Gradient Descent
11
Will use gradient descent
Initialize W to small arbitrary values
Iterate over examples, each time
computing the gradient
moving in direction opposite to gradient
wi ! wi " !"E
"wi
!E
!wi=
!
d!D
(td ! od)xid
i-th attr of d-th example
Unthresholded output
For each example d:
L. Mihalkova, CSMC498F, Fall2010
Difference
What is the difference between the perceptron and gradient descent?
Perceptron Algo: converges to perfect hypothesis when data are linearly separable
Gradient descent: converges asymptotically toward min error hypothesis, regardless of whether data are linearly separable
e.g., might require infinite time
12
L. Mihalkova, CSMC498F, Fall2010
Neural Networks
The typical NN looks something like this:
13
Inputs
Outputs
Hidden Units
L. Mihalkova, CSMC498F, Fall2010
Backpropagation Algo
Set-up
Each instance is a pair
The NN has
nin inputs
nhidden hidden units
nout outputs
Initial weights are small arbitrary values
15
!X̄, t̄"
L. Mihalkova, CSMC498F, Fall2010
Backpropagation Algo
While not done
Iterate over each :
Propagate through the network to compute output values
Compute error for each output unit k
Compute error for each hidden unit h
Update each
16
!X̄, t̄"
!k
!h
wji ! wji + !"jxji