59
Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

  • View
    227

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Artificial Neural Networks #1Machine Learning CH4 : 4.1 – 4.5

Promethea Pythaitha

Page 2: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Artificial Neural Networks

Robust approach to approximating target functions over attributes with continuous domains as well as discrete.

Can approximate unknown target functions. Target function can be

Discrete valued. Real valued. Vector valued.

Page 3: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Neural Net Applications Robust to noise in training data

Among most successful methods at interpreting noisy real world sensor data.

Microphone input / speech recognition. Camera input. Handwriting recognition Face recognition and Image Processing.

Robotic control.Fuzzy neural nets.

Page 4: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Inspiration for Artificial Neural nets.

Natural learning systems (i.e. brains) are composed of very complex webs of interconnected neurons. Each neuron receives signals (current spikes). When neuron threshold is reached, neuron sends its

signal downstream to…. Other neurons Physical actuators Perception in the Neocortex Etc….

Page 5: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Artificial Neural nets are built out of densely interconnected sets of units.Each unit (artificial neuron) takes many real-

valued inputs, and produces a real-valued output.

Output is then sent downstream to Other units within net Output layer of net.

Page 6: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Brain estimated to contain ~ 100 billion neurons.

Each neuron connects to an average of 10,000 others.

Neuron switching speeds are on the order of a thousandth of a second

(versus a ten-billionth of a second for logic gates)

Yet brains can make decisions/ recognize images, etc, VERY fast.

Page 7: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Hypothesis:Thought /information processing in the brain

is result of massively parallelized processing of distributed inputs.

Page 8: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Neural Nets are built on this idea: In parallel, process Distributed data.

Page 9: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Artificial vs. Natural.

Many Complexities of Natural Neural Nets not present in Artificial ones. Feedback (uncommon in ANNs) Etc.

Many features of Artificial Neural Nets not compatible with Natural ones.

Units in Artificial Neural Nets produce one constant output rather than a time-varying sequence of current pulses.

Page 10: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Neural Network representations.

ALVINN learned to drive an autonomous vehicle on highway. Input: 30 x 32 pixel matrix from camera

960 values (B/W pixel intensities) Output: Steering direction for vehicle.

(30 real values) Two layer Neural Net:

Input (not counted) Hidden Layer Output.

Page 11: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

ALVINN

Page 12: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

ALVINN explained Typical Neural Net Structure. All 960 inputs in the 30x32 matrix are sent to the four

hidden neurons/units, where weighted linear sums are computed. Hidden unit: a unit whose output is only accessible in the

net, but not at the output layer. Outputs from the 4 hidden neurons are sent downstream

to 30 output neurons, each of which outputs confidence value corresponding to steering in a specific direction. Fuzzy truth?? Probability measure?

Program chooses direction with highest confidence

Page 13: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Typical Neural Net structure. Usually, Layers are connected in Directed,

Acyclic Graph. Can, in general, have any structure:

Cyclic/Acyclic Directed/Undirected

Feedforward/feedback

Most common & practical Nets trained using Backpropagation Learning = Select weight value for each connection. Assumes net is directed. Cycles are restricted

Usually there are none in practice.

Page 14: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Appropriate problems for A.N.N.s Great for problems with noisy data and

complex sensor inputs.Camera/microphone, etc

VS. Shaft encoder/light-sensor, etc

Symbolic problems:As good as Decision Tree Learning!!

BACKPROPAGATIONMost widely used technique.

Page 15: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Appropriate problems for A.N.N.s Neural Net Learning suitable for problems with

following attributes: Instances represented by many attribute, value

pairs.– Target function depends on vector of predefined attributes– Real-valued inputs– Attributed can be correlated or inependent.

Target function can be discrete-valued, real-valued, or a vector!

– ALVINN’s target function was a 30 real-element vector.

Page 16: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Appropriate problems for A.N.N.s• Training Data can contain Errors/Noise

• Neural nets very robust to noisy data• Thankfully so are natural ones ;-)

• Long Training Times are acceptable.• Neural nets usu. Take longer than other

machine-learning algorithms.• Few minutes several hours.

• Fast evaluation of learned function may be required.

• Neural nets do compute learned function very fast.• ALVINN re-computes bearing several times/second.

Page 17: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Appropriate problems for A.N.N.sNot important if humans can understand

learned function!! ALVINN

960 inputs 4 hidden nodes 30 outputs

Get somewhat messy looking to humans!! Thanks to the massive parallelism and distributed

data.

Page 18: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Perceptrons. Basic building block of Neural Nets Takes several real-valued inputs Computes weighted sum Checks sum against threshold: - w0

If > threshold output +1Else output -1.

o(x1, …, xn) = 1 if w0 + w1*x1 + …+ wn*xn > 0

-1 otherwise.

Page 19: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

For simplicity, let x0 = 1, then

o(x1, …, xn) = sgn(w.x)Vectors denoted in Bold!!The . is vector dot product!!

Hypothesis space = All possible combinations of real-valued

weights.All w in Rn+1

Page 20: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Perceptron can represent any linearly separable concept.

Learned hypothesis is a hyperplane in Rn

Equation of hyperplane is w.x = 0 Example:

AND, OR, NAND, NOR - vs, XOR, etc.

Any boolean function can be represented by 2-layer perceptron network!!

Page 21: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Training a single Perceptron

Perceptron training rule.

Delta training rule / Gradient Descent.Converge to different hypotheses,Under different conditions!!

Page 22: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Perceptron training rule. Start with random weights Go through training examples

When an error is made, Update weights: For each wi

wi wi + Δwi

Δwi = η(t - o)xi

Terminology: η is the learning rate

typically small Sometimes decreases as we proceed.

t is the value of the target function o is the value outputted by perceptron.

Page 23: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

When the set of training examples is completed perfectly, STOP.

Page 24: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Pros and Cons Can be proven to converge to a w that

correctly classifies all training examples in finite time, if η is small enough.

Convergence guaranteed only if concept is linearly seperable. If not, no convergence no stopping!!Can be an infinite loop of indecision!!

Page 25: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Delta rule & Gradient descent. Addresses problem of non-convergence

for nonlinear concepts. Will give a linear approximation to

nonlinear concepts that will minimize error.

…how do we measure error??

Page 26: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Consider a perceptron with the “thresholding” function removed.

Then o = w.x Can define error as sum of squared

deviations: E = ½ Σ (td – od)2, sum over all training

examples d. td is the target function value for example d.

od is the computed value of the weighted sum for d.

Page 27: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

With this choice of error, it can be proven that minimizing E will lead to the most probable hypothesis that fits the training data, under the assumption that the noise is normally distributed with mean 0. Note “most probable” hypothesis and “correct”

hypothesis still can be different.

Page 28: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Graph E versus weights:

Page 29: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

E will always be parabolic (by definition)So it has only one global minimum.

Goal: Descent to the global minimum ASAP!

How?Gradient definition.Meaning of the Gradient.Gradient descent.

Page 30: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

The Gradient of E tells us the direction of steepest ascent.

So, – Gradient E tells us us direction of steepest descent.Go in that direction with step size η.The learning rule becomes:

Page 31: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Derivation of simple update rule.

Finally, the Delta Rule weight update is.

Δwi = η*Σ(td – od)xid , over all training examples d.

Page 32: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Gradient descent pseudocode. Pick an initial random weight vector w Until the termination condition is met, do

Set Δwi = 0 for all i.

For each <x, t> in D, do Run net on x :compute o(x) For each wi, do

Δ wi = Δ wi + η (t – o) xi

For each wi, do wi = wi + Δ wi

Page 33: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Results of Gradient Descent. Because of the shape of the error surface,

with only a local minimum, the algorithm will converge to a w with minimum squared deviation/error as long as η is small enough.This holds regardless of linear seperability. If η is too large, the algorithm may skip the

global minimum instead of settling in. A common remedy is to decrease η with time.

Page 34: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Gradient descent can be used to search large or infinite hypothesis spaces when:The hypothesis space is continuously

parameterized.The error can be differentiated with respect to

the hypothesis parameters. However

Convergence can be slooooooow If there are lots of local minima, there’s no

guarantee it will find the global one.

Page 35: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Stochastic gradient descent. Stochastic gradient descent

a.k.a.

Incremental gradient descent. Instead of updating the weights massively after going

through all the training examples, we update them after each example.

This really descends the gradient for a single-example error function (an example per step):

E = ½ (td – od)2

If η is small enough, this is as optimal as true gradient descent.

Page 36: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Stochastic Gradient Descent.

Pick an initial random weight vector w Until the termination condition is met, do

For each <x, t> in D, do Run net on x :compute o(x) For each wi, do

wi = wi + η (t – o) xi

Page 37: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Results. Compared to Stochastic gradient descent,

standard gradient descent takes more computation per step, but generally has a larger step size.

When E has multiple local minima, Stochastic gradient descent can sometimes avoid them.

Page 38: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Perceptrons with discrete output. For Delta-learning/gradient descent, we

discussed the unthresholded perceptrons. It can simply be modified to thresholded

perceptrons. Just use the thresholded t values as the t values

for the perceptron delta-learning algorithm. (with the unthresholded o values)

Unfortunately, this may not necessarily reduce the percent of errors in training data by the thresholded output, just the squared error in the thresholded output.

Page 39: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Multilayer Networks and the Backpropagation Algorithm In order to learn non-linear decision

surfaces, a system more complex than perceptrons is needed.

Page 40: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Choice of base unit. Multiple layers of linear units still linear.

Unthresholded perceptron Thresholded perceptron has non-

differentiable thresholding function: Cannot compute gradient of E

Need something different.Must be non-linearAnd continuously differentiable.

Page 41: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Sigmoid unit.

In place of the perceptron step function, use the sigmoid function as thresholding function.

Sigmoid: σ(y) = 1 / (1 + e-y)

Page 42: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

The sigmoid unit computes the weighted linear sum w.x, and then applied the sigmoid “squashing function”.

Steepness of incline increases with coefficient of –y.

Continuously differentiable:Derivative: dσ(y)/dy = σ(y)*(1 - σ(y))

Page 43: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Backpropagation algorithm. Learns the weights that minimize squared

error given fixed # of units/neurons, and interconnections.

Employs gradient descent similar to delta rule.

Error is measured by:

Page 44: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Error Surface can have multiple local minima.No guarantee algorithm will find the Global

Minimum. However, in practice, backpropagation

performs very well.Recall Stochastic Gradient descent

vs.

Gradient descent.

Page 45: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Stochastic Backpropagation Algorithm

Considering a feedforward network with two layers of sigmoid units which is fully connected in one direction.Each unit is assigned an index (I = 0, 1, 2, …)xji denotes input from i into j.

wji denotes weight on connection from i to j.

δn is the error term associated with unit n.

Page 46: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Backpropagation algorithm.

Page 47: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Backpropagation explained Make the network, randomly initialize the

weights. For each training example d, apply the

network, calculate the squared error Ed, apply the gradient, and proceed a step of size η in direction of steepest decline.

Weight update rule: Recall Delta rule: Δ wi = Δ wi + η (t – o) xi Here we have Δwji = ηδjxji

The error term δj is more complex here.

Page 48: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Error term for unit j. Intuitively, if j is an output node k, then its error is

the standard tk – ok multiplied by ok(1- ok) : the derivative of the sigmoid function. Derivative of sigmoid because we’re using gradient(E).

If j is a hidden node h, have no th to compare it with. Must sum error in the output nodes k influenced by h:

δk

weighted by how much they were influenced by h: wkh

δh = oh(1- oh) Σ wkhδk

Page 49: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Derivation of error term for unit j.

Page 50: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha
Page 51: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha
Page 52: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha
Page 53: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Termination condition. Stop after fixed number of iterations. Stop when E on training data drops below

given level. Stop when E on test data drops below a

certain level.Too few iterations too much error remains.Too many overfitting data.

Page 54: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Variant of Backpropagation with momentum. Adding momentum = make weight update at

step n depend partially on that at n-1.Δwji(n) = ηδjxji + α Δwji(n - 1) α is a small number between 0 and 1.

Analogous to a ball rolling down a bumpy hill. Momentum (α) tries to keep the ball moving in the

same direction as before. Can keep ‘ball’ moving through local minima and

plateaus. If Gradient(E) does not change, increases effective

step size.

Page 55: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Pros and Cons of Momentum Can provide quicker convergence. However theoretically, can also drop right

through global minimum and keep going.

Page 56: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Generalization to n-layer network

We simply evaluate the δk for output nodes as before.

Then Backpropagate errors through network layer by layer:For a node r at layer mδr = or(1- or) Σwsrδs over all nodes s in layer m+1

Layer m+1 is downstream from m. r feeds input into the s nodes.

Page 57: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Generalization to arbitrary acyclic network We simply evaluate the δk for output nodes

as before. Then Backpropagate errors through

network:For a node rδr = or(1- or) Σwsrδs over all nodes s in

Downstream(r). Downsteam(r) = {s | s recieves input from r}

Page 58: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Summary. Artificial Neural networks are

Practical way to learn discrete, real, and vector valued functions.

Robust to noisy data.Usually trained via Backpropagation.Used for many real world tasks

Robot control Computer creativity

http://www.venturacountystar.com/news/2007/jul/09/computers-compose-original-melodies/

Page 59: Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Feedforward networks with 3 layers can approximate any function to any desired accuracy given sufficient units/artificial neurons and connections.Good accuracy achieved even with small

nets. Backpropagation is able to find

intermediate features within the net, that are not explicitly defined as attributes of the input or output.