Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha

Artificial Neural Networks #1Machine Learning CH4 : 4.1 – 4.5

Promethea Pythaitha

Artificial Neural Networks

Robust approach to approximating target functions over attributes with continuous domains as well as discrete.

Can approximate unknown target functions. Target function can be

Discrete valued. Real valued. Vector valued.

Neural Net Applications Robust to noise in training data

Among most successful methods at interpreting noisy real world sensor data.

Microphone input / speech recognition. Camera input. Handwriting recognition Face recognition and Image Processing.

Robotic control.Fuzzy neural nets.

Inspiration for Artificial Neural nets.

Natural learning systems (i.e. brains) are composed of very complex webs of interconnected neurons. Each neuron receives signals (current spikes). When neuron threshold is reached, neuron sends its

signal downstream to…. Other neurons Physical actuators Perception in the Neocortex Etc….

Artificial Neural nets are built out of densely interconnected sets of units.Each unit (artificial neuron) takes many real-

valued inputs, and produces a real-valued output.

Output is then sent downstream to Other units within net Output layer of net.

Brain estimated to contain ~ 100 billion neurons.

Each neuron connects to an average of 10,000 others.

Neuron switching speeds are on the order of a thousandth of a second

(versus a ten-billionth of a second for logic gates)

Yet brains can make decisions/ recognize images, etc, VERY fast.

Hypothesis:Thought /information processing in the brain

is result of massively parallelized processing of distributed inputs.

Neural Nets are built on this idea: In parallel, process Distributed data.

Artificial vs. Natural.

Many Complexities of Natural Neural Nets not present in Artificial ones. Feedback (uncommon in ANNs) Etc.

Many features of Artificial Neural Nets not compatible with Natural ones.

Units in Artificial Neural Nets produce one constant output rather than a time-varying sequence of current pulses.

Neural Network representations.

ALVINN learned to drive an autonomous vehicle on highway. Input: 30 x 32 pixel matrix from camera

960 values (B/W pixel intensities) Output: Steering direction for vehicle.

(30 real values) Two layer Neural Net:

Input (not counted) Hidden Layer Output.

ALVINN

ALVINN explained Typical Neural Net Structure. All 960 inputs in the 30x32 matrix are sent to the four

hidden neurons/units, where weighted linear sums are computed. Hidden unit: a unit whose output is only accessible in the

net, but not at the output layer. Outputs from the 4 hidden neurons are sent downstream

to 30 output neurons, each of which outputs confidence value corresponding to steering in a specific direction. Fuzzy truth?? Probability measure?

Program chooses direction with highest confidence

Typical Neural Net structure. Usually, Layers are connected in Directed,

Acyclic Graph. Can, in general, have any structure:

Cyclic/Acyclic Directed/Undirected

Feedforward/feedback

Most common & practical Nets trained using Backpropagation Learning = Select weight value for each connection. Assumes net is directed. Cycles are restricted

Usually there are none in practice.

Appropriate problems for A.N.N.s Great for problems with noisy data and

complex sensor inputs.Camera/microphone, etc

VS. Shaft encoder/light-sensor, etc

Symbolic problems:As good as Decision Tree Learning!!

BACKPROPAGATIONMost widely used technique.

Appropriate problems for A.N.N.s Neural Net Learning suitable for problems with

following attributes: Instances represented by many attribute, value

pairs.– Target function depends on vector of predefined attributes– Real-valued inputs– Attributed can be correlated or inependent.

Target function can be discrete-valued, real-valued, or a vector!

– ALVINN’s target function was a 30 real-element vector.

Appropriate problems for A.N.N.s• Training Data can contain Errors/Noise

• Neural nets very robust to noisy data• Thankfully so are natural ones ;-)

• Long Training Times are acceptable.• Neural nets usu. Take longer than other

machine-learning algorithms.• Few minutes several hours.

• Fast evaluation of learned function may be required.

• Neural nets do compute learned function very fast.• ALVINN re-computes bearing several times/second.

Appropriate problems for A.N.N.sNot important if humans can understand

learned function!! ALVINN

960 inputs 4 hidden nodes 30 outputs

Get somewhat messy looking to humans!! Thanks to the massive parallelism and distributed

data.

Perceptrons. Basic building block of Neural Nets Takes several real-valued inputs Computes weighted sum Checks sum against threshold: - w0

If > threshold output +1Else output -1.

o(x1, …, xn) = 1 if w0 + w1*x1 + …+ wn*xn > 0

-1 otherwise.

For simplicity, let x0 = 1, then

o(x1, …, xn) = sgn(w.x)Vectors denoted in Bold!!The . is vector dot product!!

Hypothesis space = All possible combinations of real-valued

weights.All w in Rn+1

Perceptron can represent any linearly separable concept.

Learned hypothesis is a hyperplane in Rn

Equation of hyperplane is w.x = 0 Example:

AND, OR, NAND, NOR - vs, XOR, etc.

Any boolean function can be represented by 2-layer perceptron network!!

Training a single Perceptron

Perceptron training rule.

Delta training rule / Gradient Descent.Converge to different hypotheses,Under different conditions!!

Perceptron training rule. Start with random weights Go through training examples

When an error is made, Update weights: For each wi

wi wi + Δwi

Δwi = η(t - o)xi

Terminology: η is the learning rate

typically small Sometimes decreases as we proceed.

t is the value of the target function o is the value outputted by perceptron.

When the set of training examples is completed perfectly, STOP.

Pros and Cons Can be proven to converge to a w that

correctly classifies all training examples in finite time, if η is small enough.

Convergence guaranteed only if concept is linearly seperable. If not, no convergence no stopping!!Can be an infinite loop of indecision!!

Delta rule & Gradient descent. Addresses problem of non-convergence

for nonlinear concepts. Will give a linear approximation to

nonlinear concepts that will minimize error.

…how do we measure error??

Consider a perceptron with the “thresholding” function removed.

Then o = w.x Can define error as sum of squared

deviations: E = ½ Σ (td – od)2, sum over all training

examples d. td is the target function value for example d.

od is the computed value of the weighted sum for d.

With this choice of error, it can be proven that minimizing E will lead to the most probable hypothesis that fits the training data, under the assumption that the noise is normally distributed with mean 0. Note “most probable” hypothesis and “correct”

hypothesis still can be different.

Graph E versus weights:

E will always be parabolic (by definition)So it has only one global minimum.

Goal: Descent to the global minimum ASAP!

How?Gradient definition.Meaning of the Gradient.Gradient descent.

The Gradient of E tells us the direction of steepest ascent.

So, – Gradient E tells us us direction of steepest descent.Go in that direction with step size η.The learning rule becomes:

Derivation of simple update rule.

Finally, the Delta Rule weight update is.

Δwi = η*Σ(td – od)xid , over all training examples d.

Gradient descent pseudocode. Pick an initial random weight vector w Until the termination condition is met, do

Set Δwi = 0 for all i.

For each <x, t> in D, do Run net on x :compute o(x) For each wi, do

Δ wi = Δ wi + η (t – o) xi

For each wi, do wi = wi + Δ wi

Results of Gradient Descent. Because of the shape of the error surface,

with only a local minimum, the algorithm will converge to a w with minimum squared deviation/error as long as η is small enough.This holds regardless of linear seperability. If η is too large, the algorithm may skip the

global minimum instead of settling in. A common remedy is to decrease η with time.

Gradient descent can be used to search large or infinite hypothesis spaces when:The hypothesis space is continuously

parameterized.The error can be differentiated with respect to

the hypothesis parameters. However

Convergence can be slooooooow If there are lots of local minima, there’s no

guarantee it will find the global one.

Stochastic gradient descent. Stochastic gradient descent

a.k.a.

Incremental gradient descent. Instead of updating the weights massively after going

through all the training examples, we update them after each example.

This really descends the gradient for a single-example error function (an example per step):

E = ½ (td – od)2

If η is small enough, this is as optimal as true gradient descent.

Stochastic Gradient Descent.

Pick an initial random weight vector w Until the termination condition is met, do

For each <x, t> in D, do Run net on x :compute o(x) For each wi, do

wi = wi + η (t – o) xi

Results. Compared to Stochastic gradient descent,

standard gradient descent takes more computation per step, but generally has a larger step size.

When E has multiple local minima, Stochastic gradient descent can sometimes avoid them.

Perceptrons with discrete output. For Delta-learning/gradient descent, we

discussed the unthresholded perceptrons. It can simply be modified to thresholded

perceptrons. Just use the thresholded t values as the t values

for the perceptron delta-learning algorithm. (with the unthresholded o values)

Unfortunately, this may not necessarily reduce the percent of errors in training data by the thresholded output, just the squared error in the thresholded output.

Multilayer Networks and the Backpropagation Algorithm In order to learn non-linear decision

surfaces, a system more complex than perceptrons is needed.

Choice of base unit. Multiple layers of linear units still linear.

Unthresholded perceptron Thresholded perceptron has non-

differentiable thresholding function: Cannot compute gradient of E

Need something different.Must be non-linearAnd continuously differentiable.

Sigmoid unit.

In place of the perceptron step function, use the sigmoid function as thresholding function.

Sigmoid: σ(y) = 1 / (1 + e-y)

The sigmoid unit computes the weighted linear sum w.x, and then applied the sigmoid “squashing function”.

Steepness of incline increases with coefficient of –y.

Continuously differentiable:Derivative: dσ(y)/dy = σ(y)*(1 - σ(y))

Backpropagation algorithm. Learns the weights that minimize squared

error given fixed # of units/neurons, and interconnections.

Employs gradient descent similar to delta rule.

Error is measured by:

Error Surface can have multiple local minima.No guarantee algorithm will find the Global

Minimum. However, in practice, backpropagation

performs very well.Recall Stochastic Gradient descent

vs.

Gradient descent.

Stochastic Backpropagation Algorithm

Considering a feedforward network with two layers of sigmoid units which is fully connected in one direction.Each unit is assigned an index (I = 0, 1, 2, …)xji denotes input from i into j.

wji denotes weight on connection from i to j.

δn is the error term associated with unit n.

Backpropagation algorithm.

Backpropagation explained Make the network, randomly initialize the

weights. For each training example d, apply the

network, calculate the squared error Ed, apply the gradient, and proceed a step of size η in direction of steepest decline.

Weight update rule: Recall Delta rule: Δ wi = Δ wi + η (t – o) xi Here we have Δwji = ηδjxji

The error term δj is more complex here.

Error term for unit j. Intuitively, if j is an output node k, then its error is

the standard tk – ok multiplied by ok(1- ok) : the derivative of the sigmoid function. Derivative of sigmoid because we’re using gradient(E).

If j is a hidden node h, have no th to compare it with. Must sum error in the output nodes k influenced by h:

δk

weighted by how much they were influenced by h: wkh

δh = oh(1- oh) Σ wkhδk

Derivation of error term for unit j.

Termination condition. Stop after fixed number of iterations. Stop when E on training data drops below

given level. Stop when E on test data drops below a

certain level.Too few iterations too much error remains.Too many overfitting data.

Variant of Backpropagation with momentum. Adding momentum = make weight update at

step n depend partially on that at n-1.Δwji(n) = ηδjxji + α Δwji(n - 1) α is a small number between 0 and 1.

Analogous to a ball rolling down a bumpy hill. Momentum (α) tries to keep the ball moving in the

same direction as before. Can keep ‘ball’ moving through local minima and

plateaus. If Gradient(E) does not change, increases effective

step size.

Pros and Cons of Momentum Can provide quicker convergence. However theoretically, can also drop right

through global minimum and keep going.

Generalization to n-layer network

We simply evaluate the δk for output nodes as before.

Then Backpropagate errors through network layer by layer:For a node r at layer mδr = or(1- or) Σwsrδs over all nodes s in layer m+1

Layer m+1 is downstream from m. r feeds input into the s nodes.

Generalization to arbitrary acyclic network We simply evaluate the δk for output nodes

as before. Then Backpropagate errors through

network:For a node rδr = or(1- or) Σwsrδs over all nodes s in

Downstream(r). Downsteam(r) = {s | s recieves input from r}

Summary. Artificial Neural networks are

Practical way to learn discrete, real, and vector valued functions.

Robust to noisy data.Usually trained via Backpropagation.Used for many real world tasks

Robot control Computer creativity

http://www.venturacountystar.com/news/2007/jul/09/computers-compose-original-melodies/

Feedforward networks with 3 layers can approximate any function to any desired accuracy given sufficient units/artificial neurons and connections.Good accuracy achieved even with small

nets. Backpropagation is able to find

intermediate features within the net, that are not explicitly defined as attributes of the input or output.

Documents

Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha