Soft Com Putting

Embed Size (px)

Citation preview

  • 8/2/2019 Soft Com Putting

    1/64

    Introduction

    What is an artificial neural network?

    An artificial neural network is a system based on the operation of biological neural networks, in

    other words, is an emulation of biological neural system. Why would be necessary theimplementation of artificial neural networks? Although computing these days is truly advanced,there are certain tasks that a program made for a common microprocessor is unable to perform;even so a software implementation of a neural network can be made with their advantages anddisadvantages.Advantages:

    A neural network can perform tasks that a linear program can not. When an element of the neural network fails, it can continue without any problem by

    their parallel nature. A neural network learns and does not need to be reprogrammed. It can be implemented in any application. It can be implemented without any problem.

    Disadvantages:

    The neural network needs training to operate. The architecture of a neural network is different from the architecture of microprocessors

    therefore needs to be emulated. Requires high processing time for large neural networks.

    Another aspect of the artificial neural networks is that there are different architectures, whichconsequently requires different types of algorithms, but despite to be an apparently complexsystem, a neural network is relatively simple.

    Artificial neural networks are among the newest signal processing technologies nowadays. Thefield of work is very interdisciplinary, but the explanation I will give you here will be restrictedto an engineering perspective.

    In the world of engineering, neural networks have two main functions: Pattern classifiers and asnon linear adaptive filters. As its biological predecessor, an artificial neural network is anadaptive system. By adaptive, it means that each parameter is changed during its operation and it

    is deployed for solving the problem in matter. This is called the training phase.

    A artificial neural network is developed with a systematic step-by-step procedure whichoptimizes a criterion commonly known as the learning rule. The input/output training data isfundamental for these networks as it conveys the information which is necessary to discover theoptimal operating point. In addition, a non linear nature make neural network processingelements a very flexible system.

  • 8/2/2019 Soft Com Putting

    2/64

    Basically, an artificial neural network is a system. A system is a structure that receives an input,process the data, and provides an output. Commonly, the input consists in a data array which can

    be anything such as data from an image file, a WAVE sound or any kind of data that can berepresented in an array. Once an input is presented to the neural network, and a correspondingdesired or target response is set at the output, an error is composed from the difference of thedesired response and the real system output.

    The error information is fed back to the system which makes all adjustments to their parametersin a systematic fashion (commonly known as the learning rule). This process is repeated until thedesired output is acceptable. It is important to notice that the performance hinges heavily on thedata. Hence, this is why this data should pre-process with third party algorithms such as DSPalgorithms.

    In neural network design, the engineer or designer chooses the network topology, the triggerfunction or performance function, learning rule and the criteria for stopping the training phase.So, it is pretty difficult determining the size and parameters of the network as there is no rule orformula to do it. The best we can do for having success with our design is playing with it. Theproblem with this method is when the system does not work properly it is hard to refine thesolution. Despite this issue, neural networks based solution is very efficient in terms ofdevelopment, time and resources. By experience, I can tell that artificial neural networks providereal solutions that are difficult to match with other technologies.

    Fifteen years ago, Denker said: artificial neural networks are the second best way to implementa solution this motivated by their simplicity, design and universality. Nowadays, neural networktechnologies are emerging as the technology choice for many applications, such as patterrecognition, prediction, system identification and control.

    The Biological Model

  • 8/2/2019 Soft Com Putting

    3/64

    Artificial neural networks born after McCulloc and Pitts introduced a set of simplified neurons in1943. These neurons were represented as models of biological networks into conceptualcomponents for circuits that could perform computational tasks. The basic model of the artificialneuron is founded upon the functionality of the biological neuron. By definition, Neurons are

    basic signaling units of the nervous system of a living being in which each neuron is a discrete

    cell whose several processes are from its cell body

    The biological neuron has four main regions to its structure. The cell body, or soma, has twooffshoots from it. The dendrites and the axon end in pre-synaptic terminals. The cell body is theheart of the cell. It contains the nucleolus and maintains protein synthesis. A neuron has manydendrites, which look like a tree structure, receives signals from other neurons.

    A single neuron usually has one axon, which expands off from a part of the cell body. This Icalled the axon hillock. The axon main purpose is to conduct electrical signals generated at theaxon hillock down its length. These signals are called action potentials.

    The other end of the axon may split into several branches, which end in a pre-synaptic terminal.The electrical signals (action potential) that the neurons use to convey the information of thebrain are all identical. The brain can determine which type of information is being received basedon the path of the signal.

    The brain analyzes all patterns of signals sent, and from that information it interprets the type ofinformation received. The myelin is a fatty issue that insulates the axon. The non-insulated partsof the axon area are called Nodes of Ranvier. At these nodes, the signal traveling down the axonis regenerated. This ensures that the signal travel down the axon to be fast and constant.

    The synapse is the area of contact between two neurons. They do not physically touch becausethey are separated by a cleft. The electric signals are sent through chemical interaction. Theneuron sending the signal is called pre-synaptic cell and the neuron receiving the electrical signalis called postsynaptic cell.

    The electrical signals are generated by the membrane potential which is based on differences inconcentration of sodium and potassium ions and outside the cell membrane.

  • 8/2/2019 Soft Com Putting

    4/64

    Biological neurons can be classified by their function or by the quantity of processes they carryout. When they are classified by processes, they fall into three categories: Unipolar neurons,bipolar neurons and multipolar neurons.

    Unipolar neurons have a single process. Their dendrites and axon are located on the same stem.

    These neurons are found in invertebrates.Bipolar neurons have two processes. Their dendrites and axon have two separated processestoo.

    Multipolar neurons: These are commonly found in mammals. Some examples of these neuronsare spinal motor neurons, pyramidal cells and purkinje cells.

    When biological neurons are classified by function they fall into three categories. The first groupis sensory neurons. These neurons provide all information for perception and motor coordination.The second group provides information to muscles, and glands. There are called motor neurons.

    The last group, the interneuronal, contains all other neurons and has two subclasses. One groupcalled relay or protection interneurons. They are usually found in the brain and connect differentparts of it. The other group called local interneurons are only used in local circuits.

    The Mathematical Model

    Once modeling an artificial functional model from the biological neuron, we must take intoaccount three basic components. First off, the synapses of the biological neuron are modeled asweights. Lets remember that the synapse of the biological neuron is the one which interconnectsthe neural network and gives the strength of the connection. For an artificial neuron, the weightis a number, and represents the synapse. A negative weight reflects an inhibitory connection,

    while positive values designate excitatory connections. The following components of the modelrepresent the actual activity of the neuron cell. All inputs are summed altogether and modified bythe weights. This activity is referred as a linear combination. Finally, an activation functioncontrols the amplitude of the output. For example, an acceptable range of output is usuallybetween 0 and 1, or it could be -1 and 1.

    Mathematically, this process is described in the figure

  • 8/2/2019 Soft Com Putting

    5/64

    From this model the interval activity of the neuron can be shown to be:

    The output of the neuron, yk, would therefore be the outcome of some activation function on thevalue of vk.

    Activation functions

    As mentioned previously, the activation function acts as a squashing function, such that theoutput of a neuron in a neural network is between certain values (usually 0 and 1, or -1 and 1). Ingeneral, there are three types of activation functions, denoted by (.) . First, there is the

    Threshold Function which takes on a value of 0 if the summed input is less than a certain

    threshold value (v), and the value 1 if the summed input is greater than or equal to the thresholdvalue.

    Secondly, there is the Piecewise-Linear function. This function again can take on the values of 0

  • 8/2/2019 Soft Com Putting

    6/64

    or 1, but can also take on values between that depending on the amplification factor in a certainregion of linear operation.

    Thirdly, there is the sigmoid function. This function can range between 0 and 1, but it is alsosometimes useful to use the -1 to 1 range. An example of the sigmoid function is the hyperbolictangent function.

    The artifcial neural networks which we describe are all variations on the parallel distributedprocessing (PDP) idea. The architecture of each neural network is based on very similar buildingblocks which perform the processing. In this chapter we first discuss these processing units anddiscuss diferent neural network topologies. Learning strategies as a basis for an adaptive system

  • 8/2/2019 Soft Com Putting

    7/64

    A framework for distributed representation

    An artifcial neural network consists of a pool of simple processing units which communicate bysending signals to each other over a large number of weighted connections. A set of majoraspects of a parallel distributed model can be distinguished :

    a set of processing units ('neurons,' 'cells'); a state of activation yk for every unit, which equivalent to the output of the unit; connections between the units. Generally each connection is defined by a weight wjk

    which determines the effect which the signal of unit j has on unit k; a propagation rule, which determines the effective input sk of a unit from its external

    inputs; an activation function Fk, which determines the new level of activation based on the

    efective input sk(t) and the current activation yk(t) (i.e., the update); an external input (aka bias, offset) k for each unit; a method for information gathering (the learning rule); an environment within which the system must operate, providing input signals and|if

    necessary|error signals.

    Processing units

    Each unit performs a relatively simple job: receive input from neighbours or external sources anduse this to compute an output signal which is propagated to other units. Apart from thisprocessing, a second task is the adjustment of the weights. The system is inherently parallel inthe sense that many units can carry out their computations at the same time. Within neuralsystems it is useful to distinguish three types of units: input units (indicated by an index i) whichreceive data from outside the neural network, output units (indicated by an index o) which send

    data out of the neural network, and hidden units (indicated by an index h) whose input and outputsignals remain within the neural network. During operation, units can be updated eithersynchronously or asynchronously. With synchronous updating, all units update their activationsimultaneously; with asynchronous updating, each unit has a (usually fixed) probability ofupdating its activation at a time t, and usually only one unit will be able to do this at a time. Insome cases the latter model has some advantages.

    Neural Network topologies

    In the previous section we discussed the properties of the basic processing unit in an artificialneural network. This section focuses on the pattern of connections between the units and thepropagation of data. As for this pattern of connections, the main distinction we can make isbetween:

    Feed-forward neural networks, where the data ow from input to output units is strictlyfeedforward. The data processing can extend over multiple (layers of) units, but no

  • 8/2/2019 Soft Com Putting

    8/64

    feedback connections are present, that is, connections extending from outputs of units toinputs of units in the same layer or previous layers.

    Recurrent neural networks that do contain feedback connections. Contrary to feed-forward networks, the dynamical properties of the network are important. In some cases,the activation values of the units undergo a relaxation process such that the neural

    network will evolve to a stable state in which these activations do not change anymore.In other applications, the change of the activation values of the output neurons aresignificant, such that the dynamical behaviour constitutes the output of the neuralnetwork (Pearlmutter, 1990).

    Classical examples of feed-forward neural networks are the Perceptron and Adaline. Examplesof recurrent networks have been presented by Anderson(Anderson, 1977), Kohonen (Kohonen, 1977), and Hopfield (Hopfield, 1982) .

    Training of artifcial neural networks

    A neural network has to be configured such that the application of a set of inputs produces(either 'direct' or via a relaxation process) the desired set of outputs. Various methods to set thestrengths of the connections exist. One way is to set the weights explicitly, using a prioriknowledge. Another way is to 'train' the neural network by feeding it teaching patterns andletting it change its weights according to some learning rule.

    We can categorise the learning situations in two distinct sorts. These are:

    Supervised learning or Associative learning in which the network is trained byproviding it with input and matching output patterns. These input-output pairs can beprovided by an external teacher, or by the system which contains the neural network

    (self-supervised).

    Unsupervised learning or Self-organisation in which an (output) unit is trained torespond to clusters of pattern within the input. In this paradigm the system is supposed todiscover statistically salient features of the input population. Unlike the supervisedlearning paradigm, there is no a priori set of categories into which the patterns are to beclassified; rather the system must develop its own representation of the input stimuli.

  • 8/2/2019 Soft Com Putting

    9/64

    Reinforcement Learning This type of learning may be considered as an intermediateform of the above two types of learning. Here the learning machine does some action onthe environment and gets a feedback response from the environment. The learningsystem grades its action good (rewarding) or bad (punishable) based on theenvironmental response and accordingly adjusts its parameters. Generally, parameter

    adjustment is continued until an equilibrium state occurs, following which there will beno more changes in its parameters. The selforganizing neural learning may becategorized under this type of learning.

    Modifying patterns of connectivity of Neural Networks

    Both learning paradigms supervised learning and unsupervised learning result in an adjustmentof the weights of the connections between units, according to some modification rule. Virtuallyall learning rules for models of this type can be considered as a variant of the Hebbian learningrule suggested by Hebb in his classic book Organization of Behaviour (1949) (Hebb, 1949). Thebasic idea is that if two units j and k are active simultaneously, their interconnection must be

    strengthened. If j receives input from k, the simplest version of Hebbian learning prescribes tomodify the weight wjk with

    where is a positive constant of proportionality representing the learning rate. Another commonrule uses not the actual activation of unit kbut the difference between the actual and desired

    activation for adjusting the weights:in which dk is the desired activation provided by a teacher. This is often called the Widrow-Hoffrule or the delta rule, and will be discussed in the next chapter. Many variants (often very exoticones) have been published the last few years.

    Neural networks were started about 50 years ago. Their early abilities were exaggerated, castingdoubts on the field as a whole There is a recent renewed interest in the field, however, because ofnew techniques and a better theoretical understanding of their capabilities.

    Motivation for neural networks:

    Scientists are challenged to use machines more effectively for tasks currently solved by humans. Symbolic Rules don't reflect processes actually used by humans Traditional computing excels in many areas, but not in others.

    Types of Applications

    Machine learning:

  • 8/2/2019 Soft Com Putting

    10/64

    Having a computer program itself from a set of examples so you don't have to program ityourself. This will be a strong focus of this course: neural networks that learn from a set of

    examples.

    Optimization: given a set of constraints and a cost function, how do you find an optimalsolution? E.g. traveling salesman problem.

    Classification: grouping patterns into classes: i.e. handwritten characters into letters. Associative memory: recalling a memory based on a partial match. Regression: function mapping

    Cognitive science:

    Modelling higher level reasoning:o languageo problem solving

    Modelling lower level reasoning:o visiono audition speech recognitiono speech generation

    Neurobiology: Modelling models of how the brain works.

    neuron-level higher levels: vision, hearing, etc. Overlaps with cognitive folks.

    Mathematics:

    Nonparametric statistical analysis and regression.Philosophy:

    Can human souls/behavior be explained in terms of symbols, or does it require something lowerlevel, like a neurally based model?

    Where are neural networks being used?

    Signal processing: suppress line noise, with adaptive echo canceling, blind source separation Control: e.g. backing up a truck: cab position, rear position, and match with the dock get

    converted to steering instructions. Manufacturing plants for controlling automated machines.

    Siemens successfully uses neural networks for process automation in basic industries, e.g., inrolling mill control more than 100 neural networks do their job, 24 hours a day

    Robotics - navigation, vision recognition Pattern recognition, i.e. recognizing handwritten characters, e.g. the current version of Apple's

    Newton uses a neural net

    Medicine, i.e. storing medical records based on case information Speech production: reading text aloud (NETtalk)

  • 8/2/2019 Soft Com Putting

    11/64

    Speech recognition Vision: face recognition , edge detection, visual search engines Business,e.g.. rules for mortgage decisions are extracted from past decisions made by

    experienced evaluators, resulting in a network that has a high level of agreement with human

    experts.

    Financial Applications: time series analysis, stock market prediction Data Compression: speech signal, image, e.g. faces Game Playing: backgammon, chess, go, ...

    Neural Networks in the Brain

    The brain is not homogeneous. At the largestanatomical scale, we distinguish cortex,midbrain, brainstem, and cerebellum. Eachof these can be hierarchically subdivided intomany regions, and areas within each region,either according to the anatomical structure ofthe neural networks within it, or according tothe function performed by them.

    The overall pattern ofprojections (bundles ofneural connections) between areas isextremely complex, and only partially known.The best mapped (and largest) system in thehuman brain is the visual system, where the first 10 or 11 processing stages have been identified.We distinguish feedforward projections that go from earlier processing stages (near the sensoryinput) to later ones (near the motor output), from feedback connections that go in the oppositedirection.

    In addition to these long-range connections, neurons also link up with many thousands of theirneighbours. In this way they form very dense, complex local networks.

    Artificial Neuron Models

    Computational neurobiologists have constructed very elaborate computer models of neurons inorder to run detailed simulations of particular circuits in the brain. As Computer Scientists, we

    are more interested in the general properties of neural networks, independent of how they areactually "implemented" in the brain. This means that we can use much simpler, abstract"neurons", which (hopefully) capture the essence of neural computation even if they leave outmuch of the details of how biological neurons work.

    People have implemented model neurons in hardware as electronic circuits, often integrated onVLSI chips. Remember though that computers run much faster than brains - we can therefore run

  • 8/2/2019 Soft Com Putting

    12/64

    fairly large networks of simple model neurons as software simulations in reasonable time. Thishas obvious advantages over having to use special "neural" computer hardware.

    A Simple Artificial Neuron

    Our basic computational element (model neuron) is often called a node or unit. It receives inputfrom some other units, or perhaps from an external source. Each input has an associated weightw, which can be modified so as to model synaptic learning. The unit computes some function fofthe weighted sum of its inputs:

    Its output, in turn, can serve as input to other units.

    The weighted sum is called the net input to unit i, often written neti. Note that wijrefers to the weight from unitjto unit i(not the other way around). The functionfis the unit's activation function. In the simplest case,fis the identity function,

    and the unit's output is just its net input. This is called a linear unit.

  • 8/2/2019 Soft Com Putting

    13/64

    Neurons and Synapses

    The basic computational unit in the nervous system is the nerve cell, or neuron. A neuron has:

    Dendrites (inputs) Cell body Axon (output)

    A neuron receives input from other neurons (typically many thousands). Inputs sum(approximately). Once input exceeds a critical level, the neuron discharges a spike - an electricalpulse that travels from the body, down the axon, to the next neuron(s) (or other receptors). Thisspiking event is also called depolarization, and is followed by a refractory period, duringwhich the neuron is unable to fire.

    The axon endings (Output Zone) almost touch the dendrites or cell body of the next neuron.Transmission of an electrical signal from one neuron to the next is effected byneurotransmittors, chemicals which are released from the first neuron and which bind toreceptors in the second. This link is called a synapse. The extent to which the signal from oneneuron is passed on to the next depends on many factors, e.g. the amount of neurotransmittor

    available, the number and arrangement of receptors, amount of neurotransmittor reabsorbed, etc.

  • 8/2/2019 Soft Com Putting

    14/64

    Synaptic

    Learning

    Brains learn. Ofcourse. From whatwe know ofneuronalstructures, one waybrains learn is byaltering thestrengths ofconnectionsbetween neurons,and by adding or

    deletingconnectionsbetween neurons.Furthermore, theylearn "on-line",based onexperience, and typically without the benefit of a benevolent teacher.

    The efficacy of a synapse can change as a result of experience, providing both memory andlearning through long-term potentiation. One way this happens is through release of moreneurotransmitter. Many other changes may also be involved.

    Long-term Potentiation:

    An enduring (>1 hour) increase in synaptic efficacy that

    results from high-frequency stimulation of an afferent

    (input) pathway

    Hebbs Postulate:

    "When an axon of cell A... excites[s] cell B and repeatedly or

    persistently takes part in firing it, some growth process or

    metabolic change takes place in one or both cells so that A's

    efficiency as one of the cells firing B is increased."

    Bliss and Lomo discovered LTP in the hippocampus in1973

  • 8/2/2019 Soft Com Putting

    15/64

    Points to note about LTP:

    Synapses become more or less important over time (plasticity) LTP is based on experience LTP is based only on localinformation (Hebb's postulate)

    Summary

    The following properties of nervous systems will be of particular interest in our neurally-inspiredmodels:

    parallel, distributed information processing high degree of connectivity among basic units connections are modifiable based on experience learning is a constant process, and usually unsupervised learning is based only on local information performance degrades gracefully if some units are removed

    Examples of Activation Functions

    Identity Function

    The identity function is given by

    > f1 := proc(x) x; end;

    > plot(f1(x),x=-10..10);

  • 8/2/2019 Soft Com Putting

    16/64

    Step Function

    The step function is if and if . This is also called theheaviside function. Another common variation is for it to take on values -1 and +1 as shownbelow.

    > f2 := proc(x) 2*Heaviside(x) -1 end;

    > plot(f2(x),x=-10..10);

  • 8/2/2019 Soft Com Putting

    17/64

    Logistic Function (Sigmoid)

    The logistic function has the form

    > f3 := proc(x,a) 1/(1 + exp(-a*x)) end;

    > plot(f3(x,1),x=-10..10);

    > with(plots):

    The parameter "a" in the logistic function determines how steep it is. The larger "a", the steeper itis.

    > display([plot(f3(x,1),x=-10..10),

    > plot(f3(x,2),x=-10..10, color = blue),

    > plot(f3(x,.5),x=-10..10, color = green)]);

  • 8/2/2019 Soft Com Putting

    18/64

    Symmetric Sigmoid

    The symmetric sigmoid is simply the sigmoid that is stretched so that the y range is 2 and thenshifted down by 1 so that it ranges between -1 and 1. If g(x) is the standard sigmoid then the

    symmetric sigmoid is

    > f4 := proc(x) 2*f3(x,1)-1 end;

    > plot(f4(x),x=-10..10);

  • 8/2/2019 Soft Com Putting

    19/64

    The symmetric sigmoid differs from the hyperbolic tangent by a constant factor. As you can see,the graph below is identical to the graph for the symmetric sigmoid.

    > plot(tanh(.5*x),x=-10..10);

    Radial Basis Functions

    A radial basis function is simply a gaussian, . It is called local because, unlikethe previous functions, it is essentially zero everywhere except in a small region.

    > f5 := proc(x,a) exp(-a * x^2); end;

    > plot(f5(x,1),x=-10..10);

  • 8/2/2019 Soft Com Putting

    20/64

    Derivatives

    The derivative of the identity function is just 1. That is, if f(x) is the identity then

    The derivative of the step function is not defined which is exactly why it isn't used.

    The nice feature ofsigmoids is that their derivatives are easy to compute. If f(x) is the logistic

    function above then .

    > diff(f3(x,a),x);

    >

    simplify(f3(x,a)*(1-f3(x,a)));

  • 8/2/2019 Soft Com Putting

    21/64

    This is also true ofhyperbolic tangent . If f(x) is tanh the

    > diff(tanh(x),x);

    >

    Linear Regression

    Fitting a Model to Data

    Consider the data below (for more complete auto data, see data description, raw data, and maple

    plots):

  • 8/2/2019 Soft Com Putting

    22/64

    (Fig. 1)

    Each dot in the figure provides information about the weight (x-axis, units: U.S. pounds) and fuelconsumption (y-axis, units: miles per gallon) for one of 74 cars (data from 1979). Clearly weightand fuel consumption are linked, so that, in general, heavier cars use more fuel.

    Now suppose we are given the weight of a 75th car, and asked to predict how much fuel it willuse, based on the above data. Such questions can be answered by using a model - a shortmathematical description - of the data (see also optical illusions). The simplest useful model hereis of the form

  • 8/2/2019 Soft Com Putting

    23/64

    y = w1 x + w0 (1)

    This is a linear model: in an xy-plot, equation 1 describes a straight line with slopew1 and

    interceptw0 with the y-axis, as shown in Fig. 2. (Note that we have rescaled the coordinate axes- this does not change the problem in any fundamental way.)

    How do we choose the two parameters w0 and w1 of our model? Clearly, any straight line drawnsomehow through the data could be used as a predictor, but some lines will do a better job thanothers. The line in Fig. 2 is certainly not a good model: for most cars, it will predict too muchfuel consumption for a given weight.

  • 8/2/2019 Soft Com Putting

    24/64

    (Fig. 2)

    The Loss FunctionIn order to make precise what we mean by being a "good predictor", we define a loss (also called

    objective or error) function Eover the model parameters. A popular choice for Eis the sum-squared

    error:

  • 8/2/2019 Soft Com Putting

    25/64

    (2)

    In words, it is the sum over all points i in our data set of the squared difference between thetarget value ti (here: actual fuel consumption) and the model's predictionyi, calculated from theinput valuexi (here: weight of the car) by equation 1. For a linear model, the sum-sqaured erroris a quadratic function of the model parameters. Figure 3 showsEfor a range of values ofw0 andw1. Figure 4 shows the same functions as a contour plot.

    (Fig. 3)

  • 8/2/2019 Soft Com Putting

    26/64

    (Fig. 4)

    Minimizing the LossThe loss function Eprovides us with an objective measure of predictive error for a specific choice of

    model parameters. We can thus restate our goal of finding the best (linear) model as finding the values

    for the model parameters that minimize E.

    For linear models, linear regression provides a direct way to compute these optimal modelparameters. (See any statistics textbook for details.) However, this analytical approach does not

  • 8/2/2019 Soft Com Putting

    27/64

    generalize to nonlinear models (which we will get to by the end of this lecture). Even though thesolution cannot be calculated explicitly in that case, the problem can still be solved by aniterative numerical technique called gradient descent. It works as follows:

    1. Choose some (random) initial values for the model parameters.2.

    Calculate the gradient G of the error function with respect to each model parameter.3. Change the model parameters so that we move a short distance in the direction of the greatestrate of decrease of the error, i.e., in the direction of -G.

    4. Repeat steps 2 and 3 until G gets close to zero.How does this work? The gradient of E gives us the direction in which the loss function at the current

    settting of the whas the steepest slope. In ordder to decrease E, we take a small step in the opposite

    direction, -G (Fig. 5).

    (Fig. 5)

    By repeating this over and over, we move "downhill" inEuntil we reach a minimum, where G =0, so that no further progress is possible (Fig. 6).

    (Fig. 6)

    Fig. 7 shows the best linear model for our car data, found by this procedure.

  • 8/2/2019 Soft Com Putting

    28/64

    (Fig. 7)

    It's a neural network!Our linear model of equation 1 can in fact be implemented by the simple neural network shown in Fig. 8.

    It consists of a bias unit, an input unit, and a linear output unit. The input unit makes external inputx

    (here: the weight of a car) available to the network, while the bias unit always has a constant output of

    1. The output unit computes the sum:

  • 8/2/2019 Soft Com Putting

    29/64

    y2 = y1 w21 + 1.0 w20 (3)

    It is easy to see that this is equivalent to equation 1, with w21 implementing the slope of thestraight line, and w20 its intercept with the y-axis.

    Linear Neural Networks

    Multiple regression

    Our car example showed how we could discover an optimal linear function for predicting onevariable (fuel consumption) from one other (weight). Suppose now that we are also given one ormore additional variables which could be useful as predictors. Our simple neural network modelcan easily be extended to this case by adding more input units (Fig. 1).

    Similarly, we may want to predict more than one variable from the data that we're given. Thiscan easily be accommodated by adding more output units (Fig. 2). The loss function for anetwork with multiple outputs is obtained simply by adding the loss for each output unittogether. The network now has a typical layered structure: a layer of input units (and the bias),connected by a layer of weights to a layer of output units.

  • 8/2/2019 Soft Com Putting

    30/64

    (Fig. 1) (Fig. 2)

    Computing the gradient

    In order to train neural networks such as the ones shown above by gradient descent, we need tobe able to compute the gradient G of the loss function with respect to each weight wij of thenetwork. It tells us how a small change in that weight will affect the overall errorE. We begin by

    splitting the loss function into separate terms for each pointp in the training data:

    (1)

    where o ranges over the output units of the network. (Note that we use the superscriptp to denote the

    training point - this is not an exponentiation!) Since differentiation and summation are interchangeable,

    we can likewise split the gradient into separate components for each training point:

    (2)

    In what follows, we describe the computation of the gradient for a single data point, omitting the

    superscriptp in order to make the notation easier to follow.

  • 8/2/2019 Soft Com Putting

    31/64

    First use the chain rule to decompose the gradient into two factors:

    (3)

    The first factor can be obtained by differentiating Eqn. 1 above:

    (4)

    Using , the second factor becomes

    (5)

    Putting the pieces (equations 3-5) back together, we obtain

    (6)

    To find the gradient G for the entire data set, we sum at each weight the contribution given by equation

    6 over all the data points. We can then subtract a small proportion (called the learning rate) ofG from

    the weights to perform gradient descent.

    The Gradient Descent Algorithm

    1. Initialize all weights to small random values.2. REPEAT until done

    1. For each weight wijset2. For each data point (x, t)p

    1. set input units to x2. compute value of output units3. For each weight wijset

  • 8/2/2019 Soft Com Putting

    32/64

    3. For each weight wijsetThe algorithm terminates once we are at, or sufficiently near to, the minimum of the errorfunction, where G = 0. We say then that the algorithm has converged.

    In summary:

    general case linear network

    Training data (x,t) (x,t)

    Model parameters w w

    Model y = g(w,x)

    Error function E(y,t)

    Gradient with

    respect to wij- (ti - yi) yj

    Weight update rule

    The Learning Rate

    An important consideration is the learning rate , which determines by how much we change theweights w at each step. If is too small, the algorithm will take a long time to converge (Fig. 3).

  • 8/2/2019 Soft Com Putting

    33/64

    (Fig. 3)

    Conversely, if is too large, we may end up bouncing around the error surface out of control - the

    algorithm diverges (Fig. 4). This usually ends with an overflow error in the computer's floating-point

    arithmetic.

    (Fig. 4)

    Batch vs. Online Learning

    Above we have accumulated the gradient contributions for all data points in the training setbefore updating the weights. This method is often referred to as batch learning. An alternativeapproach is online learning, where the weights are updated immediately after seeing each datapoint. Since the gradient for a single data point can be considered a noisy approximation to theoverall gradient G (Fig. 5), this is also called stochastic (noisy) gradient descent.

  • 8/2/2019 Soft Com Putting

    34/64

    (Fig. 5)

    Online learning has a number of advantages:

    it is often much faster, especially when the training set is redundant (contains many similar datapoints),

    it can be used when there is no fixed training set (new data keeps coming in), it is better at tracking nonstationary environments (where the best model gradually changes

    over time),

    the noise in the gradient can help to escape from local minima (which are a problem forgradient descent in nonlinear models).

    These advantages are, however, bought at a price: many powerful optimization techniques (suchas: conjugate and second-order gradient methods, support vector machines, Bayesian methods,etc.) - which we will not talk about in this course! - are batch methods that cannot be used online.(Of course this also means that in order to implement batch learning really well, one has to learn

    an awful lot about these rather complicated methods!)

    A compromise between batch and online learning is the use of "mini-batches": the weights areupdated after every n data points, where n is greater than 1 but smaller than the training set size.

    Multi-layer networks

    A nonlinear problem

    Consider again the best linear fit we found for the car data. Notice that the data points are not evenly

    distributed around the line: for low weights, we see more miles per gallon than our model predicts. In

    fact, it looks as if a simple curve might fit these data better than the straight line. We can enable our

    neural network to do such curve fitting by giving it an additional node which has a suitably curved

    (nonlinear) activation function. A useful function for this purpose is the S-shaped hyperbolic tangent

    (tanh) function (Fig. 1).

  • 8/2/2019 Soft Com Putting

    35/64

    (Fig. 1) (Fig. 2)

    FIg. 2 shows our new network: an extra node (unit 2) with tanh activation function has beeninserted between input and output. Since such a node is "hidden" inside the network, it iscommonly called a hidden unit. Note that the hidden unit also has a weight from the bias unit. Ingeneral, all non-input neural network units have such a bias weight. For simplicity, the bias unitand weights are usually omitted from neural network diagrams - unless it's explicitly statedotherwise, you should always assume that they are there.

  • 8/2/2019 Soft Com Putting

    36/64

    (Fig. 3)

    When this network is trained by gradient descent on the car data, it learns to fit the tanh functionto the data (Fig. 3). Each of the four weights in the network plays a particular role in this process:the two bias weights shift the tanh function in the x- and y-direction, respectively, while the othertwo weights scale it along those two directions. Fig. 2 gives the weight values that produced thesolution shown in Fig. 3.

    Hidden Layers

    One can argue that in the example above we have cheated by picking a hidden unit activation function

    that could fit the data well. What would we do if the data looks like this (Fig. 4)?

  • 8/2/2019 Soft Com Putting

    37/64

    (Fig. 4)

    (Relative concentration of NO and NO2 in exhaust fumes as a functionof the richness of the ethanol/air mixture burned in a car engine.)

    Obviously the tanh function can't fit this data at all. We could cook up a special activationfunction for each data set we encounter, but that would defeat our purpose of learning to modelthe data. We would like to have a general, non-linear function approximation method whichwould allow us to fit any given data set, no matter how it looks like.

    (Fig. 5)

    Fortunately there is a very simple solution: add more hidden units! In fact, a network with justtwo hidden units using the tanh function (Fig. 5) can fit the dat in Fig. 4 quite well - can you seehow? The fit can be further improved by adding yet more units to the hidden layer. Note,however, that having too large a hidden layer - or too many hidden layers - can degrade thenetwork's performance (more on this later). In general, one shouldn't use more hidden units thannecessary to solve a given problem. (One way to ensure this is to start training with a very small

  • 8/2/2019 Soft Com Putting

    38/64

    network. If gradient descent fails to find a satisfactory solution, grow the network by adding ahidden unit, and repeat.)

    Theoretical results indicate that given enough hidden units, a network like the one in Fig. 5 canapproximate any reasonable function to any required degree of accuracy. In other words, any

    function can be expressed as a linear combination of tanh functions: tanh is a universal basisfunction. Many functions form a universal basis; the two classes of activation functionscommonly used in neural networks are the sigmoidal (S-shaped) basis functions (to which tanhbelongs), and the radial basis functions.

    Error Backpropagation

    We have already seen how to trainlinear networksby gradient descent. In trying to do the samefor multi-layer networks we encounter a difficulty: we don't have any target values for the hidden

    units. This seems to be an insurmountable problem - how could we tell the hidden units just whatto do? This unsolved question was in fact the reason why neural networks fell out of favor afteran initial period of high popularity in the 1950s. It took 30 years before the errorbackpropagation (or in short: backprop) algorithm popularized a way to train hidden units,leading to a new wave of neural network research and applications.

    (Fig. 1)

    In principle, backprop provides a way to train networks with any number of hidden unitsarranged in any number of layers. (There are clear practical limits, which we will discuss later.)

    In fact, the network does not have to be organized in layers - any pattern of connectivity thatpermits a partial ordering of the nodes from input to output is allowed. In other words, theremust be a way to order the units such that all connections go from "earlier" (closer to the input)to "later" ones (closer to the output). This is equivalent to stating that their connection patternmust not contain any cycles. Networks that respect this constraint are called feedforwardnetworks; their connection pattern forms a directed acyclic graph or dag.

    http://www.willamette.edu/~gorr/classes/cs449/linear2.html#gradienthttp://www.willamette.edu/~gorr/classes/cs449/linear2.html#gradienthttp://www.willamette.edu/~gorr/classes/cs449/linear2.html#gradienthttp://www.willamette.edu/~gorr/classes/cs449/linear2.html#gradient
  • 8/2/2019 Soft Com Putting

    39/64

    The Algorithm

    We want to train a multi-layer feedforward network by gradient descent to approximate an unknown

    function, based on some training data consisting of pairs (x,t). The vector x represents a pattern of input

    to the network, and the vector t the corresponding target (desired output). As we have seen before, the

    overall gradient with respect to the entire training set is just the sum of the gradients for each pattern;

    in what follows we will therefore describe how to compute the gradient for just a single training pattern.

    As before, we will number the units, and denote the weight from unit j to unit i by w ij.

    1. Definitions:o the error signal for unit j:

    o the (negative) gradient for weight wij:

    o the set of nodes anterior to unit i:

    o the set of nodes posterior to unit j:

    2. The gradient. As we did forlinear networksbefore, we expand the gradient into two factors byuse of the chain rule:

    The first factor is the error of unit i. The second is

    Putting the two together, we get

    .

    To compute this gradient, we thus need to know the activity and the error for all relevantnodes in the network.

    http://www.willamette.edu/~gorr/classes/cs449/linear2.html#gradienthttp://www.willamette.edu/~gorr/classes/cs449/linear2.html#gradienthttp://www.willamette.edu/~gorr/classes/cs449/linear2.html#gradienthttp://www.willamette.edu/~gorr/classes/cs449/linear2.html#gradient
  • 8/2/2019 Soft Com Putting

    40/64

    3. Forward activaction. The activity of the input units is determined by the network's externalinput x. For all other units, the activity is propagated forward:

    Note that before the activity of unit i can be calculated, the activity of all its anteriornodes (forming the set Ai) must be known. Since feedforward networks do not containcycles, there is an ordering of nodes from input to output that respects this condition.

    4. Calculating output error. Assuming that we are using the sum-squared loss

    the error for output unit o is simply

    5. Error backpropagation. For hidden units, we must propagate the error back from the outputnodes (hence the name of the algorithm). Again using the chain rule, we can expand the error of

    a hidden unit in terms of its posterior nodes:

    Of the three factors inside the sum, the first is just the error of node i. The second is

    while the third is the derivative of node j's activation function:

    For hidden units h that use the tanh activation function, we can make use of the specialidentitytanh(u)' = 1 - tanh(u)2, giving us

  • 8/2/2019 Soft Com Putting

    41/64

    Putting all the pieces together we get

    Note that in order to calculate the error for unit j, we must first know the error of all itsposterior nodes (forming the set Pj). Again, as long as there are no cycles in the network,there is an ordering of nodes from the output back to the input that respects thiscondition. For example, we can simply use the reverse of the order in which activity waspropagated forward.

    Matrix Form

    For layered feedforward networks that are fully connected - that is, each node in a given layer connectsto everynode in the next layer - it is often more convenient to write the backprop algorithm in matrix

    notation rather than using more general graph form given above. In this notation, the biases weights,

    net inputs, activations, and error signals for all units in a layer are combined into vectors, while all the

    non-bias weights from one layer to the next form a matrix W. Layers are numbered from 0 (the input

    layer) to L (the output layer). The backprop algorithm then looks as follows:

    1. Initialize the input layer:

    2. Propagate activity forward: for l = 1, 2, ..., L,

    where bl is the vector of bias weights.

    3. Calculate the error in the output layer:

    4. Backpropagate the error: for l = L-1, L-2, ..., 1,

    where T is the matrix transposition operator.

  • 8/2/2019 Soft Com Putting

    42/64

    5. Update the weights and biases:

    You can see that this notation is significantly more compact than the graph form, even though it

    describes exactly the same sequence of operations.

    *************************************************************************************

    Classification

    Pattern Classification and Single Layer

    Networks

    Intro

    We have just seen how a network can be trained to perform linear regression. That is, given a setof inputs (x) and output/target values (y), the network finds the best linear mapping from x to y.

    Given an x value that we have not seen, our trained network can predict what the most likely yvalue will be. The ability to (correctly) predict the output for an input the network has not seen iscalled generalization.

    This style of learning is referred to as supervised learning (or learning with a teacher) becausewe are given the target values. Later we will see examples of unsupervisedlearning which isused for finding patterns in the data rather than modeling input/output mappings.

    We now step away from linear regression for a moment and look at another type of supervisedlearning problem called pattern classification. We start by considering only single layernetworks.

    Pattern classificationA classic example of pattern classifiction is letter recognition. We are given, for example, a setof pixel values associated with an image of a letter. We want the computer to determine whatletter it is. The pixel values are refered to as the inputs or the decision variables, and the lettercategories are referred to as classes.

  • 8/2/2019 Soft Com Putting

    43/64

    Now, a given letter such as "A" can look quite different depending on the type of font that isused or, in the case of handwritten letters, different people's handwriting. Thus, there will be arange of values for the decision variables that map to the same class. That is, if we plot thevalues of the decision variables, different regions will correspond to different classes.

    Example 1:

    Two Classes (class 0 and class 1), Two Inputs (x1 and x2).

    Example 2:

    Another example (see data description, data, Maple plots):class = types of irisdecision variables = sepal and petal sizes

    Example 3:

    example of zipcode digits in Maple

    Single layer Networks for Pattern Classification

    We can apply a similar approach as in linear regression where the targets are now the classes.Note that the outputs are no longer continuous but rather take on discrete values.

  • 8/2/2019 Soft Com Putting

    44/64

    Two Classes:

    What does the network look like? If there are just 2 classes we only need 1 output node. Thetarget is 1 if the example is in, say, class 1, and the target is 0 (or -1) if the target is in class 0. Itseems reasonable that we use a binary step function to guarantee an appropriate output value.

    Training Methods:

    We will discuss two kinds of methods for training single-layer networks that do patternclassification:

    Perceptron - guaranteed to find the right weights if they exist The Adaline (uses Delta Rule) - can easily be generalized to multi-layer nets (nonlinear

    problems)

    But how do we know if the right weights exist at all????

    Let's look to see what a single layer architecture can do ....

    Single Layer with a Binary Step Function

    Consider a network with 2 inputs and 1 output node (2 classes).

    The net output of the network is a linear function of the weights and the inputs

    net = W X = x1 w1 + x2 w2

    y = f(net)

  • 8/2/2019 Soft Com Putting

    45/64

    x1 w1 + x2 w2 = 0 defines a straight line through the input space.

    x2 = - w1/w2 x1

  • 8/2/2019 Soft Com Putting

    46/64

    Linear Separability

    Classification problems for which there is a line that exactly separates the classes are calledlinearly separable. Single layer networks are only able to solve linearly separable problems. Mostreal world are not linearly separable.

    The Perceptron

    The perceptron learning rule is a method for finding the weights in a network.

  • 8/2/2019 Soft Com Putting

    47/64

    We consider the problem ofsupervised learning for classification although other types ofproblems can also be solved.

    A nice feature of the perceptron learning rule is that if there exista set of weights that solve theproblem, then the perceptron will find these weights. This is true for either binary or bipolar

    representations.

    Assumptions:

    We have single layer network whose output is, as before,output = f(net) = f(W X)

    where f is a binary step function f whose values are (+-1).

    We assume that the bias treated as just an extra input whose value is 1 p = number of training examples (x,t) where t = +1 or -1

    Geometric Interpretation:

    With this binary function f, the problem reduces to finding weights such that

    sign( W X) = t

    That is, the weight must be chosen so that the projection of pattern X onto W has the same signas the target t. But the boundary between positive and negative projections is just the plane W X= 0 , i.e. the same decision boundary we saw before.

    The Perceptron Algorithm

    1. initialize the weights (either to zero or to a small random value)2. pick a learning rate ( this is a number between 0 and 1)3. Until stopping condition is satisfied (e.g. weights don't change):

    For each training pattern (x, t):

    compute output activation y = f(w x) If y = t, don't change weights

  • 8/2/2019 Soft Com Putting

    48/64

    If y != t, update the weights:w(new) = w(old) + 2 t x

    or

    w(new) = w(old) + (t - y ) x, for all t

    Consider wht happens below when the training pattern p1 or p2 is chosen. Beforeupdating the weight W, we note that both p1 and p2 are incorrectly classified (red dashedline is decision boundary). Suppose we choose p1 to update the weights as in picturebelow on the left. P1 has target value t=1, so that the weight is moved a small amount inthe direction of p1. Suppose we choose p2 to update the weights. P2 has target value t=-1so the weight is moved a small amount in the direction of -p2. In either case, the newboundary (blue dashed line) is better than before.

  • 8/2/2019 Soft Com Putting

    49/64

    Comments on Perceptron

    The choice of learning rate does not matter because it just changes the scaling of w. The decision surface (for 2 inputs and one bias) has equation:

    x2 = - (w1/w2) x1 - w3 / w2

    where we have defined w3 to be the bias: W = (w1,w2,b) = (w1,w2,w3)

    From this we see that the equation remains the same if W is scaled by a constant.Outline: Find a lower bound L(k) for |w|2 as a function of iteration k. Then find an upper boundU(k) for |w|2. Then show that the lower bound grows at a faster rate than the upper bound. Sincethe lower bound can't be larger than the upper bound, there must be a finite k such that theweight is no longer updated. However, this can only happen if all patterns are correctlyclassified.

    Perceptron Decision Boundaries

    Delta Rule

    Also known by the names:

    Adaline Rule Widrow-Hoff Rule Least Mean Squares (LMS) Rule

    Change from Perceptron:

    Replace the step function in the with a continuous (differentiable) activation function, e.g linear For classification problems, use the step function only to determine the class and not to update

    the weights.

    Note: this is the same algorithm we saw for regression. All that really differs is how the classesare determined.

  • 8/2/2019 Soft Com Putting

    50/64

    Delta Rule: Training by Gradient Descent Revisited

    Construct a cost function E that measures how well the network has learned. For example

    (one output node)

    where

    n = number of examples

    ti = desired target value associated with the i-th example

    yi = output of network when the i-th input pattern is presented to network

    To train the network, we adjust the weights in the network so as to decrease the cost (this iswhere we require differentiability). This is called gradient descent.

  • 8/2/2019 Soft Com Putting

    51/64

    Algorithm

    Initialize the weights with some small random value Until E is within desired tolerance, update the weights according to

    where E is evaluated at W(old), is the learning rate.:

    and the gradient is

    More than Two Classes.

    If there are mor ethan 2 classes we could still use the same network but instead of having abinary target, we can let the target take on discrete values. For example of there ar 5 classes, wecould have t=1,2,3,4,5 or t= -2,-1,0,1,2. It turns out, however, that the network has a much easiertime if we have one output for class. We can think of each output node as trying to solve a binaryproblem (it is either in the given class or it isn't).

  • 8/2/2019 Soft Com Putting

    52/64

    Two Layer Net: The above is not the most general region. Here, we have assumed the top layeris an AND function.

    Problem: In the general for the 2- and 3- layers cases, there is no simple way to determine theweights.

    Delta Rule

    Also known by the names:

    Adaline Rule Widrow-Hoff Rule

  • 8/2/2019 Soft Com Putting

    53/64

    Least Mean Squares (LMS) RuleChange from Perceptron:

    Replace the step function in the with a continuous (differentiable) activation function, e.g linear

    For classification problems, use the step function only to determine the class and not to updatethe weights.

    Note: this is the same algorithm we saw for regression. All that really differs is how the classesare determined.

    Delta Rule: Training by Gradient Descent Revisited

    Construct a cost function E that measures how well the network has learned. For example

    (one output node)

    where

  • 8/2/2019 Soft Com Putting

    54/64

    n = number of examples

    ti = desired target value associated with the i-th example

    yi = output of network when the i-th input pattern is presented to network

    To train the network, we adjust the weights in the network so as to decrease the cost (this iswhere we require differentiability). This is called gradient descent.

    Algorithm

    Initialize the weights with some small random value Until E is within desired tolerance, update the weights according to

    where E is evaluated at W(old), is the learning rate.:

    and the gradient is

    More than Two Classes.

    If there are mor ethan 2 classes we could still use the same network but instead of having abinary target, we can let the target take on discrete values. For example of there ar 5 classes, wecould have t=1,2,3,4,5 or t= -2,-1,0,1,2. It turns out, however, that the network has a much easiertime if we have one output for class. We can think of each output node as trying to solve a binaryproblem (it is either in the given class or it isn't).

  • 8/2/2019 Soft Com Putting

    55/64

    Doing Classification Correctly

    The Old Way

    When there are more than 2 classes, we so far have suggested doing the following:

    Assign one output node to each class. Set the target value of each node to be 1 if it is the correct class and 0 otherwise. Use a linear network with a mean squared error function. Determine the network class prediction by picking the output node with the largest value.

    There are problems with this method. First, there is a disconnect between the definition of theerror function and the determination of the class. A minimum error does not necessary producethe network with the largest number of correct prediction.

  • 8/2/2019 Soft Com Putting

    56/64

    By varying the above method a little bit we can remove this inconsistency. Let us start bychanging the interpretation of the output:

    The New Way

    New Interpretation: The output of yi is interpreted as the probability that i is the correct class.This means that:

    The output of each node must be between 0 and 1 The sum of the outputs over all nodes must be equal to 1.

    How do we achieve this? There are several things to vary.

    We can vary the activation function, for example, by using a sigmoid. Sigmoids rangecontinuously between 0 and 1. Is a sigmoid a good choice?

    We can vary the cost function. We need not use mean squared error (MSE). What are our otheroptions?

    To decide, let's start by thinking about what makes sense intuitively. With a linear network usinggradient descent on a MSE function, we found that the weight updates were proportional to theerror (t-y). This seems to make sense. If we use a sigmoid activation function, we obtain a morecomplicated formula:

    See derivatives of activation functions to see where this comes from.

    This is not quite what we want. It turns out that there is a better error function/activation functioncombination that gives us what we want.

    Error Function:

    Cross Entropy is defined as

    where c is the number of classes (i.e. the number of output nodes).

  • 8/2/2019 Soft Com Putting

    57/64

    This equation comes from information theory and is often applied when the outputs (y) areinterpreted as probabilities. We won't worry about where it comes from but let's see if it makessense for certain special cases.

    Suppose the network is trained perfectly so that the targets exactly match the network output.Suppose class 3 is chosen. This means that output of node 3 is 1 (i.e. the probability is 1 that 3 iscorrect) and the outputs of the other nodes are 0 (i.e. the probability is 0 that class != 3 is

    correct). In this case do you see that the above equation is 0, as desired.

    Suppose the network gives an output of y=.5 for all of the output i.e. that there is completeuncertainty about which is the correct class. It turns out that E has a maximum value in this case.

    Thus, the more uncertain the network is, the larger the error E. This is as it should be.Activation function:

    Softmax is defined as

    where fi is the activation function of the ith output node and c is the number of classes.

    Note that this has the following good properties:

    it is always a number between 0 and 1

    when combined with the error function gives a weight update proportional to (t-y).

    Optimal Weight and Learning Rates for

    Linear Networks

    Regression Revisited

    Suppose we are given a set of data (x(1),y(1)),(x(2),y(2))...(x(p),y(p)):

  • 8/2/2019 Soft Com Putting

    58/64

    If we assume that g is linear, then finding the best line that fits the data (linear regression) can bedone algebraically:

    The solution is based on minimizing the squared error (Cost) between the network output and thedata:

    where y = w x.

    Finding the best set of weights

    1-input, 1 output, 1 weight

    But the derivative of E is zero at the minimum so we can solve for wopt.

    n-inputs, m outputs: nm weights

    The same analysis can be done in the multi-dimensional case except that now everythingbecomes matrices:

  • 8/2/2019 Soft Com Putting

    59/64

    where wopt is an mxn matrix, H is an nxn matrix and is an mxn matrix.

    Matrix inversion is an expensive operation. Also, if the input dimension, n, is very large then His huge and may not even b possible to compute. If we are not able to compute the inverseHessian or if we don't want to spend the time, then we can use gradient descent.

    Gradient Descent: Picking the Best Learning Rate

    For linear networks, E is quadratic then we can write

    so that we have

    But this is just a Taylor series expansion of E(w) about w0. Now, suppose we want to determinethe optimal weight, wopt. We can differentiate E(w) and evaluate the result at wopt, noting thatE`(wopt) is zero:

  • 8/2/2019 Soft Com Putting

    60/64

    Solving for wopt we obtain:

    comparing this to the update equation, we find that the learning "rate" that takes us directly to theminimum is equal to the inverse Hessian, which is a matrix and not a scalar. Why do we need amatrix?

    2-D Example

  • 8/2/2019 Soft Com Putting

    61/64

    Curvature axes aligned with the coordinate axes:

    or in matrix form:

    1 and 2 are inversely related to the size of the curvature along each axis. Using the abovelearning rate matrix has the effect of scaling the gradient differently to make the surface "look"spherical.

    If the axes are not aligned with coordinate axes, the we need a full matrix of learning rates. Thismatrix is just the inverse Hessian. In general, H-1 is not diagonal. We can obtain the curvaturealong each axis, however, by computing the eigenvalues of H. Anyone remember whateigenvalues are??

    Taking a Step Back

    We have been spending a lot of time on some pretty tough math. Why? Because training anetwork can take a long time if you just blindly apply the basic algorithms. There are techniquesthat can improve the rate of convergence by orders of magnitude. However, understanding thesetechniques requires a deep understanding of the underlying characteristics of the problem (i.e.the mathematics). Knowing what speed-up techniques to apply, can make a difference betweenhaving a net that takes 100 iterations to train vs. 10000 iterations to train (assuming it trains atall).

    The previous slides are trying to make the following point for linear networks (i.e. thosenetworks whose cost function is a quadratic function of the weights):

    1. The shape of the cost surface has a significant effect on how fast a net can learn. Ideally, wewant a spherically symmetric surface.

    2. The correlation matrix is defined as the average over all inputs of xxT

  • 8/2/2019 Soft Com Putting

    62/64

    3. The Hessian is the second derivative of E with respect to w.For linear nets, the Hessian is the same as the correlation matrix.

    4. The Hessian, tells you about the shape of the cost surface:5. The eigenvalues of H are a measure of the steepness of the surface along the curvature

    directions.

    6. a large eigenvalue => steep curvature => need small learning rate7. the learning rate should be proportional to 1/eigenvalue8. if we are forced to use a single learning rate for all weights, then we must use a learning rate

    that will not cause divergence along the steep directions (large eigenvalue directions). Thus, we

    must choose a learning rate that is on the order of 1/max where max is the largest

    eigenvalue.

    9. If we can use a matrix of learning rates, this matrix is proportional to H -1.10.For real problems (i.e. nonlinear), you don't know the eigenvalues so you just have to guess. Of

    course, there are algorithms that will estimate max ....We won't be considering these here.

    11.An alternative solution to speeding up learning is to transform the inputs (that is, x -> Px, forsome transformation matrix P) so that the resulting correlation matrix, (Px)(Px)T, is equal to the

    identity.

    12.The above suggestions are only really true for linear networks. However, the cost surface ofnonlinear networks can be modeled as a quadratic in the vicinity of the current weight. We can

    then apply the similar techniques as above, however, they will only be approximations.

    Summary of Linear Nets

    Characteristics of Networks

    number of layers number of nodes per layer activation function (linear, binary, softwmax) error function (mean squared error (MSE), cross entropy) type of learning algorithms (gradient descent, perceptron, delta rule)

    Types of Applications and Associated Nets

    Regression:o uses a one-layer linear network (activation function is identity)o uses MSE cost functiono uses gradient decent learning

    Classification - Perceptron Learningo uses a one-layer network with a binary step activation functiono uses MSE cost functiono uses the perceptron learning algorithm (identical with gradient descent when targets

    are +1 and -1)

    Classification - Delta Ruleo uses a one-layer network with a linear activation functiono uses MSE cost functiono uses gradient descent

  • 8/2/2019 Soft Com Putting

    63/64

  • 8/2/2019 Soft Com Putting

    64/64

    where ij = 0 if i=j and zero otherwise. Note that if r is the correct class then t r = 1 and RHS ofthe above equation reduces to (tr-yr)xs. If q!=r is the correct class then tr = 0 the above alsoreduces to (tr-yr)xs. Thus we have