Today’s Lecture

CS-424 Gregory Dudek

Today’s Lecture

• Administrative Details

• Learning– Decision trees: cleanup & details– Belief nets– Sub-symbolic learning

• Neural networks


Administrativia• Assignment 3 now available. Due in 1 week.

<http://www.cim.mcgill.ca/~dudek/424/game.pdf>

• The game for the final project has been defined. It’s description can be

found via the course home page at: <http://www.cim.mcgill.ca/~dudek/424/game.pdf>

• Examine the game now, try it, see what good strategies might be.

• Note: the normal late policy will not apply to the project.– You **must** submit the electronic (executable) on time, or it may not be

evaluated (i.e. you get zero)!– It must run on LINUX. Be certain to compile and test it on one of the linux machines

in the lab well before it’s due. • If you are developing on another platform, regularly test on linux during

development.


ID3 (more…)• Last class, discussed using entropy to select a question for

building a decision tree. This idea first developed by Quinlan (1979) in the ID3 system, later improved resulting in C4.5

• Recap:– Entropy for classification into sets p+ and p- is I(p+,p-)– Consider information gain per attribute A.

• For each subtree Ai we have a few bits given by the distribution of cases on the subtree I(pi+,pi-). To fully classify all sub-cases, we need

Remainder(A) = i weight(i) I(pi+,pi-)• Thus, info gain is what’s left:

Gain = I(p+,p-) - Remainder(A)


Final thoughts on entropy...• Provoked by the seminar yesterday.• Idea:

– Entropy tells you about unpredictability, or randomness.– When selecting a question, the one with highest entropy

will carry the most information with respect to what you knew, because the answer is hardest to predict.

– When asking if we know about something, then high entropy is bad

– Consider the PDF of a robot’s position estimate…More entropy means more uncertainty.


Training & testing• When constructing a learning system, we what to

generalize to new example. (already discussed ad nauseam)

• How can we tell if it’s working?– Look at how well we do with our training data.

• But…. What if we just learned a “quirk” of the data?• Overfitting? Bad features?

– (tank classification example; table lookup)

– Look at how we do on a set of examples never used for any training: a “test set”.

• But… what if we can’t afford the data?– Cross validation: learner L on training set X

e(L:X) = i in X S=X-i error(L(S) on case i)2


Simple functions?

Is there a fixed circuit network topology that can be used to represent

a family of functions?

Yes! Neural-like networks (a.k.a. artificial neural networks) allow us this flexibility and more; we can represent arbitrary families of continuous functions using fixed topology networks.


Belief networks (ch. 15) - briefly

• We will cover only R&N Section 15.1 & 15.2 (briefly), and then segue to

chapter 19. You should read 15.3. This will be cursory coverage only.

• A belief net is a formalism for describing probabilistic relationships (a.k.a. Bayes nets).

• A graph (in fact a DAG) G(V,E). Nodes are random variables. Directed edges indicate a node has direct influence on another node.

• Each node has an associated conditional probability table quantifying the effects of it’s “parents”.

• No directed cycles.


Why?• Objective:

– Compute probabilities of variables of interest, query variables

– Given observations of specific phenomena in the world, evidence variables.

• The net you get depends on how you construct it, not just the problem and probabilities.

• Seek compactness (fewer links, tighter clusters): called locally structured nets.


• See overheads...


Issues• Where do the probabilities come from?

– They can be learned or inferred from data.

• Where does the causal structure come from (the topology)?– It’s (sometimes) very hard to learn.– Problem: lots of alternative topologies are possible.

What’s really cause and what’s effect?• Did it really rain because I brought my umbrella? Can

a computer infer this (or the opposite) just from weather data?

• Both these topics are current research areas.

Not in text


Neural Networks?Artificial Neural Nets

a.k.a.

Connectionist Nets (connectionist learning)

a.k.a.Sub-symbolic learning

a.k.a.Perceptron learning (a special case)


Networks that model the brain?• Note there is an interesting connection to Bayes

nets: it isn’t considered in the book.– Something to reflect on.

• Idea: model intelligence withour “jumping ahead” to symbolic representations.

• Related to earliest work on cybernetics.


The idealized neuron• Artificial neural networks come in several “flavors”.

– Most of based on a simplified model of a neuron.

• A set of (many) inputs.• One output.

• Output is a function of the sum on the inputs.– Typical functions:

• Weighted sum• Threshold• Gaussian


Why neural nets?• Motives:

– We wish to create systems with abilities akin to those of the human mind.

• The mind is usually assumed to be be a direct consequence of the structure of the brain.

– Let’s mimic the structure of the brain!

– By using simple computing elements, we obtain a system that might scale up easily to parallel hardware.

– Avoids (or solves?) the key unresolved problem of how to get from “signal domain” to symbolic representations.

– Fault tolerance

Not in text


Not in text


Not in text


Real and fake neurons• Signals in neurons

are coded by “spike rate”.

• In ANN’s, inputs can be either:– 0 or 1 (binary)– [0,1]– [-1,1]– R (real)

• Each input Ii has an associated real-valued weight wi

• Learning by changing

weights at synapses.


Not in text


Brains• The brain seems divided into functional areas• These are often seen as analogous to modules in a

software system.• Why would it be like this? (2 possible answers)

– Evolution: incremental improvement easier in a modular system.

– Advantage of combining complementary solutions.– It isn’t!


Inductive bias?• Where’s the inductive bias?

– In the topology and architecture of the network.– In the learning rules.– In the input and output representation.– In the initial weights.

Not in text


Simple neural models• Oldest ANN model is McCulloch-Pitts neuron

[1943] .– Inputs are +1 or -1 with real-valued weights.– If sum of weighted inputs is > 0, then the neuron “fires”

and gives +1 as an output.– Showed you can comput logical functions.– Relation to learning proposed (later!) by Donald Hebb

[1949].• Perceptron model [Rosenblatt, 1958].

– Single-layer network with same kind of neuron.• Firing when input is about a threshold: ∑xiwi>t .

– Added a learning rule to allow weight selection.

Not in text


Perceptron nets


Perceptron learning• Perceptron learning:

– Have a set of training examples (TS) encoded as input values (I.e. in the form of binary vectors)

– Have a set of desired output values associated with these inputs.

• This is supervised learning.– Problem: how to adjust the weights to make the actual

outputs match the training examples. • NOTE: we to not allow the topology to change! [You

should be thinking of a question here.]• Intuition: when a perceptron makes a mistake, it’s weights

are wrong.– Modify them so make the output bigger or smaller, as desired.


Learning algorithm• Desired Ti Actual output Oi

• Weight update formula (weight from unit j to i):Wj,i = Wj,I + k* xj * (Ti - Oi)

Where k is the learning rate.

• If the examples can be learned (encoded), then the perceptron learning rule will find the weights.– How?Gradient descent.

Key thing to prove is the absence of local minima.


Perceptrons: what can they learn?• Only linearly separable functions [Minsky &

Papert 1969].

• N dimensions: N-dimensional hyperplane.


More general networks• Generalize in 3 ways:

– Allow continuous output values [0,1]– Allow multiple layers.

• This is key to learning a larger class of functions.– Allow a more complicated function than thresholded

summation[why??]

Generalize the learning rule to accommodate this: let’s see how it works.


The threshold• The key variant:

– Change threshold into a differentiable function

– Sigmoid, known as a “soft non-linearity” (silly).

M = ∑xiwi

O = 1 / (1 + e -k M )

Documents

Today’s Lecture