27
CS-424 Gregory Dudek Today’s Lecture Administrative Details • Learning Decision trees: cleanup & details Belief nets Sub-symbolic learning • Neural networks

Today’s Lecture

Embed Size (px)

DESCRIPTION

Today’s Lecture. Administrative Details Learning Decision trees: cleanup & details Belief nets Sub-symbolic learning Neural networks. Administrativia. Assignment 3 now available. Due in 1 week. - PowerPoint PPT Presentation

Citation preview

Page 1: Today’s Lecture

CS-424 Gregory Dudek

Today’s Lecture

• Administrative Details

• Learning– Decision trees: cleanup & details– Belief nets– Sub-symbolic learning

• Neural networks

Page 2: Today’s Lecture

CS-424 Gregory Dudek

Administrativia• Assignment 3 now available. Due in 1 week.

<http://www.cim.mcgill.ca/~dudek/424/game.pdf>

• The game for the final project has been defined. It’s description can be

found via the course home page at: <http://www.cim.mcgill.ca/~dudek/424/game.pdf>

• Examine the game now, try it, see what good strategies might be.

• Note: the normal late policy will not apply to the project.– You **must** submit the electronic (executable) on time, or it may not be

evaluated (i.e. you get zero)!– It must run on LINUX. Be certain to compile and test it on one of the linux machines

in the lab well before it’s due. • If you are developing on another platform, regularly test on linux during

development.

Page 3: Today’s Lecture

CS-424 Gregory Dudek

ID3 (more…)• Last class, discussed using entropy to select a question for

building a decision tree. This idea first developed by Quinlan (1979) in the ID3 system, later improved resulting in C4.5

• Recap:– Entropy for classification into sets p+ and p- is I(p+,p-)– Consider information gain per attribute A.

• For each subtree Ai we have a few bits given by the distribution of cases on the subtree I(pi+,pi-). To fully classify all sub-cases, we need

Remainder(A) = i weight(i) I(pi+,pi-)• Thus, info gain is what’s left:

Gain = I(p+,p-) - Remainder(A)

Page 4: Today’s Lecture

CS-424 Gregory Dudek

Final thoughts on entropy...• Provoked by the seminar yesterday.• Idea:

– Entropy tells you about unpredictability, or randomness.– When selecting a question, the one with highest entropy

will carry the most information with respect to what you knew, because the answer is hardest to predict.

– When asking if we know about something, then high entropy is bad

– Consider the PDF of a robot’s position estimate…More entropy means more uncertainty.

Page 5: Today’s Lecture

CS-424 Gregory Dudek

Training & testing• When constructing a learning system, we what to

generalize to new example. (already discussed ad nauseam)

• How can we tell if it’s working?– Look at how well we do with our training data.

• But…. What if we just learned a “quirk” of the data?• Overfitting? Bad features?

– (tank classification example; table lookup)

– Look at how we do on a set of examples never used for any training: a “test set”.

• But… what if we can’t afford the data?– Cross validation: learner L on training set X

e(L:X) = i in X S=X-i error(L(S) on case i)2

Page 6: Today’s Lecture

CS-424 Gregory Dudek

Simple functions?

Is there a fixed circuit network topology that can be used to represent

a family of functions?

Yes! Neural-like networks (a.k.a. artificial neural networks) allow us this flexibility and more; we can represent arbitrary families of continuous functions using fixed topology networks.

Page 7: Today’s Lecture

CS-424 Gregory Dudek

Belief networks (ch. 15) - briefly

• We will cover only R&N Section 15.1 & 15.2 (briefly), and then segue to

chapter 19. You should read 15.3. This will be cursory coverage only.

• A belief net is a formalism for describing probabilistic relationships (a.k.a. Bayes nets).

• A graph (in fact a DAG) G(V,E). Nodes are random variables. Directed edges indicate a node has direct influence on another node.

• Each node has an associated conditional probability table quantifying the effects of it’s “parents”.

• No directed cycles.

Page 8: Today’s Lecture

CS-424 Gregory Dudek

Why?• Objective:

– Compute probabilities of variables of interest, query variables

– Given observations of specific phenomena in the world, evidence variables.

• The net you get depends on how you construct it, not just the problem and probabilities.

• Seek compactness (fewer links, tighter clusters): called locally structured nets.

Page 9: Today’s Lecture

CS-424 Gregory Dudek

• See overheads...

Page 10: Today’s Lecture

CS-424 Gregory Dudek

Issues• Where do the probabilities come from?

– They can be learned or inferred from data.

• Where does the causal structure come from (the topology)?– It’s (sometimes) very hard to learn.– Problem: lots of alternative topologies are possible.

What’s really cause and what’s effect?• Did it really rain because I brought my umbrella? Can

a computer infer this (or the opposite) just from weather data?

• Both these topics are current research areas.

Not in text

Page 11: Today’s Lecture

CS-424 Gregory Dudek

Neural Networks?Artificial Neural Nets

a.k.a.

Connectionist Nets (connectionist learning)

a.k.a.Sub-symbolic learning

a.k.a.Perceptron learning (a special case)

Page 12: Today’s Lecture

CS-424 Gregory Dudek

Networks that model the brain?• Note there is an interesting connection to Bayes

nets: it isn’t considered in the book.– Something to reflect on.

• Idea: model intelligence withour “jumping ahead” to symbolic representations.

• Related to earliest work on cybernetics.

Page 13: Today’s Lecture

CS-424 Gregory Dudek

The idealized neuron• Artificial neural networks come in several “flavors”.

– Most of based on a simplified model of a neuron.

• A set of (many) inputs.• One output.

• Output is a function of the sum on the inputs.– Typical functions:

• Weighted sum• Threshold• Gaussian

Page 14: Today’s Lecture

CS-424 Gregory Dudek

Why neural nets?• Motives:

– We wish to create systems with abilities akin to those of the human mind.

• The mind is usually assumed to be be a direct consequence of the structure of the brain.

– Let’s mimic the structure of the brain!

– By using simple computing elements, we obtain a system that might scale up easily to parallel hardware.

– Avoids (or solves?) the key unresolved problem of how to get from “signal domain” to symbolic representations.

– Fault tolerance

Not in text

Page 15: Today’s Lecture

CS-424 Gregory Dudek

Not in text

Page 16: Today’s Lecture

CS-424 Gregory Dudek

Not in text

Page 17: Today’s Lecture

CS-424 Gregory Dudek

Real and fake neurons• Signals in neurons

are coded by “spike rate”.

• In ANN’s, inputs can be either:– 0 or 1 (binary)– [0,1]– [-1,1]– R (real)

• Each input Ii has an associated real-valued weight wi

• Learning by changing

weights at synapses.

Page 18: Today’s Lecture

CS-424 Gregory Dudek

Not in text

Page 19: Today’s Lecture

CS-424 Gregory Dudek

Brains• The brain seems divided into functional areas• These are often seen as analogous to modules in a

software system.• Why would it be like this? (2 possible answers)

– Evolution: incremental improvement easier in a modular system.

– Advantage of combining complementary solutions.– It isn’t!

Page 20: Today’s Lecture

CS-424 Gregory Dudek

Inductive bias?• Where’s the inductive bias?

– In the topology and architecture of the network.– In the learning rules.– In the input and output representation.– In the initial weights.

Not in text

Page 21: Today’s Lecture

CS-424 Gregory Dudek

Simple neural models• Oldest ANN model is McCulloch-Pitts neuron

[1943] .– Inputs are +1 or -1 with real-valued weights.– If sum of weighted inputs is > 0, then the neuron “fires”

and gives +1 as an output.– Showed you can comput logical functions.– Relation to learning proposed (later!) by Donald Hebb

[1949].• Perceptron model [Rosenblatt, 1958].

– Single-layer network with same kind of neuron.• Firing when input is about a threshold: ∑xiwi>t .

– Added a learning rule to allow weight selection.

Not in text

Page 22: Today’s Lecture

CS-424 Gregory Dudek

Perceptron nets

Page 23: Today’s Lecture

CS-424 Gregory Dudek

Perceptron learning• Perceptron learning:

– Have a set of training examples (TS) encoded as input values (I.e. in the form of binary vectors)

– Have a set of desired output values associated with these inputs.

• This is supervised learning.– Problem: how to adjust the weights to make the actual

outputs match the training examples. • NOTE: we to not allow the topology to change! [You

should be thinking of a question here.]• Intuition: when a perceptron makes a mistake, it’s weights

are wrong.– Modify them so make the output bigger or smaller, as desired.

Page 24: Today’s Lecture

CS-424 Gregory Dudek

Learning algorithm• Desired Ti Actual output Oi

• Weight update formula (weight from unit j to i):Wj,i = Wj,I + k* xj * (Ti - Oi)

Where k is the learning rate.

• If the examples can be learned (encoded), then the perceptron learning rule will find the weights.– How?Gradient descent.

Key thing to prove is the absence of local minima.

Page 25: Today’s Lecture

CS-424 Gregory Dudek

Perceptrons: what can they learn?• Only linearly separable functions [Minsky &

Papert 1969].

• N dimensions: N-dimensional hyperplane.

Page 26: Today’s Lecture

CS-424 Gregory Dudek

More general networks• Generalize in 3 ways:

– Allow continuous output values [0,1]– Allow multiple layers.

• This is key to learning a larger class of functions.– Allow a more complicated function than thresholded

summation[why??]

Generalize the learning rule to accommodate this: let’s see how it works.

Page 27: Today’s Lecture

CS-424 Gregory Dudek

The threshold• The key variant:

– Change threshold into a differentiable function

– Sigmoid, known as a “soft non-linearity” (silly).

M = ∑xiwi

O = 1 / (1 + e -k M )