PROBABILISTIC MODELS OF LEARNING AND...

PROBABILISTIC MODELS OF LEARNING AND MEMORYUncertainty and Bayesian inference

MÁTÉ LENGYEL

Computational and Biological Learning LabDepartment of Engineering

University of Cambridge

Probabilistic models of learning and memory — Uncertainty and Bayesian Inference http://www.eng.cam.ac.uk/~m.lengyelCEU, Budapest, 22-26 June 2009 2

listen to the words

Probabilistic models of learning and memory — Uncertainty and Bayesian Inference http://www.eng.cam.ac.uk/~m.lengyelCEU, Budapest, 22-26 June 2009

UNCONSCIOUS INFERENCES

Hermann von Helmholtz1867

listen to the words

Adelson, unpubl

listen to the words

Adelson, unpubl

listen to the words

Adelson, unpubl

stimulus

percept

prior knowledge

listen to the words

UNCONSCIOUS INFERENCES, CONT’D

stimulus

percept

prior knowledge

rocking

chair Roed

stimulus

percept

prior knowledge

rocking

✗✓

✓✗✗✗✗ Ro

stimulus

percept

prior knowledge

rocking

✗✓

✓✗✗✗✗ Ro

experience

memories

stimulus

percept

prior knowledge

rocking

✗✓

✓✗✗✗✗ Ro

experience

memories

stimulus

percept

prior knowledge

learning

FORMALISING INFERENCES

possible

impossible

belief

possible

impossible

belief

shade of square A

possible

impossible

belief

physical luminance

shade of square A

possible

impossible

belief

physical luminanceknowledge of checkerboards

shade of square A

possible

impossible

belief

shade of square A

possible

impossible

belief

shade of square B

possible

impossible

belief

shade of square B

possible

impossible

belief

shade of square B

REPRESENTING DEGREES OF BELIEFS AS PROBABILITIES

shade of square B

P(shade of square B)

shade of square B

Dutch Book Theorem:If you are willing to bet on your beliefs, then unless they satisfy the axioms of probability, there will always be a set of bets (a “Dutch book”) that you would accept which is guaranteed to lose you money, no matter what the outcome is!

odds(shade of square B = x) =!$ if shade of square B "= x

+$ if shade of square B = x=

P(shade of square B = x)P(shade of square B "= x)

THE AXIOMS OF PROBABILITY(HOW TO REPRESENT YOUR BELIEFS)

0 ! P(x) ! 1

P(x|y) belief in x if we know y is true

P(x) = 1 ! x is certainly trueP(x) = 0 ! x is certainly not true

properties:

0 ! P(x) ! 1

properties:

axioms:

probabilities are non-negative: P(x) ! 0

0 ! P(x) ! 1

properties:

axioms:

P(x) = 1probabilities are normalised

joint probability

0 ! P(x) ! 1

properties:

axioms:

P(x, y) = P(x) · P(y) ! x and y they are independent

P(x, y) ! P(x) =!

P(x, y) marginal probability

joint probability

conditional probability by Bayes’ rule

0 ! P(x) ! 1

properties:

axioms:

P(x, y) = P(x|y) · P(y) = P(y|x) · P(x)

P(x, y) ! P(x) =!

joint probability

0 ! P(x) ! 1

properties:

axioms:

P(x, y) = P(x|y) · P(y) = P(y|x) · P(x) P(x|y) =P(y|x) P(x)

P(x, y) ! P(x) =!

joint probability

0 ! P(x) ! 1

properties:

axioms:

P(x, y) = P(x|y) · P(y) = P(y|x) · P(x) P(x|y) =P(y|x) P(x)

posterior likelihood prior∝ ×

P(x, y) ! P(x) =!

shade of square B

P(shade of square B | luminance, checkerboard, shadows) ∝

∝ P(luminance of square B | shade of square B) × P(shade of square B | checkerboard)

shade of square B

P(shade of square B | luminance, checkerboard, shadows) ∝

∝ P(luminance of square B | shade of square B) × P(shade of square B | checkerboard)

posterior

likelihood prior

BAYESIAN DECISION THEORY(HOW TO MAKE POINT ESTIMATES)

state of the worldx1 x2 x3

action

a1 L(a1,x1) L(a1,x2) L(a1,x3)a2 L(a2,x1) L(a2,x2) L(a2,x3)a3 L(a3,x1) L(a3,x2) L(a3,x3)

loss function

action

action to choose:

note: a and x need not live in the same space

loss function

a! = argmina

L(a, x) P(x)

action

action to choose:

special cases when and x do live in the same space

loss function

a! = argmina

L(a, x) P(x)

a = x̂

action

action to choose:

loss function

a! = argmina

L(a, x) P(x)

a = x̂

L(x̂, x) = (x̂! x)2

action

action to choose:

posterior mean

loss function

a! = argmina

L(a, x) P(x)

a = x̂

L(x̂, x) = (x̂! x)2 x̂ =!

action

action to choose:

posterior mean

loss function

a! = argmina

L(a, x) P(x)

a = x̂

L(x̂, x) = (x̂! x)2 x̂ =!

L(x̂, x) = |x̂! x|

action

action to choose:

posterior mean

posterior median

loss function

a! = argmina

L(a, x) P(x)

a = x̂

L(x̂, x) = (x̂! x)2 x̂ =!

L(x̂, x) = |x̂! x|x̂!

x=!"P(x) =

action

action to choose:

posterior mean

posterior median

loss function

a! = argmina

L(a, x) P(x)

a = x̂

L(x̂, x) = (x̂! x)2 x̂ =!

L(x̂, x) = |x̂! x|x̂!

x=!"P(x) =

L(x̂, x) =!

0 if x=x̂!1 otherwise

action

action to choose:

posterior mean

posterior median

maximum a posteriori (MAP)

loss function

a! = argmina

L(a, x) P(x)

a = x̂

L(x̂, x) = (x̂! x)2 x̂ =!

L(x̂, x) = |x̂! x|x̂!

x=!"P(x) =

L(x̂, x) =!

0 if x=x̂!1 otherwise

x̂ = argmaxx

EXAMPLE: PREDICTING LIFE SPAN

You meet someone who is t years old. What will be his total life span ttotal?

P(ttotal|t) ! P(t|ttotal) P(ttotal)

the probability that you meet someone

at the age of twhen s/he will have

a total life span of ttotal

1P(ttotal)

if t < ttotal

0 otherwise

prior on life span distribution of people

1P(ttotal)

if t < ttotal

0 otherwise

1P(ttotal)

if t < ttotal

0 otherwise

+ decision theory

e.g. report the median of the posterior

1P(ttotal)

if t < ttotal

0 otherwise

+ decision theory

e.g. report the median of the posterior

gous to the Copernican anthropic principle in Bayesian cos-

mology (Buch, 1994; Caves, 2000; Garrett & Coles, 1993; Gott,1993, 1994; Ledford, Marriott, & Crowder, 2001) and the ge-

neric-view principle in Bayesian models of visual perception(Freeman, 1994; Knill & Richards, 1996). The prior probability

p(ttotal) reflects our general expectations about the relevant classof events—in this case, about how likely it is that a man’s lifespan will be ttotal. Analysis of actuarial data shows that the

distribution of life spans in our society is (ignoring infant mor-tality) approximately Gaussian—normally distributed—with a

mean, m, of about 75 years and a standard deviation, s, of about16 years.

Combining the prior with the likelihood according to Equation1 yields a probability distribution p(ttotal|t) over all possible totallife spans ttotal for a man encountered at age t. A good guess for

ttotal is the median of this distribution—that is, the point at whichit is equally likely that the true life span is longer or shorter.

Taking the median of p(ttotal|t) defines a Bayesian predictionfunction, specifying a predicted value of ttotal for each observedvalue of t. Prediction functions for events with Gaussian priors

are nonlinear: For values of t much less than the mean of theprior, the predicted value of ttotal is approximately the mean;

once t approaches the mean, the predicted value of ttotal in-creases slowly, converging to t as t increases but always re-

maining slightly higher, as shown in Figure 1. Although itsmathematical form is complex, this prediction function makesintuitive sense for human life spans: A predicted life span of

about 75 years would be reasonable for aman encountered at age18, 39, or 51; if we met a man at age 75, we might be inclined to

give him several more years at least; but if wemet someone at age96, we probably would not expect him to live much longer.This approach to prediction is quite general, applicable to any

problem that requires estimating the upper limit of a duration,extent, or other numerical quantity given a sample drawn from

that interval (Buch, 1994; Caves, 2000; Garrett & Coles, 1993;Gott, 1993, 1994; Jaynes, 2003; Jeffreys, 1961; Ledford et al.,

2001; Leslie, 1996; Maddox, 1994; Shepard, 1987; Tenenbaum& Griffiths, 2001). However, different priors will be appropriatefor different kinds of phenomena, and the prediction function

will vary substantially as a result. For example, imagine trying topredict the total box-office gross of a movie given its take so far.

The total gross of movies follows a power-law distribution, withp(ttotal) / ttotal

!g for some g> 0.1 This distribution has a highly

non-Gaussian shape (see Fig. 1), with most movies taking in onlymodest amounts, but occasional blockbusters making hugeamounts of money. In the appendix, we show that for power-law

priors, the Bayesian prediction function picks a value for ttotalthat is a multiple of the observed sample t. The exact multipledepends on the parameter g. For the particular power law thatbest fits the actual distribution of movie grosses, an optimal

Bayesian observer would estimate the total gross to be approx-imately 50% greater than the current gross: Thus, if we observe amovie has made $40 million to date, we should guess a total

gross of around $60 million; if we observe a current gross of only$6 million, we should guess about $9 million for the total.

Although such constant-multiple prediction rules are optimalfor event classes that follow power-law priors, they are clearly

inappropriate for predicting life spans or other kinds of eventswith Gaussian priors. For instance, upon meeting a 10-year-oldgirl and her 75-year-old grandfather, we would never predict

that the girl will live a total of 15 years (1.5 " 10) and thegrandfather will live to be 112 (1.5" 75). Other classes of priors,

such as the exponential-tailed Erlang distribution, p(ttotal) /ttotalexp(!ttotal/b) for b> 0,2 are also associated with distinctiveoptimal prediction functions. For the Erlang distribution, the

Fig. 1. Bayesian prediction functions and their associated prior distri-butions. The three columns represent qualitatively different statisticalmodels appropriate for different kinds of events. The top row of plotsshows three parametric families of prior distributions for the total dura-tion or extent, ttotal, that could describe events in a particular class. Linesof different styles represent different parameter values (e.g., differentmean durations) within each family. The bottom row of plots shows theoptimal predictions for ttotal as a function of t, the observed duration orextent of an event so far, assuming the prior distributions shown in the toppanel. For Gaussian priors (left column), the prediction function alwayshas a slope less than 1 and an intercept near the mean m: Predictions arenever much smaller than the mean of the prior distribution, nor muchlarger than the observed duration. Power-law priors (middle column)result in linear prediction functions with variable slope and a zero inter-cept. Erlang priors (right column) yield a linear prediction function thatalways has a slope equal to 1 and a nonzero intercept.

1When g > 1, a power-law distribution is often referred to in statistics andeconomics as a Pareto distribution.

2The Erlang distribution is a special case of the gamma distribution. Thegamma distribution is p(ttotal) / ttotal

k!1exp(!ttotal/b), where k > 0 and b > 0are real numbers. The Erlang distribution assumes that k is an integer. FollowingShepard (1987), we use a one-parameter Erlang distribution, fixing k at 2.

768 Volume 17—Number 9

Everyday Predictions

EVERYDAY PREDICTIONS

best guess of ttotal is simply t plus a constant determined bythe parameter b, as shown in the appendix and illustrated in

Figure 1.Our experiment compared these ideal Bayesian analyses with

the judgments of a large sample of human participants, exam-

ining whether people’s predictions were sensitive to the distri-butions of different quantities that arise in everyday contexts.

We used publicly available data to identify the true prior dis-tributions for several classes of events (the sources of these data

are given in Table 1). For example, as shown in Figure 2, humanlife spans and the run time of movies are approximatelyGaussian, the gross of movies and the length of poems are ap-

proximately power-law distributed, and the distributions of thenumber of years in office for members of the U.S. House of

Representatives and of the length of the reigns of pharaohs are

approximately Erlang. The experiment examined how wellpeople’s predictions corresponded to optimal statistical infer-

ence in these different settings.

METHOD

Participants and ProcedureParticipants were tested in two groups, with each group makingpredictions about five different phenomena. One group of 208undergraduates made predictions about movie grosses, poem

lengths, life spans, reigns of pharaohs, and lengths of marriages.A second group of 142 undergraduates made predictions about

movie run times, terms of U.S. representatives, baking times forcakes, waiting times, and lengths of marriages. The surveys were

TABLE 1

Sources of Data for Estimating Prior Distributions

Data set Source (number of data points)

Movie grosses http://www.worldwideboxoffice.com/ (5,302)Poem lengths http://www.emule.com/ (1,000)Life spans http://www.demog.berkeley.edu/wilmoth/mortality/states.html (complete life table)Movie run times http://www.imdb.com/charts/usboxarchive/ (233 top-10 movies from 1998 through 2003)U.S. representatives’ terms http://www.bioguide.congress.gov/ (2,150 members since 1945)Cake baking times http://www.allrecipes.com/ (619)Pharaohs’ reigns http://www.touregypt.com/ (126)

Note. Data were collected from these Web sites between July and December 2003.

Fig. 2.People’s predictions for various everyday phenomena.The top row of plots shows the empirical distributions of the total duration or extent, ttotal,for each of these phenomena. The first two distributions are approximately Gaussian, the third and fourth are approximately power-law, and the fifthand sixth are approximatelyErlang.The bottom row shows participants’ predicted values of ttotal for a single observed sample t of a duration or extent foreach phenomenon. Black dots show the participants’ median predictions of ttotal. Error bars indicate 68% confidence intervals (estimated by a 1,000-sample bootstrap). Solid lines show the optimal Bayesian predictions based on the empirical prior distributions shown above. Dashed lines show pre-dictions made by estimating a subjective prior, for the pharaohs and waiting-times stimuli, as explained in the main text. Dotted lines show predictionsbased on a fixed uninformative prior (Gott, 1993).

Volume 17—Number 9 769

Thomas L. Griffiths and Joshua B. Tenenbaum

RATIONAL VS IRRATIONAL

Bernoulli (1713) Kahneman & Tversky2002 Nobel Prize

in Economics

John Andersonrational analysis

RATIONAL VS IRRATIONAL

Bernoulli (1713) Kahneman & Tversky2002 Nobel Prize

in Economics

• computational cost• ecology vs economy• certainty vs uncertainty• implicit vs explicit (esp verbal) computations

for more, see Anderson (1990)

John Andersonrational analysis

Adelson (unpubl) http://web.mit.edu/persci/people/adelson/checkershadow_illusion.html

Anderson (1990) The adaptive character of thought. Lawrence Erlbaum Asociates, Hillsdale, NJ.

Bernoulli J (1713) Ars conjectandi. Thurnisiorum, Basel.

Griffiths TL, Tenenbaum, JB (2006) Optimal predictions in everyday cognition. Psychol Sci 17:767-773.

Helmholtz H (1867) Handbuch der physiologischen Optik. L. Voss, Leipzig. (translated in English by JPC Southall as “Treatise on Physiological Optics”)

Kahneman D, Tversky A (1973) On the psychology of predictions. Psychol Rev 80:237-251.

Roediger HL III, McDermott KB (1995) Creating false memories: Remembering words not presented in lists. J Exp Psychol Learn Mem Cogn 21:803-14.

REFERENCES

PROBABILISTIC MODELS OF LEARNING AND...

Documents

Probabilistic Models

Introduction to Probabilistic Topic Models · Introduction to Probabilistic Topic Models David M. Blei Princeton University Abstract Probabilistic topic models are a suite of algorithms

Probabilistic Topic Models - cocosci.berkeley.educocosci.berkeley.edu/tom/papers/SteyversGriffiths.pdf · 3. Probabilistic Topic Models A variety of probabilistic topic models have

Probabilities and Probabilistic Models. Probabilistic models A model means a system that simulates an object under consideration. A probabilistic model

Intro to Probabilistic Models PSSMs

CSC535: Probabilistic Graphical Models

Probabilistic Graphical Models - People

CS 782 – Machine Learning Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models

Probabilistic Models of Natural Language Processingshuly/fg02/simaan.pdf · Probabilistic Models of NLP: Empirical Validity and Technological Viability Probabilistic Models of Natural

4. Probabilistic Graphical Models Directed Models€¦ · Probabilistic Graphical Models Undirected Models. PD Dr. Rudolph Triebel Computer Vision Group Machine Learning for Computer

Dynamical Probabilistic Graphical Models

Probabilistic Graphical Models - Radboud Universiteit · Probabilistic graphical models (PGMs) ... – AssignmentI ImplementaBayesiannetworkforareal-worlddomain. ... :437–48,2014

Temporal Probabilistic Models

Temporal Probabilistic Models Pt 2

Statistical Models for Probabilistic Forecasting

Probabilistic models (part 1)

Probabilistic content models,

Probabilistic Models for Relational Data

Probabilistic Graph Models WMCI 20154

Modelling Retrieval Models in Probabilistic Logical Models