Bayesian models of inductive learning Josh Tenenbaum & Tom Griffiths MIT Computational Cognitive Science Group Department of Brain and Cognitive Sciences

Bayesian models of inductive learning

Josh Tenenbaum & Tom Griffiths

MITComputational Cognitive Science Group

Department of Brain and Cognitive Sciences

Computer Science and AI Lab (CSAIL)

What to expect• What you’ll get out of this tutorial:

– Our view of what Bayesian models have to offer cognitive science.– In-depth examples of basic and advanced models: how the math

works & what it buys you. – Some comparison to other approaches.– Opportunities to ask questions.

• What you won’t get:– Detailed, hands-on how-to. – Where you can learn more:

http://bayesiancognition.com

Outline• Morning

– Introduction (Josh) – Basic case study #1: Flipping coins (Tom)– Basic case study #2: Rules and similarity (Josh)

• Afternoon– Advanced case study #1: Causal induction (Tom)– Advanced case study #2: Property induction (Josh)– Quick tour of more advanced topics (Tom)

Outline• Morning



Bayesian models in cognitive science

• Vision

• Motor control

• Memory

• Language

• Inductive learning and reasoning….

Everyday inductive leaps

• Learning concepts and words from examples

“horse”

“horse”

“horse”

Learning concepts and words“tufa”

“tufa”

“tufa”

Can you pick out the tufas?

Inductive reasoning

Cows can get Hick’s disease.

Gorillas can get Hick’s disease.

All mammals can get Hick’s disease.

Input:

Task: Judge how likely conclusion is to be true, given that premises are true.

(premises)

(conclusion)

Inferring causal relations

Took vitamin B23 HeadacheDay 1 yes no

Day 2 yes yes

Day 3 no yes

Day 4 yes no

. . . . . . . . .

Does vitamin B23 cause headaches?

Input:

Task: Judge probability of a causal link given several joint observations.

Everyday inductive leaps

How can we learn so much about . . . – Properties of natural kinds– Meanings of words– Future outcomes of a dynamic process– Hidden causal properties of an object– Causes of a person’s action (beliefs, goals)– Causal laws governing a domain

. . . from such limited data?

The Challenge

• How do we generalize successfully from very limited data?– Just one or a few examples– Often only positive examples

• Philosophy: – Induction is a “problem”, a “riddle”, a “paradox”, a “scandal”,

or a “myth”.

• Machine learning and statistics:– Focus on generalization from many examples, both positive

and negative.

Rational statistical inference(Bayes, Laplace)

Hh

hphdp

hphdpdhp

)()|(

)()|()|(

Posteriorprobability

Likelihood Priorprobability

Sum over space of hypotheses

• Shepard (1987)– Analysis of one-shot stimulus generalization, to explain

the universal exponential law.

• Anderson (1990)– Models of categorization and causal induction.

• Oaksford & Chater (1994)– Model of conditional reasoning (Wason selection task).

• Heit (1998)– Framework for category-based inductive reasoning.

Bayesian models of inductive learning: some recent history

• Rational statistical inference (Bayes):

• Learners’ domain theories generate their hypothesis space H and prior p(h). – Well-matched to structure of the natural world.– Learnable from limited data. – Computationally tractable inference.

Hh

hphdp

hphdpdhp

)()|(

)()|()|(

Theory-Based Bayesian Models

What is a theory?

• Working definition– An ontology and a system of abstract principles

that generates a hypothesis space of candidate world structures along with their relative probabilities.

• Analogy to grammar in language.

• Example: Newton’s laws

Structure and statistics• A framework for understanding how structured knowledge and

statistical inference interact.

– How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning.

– How simplicity trades off with fit to the data in evaluating structural hypotheses.

– How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance.

Structure and statistics• A framework for understanding how structured knowledge and

statistical inference interact.

– How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning.

Hierarchical Bayes.

– How simplicity trades off with fit to the data in evaluating structural hypotheses.

Bayesian Occam’s Razor.


Non-parametric Bayes.

Alternative approaches to inductive generalization

• Associative learning

• Connectionist networks

• Similarity to examples

• Toolkit of simple heuristics

• Constraint satisfaction

• Analogical mapping

Marr’s Three Levels of Analysis

• Computation: “What is the goal of the computation, why is it

appropriate, and what is the logic of the strategy by which it can be carried out?”

• Representation and algorithm: Cognitive psychology

• Implementation:Neurobiology

Why Bayes?• A framework for explaining cognition.

– How people can learn so much from such limited data.

– Why process-level models work the way that they do.

– Strong quantitative models with minimal ad hoc assumptions.

• A framework for understanding how structured knowledge and statistical inference interact.– How structured knowledge guides statistical inference, and is itself acquired

through higher-order statistical learning.

– How simplicity trades off with fit to the data in evaluating structural hypotheses (Occam’s razor).


Outline• Morning



Coin flipping

Coin flipping

HHTHT

HHHHH

What process produced these sequences?

Bayes’ rule

• “Posterior probability”:

• “Prior probability”:

• “Likelihood”:

)(

)|()()|(

DP

HDPHPDHP

)|( DHP

)(HP

)|( HDP

For data D and a hypothesis H, we have:

The origin of Bayes’ rule

• A simple consequence of using probability to represent degrees of belief

• For any two random variables:

)|()()&(

)|()()&(

BApBpBAp

ABpApBAp

)|()()|()( ABpApBApBp

)(

)|()()|(

Bp

ABpApBAp

• Good statistics– consistency, and worst-case error bounds.

• Cox Axioms– necessary to cohere with common sense

• “Dutch Book” + Survival of the Fittest– if your beliefs do not accord with the laws of probability, then you

can always be out-gambled by someone whose beliefs do so accord.

• Provides a theory of learning– a common currency for combining prior knowledge and the lessons

of experience.

Why represent degrees of belief with probabilities?

Bayes’ rule

• “Posterior probability”:

• “Prior probability”:

• “Likelihood”:

)(

)|()()|(

DP

HDPHPDHP

)|( DHP

)(HP

)|( HDP

For data D and a hypothesis H, we have:

Hypotheses in Bayesian inference

• Hypotheses H refer to processes that could have generated the data D

• Bayesian inference provides a distribution over these hypotheses, given D

• P(D|H) is the probability of D being generated by the process identified by H

• Hypotheses H are mutually exclusive: only one process could have generated D

Hypotheses in coin flipping

• Fair coin, P(H) = 0.5

• Coin with P(H) = p

• Markov model

• Hidden Markov model

• ...

Describe processes by which D could be generated

HHTHTD =

statisticalmodels

Hypotheses in coin flipping

• Fair coin, P(H) = 0.5

• Coin with P(H) = p

• Markov model

• Hidden Markov model

• ...

Describe processes by which D could be generated

generativemodels

HHTHTD =

Representing generative models

• Graphical model notation– Pearl (1988), Jordan (1998)

• Variables are nodes, edges indicate dependency

• Directed edges show causal process of data generation

HHTHTd1 d2 d3 d4 d5

d1 d2 d3 d4

Fair coin, P(H) = 0.5

d1 d2 d3 d4

Markov model

Models with latent structure

• Not all nodes in a graphical model need to be observed

• Some variables reflect latent structure, used in generating D but unobserved

HHTHTd1 d2 d3 d4 d5

d1 d2 d3 d4

Hidden Markov model

s1 s2 s3 s4

d1 d2 d3 d4

P(H) = p

p

Coin flipping

• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0

• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p

• Comparing infinitely many hypotheses– P(H) = p

• Psychology: Representativeness

Coin flipping





Comparing two simple hypotheses

• Contrast simple hypotheses:– H1: “fair coin”, P(H) = 0.5

– H2:“always heads”, P(H) = 1.0

• Bayes’ rule:

• With two hypotheses, use odds form

)(

)|()()|(

DP

HDPHPDHP

Bayes’ rule in odds form

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: data

H1, H2: models

P(H1|D): posterior probability H1 generated the data

P(D|H1): likelihood of data under model H1

P(H1): prior probability H1 generated the data

= x

Coin flipping

HHTHT

HHHHH

What process produced these sequences?


P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: HHTHTH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/25 P(H1) = 999/1000

P(D|H2) = 0 P(H2) = 1/1000

P(H1|D) / P(H2|D) = infinity

= x


P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: HHHHHH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/25 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

P(H1|D) / P(H2|D) 30

= x


P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: HHHHHHHHHHH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/210 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

P(H1|D) / P(H2|D) 1

= x

• Bayes’ rule tells us how to combine prior beliefs with new data– top-down and bottom-up influences

• As a model of human inference– predicts conclusions drawn from data– identifies point at which prior beliefs are

overwhelmed by new experiences

• But… more complex cases?


Coin flipping





Comparing simple and complex hypotheses

• Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p?

d1 d2 d3 d4

Fair coin, P(H) = 0.5

d1 d2 d3 d4

P(H) = p

p

vs.

• P(H) = p is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = p– for any observed sequence X, we can choose p

such that X is more probable than if P(H) = 0.5



Pro

babi

lity


Pro

babi

lity

HHHHH p = 1.0


Pro

babi

lity

HHTHT p = 0.6

• P(H) = p is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = p– for any observed sequence X, we can choose p such

that X is more probable than if P(H) = 0.5

• How can we deal with this?– frequentist: hypothesis testing– information theorist: minimum description length– Bayesian: just use probability theory!


P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

Computing P(D|H1) is easy:

P(D|H1) = 1/2N

Compute P(D|H2) by averaging over p:

= x



Pro

babi

lity

Distribution is an average over all values of p


Pro

babi

lity


• Simple and complex hypotheses can be compared directly using Bayes’ rule– requires summing over latent variables

• Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor”

• This principle is used in model selection methods in psychology (e.g. Myung & Pitt, 1997)


Coin flipping





Comparing infinitely many hypotheses

• Assume data are generated from a model:

• What is the value of p?– each value of p is a hypothesis H– requires inference over infinitely many hypotheses

d1 d2 d3 d4

P(H) = p

p

• Flip a coin 10 times and see 5 heads, 5 tails. • P(H) on next flip? 50%• Why? 50% = 5 / (5+5) = 5/10.• “Future will be like the past.”

• Suppose we had seen 4 heads and 6 tails.• P(H) on next flip? Closer to 50% than to 40%.• Why? Prior knowledge.


• Posterior distribution P(p | D) is a probability density over p = P(H)

• Need to work out likelihood P(D | p) and specify prior distribution P(p)

)(

)|()()|(

DP

HDPHPDHP

Integrating prior knowledge and data

P(p | D) P(D | p) P(p)

Likelihood and prior

• Likelihood:

P(D | p) = pNH (1-p)NT

– NH: number of heads– NT: number of tails

• Prior:

P(p) pFH-1 (1-p)FT-1 ?

A simple method of specifying priors

• Imagine some fictitious trials, reflecting a set of previous experiences– strategy often used with neural networks

• e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair

• In fact, this is a sensible statistical idea...


• Likelihood:



• Prior:

P(p) pFH-1 (1-p)FT-1 – FH: fictitious observations of heads– FT: fictitious observations of tails

Beta(FH,FT)

Conjugate priors

• Exist for many standard distributions– formula for exponential family conjugacy

• Define prior in terms of fictitious observations

• Beta is conjugate to Bernoulli (coin-flipping)

FH = FT = 1 FH = FT = 3FH = FT = 1000


• Likelihood:



• Prior:

P(p) pFH-1 (1-p)FT-1 – FH: fictitious observations of heads– FT: fictitious observations of tails

• Posterior is Beta(NH+FH,NT+FT)– same form as conjugate prior

• Posterior mean:

• Posterior predictive distribution:


P(p | D) P(D | p) P(p) = pNH+FH-1 (1-p)NT+FT-1

Some examples• e.g., F ={1000 heads, 1000 tails} ~ strong

expectation that any new coin will be fair

• After seeing 4 heads, 6 tails, P(H) on next flip = 1004 / (1004+1006) = 49.95%

• e.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair

• After seeing 4 heads, 6 tails, P(H) on next flip = 7 / (7+9) = 43.75%

Prior knowledge too weak

But… flipping thumbtacks

• e.g., F ={4 heads, 3 tails} ~ weak expectation that tacks are slightly biased towards heads

• After seeing 2 heads, 0 tails, P(H) on next flip = 6 / (6+3) = 67%

• Some prior knowledge is always necessary to avoid jumping to hasty conclusions...

• Suppose F = { }: After seeing 2 heads, 0 tails, P(H) on next flip = 2 / (2+0) = 100%

Origin of prior knowledge

• Tempting answer: prior experience

• Suppose you have previously seen 2000 coin flips: 1000 heads, 1000 tails

• By assuming all coins (and flips) are alike, these observations of other coins are as good as observations of the present coin

Problems with simple empiricism

• Haven’t really seen 2000 coin flips, or any flips of a thumbtack– Prior knowledge is stronger than raw experience justifies

• Haven’t seen exactly equal number of heads and tails– Prior knowledge is smoother than raw experience justifies

• Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins– Prior knowledge is more structured than raw experience

A simple theory

• “Coins are manufactured by a standardized procedure that is effective but not perfect.” – Justifies generalizing from previous coins to the present

coin.

– Justifies smoother and stronger prior than raw experience alone.

– Explains why seeing 10 flips each for 200 coins is more valuable than seeing 2000 flips of one coin.

• “Tacks are asymmetric, and manufactured to less exacting standards.”

Limitations

• Can all domain knowledge be represented so simply, in terms of an equivalent number of fictional observations?

• Suppose you flip a coin 25 times and get all heads. Something funny is going on…

• But with F ={1000 heads, 1000 tails}, P(H) on next flip = 1025 / (1025+1000) = 50.6%.

Looks like nothing unusual

Hierarchical priors

• Higher-order hypothesis: is this coin fair or unfair?

• Example probabilities:– P(fair) = 0.99

– P(p|fair) is Beta(1000,1000)

– P(p|unfair) is Beta(1,1)

• 25 heads in a row propagates up, affecting p and then P(fair|D)

d1 d2 d3 d4

p

fair

P(fair|25 heads) P(25 heads|fair) P(fair) P(unfair|25 heads) P(25 heads|unfair) P(unfair)

= = 9 x 10-5

• Latent structure can capture coin variability

• 10 flips from 200 coins is better than 2000 flips from a single coin: allows estimation of FH, FT

More hierarchical priors

d1 d2 d3 d4

p

FH,FT

d1 d2 d3 d4

p

d1 d2 d3 d4

p

p ~ Beta(FH,FT)

Coin 1 Coin 2 Coin 200...

• Discrete beliefs (e.g. symmetry) can influence estimation of continuous properties (e.g. FH, FT)

Yet more hierarchical priors

d1 d2 d3 d4

p

FH,FT

d1 d2 d3 d4

p

d1 d2 d3 d4

p

physical knowledge

• Apply Bayes’ rule to obtain posterior probability density

• Requires prior over all hypotheses– computation simplified by conjugate priors– richer structure with hierarchical priors

• Hierarchical priors indicate how simple theories can inform statistical inferences– one step towards structure and statistics


Coin flipping





Psychology: Representativeness

Which sequence is more likely from a fair coin?

HHTHT

HHHHH

more representative of a fair coin

(Kahneman & Tversky, 1972)

What might representativeness mean?

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

H1: random process (fair coin)

H2: alternative processes

= x

Evidence for a random generating process

likelihood ratio

A constrained hypothesis space

Four hypotheses:

h1 fair coin HHTHTTTH

h2 “always alternates” HTHTHTHT

h3 “mostly heads” HHTHTHHH

h4 “always heads” HHHHHHHH

Representativeness judgments

Results

• Good account of representativeness data, with three pseudo-free parameters, = 0.91– “always alternates” means 99% of the time

– “mostly heads” means P(H) = 0.85

– “always heads” means P(H) = 0.99

• With scaling parameter, r = 0.95

(Tenenbaum & Griffiths, 2001)

The role of theories

The fact that HHTHT looks representative of a fair coin and HHHHH does not reflects our implicit theories of how the world works. – Easy to imagine how a trick all-heads coin

could work: high prior probability.– Hard to imagine how a trick “HHTHT” coin

could work: low prior probability.

Summary

• Three kinds of Bayesian inference– comparing two simple hypotheses– comparing simple and complex hypotheses– comparing an infinite number of hypotheses

• Critical notions:– generative models, graphical models– Bayesian Occam’s razor– priors: conjugate, hierarchical (theories)

Outline• Morning



Rules and similarity

Structure versus statistics

RulesLogicSymbols

StatisticsSimilarityTypicality

A better metaphor

A better metaphor

Structure and statistics

RulesLogicSymbols

StatisticsSimilarityTypicality


• Basic case study #1: Flipping coins– Learning and reasoning with structured

statistical models.

• Basic case study #2: Rules and similarity– Statistical learning with structured

representations.

The number game

• Program input: number between 1 and 100

• Program output: “yes” or “no”

The number game

• Learning task:– Observe one or more positive (“yes”) examples.– Judge whether other numbers are “yes” or “no”.

The number game

Examples of“yes” numbers

Generalizationjudgments (N = 20)

60Diffuse similarity

The number game


Generalizationjudgments (n = 20)

60

60 80 10 30

Diffuse similarity

Rule: “multiples of 10”

The number game



60

60 80 10 30

60 52 57 55

Diffuse similarity


Focused similarity: numbers near 50-60

The number game



16

16 8 2 64

16 23 19 20

Diffuse similarity

Rule: “powers of 2”

Focused similarity: numbers near 20

Main phenomena to explain:– Generalization can appear either similarity-based (graded) or rule-based (all-or-

none). – Learning from just a few positive examples.

60

60 80 10 30

60 52 57 55

Diffuse similarity


Focused similarity: numbers near 50-60

The number game

Rule/similarity hybrid models

• Category learning– Nosofsky, Palmeri et al.: RULEX– Erickson & Kruschke: ATRIUM

Divisions into “rule” and “similarity” subsystems

• Category learning– Nosofsky, Palmeri et al.: RULEX– Erickson & Kruschke: ATRIUM

• Language processing– Pinker, Marcus et al.: Past tense morphology

• Reasoning– Sloman – Rips– Nisbett, Smith et al.

Rule/similarity hybrid models

• Why two modules? • Why do these modules work the way that they do, and interact as they do?

• How do people infer a rule or similarity metric from just a few positive examples?

• H: Hypothesis space of possible concepts:– h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”)

– h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”)

– h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”)

– h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”)

– . . .

Bayesian model

Representational interpretations for H:– Candidate rules

– Features for similarity

– “Consequential subsets” (Shepard, 1987)

Inferring hypotheses from similarity judgment

Additive clustering (Shepard & Arabie, 1977):

: similarity of stimuli i, j

: weight of cluster k

: membership of stimulus i in cluster k

(1 if stimulus i in cluster k, 0 otherwise)

Equivalent to similarity as a weighted sum of common features (Tversky, 1977).

k

jkikkij ffws

ijs

kw

ikf

Additive clustering for the integers 0-9:

k

jkikkij ffws

Rank Weight Stimuli in cluster Interpretation

0 1 2 3 4 5 6 7 8 9

1 .444 * * * powers of two

2 .345 * * * small numbers

3 .331 * * * multiples of three

4 .291 * * * * large numbers

5 .255 * * * * * middle numbers

6 .216 * * * * * odd numbers

7 .214 * * * * smallish numbers

8 .172 * * * * * largish numbers

Three hypothesis subspaces for number concepts

• Mathematical properties (24 hypotheses): – Odd, even, square, cube, prime numbers– Multiples of small integers– Powers of small integers

• Raw magnitude (5050 hypotheses): – All intervals of integers with endpoints between 1 and

100.

• Approximate magnitude (10 hypotheses):– Decades (1-10, 10-20, 20-30, …)

Hypothesis spaces and theories• Why a hypothesis space is like a domain theory:

– Represents one particular way of classifying entities in a domain.

– Not just an arbitrary collection of hypotheses, but a principled system.

• What’s missing?– Explicit representation of the principles.

• Hypothesis spaces (and priors) are generated by theories. Some analogies:– Grammars generate languages (and priors over structural

descriptions)

– Hierarchical Bayesian modeling

• H: Hypothesis space of possible concepts:– Mathematical properties: even, odd, square, prime, . . . .

– Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .

– Raw magnitude: all intervals between 1 and 100.

• X = {x1, . . . , xn}: n examples of a concept C.

• Evaluate hypotheses given data:

– p(h) [“prior”]: domain knowledge, pre-existing biases

– p(X|h) [“likelihood”]: statistical information in examples.

– p(h|X) [“posterior”]: degree of belief that h is the true extension of C.

Bayesian model

)(

)()|()|(

Xp

hphXpXhp

• H: Hypothesis space of possible concepts:– Mathematical properties: even, odd, square, prime, . . . .

– Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .

– Raw magnitude: all intervals between 1 and 100.

• X = {x1, . . . , xn}: n examples of a concept C.

• Evaluate hypotheses given data:

– p(h) [“prior”]: domain knowledge, pre-existing biases

– p(X|h) [“likelihood”]: statistical information in examples.

– p(h|X) [“posterior”]: degree of belief that h is the true extension of C.

Bayesian model

Hh

hphXp

hphXpXhp

)()|(

)()|()|(

Likelihood: p(X|h)

• Size principle: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases.

• Follows from assumption of randomly sampled examples.

• Captures the intuition of a representative sample.

hxx

n

nhhXp

,,if

1)size(

1)|(

hxi any if 0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100

Illustrating the size principle

h1 h2

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100


h1 h2

Data slightly more of a coincidence under h1

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100


h1 h2

Data much more of a coincidence under h1

Bayesian Occam’s Razor

All possible data sets d

p(D

= d

| M

)

M1

M2

1)|( all

MdDpDd

For any model M,

Law of “Conservation

of Belief”


Pro

babi

lity


Prior: p(h)

• Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses.

• Prevents overfitting by highly specific but unnatural hypotheses, e.g. “multiples of 10 except 50 and 70”.

Prior: p(h)

• Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses.

• Prevents overfitting by highly specific but unnatural hypotheses, e.g. “multiples of 10 except 50 and 70”.

• p(h) encodes relative weights of alternative theories:

H1: Math properties (24)

• even numbers• powers of two• multiples of three ….

H2: Raw magnitude (5050)

• 10-15• 20-32• 37-54 ….

H3: Approx. magnitude (10)

• 10-20• 20-30• 30-40 ….

H: Total hypothesis spacep(H1) = 1/5

p(H2) = 3/5p(H3) = 1/5

p(h) = p(H1) / 24 p(h) = p(H2) / 5050 p(h) = p(H3) / 10

A more complex approach to priors

• Start with a base set of regularities R and combination operators C.

• Hypothesis space = closure of R under C.

– C = {and, or}: H = unions and intersections of regularities in R (e.g., “multiples of 10 between 30 and 70”).

– C = {and-not}: H = regularities in R with exceptions (e.g., “multiples of 10 except 50 and 70”).

• Two qualitatively similar priors:

– Description length: number of combinations in C needed to generate hypothesis from R.

– Bayesian Occam’s Razor, with model classes defined by number of combinations: more combinations more hypotheses lower prior

Posterior:

• X = {60, 80, 10, 30}

• Why prefer “multiples of 10” over “even numbers”? p(X|h).

• Why prefer “multiples of 10” over “multiples of 10 except 50 and 20”? p(h).

• Why does a good generalization need both high prior and high likelihood? p(h|X) ~ p(X|h) p(h)

Hh

hphXp

hphXpXhp

)()|(

)()|()|(

Bayesian Occam’s RazorProbabilities provide a common currency for balancing model complexity with fit to the data.

Generalizing to new objects

Given p(h|X), how do we compute , the probability that C applies to some new stimulus y?

Generalizing to new objects

Hypothesis averaging:

Compute the probability that C applies to some new object y by averaging the predictions of all hypotheses h, weighted by p(h|X):

Hh

XhphCypXCyp )|()|()|(

hy

hy

if 0

if 1

},{

)|(Xyh

Xhp

Examples: 16

Connection to feature-based similarity

• Additive clustering model of similarity:

• Bayesian hypothesis averaging:

• Equivalent if we identify features fk with hypotheses h, and weights wk with .

},{

)|(Xyh

Xhp

k

jkikkij ffws

},{

)()|(Xyh

hphXp

)|( Xhp

Examples: 16 8 2 64

Examples: 16 23 19 20

Model fits



60

60 80 10 30

60 52 57 55

Bayesian Model (r = 0.96)

Model fits



16

16 8 2 64

16 23 19 20

Bayesian Model (r = 0.93)

Summary of the Bayesian model

• How do the statistics of the examples interact with prior knowledge to guide generalization?

• Why does generalization appear rule-based or similarity-based?

priorlikelihoodposterior

principle size averaging hypothesis

broad p(h|X): similarity gradient narrow p(h|X): all-or-none rule

Summary of the Bayesian model

• How do the statistics of the examples interact with prior knowledge to guide generalization?

• Why does generalization appear rule-based or similarity-based?

priorlikelihoodposterior

principle size averaging hypothesis

Many h of similar size: broad p(h|X) One h much smaller: narrow p(h|X)

Alternative models• Neural networks

even multiple of 10

power of 2

multiple of 3

80

10

30

60


• Hypothesis ranking and elimination

even multiple of 10

power of 2

multiple of 3

80

10

30

60

Hypothesis ranking: 1 2 3 4 ….

….

Model (r = 0.80)Data



• Similarity to exemplars– Average similarity:

60

60 80 10 30

60 52 57 55

),(sim||

1)|( j

Xx

xyX

XCyp

j

Model (r = 0.64)Data



• Similarity to exemplars– Max similarity: ),(simmax)|( j

XxxyXCyp

j

60

60 80 10 30

60 52 57 55

Alternative models

• Neural networks


• Similarity to exemplars– Average similarity– Max similarity– Flexible similarity? Bayes.

Alternative models

• Neural networks


• Similarity to exemplars

• Toolbox of simple heuristics– 60: “general” similarity– 60 80 10 30: most specific rule (“subset principle”).– 60 52 57 55: similarity in magnitude

Why these heuristics? When to use which heuristic? Bayes.

Summary• Generalization from limited data possible via the interaction of

structured knowledge and statistics. – Structured knowledge: space of candidate rules, theories generate hypothesis

space (c.f. hierarchical priors)– Statistics: Bayesian Occam’s razor.

• Better understand the interactions between traditionally opposing concepts:– Rules and statistics– Rules and similarity

• Explains why central but notoriously slippery processing-level concepts work the way they do. – Similarity– Representativeness

– Rules and representativeness

Why Bayes?• A framework for explaining cognition.

– How people can learn so much from such limited data.

– Why process-level models work the way that they do.

– Strong quantitative models with minimal ad hoc assumptions.

• A framework for understanding how structured knowledge and statistical inference interact.– How structured knowledge guides statistical inference, and is itself acquired

through higher-order statistical learning.

– How simplicity trades off with fit to the data in evaluating structural hypotheses (Occam’s razor).




Hh

hphdp

hphdpdhp

)()|(

)()|()|(


Looking towards the afternoon

• How do we apply these ideas to more natural and complex aspects of cognition?

• Where do the hypothesis spaces come from?

• Can we formalize the contributions of domain theories?

Outline• Morning



Outline• Morning






• Representation and algorithm: Cognitive psychology


Working at the computational level

• What is the computational problem?– input: data– output: solution

statistical

Working at the computational level

• What is the computational problem?– input: data– output: solution

• What knowledge is available to the learner?

• Where does that knowledge come from?

statistical



Hh

hphdp

hphdpdhp

)()|(

)()|()|(


Causality

Bayes nets and beyond...

• Increasingly popular approach to studying human causal inferences

(e.g. Glymour, 2001; Gopnik et al., 2004)

• Three reactions:– Bayes nets are the solution!– Bayes nets are missing the point, not sure why…– what is a Bayes net?


• What are Bayes nets?– graphical models– causal graphical models

• An example: elemental causal induction

• Beyond Bayes nets…– other knowledge in causal induction– formalizing causal theories





Graphical models

• Express the probabilistic dependency structure among a set of variables (Pearl, 1988)

• Consist of– a set of nodes, corresponding to variables– a set of edges, indicating dependency– a set of functions defined on the graph that

defines a probability distribution

Undirected graphical models

• Consist of– a set of nodes– a set of edges– a potential for each clique, multiplied together to yield the distribution over variables

• Examples– statistical physics: Ising model, spinglasses– early neural networks (e.g. Boltzmann machines)

X1

X2

X3 X4

X5

Directed graphical modelsX3 X4

X5

X1

X2

• Consist of– a set of nodes– a set of edges– a conditional probability distribution for each node, conditioned on its parents, multiplied

together to yield the distribution over variables

• Constrained to directed acyclic graphs (DAG)• AKA: Bayesian networks, Bayes nets

Bayesian networks and Bayes

• Two different problems– Bayesian statistics is a method of inference– Bayesian networks are a form of representation

• There is no necessary connection– many users of Bayesian networks rely upon

frequentist statistical methods (e.g. Glymour)– many Bayesian inferences cannot be easily

represented using Bayesian networks

Properties of Bayesian networks

• Efficient representation and inference– exploiting dependency structure makes it easier

to represent and compute with probabilities

• Explaining away– pattern of probabilistic reasoning characteristic

of Bayesian networks, especially early use in AI

• Three binary variables: Cavity, Toothache, Catch

Efficient representation and inference

• Three binary variables: Cavity, Toothache, Catch

• Specifying P(Cavity, Toothache, Catch) requires 7 parameters (1 for each set of values, minus 1 because it’s a probability distribution)

• With n variables, we need 2n -1 parameters• Here n=3. Realistically, many more: X-ray, diet,

oral hygiene, personality, . . . .

Efficient representation and inference

• All three variables are dependent, but Toothache and Catch are independent given the presence or absence of Cavity

• In probabilistic terms:

• With n evidence variables, x1, …, xn, we need 2 n

conditional probabilities:

Conditional independence

)|()|()|( cavcatchPcavachePcavcatchacheP )|()|()|( cavcatchPcavachePcavcatchacheP

)|()|(1 cavcatchPcavacheP

)|(),|( cavxPcavxP ii

• Graphical representation of relations between a set of random variables:

• Probabilistic interpretation: factorizing complex terms

A simple Bayesian network

Cavity

Toothache Catch

)()|,(),,( CavPCavCatchAchePCavCatchAcheP )()|()|( CavPCavCatchPCavAcheP

},,{

])[parents|(),,(CBAV

VVPCBAP

• Joint distribution sufficient for any inference:

A more complex systemBattery

Radio Ignition Gas

Starts

On time to work

)|(),|()()|()|()(),,,,,( SOPGISPGPBIPBRPBPOSGIRBP

)(

)|(),|()()|()|()(

)(

),()|( ,,,

GP

SOPGISPGPBIPBRPBP

GP

GOPGOP SIRB



Radio Ignition Gas

Starts

On time to work


S IB

SOPGISPBIPBPGP

GOPGOP )|(),|()|()(

)(

),()|(

,


• General inference algorithm: local message passing (belief propagation; Pearl, 1988)– efficiency depends on sparseness of graph structure


Radio Ignition Gas

Starts

On time to work


• Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on:

Explaining away

Rain Sprinkler

Grass Wet

.andif0 sSrR

),|()()(),,( RSWPSPRPWSRP

rRsSRSwWP orif1),|(

Explaining away

Rain Sprinkler

Grass Wet

)(

)()|()|(

wP

rPrwPwrP

Compute probability it rained last night, given that the grass is wet:

.andif0 sSrR


rRsSRSwWP orif1),|(

Explaining away

Rain Sprinkler

Grass Wet

sr

srPsrwP

rPrwPwrP

,

),(),|(

)()|()|(


.andif0 sSrR


rRsSRSwWP orif1),|(

Explaining away

Rain Sprinkler

Grass Wet

),(),(),(

)()|(

srPsrPsrP

rPwrP


.andif0 sSrR


rRsSRSwWP orif1),|(

Explaining away

Rain Sprinkler

Grass Wet


),()(

)()|(

srPrP

rPwrP

.andif0 sSrR


rRsSRSwWP orif1),|(

Explaining away

Rain Sprinkler

Grass Wet

)()()(

)()|(

sPrPrP

rPwrP


Between 1 and P(s)

)(rP

.andif0 sSrR


rRsSRSwWP orif1),|(

Explaining away

Rain Sprinkler

Grass Wet

Compute probability it rained last night, given that the grass is wet and sprinklers were left on:

)|(

)|(),|(),|(

swP

srPsrwPswrP

Both terms = 1

.andif0 sSrR


rRsSRSwWP orif1),|(

Explaining away

Rain Sprinkler

Grass Wet

Compute probability it rained last night, given that the grass is wet and sprinklers were left on:

)(rP)|(),|( srPswrP

.andif0 sSrR


rRsSRSwWP orif1),|(

Explaining away

Rain Sprinkler

Grass Wet

)(rP)|(),|( srPswrP )()()(

)()|(

sPrPrP

rPwrP

)(rP

“Discounting” to prior probability.

.andif0 sSrR


rRsSRSwWP orif1),|(

• Formulate IF-THEN rules:– IF Rain THEN Wet– IF Wet THEN Rain

• Rules do not distinguish directions of inference• Requires combinatorial explosion of rules

Contrast w/ production system

Rain

Grass Wet

Sprinkler

IF Wet AND NOT Sprinkler THEN Rain

• Observing rain, Wet becomes more active. • Observing grass wet, Rain and Sprinkler become more active.• Observing grass wet and sprinkler, Rain cannot become less active. No explaining away!

• Excitatory links: Rain Wet, Sprinkler Wet

Contrast w/ spreading activation

Rain Sprinkler

Grass Wet

• Observing grass wet, Rain and Sprinkler become more active.• Observing grass wet and sprinkler, Rain becomes less active: explaining away.

• Excitatory links: Rain Wet, Sprinkler Wet• Inhibitory link: Rain Sprinkler

Contrast w/ spreading activation

Rain Sprinkler

Grass Wet

• Each new variable requires more inhibitory connections.• Interactions between variables are not causal.• Not modular.

– Whether a connection exists depends on what other connections exist, in non-transparent ways. – Big holism problem. – Combinatorial explosion.

Contrast w/ spreading activationRain

Sprinkler

Grass Wet

Burst pipe

Graphical models

• Capture dependency structure in distributions

• Provide an efficient means of representing and reasoning with probabilities

• Allow kinds of inference that are problematic for other representations: explaining away– hard to capture in a production system– hard to capture with spreading activation



• An example: causal induction


Causal graphical models

• Graphical models represent statistical dependencies among variables (ie. correlations)– can answer questions about observations

• Causal graphical models represent causal dependencies among variables– express underlying causal structure– can answer questions about both observations and

interventions (actions upon a variable)

Observation and interventionBattery

Radio Ignition Gas

Starts

On time to work

Graphical model: P(Radio|Ignition)

Causal graphical model: P(Radio|do(Ignition))

Observation and interventionBattery

Radio Ignition Gas

Starts

On time to work

Graphical model: P(Radio|Ignition)

Causal graphical model: P(Radio|do(Ignition))

“graph surgery” produces “mutilated graph”

Assessing interventions

• To compute P(Y|do(X=x)), delete all edges coming into X and reason with the resulting Bayesian network (“do calculus”; Pearl, 2000)

• Allows a single structure to make predictions about both observations and interventions

• Using a representation in which the direction of causality is correct produces sparser graphs• Suppose we get the direction of causality wrong, thinking that “symptoms” causes

“diseases”:

• Does not capture the correlation between symptoms: falsely believe P(Ache, Catch) = P(Ache) P(Catch).

Causality simplifies inference

Ache Catch

Cavity

• Using a representation in which the direction of causality is correct produces sparser graphs• Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”:

• Inserting a new arrow allows us to capture this correlation.• This model is too complex: do not believe that

Ache Catch

Cavity

)|()|()|,( CavCatchPCavAchePCavCatchAcheP


• Using a representation in which the direction of causality is correct produces sparser graphs• Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”:

• New symptoms require a combinatorial proliferation of new arrows. This reduces efficiency of inference.

Ache Catch

Cavity

X-ray


• Strength: how strong is a relationship?• Structure: does a relationship exist?

E

B C

E

B CB B

Learning causal graphical models

• Strength: how strong is a relationship?

E

B C

E

B CB B

Causal structure vs. causal strength

• Strength: how strong is a relationship?– requires defining nature of relationship

E

B C

w0 w1

E

B C

w0

B B


Parameterization

• Structures: h1 = h0 =

• Parameterization:

E

B C

E

B C

C B

0 01 00 11 1

h1: P(E = 1 | C, B) h0: P(E = 1| C, B)

p00

p10

p01

p11

p0

p0

p1

p1

Generic

Parameterization



E

B C

E

B C

w0 w1w0

w0, w1: strength parameters for B, C

C B

0 01 00 11 1

h1: P(E = 1 | C, B) h0: P(E = 1| C, B)

0w1

w0

w1+ w0

00w0

w0

Linear

Parameterization



E

B C

E

B C

w0 w1w0

w0, w1: strength parameters for B, C

C B

0 01 00 11 1

h1: P(E = 1 | C, B) h0: P(E = 1| C, B)

0w1

w0

w1+ w0 – w1 w0

00w0

w0

“Noisy-OR”

Parameter estimation

• Maximum likelihood estimation:

maximize i P(bi,ci,ei; w0, w1)

• Bayesian methods: as in the “Comparing infinitely many hypotheses” example…

• Structure: does a relationship exist?

E

B C

E

B CB B


Approaches to structure learning

• Constraint-based– dependency from statistical tests (eg. 2)– deduce structure from dependencies E

B CB

(Pearl, 2000; Spirtes et al., 1993)


E

B CB• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies



E

B CB• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies



E

B CB

Attempts to reduce inductive problem to deductive problem

• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies



E

B CB

• Bayesian:– compute posterior

probability of structures,

given observed dataE

B C

E

B C

P(S|data) P(data|S) P(S)

P(S1|data) P(S0|data)

• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies


(Heckerman, 1998; Friedman, 1999)

Causal graphical models

• Extend graphical models to deal with interventions as well as observations

• Respecting the direction of causality results in efficient representation and inference

• Two steps in learning causal models– parameter estimation– structure learning





Elemental causal induction

“To what extent does C cause E?”

E present

E absent

C present C absent

a

b

c

d

• Strength: how strong is a relationship?• Structure: does a relationship exist?

E

B C

w0 w1

E

B C

w0

B B


Causal strength

• Assume structure:

• Leading models (P and causal power) are maximum likelihood estimates of the strength parameter w1, under different parameterizations for P(E|B,C): – linear P, Noisy-OR causal power

E

B C

w0 w1

B

• Hypotheses: h1 = h0 =

• Bayesian causal inference:

support =

E

B C

E

B CB B

101

0

1

0 110101 )|,(),|data()|data( dwdwhwwpwwPhP

01

0 0000 )|()|data()|data( dwhwpwPhP

Causal structure

People

P (r = 0.89)

Power (r = 0.88)

Support (r = 0.97)

Buehner and Cheng (1997)

The importance of parameterization

• Noisy-OR incorporates mechanism assumptions:– generativity: causes increase probability of effects

– each cause is sufficient to produce the effect

– causes act via independent mechanisms(Cheng, 1997)

• Consider other models:– statistical dependence: 2 test

– generic parameterization (Anderson, computer science)

People

Support (Noisy-OR)

2

Support (generic)

Generativity is essential

• Predictions result from “ceiling effect”– ceiling effects only matter if you believe a cause increases the

probability of an effect

P(e+|c+)P(e+|c-)

8/88/8

6/86/8

4/84/8

2/82/8

0/80/8

Support 10050

0





Hamadeh et al. (2002) Toxicological sciences.

Clofibrate Wyeth 14,643 Gemfibrozil Phenobarbital

p450 2B1 Carnitine Palmitoyl Transferase 1

chemicalsgenes



X


chemicalsgenes



Chemical X

+++

peroxisome proliferators


chemicalsgenes

Using causal graphical models

• Three questions (usually solved by researcher)– what are the variables?– what structures are plausible?– how do variables interact?

• How are these questions answered if causal graphical models are used in cognition?





Theory-based causal induction

Causal theory– Ontology

– Plausible relations

– Functional form

Z

B YX

Z

B YXh0:h1:

P(h1) = P(h0) =1 –

Hypothesis space of causal graphical models

Generates

P(h|data) P(data|h) P(h)Evaluated by statistical inference

Blicket detector (Gopnik, Sobel, and colleagues)

See this? It’s a blicket machine. Blickets make it go.

Let’s put this oneon the machine.

Oooh, it’s a blicket!

– Two objects: A and B– Trial 1: A on detector – detector active– Trial 2: B on detector – detector inactive– Trials 3,4: A B on detector – detector active– 3, 4-year-olds judge whether each object is a blicket

• A: a blicket• B: not a blicket

“Blocking”

Trial 1 Trials 3, 4A B Trial 2

A deductive inference?• Causal law: detector activates if and only if one or

more objects on top of it are blickets. • Premises:

– Trial 1: A on detector – detector active– Trial 2: B on detector – detector inactive– Trials 3,4: A B on detector – detector active

• Conclusions deduced from premises and causal law:– A: a blicket– B: not a blicket

– Two objects: A and B– Trial 1: A B on detector – detector active– Trial 2: A on detector – detector active– 4-year-olds judge whether each object is a blicket

• A: a blicket (100% of judgments)

• B: probably not a blicket (66% of judgments)

“Backwards blocking” (Sobel, Tenenbaum & Gopnik, 2004)

Trial 1 Trial 2A B

• Ontology– Types: Block, Detector, Trial

– Predicates:

Contact(Block, Detector, Trial)

Active(Detector, Trial)

• Constraints on causal relations– For any Block b and Detector d, with prior probability

q: Cause(Contact(b,d,t), Active(d,t))

• Functional form of causal relations– Causes of Active(d,t) are independent mechanisms, with

causal strengths wi. A background cause has strength w0. Assume a near-deterministic mechanism: wi ~ 1, w0 ~ 0.

Theory


– Predicates:



Theory

E

A B


– Predicates:



Theory

E

A B

A = 1 if Contact(block A, detector, trial), else 0B = 1 if Contact(block B, detector, trial), else 0E = 1 if Active(detector, trial), else 0

• Constraints on causal relations– For any Block b and Detector d, with prior probability

q: Cause(Contact(b,d,t), Active(d,t))

Theory

h00 : h10 :

h01 : h11 :

E

A B

E

A B

E

A B

E

A B

P(h00) = (1 – q)2 P(h10) = q(1 – q)

P(h01) = (1 – q) q P(h11) = q2

No hypotheses with E B, E A, A B, etc.

= “A is a blicket”

E

A

• Functional form of causal relations– Causes of Active(d,t) are independent mechanisms, with

causal strengths wb. A background cause has strength w0. Assume a near-deterministic mechanism: wb ~ 1, w0 ~ 0.

Theory

“Activation law”: E=1 if and only if A=1 or B=1.

P(E=1 | A=0, B=0): 0 0 0 0

P(E=1 | A=1, B=0): 0 0 1 1P(E=1 | A=0, B=1): 0 1 0 1P(E=1 | A=1, B=1): 0 1 1 1

E

BA

E

BA

E

BA

E

BA

P(h00) = (1 – q)2 P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2

Bayesian inference

• Evaluating causal models in light of data:

• Inferring a particular causal relation:

Hj

hjj

iii hPhdP

hPhdPdhP

)()|(

)()|()|(

H

jh

jj dhPhEAPdEAP )|()|()|(

Modeling backwards blocking

P(E=1 | A=0, B=0): 0 0 0 0

P(E=1 | A=1, B=0): 0 0 1 1P(E=1 | A=0, B=1): 0 1 0 1P(E=1 | A=1, B=1): 0 1 1 1

E

BA

E

BA

E

BA

E

BA

P(h00) = (1 – q)2 P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2

q

q

hPhP

hPhP

dEBP

dEBP

1)()(

)()(

)|(

)|(

1000

1101

qhP

hPhP

dEBP

dEBP

1

1

)(

)()(

)|(

)|(

10

1101

P(E=1 | A=1, B=1): 0 1 1 1

E

BA

E

BA

E

BA

E

BA

P(h00) = (1 – q)2 P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2


q

q

hP

hP

dEBP

dEBP

1)(

)(

)|(

)|(

10

11

P(E=1 | A=1, B=0): 0 1 1

P(E=1 | A=1, B=1): 1 1 1

E

BA

E

BA

E

BA

P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2


After each trial, adults judge the probability that each object is a blicket.

Trial 1 Trial 2BA

I. Pre-training phase: Blickets are rare . . . .

II. Backwards blocking phase:

Manipulating the prior

• “Rare” condition: First observe 12 objects on detector, of which 2 set it off.

• “Common” condition: First observe 12 objects on detector, of which 10 set it off.

After each trial, adults judge the probability that each object is a blicket.

Trial 1 Trial 2BA

I. Pre-training phase: Blickets are rare . . . .

II. Two trials: A B detector, B C detector

Inferences from ambiguous data

C

• Hypotheses: h000 = h100 =

h010 = h001 =

h110 = h011 =

h101 = h111 =

• Likelihoods:

E

A B C

E

A B C

E

A B C

E

A B C

E

A B C

E

A B C

E

A B C

E

A B C

if A = 1 and A E exists, or B = 1 and B E exists, or C = 1 and C E exists, else 0.

P(E=1| A, B, C; h) = 1

Same domain theory generates hypothesis space for 3 objects:

• “Rare” condition: First observe 12 objects on detector, of which 2 set it off.

The role of causal mechanism knowledge

• Is mechanism knowledge necessary?– Constraint-based learning using 2 tests of

conditional independence.

• How important is the deterministic functional form of causal relations?– Bayes with “noisy sufficient causes” theory (c.f.,

Cheng’s causal power theory).

Bayes with correct theory:

Bayes with “noisy sufficient causes” theory:

Theory-based causal induction

• Explains one-shot causal inferences about physical systems: blicket detectors

• Captures a spectrum of inferences:– unambiguous data: adults and children make all-or-

none inferences

– ambiguous data: adults and children make more graded inferences

• Extends to more complex cases with hidden variables, dynamic systems: come to my talk!

Summary

• Causal graphical models provide a language for asking questions about causality

• Key issues in modeling causal induction:– what do we mean by causal induction?– how do knowledge and statistics interact?

• Bayesian approach allows exploration of different answers to these questions

Outline• Morning



Property induction

Collaborators

Charles Kemp Neville Sanjana

Lauren Schmidt Amy Perfors

Fei Xu Liz Baraff

Pat Shafto

The Big Question

• How can we generalize new concepts reliably from just one or a few examples? – Learning word meanings

“horse” “horse” “horse”

The Big Question

• How can we generalize new concepts reliably from just one or a few examples? – Learning word meanings, causal relations, social rules,

….– Property induction

How probable is the the conclusion (target) given the premises (examples)?

Gorillas have T4 cells.Squirrels have T4 cells.

All mammals have T4 cells.

The Big Question

• How can we generalize new concepts reliably from just one or a few examples? – Learning word meanings, causal relations,

social rules, ….– Property induction

Gorillas have T4 cells.Squirrels have T4 cells.


Gorillas have T4 cells.Chimps have T4 cells.


More diverse examples stronger generalization

Is rational inference the answer?• Everyday induction often appears to follow

principles of rational scientific inference. – Could that explain its success?

• Goal of this work: a rational computational model of human inductive generalization.– Explain people’s judgments as approximations to optimal

inference in natural environments.

– Close quantitative fits to people’s judgments with a minimum of free parameters or assumptions.



Hh

hphdp

hphdpdhp

)()|(

)()|()|(


The plan• Similarity-based models

• Theory-based model

• Bayesian models– “Empiricist” Bayes

– Theory-based Bayes, with different theories

• Connectionist (PDP) models

• Advanced Theory-based Bayes– Learning with multiple domain theories

– Learning domain theories








• 20 subjects rated the strength of 45 arguments:

X1 have property P.

X2 have property P.

X3 have property P.

All mammals have property P.

• 40 different subjects rated the similarity of all pairs of 10 mammals.

An experiment(Osherson et al., 1990)

Similarity-based models(Osherson et al.)

strength(“all mammals” | X )

mammals

),sim(i

Xi

xx

x

Mammals:Examples: x



mammals

),sim(i

Xi

xx

x

Mammals:Examples: x



mammals

),sim(i

Xi

xx

x

Mammals:Examples: x



mammals

),sim(i

Xi

xx

x

Mammals:Examples: x



mammals

),sim(i

Xi

xx

x

Mammals:Examples: x

• Sum-Similarity:

Xj

jiXi ),sim(),sim(



mammals

),sim(i

Xi

xx

x

Mammals:Examples: x

• Max-Similarity:

),sim(max),sim( jiXiXj

max



mammals

),sim(i

Xi

xx

x

Mammals:Examples: x

• Max-Similarity:




mammals

),sim(i

Xi

xx

x

Mammals:Examples: x

• Max-Similarity:




mammals

),sim(i

Xi

xx

x

Mammals:Examples: x

• Max-Similarity:




mammals

),sim(i

Xi

xx

x

Mammals:Examples: x

• Max-Similarity:


Sum-sim versus Max-sim• Two models appear functionally similar:

– Both increase monotonically as new examples are observed.

• Reasons to prefer Sum-sim:– Standard form of exemplar models of categorization,

memory, and object recognition.– Analogous to kernel density estimation techniques in

statistical pattern recognition.

• Reasons to prefer Max-sim:– Fit to generalization judgments . . . .

Model

Dat

aData vs. models

Each “ ” represents one argument:X1 have property P.X2 have property P.X3 have property P.


.

Three data sets

Max-sim

Sum-sim

Conclusionkind:

Number ofexamples:

“all mammals” “horses” “horses”

3 2 1, 2, or 3

Feature rating data(Osherson and Wilkie)

• People were given 48 animals, 85 features, and asked to rate whether each animal had each feature.

• E.g., elephant: 'gray' 'hairless' 'toughskin' 'big' 'bulbous' 'longleg' 'tail' 'chewteeth' 'tusks' 'smelly' 'walks' 'slow' 'strong' 'muscle’ 'quadrapedal' 'inactive' 'vegetation' 'grazer' 'oldworld' 'bush' 'jungle' 'ground' 'timid' 'smart' 'group'

• Compute similarity based on Hamming distance, or cosine.

• Generalize based on Max-sim or Sum-sim. BABA

???????

?

Species 1Species 2Species 3Species 4Species 5Species 6Species 7Species 8Species 9Species 10

Features New property

?

Three data sets

Max-Sim

Sum-Sim

Conclusionkind:

Number ofexamples:


3 2 1, 2, or 3

r = 0.77 r = 0.75 r = 0.94

r = – 0.21 r = 0.63 r = 0.19

Problems for sim-based approach• No principled explanation for why Max-Sim works so

well on this task, and Sum-Sim so poorly, when Sum-Sim is the standard in other similarity-based models.

• Free parameters mixing similarity and coverage terms, and possibly Max-Sim and Sum-Sim terms.

• Does not extend to induction with other kinds of properties, e.g., from Smith et al., 1993:

Dobermanns can bite through wire.

German shepherds can bite through wire.

Poodles can bite through wire.

German shepherds can bite through wire.




• Representation and algorithm:Max-sim, Sum-sim









• Scientific biology: species generated by an evolutionary branching process.– A tree-structured taxonomy of species.

• Taxonomy also central in folkbiology (Atran).

Theory-based induction

Begin by reconstructing intuitive taxonomy from similarity judgments:

chim

pgo

rilla

hors

eco

wel

epha

ntrh

ino

mou

sesq

uirre

ldo

lphi

nse

al

clustering

Theory-based induction

How taxonomy constrains induction

• Atran (1998): “Fundamental principle of systematic induction” (Warburton 1967, Bock 1973)– Given a property found among members of any

two species, the best initial hypothesis is that the property is also present among all species that are included in the smallest higher-order taxon containing the original pair of species.

elep

hant

squi

rrel

chim

pgo

rilla

hors

eco

w

rhin

om

ouse

dolp

hin

seal

“all mammals”

Cows have property P.Dolphins have property P.Squirrels have property P.


Strong (0.76 [max = 0.82])

elep

hant

squi

rrel

chim

pgo

rilla

hors

eco

w

rhin

om

ouse

dolp

hin

seal



Cows have property P.Horses have property P.Rhinos have property P.


“large herbivores”

Strong: 0.76 [max = 0.82]) Weak: 0.17 [min = 0.14]

elep

hant

squi

rrel

chim

pgo

rilla

hors

eco

w

rhin

om

ouse

dolp

hin

seal

Seals have property P.Dolphins have property P.Squirrels have property P.




“all mammals”

Strong: 0.76 [max = 0.82] Weak: 0.30 [min = 0.14]

Max-sim

Sum-sim

Conclusionkind:

Number ofexamples:


3 2 1, 2, or 3

Taxonomicdistance

The challenge

• Can we build models with the best of both traditional approaches?– Quantitatively accurate predictions.– Strong rational basis.

• Will require novel ways of integrating structured knowledge with statistical inference.








The Bayesian approach

???????

?


New property

?

Features


???????

?


?

New property

GeneralizationHypothesis

Features


???????

?


?

New property


Features


???????

?


?

New property


Features


???????

?


?

New property


Features


???????

?


?

New property


Features


???????

?


?

New property


Features


???????

?


p(h)

New property


h dp(d |h)

Features

???????

?


New property


h d

Hh

hphdp

hphdpdhp

)()|(

)()|()|(Bayes’ rule:

p(h)

p(d |h)

Features

???????

?


New property


h d

Probability that property Q holds for species x:

p(d |h)

p(h)

Features

)( with consistent

)|()|)((

xQh

dhpdxQp

???????

?


New property


h d

0

if d is consistentwith hotherwise

p(d |h)

p(h)

Features

“Size principle”: |h | = # of positive

instances of h

hhdp

1)|(

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100

The size principle

h1 h2

“even numbers” “multiples of 10”

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100

The size principle

Data slightly more of a coincidence under h1

h1 h2


2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100

The size principle

Data much more of a coincidence under h1

h1 h2



Grizzly bears have property P.


Grizzly bears have property P.Brown bears have property P. Polar bears have property P.


“Non-monotonicity”

Which argument is stronger?

???????

?


New property

GeneralizationHypotheses

h d

...

p(Q(x)|d)

p(h)

Probability that property Q holdsfor species x: hhphhpdxQp

dh

dxQh

/)(/)()|)((

with consistent

),(with consistent

p(d |h)

???????

?


New property


h d

Probability that property Q holdsfor species x:

p(d |h)

p(h)

Features

hhphhpdxQp

dh

dxQh

/)(/)()|)((

with consistent

),(with consistent

Specifying the prior p(h)

• A good prior must focus on a small subset of all 2n possible hypotheses, in order to:– Match the distribution of properties in the world.– Be learnable from limited data. – Be efficiently computationally.

• We consider two approaches:– “Empiricist” Bayes: unstructured prior based directly on

known features.– “Theory-based” Bayes: structured prior based on rational

domain theory, tuned to known features.

???????

?


p(h) =

New property


h d


h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12

15

1

15

1

15

1

15

1

15

1

15

1

15

1

15

115

1

15

115

2

15

3

Features

“Empiricist”Bayes:(Heit, 1998)

Results

Max-Sim

r = 0.77 r = 0.75 r = 0.94

“Empiricist” Bayes

r = 0.38 r = 0.16 r = 0.79

Why doesn’t “Empiricist” Bayes work?

• With no structural bias, requires too many features to estimate the prior reliably.

• An analogy: Estimating a smooth probability density function by local interpolation.

N = 5 N = 100 N = 500

Why doesn’t “Empiricist” Bayes work?

N = 5 N = 5

Assuming an appropriatelystructured form for density (e.g., Gaussian) leads to better generalization from sparse data.

• With no structural bias, requires too many features to estimate the prior reliably.

• An analogy: Estimating a smooth probability density function by local interpolation.

“Theory-based” BayesTheory: Two principles based on the structure of species

and properties in the natural world.

1. Species generated by an evolutionary branching process.– A tree-structured taxonomy of species (Atran, 1998).

2. Features generated by stochastic mutation process and passed on to descendants. – Novel features can appear anywhere in tree, but some

distributions are more likely than others.

???????

?


New property


h d

T

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10

Mutation process generates p(h|T):– Choose label for root.– Probability that label mutates along

branch b :

= mutation rate|b| = length of branch b

p(h|T)

2

1 2 be

Features

???????

?


New property


h d

T


branch b :


2

1 2 be

x

x

x

p(h|T)

Features

Samples from the prior

• Labelings that cut the data along fewer branches are more probable:

>

“monophyletic” “polyphyletic”

Samples from the prior

• Labelings that cut the data along longer branches are more probable:

>

“more distinctive” “less distinctive”

???????

?


New property


h d

T

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10

• Mutation process over tree T generates p(h|T).

• Message passing over tree T efficiently sums over all h.

• How do we know which tree T to use?

p(h|T)

Features

???????

?


New property


h d

T

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10

The same mutation process generates p(Features|T):– Assume each feature generated

independently over the tree.– Use MCMC to infer most likely tree T and

mutation rate given observed features. – No free parameters!

p(h|T)

Features

Results

Max-Sim

r = 0.77 r = 0.75 r = 0.94

r = 0.38 r = 0.16 r = 0.79

“Theory-based” Bayes

r = 0.91 r = 0.95 r = 0.91

“Empiricist” Bayes

Reconstruct intuitive taxonomy from similarity judgments:

chim

pgo

rilla

hors

eco

wel

epha

ntrh

ino

mou

sesq

uirre

ldo

lphi

nse

al

clustering

Grounding in similarity

Max-sim

Sum-sim

Conclusionkind:

Number ofexamples:


3 2 1, 2, or 3

Theory-based Bayes

Explaining similarity • Why does Max-sim fit so well?

– An efficient and accurate approximation to this Theory-Based Bayesian model.

– Theorem. Nearest neighbor classification approximates evolutionary Bayes in the limit of high mutation rate, if domain is tree-structured.

Correlation (r)

Mean r = 0.94– Correlation with Bayes on

three-premise general arguments, over 100 simulated trees:

Alternative feature-based models• Taxonomic Bayes (strictly taxonomic

hypotheses, with no mutation process)

>

“monophyletic” “polyphyletic”

Alternative feature-based models• Taxonomic Bayes (strictly taxonomic

hypotheses, with no mutation process)

• PDP network (Rogers and McClelland)

Features

Species

Results

PDP network

r = 0.41 r = 0.62 r = 0.71

Taxonomic Bayes

r = 0.51 r = 0.53 r = 0.85

Theory-based Bayes

r = 0.91 r = 0.95 r = 0.91

Bias is too

strong

Bias is too

weak

Bias is just

right!

Mutation principle versus pure Occam’s Razor

• Mutation principle provides a version of Occam’s Razor, by favoring hypotheses that span fewer disjoint clusters.

• Could we use a more generic Bayesian Occam’s Razor, without the biological motivation of mutation?

???????

?


New property


h d

T

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10


branch b :


p(h|T)

2

1 2 be

Features

???????

?


New property


h d

T

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10


branch b :


p(h|T)

Features

Bayes(taxonomy+

Occam)

Max-sim

Conclusionkind:

Number ofexamples:

“all mammals”

1

Premise typicality effect (Rips, 1975; Osherson et al., 1990):

Strong:

Weak:

Horses have property P.


Seals have property P.


Bayes(taxonomy+mutation)

Typicality meets hierarchies• Collins and Quillian: semantic memory structured hierarchically

• Traditional story: Simple hierarchical structure uncomfortable with typicality effects & exceptions.

• New story: Typicality & exceptions compatible with rational statistical inference over hierarchy.

Intuitive versus scientific theories of biology

• Same structure for how species are related.– Tree-structured taxonomy.

• Same probabilistic model for traits– Small probability of occurring along any branch at

any time, plus inheritance.

• Different features – Scientist: genes– People: coarse anatomy and behavior

Induction in Biology: summary• Theory-based Bayesian inference explains taxonomic

inductive reasoning in folk biology.

• Insight into processing-level accounts.– Why Max-sim over Sum-sim in this domain?– How is hierarchical representation compatible with typicality

effects & exceptions?

• Reveals essential principles of domain theory.

– Category structure: taxonomic tree.– Feature distribution: stochastic mutation process +

inheritance.








Hyena

Lion

Giraffe

Gazelle

Monkey

Gorilla

Cheetah

Property type Generic “essence”

Theory Structure Taxonomic Tree

LionCheetahHyenaGiraffeGazelleGorillaMonkey

. . .

Hyena

Lion

Giraffe

Gazelle

Monkey

Gorilla

CheetahHyena

Lion

Giraffe

Gazelle

Monkey

Gorilla

Cheetah

Hyena

Lion

Giraffe

Gazelle

Monkey

Gorilla

Cheetah

Property type Generic “essence” Size-related Food-carried

Theory Structure Taxonomic Tree Dimensional Directed Acyclic Network

LionCheetahHyenaGiraffeGazelleGorillaMonkey

. . . . . . . . .

One-dimensional predicates• Q = “Have skins that are more resistant to penetration than most synthetic fibers”.

– Unknown relevant property: skin toughness

– Model influence of known properties via judged prior probability that each species has Q.

Skin toughness

House cat Camel Elephant Rhino

threshold for Q

Max-sim

Bayes(taxonomy+mutation)

Bayes(1D model)

One-dimensional predicates

Dis

ease

Pro

per

ty

Mammals Island

r = -0.35

r = 0.77 r = 0.82

r = -0.05

Food web model fits (Shafto et al.)

Dis

ease

Pro

per

ty

Mammals Island

Taxonomic tree model fits (Shafto et al.)

r = 0.81

r = -0.12 r = 0.16

r = 0.62








DomainStructure

Theory • Species organized in taxonomic tree structure• Feature i generated by mutation process with rate i


Data

S3 S4 S1 S2 S9 S10 S5 S6 S7 S8

F1 F2F3 F4 F5

F6

F7

F8

F9

F10

F10

F11

F12

F13

F14

F14

F10

p(S|T)

p(D|S)

10 high ~ weight low



Data

S3 S4 S1 S2 S9 S10 S5 S6 S7 S8

F1 F2F3 F4 F5

F6

F7

F8

F9

F10

F10

F11

F12

F13

F14

F14

F10

p(S|T)

p(D|S)

? ? ? ? ? ? ? ? ? ? ? ? ? Species X

DomainStructure



Data

S3 S4 S1 S2 S9 S10 S5 S6 S7 S8

F1 F2F3 F4 F5

F6

F7

F8

F9

F10

F10

F11

F12

F13

F14

F14

F10

p(S|T)

p(D|S) SX

Species X

DomainStructure

Where does the domain theory come from?

• Innate.– Atran (1998): The tendency to group living kinds

into hierarchies reflects an “innately determined cognitive structure”.

• Emerges (only approximately) through learning in unstructured connectionist networks.– McClelland and Rogers (2003).

Bayesian inference to theories

• Challenge to the nativist-empiricist dichotomy.– We really do have structured domain theories.– We really do learn them.

• Bayesian framework applies over multiple levels:– Given hypothesis space + data, infer concepts.– Given theory + data, infer hypothesis space.– Given X + data, infer theory.

Bayesian inference to theories

• Candidate theories for biological species and their features:– T0: Features generated independently for each species. (c.f. naive

Bayes, Anderson’s rational model.)

– T1: Features generated by mutation in tree-structured taxonomy of species.

– T2: Features generated by mutation in a one-dimensional chain of species.

• Score theories by likelihood on object-feature matrix:

)|(),|()|( TSpTSDpTDpS

)|(),|(max TSpTSDpS

T0:• No organizational structure for species. • Features distributed independently over species.


Features

Data

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

F1F2F5F8F9

F2F4F6F7F9

F14

F1F2F3F5F7F8

F10F12F13

F2F4F7F9

F12F14

F1F5F7F13F14

F1F6F7F8F9F10F13

F2F4F5

F12F13F14

F2F3F6

F11F13

F1F6F8F9

F12

F2F4F8F9

F10F11F14



Features

Data

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

F1F6F7F8F9

F11

F1F6F7F8F9

F10F11

F3F7F8F9

F11F12F14

F3F7F8F9

F11F12F14

F4F8F9

F4F8F9

F5F9

F10F13F14

F5F9

F10F13F14

F2F6F7F8F9F11

F2F6F7F8F9

F11

T1:• Species organized in taxonomic tree structure.• Features distributed via stochastic mutation process.



Features

Data

S3 S4 S1 S2 S9 S10 S5 S6 S7 S8

F1 F2F3 F4 F5

F6

F7

F8

F9

F10

F10

F11

F12

F13

F14

F14

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

F1F6F7F8F9

F11

F1F6F7F8F9

F10F11

F3F7F8F9

F11F12F14

F3F7F8F9

F11F12F14

F4F8F9

F4F8F9

F5F9

F10F13F14

F5F9

F10F13F14

F2F6F7F8F9F11

F2F6F7F8F9

F11

T1: p(Data|T2) ~ 2.42 x 10-32

• Species organized in taxonomic tree structure.• Features distributed via stochastic mutation process.

T0: p(Data|T1) ~ 1.83 x 10-41

• No organizational structure for species. • Features distributed independently over species.


Features

Data

S3 S4 S1 S2 S9 S10 S5 S6 S7 S8

F1 F2F3 F4 F5

F6

F7

F8

F9

F10

F10

F11

F12

F13

F14

F14

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

F1F6F7F8F9

F11

F1F6F7F8F9

F10F11

F3F7F8F9

F11F12F14

F3F7F8F9

F11F12F14

F4F8F9

F4F8F9

F5F9

F10F13F14

F5F9

F10F13F14

F2F6F7F8F9F11

F2F6F7F8F9

F11



Features

Data

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

F1F2F5F8F9

F2F4F6F7F9

F14

F1F2F3F5F7F8

F10F12F13

F2F4F7F9

F12F14

F1F5F7F13F14

F1F6F7F8F9F10F13

F2F4F5

F12F13F14

F2F3F6

F11F13

F1F6F8F9

F12

F2F4F8F9

F10F11F14

T1:• Species organized in taxonomic tree structure.• Features distributed via stochastic mutation process.

S2 S4 S7 S10 S8 S1 S9 S6 S3 S5

F1F2

F3

F4

F5

F7F10

F11

F12

F13

F14

F2 F2F3F5

F5 F7 F13

F6 F6F6F6 F7

F8 F9

F8

F8

F9

F9

F10F10

F12

F12F12

F13

F13

F14

F11

T0: p(Data|T1) ~ 2.29 x 10-42

• No organizational structure for species. • Features distributed independently over species.


Features

Data

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

F1F2F5F8F9

F2F4F6F7F9

F14

F1F2F3F5F7F8

F10F12F13

F2F4F7F9

F12F14

F1F5F7F13F14

F1F6F7F8F9F10F13

F2F4F5

F12F13F14

F2F3F6

F11F13

F1F6F8F9

F12

F2F4F8F9

F10F11F14

T1: p(Data|T2) ~ 4.38 x 10-53

• Species organized in taxonomic tree structure.• Features distributed via stochastic mutation process.

S2 S4 S7 S10 S8 S1 S9 S6 S3 S5

F1F2

F3

F4

F5

F7F10

F11

F12

F13

F14

F2 F2F3F5

F5 F7 F13

F6 F6F6F6 F7

F8 F9

F8

F8

F9

F9

F10F10

F12

F12F12

F13

F13

F14

F11

Empirical tests

• Synthetic data: 32 objects, 120 features– tree-structured generative model

– linear chain generative model

– unconstrained (independent features).

• Real data– Animal feature judgments: 48 species, 85 features.

– US Supreme Court decisions, 1981-1985: 9 people, 637 cases.

ResultsPreferred

Model

NullTree

LinearTree

Linear

Theory acquisition: summary• So far, just a computational proof of concept.

• Future work:– Experimental studies of theory acquisition in the lab, with

adult and child subjects. – Modeling developmental or historical trajectories of

theory change.

• Sources of hypotheses for candidate theories:– What is innate? – Role of analogy?

Outline• Morning



Advanced topics


• Statistical language modeling– topic models

• Relational categorization– attributes and relations




Statistical language modeling

• A variety of approaches to statistical language modeling are used in cognitive science– e.g. LSA (Landauer & Dumais, 1997)

– distributional clustering (Redington, Chater, & Finch, 1998)

• Generative models have unique advantages– identify assumed causal structure of language– make use of standard tools of Bayesian statistics– easily extended to capture more complex structure

Generative models for language

latent structure

observed data

Generative models for language

meaning

sentences

Topic models

• Each document a mixture of topics

• Each word chosen from a single topic

• Introduced by Blei, Ng, and Jordan (2001), reinterpretation of PLSI (Hofmann, 1999)

• Idea of probabilistic topics widely used (eg. Bigi et al., 1997; Iyer & Ostendorf, 1996; Ueda & Saito, 2003)

Generating a document

z

w

zz

w w

distribution over topics

topic assignments

observed words

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2SCIENTIFIC 0.0KNOWLEDGE 0.0WORK 0.0RESEARCH 0.0MATHEMATICS 0.0

HEART 0.0 LOVE 0.0SOUL 0.0TEARS 0.0JOY 0.0 SCIENTIFIC 0.2KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

topic 1 topic 2

w P(w|z = 1) = (1) w P(w|z = 2) = (2)

Choose mixture weights for each document, generate “bag of words”

= {P(z = 1), P(z = 2)}

{0, 1}

{0.25, 0.75}

{0.5, 0.5}

{0.75, 0.25}

{1, 0}

MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK

SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART

MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART

WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL

TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

THEORYSCIENTISTS

EXPERIMENTOBSERVATIONS

SCIENTIFICEXPERIMENTSHYPOTHESIS

EXPLAINSCIENTISTOBSERVED

EXPLANATIONBASED

OBSERVATIONIDEA

EVIDENCETHEORIESBELIEVED

DISCOVEREDOBSERVE

FACTS

SPACEEARTHMOON

PLANETROCKET

MARSORBIT

ASTRONAUTSFIRST

SPACECRAFTJUPITER

SATELLITESATELLITES

ATMOSPHERESPACESHIPSURFACE

SCIENTISTSASTRONAUT

SATURNMILES

ARTPAINT

ARTISTPAINTINGPAINTEDARTISTSMUSEUM

WORKPAINTINGS

STYLEPICTURES

WORKSOWN

SCULPTUREPAINTER

ARTSBEAUTIFUL

DESIGNSPORTRAITPAINTERS

STUDENTSTEACHERSTUDENT

TEACHERSTEACHING

CLASSCLASSROOM

SCHOOLLEARNING

PUPILSCONTENT

INSTRUCTIONTAUGHTGROUPGRADE

SHOULDGRADESCLASSES

PUPILGIVEN

BRAINNERVESENSE

SENSESARE

NERVOUSNERVES

BODYSMELLTASTETOUCH

MESSAGESIMPULSES

CORDORGANSSPINALFIBERS

SENSORYPAIN

IS

CURRENTELECTRICITY

ELECTRICCIRCUIT

ISELECTRICAL

VOLTAGEFLOW

BATTERYWIRE

WIRESSWITCH

CONNECTEDELECTRONSRESISTANCE

POWERCONDUCTORS

CIRCUITSTUBE

NEGATIVE

NATUREWORLDHUMAN

PHILOSOPHYMORAL

KNOWLEDGETHOUGHTREASONSENSEOUR

TRUTHNATURAL

EXISTENCEBEINGLIFE

MINDARISTOTLEBELIEVED

EXPERIENCEREALITY

A selection of topics (from 500)

THIRDFIRST

SECONDTHREE

FOURTHFOUR

GRADETWO

FIFTHSEVENTH

SIXTHEIGHTH

HALFSEVEN

SIXSINGLENINTH

ENDTENTH

ANOTHER

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN


FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN


FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Learning topic hiearchies

(Blei, Griffiths, Jordan, & Tenenbaum, 2004)

Syntax and semantics from statistics

z

w

zz

w w

xxx

semantics: probabilistic topics

syntax: probabilistic regular grammar

Factorization of language based onstatistical dependency patterns:

long-range, document specific,dependencies

short-range dependencies constantacross all documents

(Griffiths, Steyvers, Blei, & Tenenbaum, submitted)

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3



THE 0.6 A 0.3MANY 0.1


0.9

0.1

0.2

0.8

0.7

0.3

THE ………………………………

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2



THE 0.6 A 0.3MANY 0.1


0.9

0.1

0.2

0.8

0.7

0.3

THE LOVE……………………

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2



THE 0.6 A 0.3MANY 0.1


0.9

0.1

0.2

0.8

0.7

0.3

THE LOVE OF………………

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2



THE 0.6 A 0.3MANY 0.1


0.9

0.1

0.2

0.8

0.7

0.3

THE LOVE OF RESEARCH ……

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2

FOODFOODSBODY

NUTRIENTSDIETFAT

SUGARENERGY

MILKEATINGFRUITS

VEGETABLESWEIGHT

FATSNEEDS

CARBOHYDRATESVITAMINSCALORIESPROTEIN

MINERALS

MAPNORTHEARTHSOUTHPOLEMAPS

EQUATORWESTLINESEAST

AUSTRALIAGLOBEPOLES

HEMISPHERELATITUDE

PLACESLAND

WORLDCOMPASS

CONTINENTS

DOCTORPATIENTHEALTH

HOSPITALMEDICAL

CAREPATIENTS

NURSEDOCTORSMEDICINENURSING

TREATMENTNURSES

PHYSICIANHOSPITALS

DRSICK

ASSISTANTEMERGENCY

PRACTICE

BOOKBOOKS

READINGINFORMATION

LIBRARYREPORT

PAGETITLE

SUBJECTPAGESGUIDE

WORDSMATERIALARTICLE

ARTICLESWORDFACTS

AUTHORREFERENCE

NOTE

GOLDIRON

SILVERCOPPERMETAL

METALSSTEELCLAYLEADADAM

OREALUMINUM

MINERALMINE

STONEMINERALS

POTMININGMINERS

TIN

BEHAVIORSELF

INDIVIDUALPERSONALITY

RESPONSESOCIAL

EMOTIONALLEARNINGFEELINGS

PSYCHOLOGISTSINDIVIDUALS

PSYCHOLOGICALEXPERIENCES

ENVIRONMENTHUMAN

RESPONSESBEHAVIORSATTITUDES

PSYCHOLOGYPERSON

CELLSCELL

ORGANISMSALGAE

BACTERIAMICROSCOPEMEMBRANEORGANISM

FOODLIVINGFUNGIMOLD

MATERIALSNUCLEUSCELLED

STRUCTURESMATERIAL

STRUCTUREGREENMOLDS

Semantic categories

PLANTSPLANT

LEAVESSEEDSSOIL

ROOTSFLOWERS

WATERFOOD

GREENSEED

STEMSFLOWER

STEMLEAF

ANIMALSROOT

POLLENGROWING

GROW

GOODSMALL

NEWIMPORTANT

GREATLITTLELARGE

*BIG

LONGHIGH

DIFFERENTSPECIAL

OLDSTRONGYOUNG

COMMONWHITESINGLE

CERTAIN

THEHIS

THEIRYOURHERITSMYOURTHIS

THESEA

ANTHATNEW

THOSEEACH

MRANYMRSALL

MORESUCHLESS

MUCHKNOWN

JUSTBETTERRATHER

GREATERHIGHERLARGERLONGERFASTER

EXACTLYSMALLER

SOMETHINGBIGGERFEWERLOWER

ALMOST

ONAT

INTOFROMWITH

THROUGHOVER

AROUNDAGAINSTACROSS

UPONTOWARDUNDERALONGNEAR

BEHINDOFF

ABOVEDOWN

BEFORE

SAIDASKED

THOUGHTTOLDSAYS

MEANSCALLEDCRIED

SHOWSANSWERED

TELLSREPLIED

SHOUTEDEXPLAINEDLAUGHED

MEANTWROTE

SHOWEDBELIEVED

WHISPERED

ONESOMEMANYTWOEACHALL

MOSTANY

THREETHIS

EVERYSEVERAL

FOURFIVEBOTHTENSIX

MUCHTWENTY

EIGHT

HEYOU

THEYI

SHEWEIT

PEOPLEEVERYONE

OTHERSSCIENTISTSSOMEONE

WHONOBODY

ONESOMETHING

ANYONEEVERYBODY

SOMETHEN

Syntactic categories

BEMAKE

GETHAVE

GOTAKE

DOFINDUSESEE

HELPKEEPGIVELOOKCOMEWORKMOVELIVEEAT

BECOME

Statistical language modeling

• Generative models provide– transparent assumptions about causal process– opportunities to combine and extend models

• Richer generative models...– probabilistic context-free grammars– paragraph or sentence-level dependencies– more complex semantics




Relational categorization

• Most approaches to categorization in psychology and machine learning focus on attributes - properties of objects– words in titles of CogSci posters

• But… a significant portion of knowledge is organized in terms of relations– co-authors on posters– who talks to whom

(Kemp, Griffiths, & Tenenbaum, 2004)

Attributes and relations

Data Modelobjects

attr

ibut

esob

ject

s

objects

P(X) = ik z P(xik|zi) i P(zi)X

Y P(Y) = ij z P(yij|zi) i P(zi)

mixture model (c.f. Anderson, 1990)

stochastic blockmodel

Stochastic blockmodels

• For any pair of objects, (i,j), probability of relation is determined by classes, (zi, zj)

• Allows types of objects and class probabilities to be learned from data

21 22 23

31 32 33

11 12 13

Fromtype i

To type j

Each entity has a type = Z

P(Z,|Y) P(Y|Z,P(Z)P(

Stochastic blockmodels

C

BA

C

B

A

CBA

C

BA

C

BA

CBA

D

D

D

Categorizing words

• Relational data: word association norms (Nelson, McEvoy, &

Schreiber, 1998)

• 5018 x 5018 matrix of associations– symmetrized– all words with < 50 and > 10 associates– 2513 nodes, 34716 links

Categorizing words

BANDINSTRUMENT

BLOWHORNFLUTEBRASS

GUITARPIANOTUBA

TRUMPET

TIECOATSHOESROPE

LEATHERSHOEHAT

PANTSWEDDING

STRING

SEWMATERIAL

WOOLYARNWEARTEARFRAYJEANS

COTTONCARPET

WASHLIQUID

BATHROOMSINK

CLEANERSTAINDRAINDISHES

TUBSCRUB

Categorizing actors

• Internet Movie Database (IMDB) data, from the start of cinema to 1960 (Jeremy Kubica)

• Relational data: collaboration

• 5000 x 5000 matrix of most prolific actors– all actors with < 400 and > 1 collaborators– 2275 nodes, 204761 links

Categorizing actors

Albert LievenKarel Stepanek

Walter RillaAnton Walbrook

Moore MarriottLaurence HanrayGus McNaughton

Gordon HarkerHelen Haye

Alfred GoddardMorland Graham

Margaret LockwoodHal Gordon

Bromley Davenport

Gino CerviNadia GrayEnrico GloriPaolo Stoppa

Bernardi NerioAmedeo NazzariGina Lollobrigida

Aldo SilvaniFranco Interlenghi

Guido Celano

Archie RicksHelen GibsonOscar Gahan

Buck MoultonBuck ConnorsClyde McClaryBarney BeasleyBuck Morgan

Tex PhelpsGeorge Sowards

Germany UK British comedy Italian US Westerns


• Bayesian approach allows us to specify structured probabilistic models

• Explore novel representations and domains– topics for semantic representation– relational categorization

• Use powerful methods for inference, developed in statistics and machine learning

Other methods and tools...

• Inference algorithms– belief propagation– dynamic programming– the EM algorithm and variational methods– Markov chain Monte Carlo

• More complex models– Dirichlet processes and Bayesian non-parametrics– Gaussian processes and kernel methods

Reading list at http://www.bayesiancognition.com

Taking stock


• Inductive leaps can be explained with hierarchical Theory-based Bayesian models:

Domain Theory

Structural Hypotheses

Data

ProbabilisticGenerativeModel

Bayesianinference



T

S

D

SS

DDD DD D DD

......


• Inductive leaps can be explained with hierarchical Theory-based Bayesian models.

• What the approach offers:– Strong quantitative models of generalization behavior.

– Flexibility to model different patterns of reasoning that in different tasks and domains, using differently structured theories, but the same general-purpose Bayesian engine.

– Framework for explaining why inductive generalization works, where knowledge comes from as well as how it is used.


• Inductive leaps can be explained with hierarchical Theory-based Bayesian models.

• Challenges:– Theories are hard.



• The interaction between structure and statistics is crucial.– How structured knowledge supports statistical learning, by

constraining hypothesis spaces.– How statistics supports reasoning with and learning

structured knowledge. – How complex structures can grow from data, rather than

being fully specified in advance.

Documents

Bayesian models of inductive learning Josh Tenenbaum & Tom Griffiths MIT Computational Cognitive Science Group Department of Brain and Cognitive Sciences