Logistics Class size? Who is new? Who is listening? Everyone on Athena mailing list “concepts- and-theories”? If not write to me. Everyone on stellar yet?

Logistics

• Class size? Who is new? Who is listening?• Everyone on Athena mailing list “concepts-

and-theories”? If not write to me.• Everyone on stellar yet? If not, write to

Melissa Yeh ([email protected]).• Interest in having a printed course pack,

even if a few readings get changed?

Plan for tonight

• Why be Bayesian?• Informal introduction to learning as

probabilistic inference• Formal introduction to probabilistic

inference• A little bit of mathematical psychology• An introduction to Bayes nets

Plan for tonight




Virtues of Bayesian framework • Generates principled models with strong explanatory

and descriptive power.


and descriptive power. • Unifies models of cognition across tasks and domains.

– Categorization – Concept learning– Word learning– Inductive reasoning– Causal inference– Conceptual change

– Biology– Physics– Psychology– Language– . . .


and descriptive power. • Unifies models of cognition across tasks and domains. • Explains which processing models work, and why.

– Associative learning– Connectionist networks – Similarity to examples– Toolkit of simple heuristics


and descriptive power. • Unifies models of cognition across tasks and domains. • Explains which processing models work, and why. • Allows us to move beyond classic dichotomies.

– Symbols (rules, logic, hierarchies, relations) versus Statistics– Domain-general versus Domain-specific– Nature versus Nurture

Virtues of Bayesian framework • Generates principled models with strong explanatory and

descriptive power. • Unifies models of cognition across tasks and domains. • Explains which processing models work, and why. • Allows us to move beyond classic dichotomies.• A framework for understanding theory-based cognition:

– How are theories used to learn about the structure of the world?– How are theories acquired?

• Fundamental question: How do we update beliefs in light of data?

• Fundamental (and only) assumption:

Represent degrees of belief as probabilities.

• The answer:Mathematics of probability theory.

Rational statistical inference(Bayes, Laplace)

Frequentists: Probability as expected frequency• P(A) = 1: A will always occur. • P(A) = 0: A will never occur. • 0.5 < P(A) < 1: A will occur more often than not.

Subjectivists: Probability as degree of belief• P(A) = 1: believe A is true. • P(A) = 0: believe A is false.• 0.5 < P(A) < 1: believe A is more likely to be true

than false.

What does probability mean?

Frequentists: Probability as expected frequency• P(“heads”) = 0.5 ~ “If we flip 100 times, we expect

to see about 50 heads.”

Subjectivists: Probability as degree of belief• P(“heads”) = 0.5 ~ “On the next flip, it’s an even

bet whether it comes up heads or tails.”• P(“rain tomorrow”) = 0.8• P(“Saddam Hussein is dead”) = 0.1• . . .

What does probability mean?

Is subjective probability cognitively viable?

• Evolutionary psychologists (Gigerenzer, Cosmides, Tooby, Pinker) argue it is not.

“To understand the design of statistical inference mechanisms, then, one needs to examine what form inductive-reasoning problems -- and the information relevant to solving them -- regularly took in ancestral environments. […] Asking for the probability of a single event seems unexceptionable in the modern world, where we are bombarded with numerically expressed statistical information, such as weather forecasts telling us there is a 60% chance of rain today. […] In ancestral environments, the only external database available from which to reason inductively was one's own observations and, possibly, those communicated by the handful of other individuals with whom one lived. The ‘probability’ of a single event cannot be observed by an individual, however. Single events either happen or they don’t -- either it will rain today or it will not. Natural selection cannot build cognitive mechanisms designed to reason about, or receive as input, information in a format that did not regularly exist.”

(Brase, Cosmides and Tooby, 1998)

Is subjective probability cognitively viable?

• Evolutionary psychologists (Gigerenzer, Cosmides, Tooby, Pinker) argue it is not.

• Reasons to think it is:– Intuitions are old and potentially universal

(Aristotle, the Talmud).– Represented in semantics (and syntax?) of

natural language.– Extremely useful ….

Why be subjectivist?• Often need to make inferences about singular events

– e.g., How likely is it to rain tomorrow?

• Cox Axioms– A formal model of common sense

• “Dutch Book” + Survival of the Fittest– If your beliefs do not accord with the laws of probability, then you

can always be out-gambled by someone whose beliefs do so accord.

• Provides a theory of learning– A common currency for combining prior knowledge and the lessons

of experience.

Cox Axioms (via Jaynes)• Degrees of belief are represented by real numbers.• Qualitative correspondence with common sense, e.g.:

• Consistency:– If a conclusion can be reasoned in more than one way, then every

possible way must lead to the same result.– All available evidence should be taken into account when inferring

a degree of belief.– Equivalent states of knowledge should be represented with

equivalent degrees of belief.• Accepting these axioms implies Bel can be represented as a

probability measure.

)]([)( ABelfABel

)]|(),([)( ABBelABelgBABel

Plan for tonight




Example: flipping coins• Flip a coin 10 times and see 5 heads, 5 tails. • P(heads) on next flip? 50%• Why? 50% = 5 / (5+5) = 5/10.• “Future will be like the past.”

• Suppose we had seen 4 heads and 6 tails.• P(heads) on next flip? Closer to 50% than to 40%.• Why? Prior knowledge.

Example: flipping coins• Represent prior knowledge as fictional observations F. • E.g., F ={1000 heads, 1000 tails} ~ strong expectation

that any new coin will be fair.• After seeing 4 heads, 6 tails, P(heads) on next flip =

1004 / (1004+1006) = 49.95%

• E.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair.

• After seeing 4 heads, 6 tails, P(heads) on next flip = 7 / (7+9) = 43.75%. Prior knowledge too weak.

Example: flipping thumbtacks• Represent prior knowledge as fictional observations F. • E.g., F ={4 heads, 3 tails} ~ weak expectation that tacks

are slightly biased towards heads.• After seeing 2 heads, 0 tails, P(heads) on next flip = 6 /

(6+3) = 67%.

• Some prior knowledge is always necessary to avoid jumping to hasty conclusions.

• Suppose F = { }: After seeing 2 heads, 0 tails, P(heads) on next flip = 2 / (2+0) = 100%.

Origin of prior knowledge• Tempting answer: prior experience• Suppose you have previously seen 2000 coin flips:

1000 heads, 1000 tails. • By assuming all coins (and flips) are alike, these

observations of other coins are as good as actual observations of the present coin.

Problems with simple empiricism

• Haven’t really seen 2000 coin flips, or any thumbtack flips. – Prior knowledge is stronger than raw experience justifies.

• Haven’t seen exactly equal number of heads and tails.– Prior knowledge is smoother than raw experience justifies.

• Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins.– Prior knowledge is more structured than raw experience.

A simple theory• “Coins are manufactured by a standardized procedure

that is effective but not perfect.” – Justifies generalizing from previous coins to the present coin.– Justifies smoother and stronger prior than raw experience

alone. – Explains why seeing 10 flips each for 200 coins is more

valuable than seeing 2000 flips of one coin.• “Tacks are asymmetric, and manufactured to less

exacting standards.”

Limitations

• Can all domain knowledge be represented so simply, in terms of an equivalent number of fictional observations?

• Suppose you flip a coin 25 times and get all heads. Something funny is going on ….

• But with F ={1000 heads, 1000 tails}, P(heads) on next flip = 1025 / (1025+1000) = 50.6%. Looks like nothing unusual.

Plan for tonight




Basics

• Propositions: A, B, C, . . . .• Negation: • Logical operators “and”, “or”: • Obey classical logic, e.g.,

A

)( BABA

BABA ,

Basics

• Conservation of belief:• “Joint probability”:• For independent propositions:

• More generally:

1)()( APAP

),( written also ),( BAPBAP

)()(),( BPAPBAP

)|()()|()(),( BAPBPABPAPBAP

“Conditional probability”

Basics

• Example:– A = “Heads on flip 2”– B = “Tails on flip 2”

41

21

21)()(),( BPAPBAP

0021)|()(),( ABPAPBAP

Basics

• All probabilities should be conditioned on background knowledge K: e.g.,

• All the same rules hold conditioned on any K: e.g.,

• Often background knowledge will be implicit, brought in as needed.

)|( KAP

),|()|()|,( KABPKAPKBAP

Bayesian inference

• Definition of conditional probability:

• Bayes’ theorem:


)()|()()|(

APBAPBPABP

Bayesian inference

• Definition of conditional probability:

• Bayes’ rule:

• “Posterior probability”:• “Prior probability”:• “Likelihood”:


)()|()()|(

DPHDPHPDHP

)|( DHP

)(HP

)|( HDP

Bayesian inference

• Bayes’ rule:

• What makes a good scientific argument? P(H|D) is high if:– Hypothesis is plausible: P(H) is high – Hypothesis strongly predicts the observed data: P(D|H) is high– Data are surprising: P(D) is low

)()|()()|(

DPHDPHPDHP

Bayesian inference

• Deriving a more useful version:

)()|()()|(

APBAPBPABP

)()|()()|(

APBAPBPABP

1)|()|( ABPABP

Bayesian inference


)()|()()|(

APBAPBPABP

)()|()()|(

APBAPBPABP

1)(

)|()()(

)|()(

APBAPBP

APBAPBP

Bayesian inference


)()|()()|(

APBAPBPABP

)()|()()|(

APBAPBPABP

)()|()()|()( APBAPBPBAPBP “Conditionalization”

)(),(),( APBAPBAP “Marginalization”

Bayesian inference


)()|()()|(

APBAPBPABP

)()|()()|(

APBAPBPABP

)()|()()|()( APBAPBPBAPBP

Bayesian inference


)()|()()|(

APBAPBPABP

)()|()()|()( APBAPBPBAPBP

Bayesian inference


)|()()|()()|()()|(

BAPBPBAPBPBAPBPABP

Bayesian inference


)|()()|()()|()()|(

HDPHPHDPHPHDPHPDHP

Random variables

• Random variable X denotes a set of mutually exclusive exhaustive propositions (states of the world):

• Bayes’ theorem for random variables:

},,{ 1 nxxX

1)( ii

xXP

)|()()|()()|(

ii

i yYxXPyYPyYxXPyYPxXyYP

Random variables

• Random variable X denotes a set of mutually exclusive exhaustive propositions (states of the world):

• Bayes’ rule for more than two hypotheses:

},,{ 1 nxxX

1)( ii

xXP

)|()()|()()|(

ii

i hHdDPhHPhHdDPhHPdDhHP

Sherlock Holmes

• “How often have I said to you that when you have eliminated the impossible whatever remains, however improbable, must be the truth?” (The Sign of the Four)

)|()()|()()|(

ii

i hHdDPhHPhHdDPhHPdDhHP

Sherlock Holmes


)|()()|()()|()()|(

ihh

i hdPhPhdPhPhdPhPdhP

i

Sherlock Holmes


)|()()|()()|()()|(

ihh

i hdPhPhdPhPhdPhPdhP

i

= 0

Sherlock Holmes


1)|()()|()()|(

hdPhPhdPhPdhP

Plan for tonight




Representativeness in reasoning

Which sequence is more likely to be produced by flipping a fair coin?

HHTHT

HHHHH

A reasoning fallacy

Kahneman & Tversky: people judge the probability of an outcome based on the extent to which it is representative of the generating process.

But how does “representativeness” work?

Predictive versus inductive reasoning

H

D

Hypothesis

Data

Predictiongiven

?

H

D

Likelihood: )|( HDP


Induction?

given

Predictiongiven

?

H

D

Representativeness:)|()|(

2

1HDPHDP


Likelihood: )|( HDP

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2) = x

Bayes’ Rule in odds form

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: dataH1, H2: models

P(H1|D): posterior probability that model 1 generated the data.

P(D|H1): likelihood of data given model 1

P(H1): prior probability that model 1 generated the data

= x

D: HHTHTH1, H2: fair coin, trick “all heads” coin.

P(D|H1) = 1/32 P(H1) = 999/1000

P(D|H2) = 0 P(H2) = 1/1000

P(H1|D) / P(H2|D) = infinity

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

Bayesian analysis of coin flipping

= x

D: HHHHHH1, H2: fair coin, trick “all heads” coin.

P(D|H1) = 1/32 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

P(H1|D) / P(H2|D) = 999/32 ~ 30:1

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)


= x

D: HHHHHHHHHHH1, H2: fair coin, trick “all heads” coin.

P(D|H1) = 1/1024 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

P(H1|D) / P(H2|D) = 999/1024 ~ 1:1

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)


= x

The role of theories

The fact that HHTHT looks representative of a fair coin and HHHHH does not reflects our implicit theories of how the world works. – Easy to imagine how a trick all-heads coin

could work: high prior probability.– Hard to imagine how a trick “HHTHT” coin

could work: low prior probability.

Plan for tonight




• Three binary variables: Cavity, Toothache, Catch (whether dentist’s probe catches in your tooth).

Scaling up

)()|()()|()()|(

)|(

cavityPcavcatchachePcavPcavcatchachePcavPcavcatchacheP

catchachecavP

• Three binary variables: Cavity, Toothache, Catch (whether dentist’s probe catches in your tooth).

• With n pieces of evidence, we need 2n+1 conditional probabilities.

• Here n=2. Realistically, many more: X-ray, diet, oral hygiene, personality, . . . .

Scaling up

)()|()()|()()|(

)|(

cavityPcavcatchachePcavPcavcatchachePcavPcavcatchacheP

catchachecavP

• All three variables are dependent, but Toothache and Catch are independent given the presence or absence of Cavity.

• Both Toothache and Catch are caused by Cavity, but via independent causal mechanisms.

• In probabilistic terms:

• With n pieces of evidence, x1, …, xn, we need 2 n conditional probabilities:

Conditional independence

)|()|()|( cavcatchPcavachePcavcatchacheP

)|()|()|( cavcatchPcavachePcavcatchacheP )|()|(1 cavcatchPcavacheP

)|(),|( cavxPcavxP ii

• Graphical representation of relations between a set of random variables:

• Causal interpretation: independent local mechanisms• Probabilistic interpretation: factorizing complex terms

A simple Bayes net

Cavity

Toothache Catch

)()|,(),,( CavPCavCatchAchePCavCatchAcheP )()|()|( CavPCavCatchPCavAcheP

},,{

])[parents|(),,(CBAV

VVPCBAP

• Joint distribution sufficient for any inference:

A more complex systemBattery

Radio Ignition Gas

Starts

On time to work

)|(),|()()|()|()(),,,,,( SOPGISPGPBIPBRPBPOSGIRBP

)(

)|(),|()()|()|()(

)(),()|( ,,,

GP

SOPGISPGPBIPBRPBP

GPGOPGOP SIRB



Radio Ignition Gas

Starts

On time to work


S IBSOPGISPBIPBP

GPGOPGOP )|(),|()|()()(

),()|(,


• General inference algorithm: local message passing


Radio Ignition Gas

Starts

On time to work


• Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on:

Explaining awayRain Sprinkler

Grass Wet

.andif0 sSrR

),|()()(),,( RSWPSPRPWSRP

rRsSRSwWP orif1),|(


Grass Wet

)()()|()|(

wPrPrwPwrP

Compute probability it rained last night, given that the grass is wet:

.andif0 sSrR


rRsSRSwWP orif1),|(


Grass Wet

srsrPsrwP

rPrwPwrP

,),(),|(

)()|()|(Compute probability it rained last night, given that the grass is wet:

.andif0 sSrR


rRsSRSwWP orif1),|(


Grass Wet

),(),(),()()|(

srPsrPsrPrPwrP


.andif0 sSrR


rRsSRSwWP orif1),|(


Grass Wet


),()()()|(

srPrPrPwrP

.andif0 sSrR


rRsSRSwWP orif1),|(


Grass Wet

)()()()()|(

sPrPrPrPwrP


Between 1 and P(s)

)(rP

.andif0 sSrR


rRsSRSwWP orif1),|(


Grass Wet

Compute probability it rained last night, given that the grass is wet and sprinklers were left on: )|(

)|(),|(),|(swP

srPsrwPswrP

Both terms = 1

.andif0 sSrR


rRsSRSwWP orif1),|(


Grass Wet

Compute probability it rained last night, given that the grass is wet and sprinklers were left on:

)(rP)|(),|( srPswrP

.andif0 sSrR


rRsSRSwWP orif1),|(


Grass Wet

)(rP)|(),|( srPswrP )()()(

)()|(sPrPrP

rPwrP

)(rP

“Discounting” to prior probability.

.andif0 sSrR


rRsSRSwWP orif1),|(

• Observing rain, Wet becomes more active. • Observing grass wet, Rain and Sprinkler become more active.• Observing grass wet and sprinkler, Rain cannot become less active. No explaining away!

• Excitatory links: Rain Wet, Sprinkler Wet

Contrast w/ spreading activationRain Sprinkler

Grass Wet

• Observing grass wet, Rain and Sprinkler become more active.• Observing grass wet and sprinkler, Rain becomes less active: explaining away.

• Excitatory links: Rain Wet, Sprinkler Wet• Inhibitory link: Rain Sprinkler

Contrast w/ spreading activationRain Sprinkler

Grass Wet

• Each new variable requires more inhibitory connections.• Interactions between variables are not causal.• Not modular.

– Whether a connection exists depends on what other connections exist, in non-transparent ways. – Big holism problem. – Combinatorial explosion.

Contrast w/ spreading activationRain

Sprinkler

Grass Wet

Burst pipe

• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.

• Example:

Causality and the Markov property

Cavity

Ache Catch

},,{


VVPCBAP

)(),,()|,(

CavPCavCatchAchePCavCatchAcheP


• Example:


Cavity

Ache Catch

},,{


VVPCBAP

)()()|()|()|,(

CavPCavPCavCatchPCavAchePCavCatchAcheP


• Example:


Cavity

Ache Catch

},,{


VVPCBAP

)|()|()|,( CavCatchPCavAchePCavCatchAcheP


• Example:


},,{


VVPCBAP

Rain Sprinkler

Grass Wet

Wet

WetSprinklerRainPSprinklerRainP ),,(),(


• Example:


},,{


VVPCBAP

Rain Sprinkler

Grass Wet

Wet

SprinklerPRainPSprinklerRainWetPSprinklerRainP )()(),|(),(

=1, for any values of Rain and Sprinkler


• Example:


},,{


VVPCBAP

Rain Sprinkler

Grass Wet

)()(),( SprinklerPRainPSprinklerRainP


• Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”:

• Does not capture the correlation between symptoms: falsely believe P(Ache, Catch) = P(Ache) P(Catch).


Ache Catch

Cavity


• Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”:

• Inserting a new arrow allows us to capture this correlation.• This model is too complex: do not believe that


Ache Catch

Cavity

)|()|()|,( CavCatchPCavAchePCavCatchAcheP

• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.• Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”:

• New symptoms require a combinatorial proliferation of new arrows. Too general, not modular, holism, yuck . . . .


Ache Catch

Cavity

X-ray

Still to come• Applications to models of categorization• More on the relation between causality and probability:

• Learning causal graph structures.• Learning causal abstractions (“diseases cause

symptoms”)• What’s missing

Causal structure

Statistical dependencies

The end

Mathcamp data: raw

Mathcamp data: collapsed over parity

Zenith radio data: collapsed over parity

Documents

Logistics Class size? Who is new? Who is listening? Everyone on Athena mailing list “concepts- and-theories”? If not write to me. Everyone on stellar yet?