Upload
joanna-blake
View
219
Download
0
Embed Size (px)
DESCRIPTION
Plan for tonight Why be Bayesian? Informal introduction to learning as probabilistic inference Formal introduction to probabilistic inference A little bit of mathematical psychology An introduction to Bayes nets
Citation preview
Logistics
• Class size? Who is new? Who is listening?• Everyone on Athena mailing list “concepts-
and-theories”? If not write to me.• Everyone on stellar yet? If not, write to
Melissa Yeh ([email protected]).• Interest in having a printed course pack,
even if a few readings get changed?
Plan for tonight
• Why be Bayesian?• Informal introduction to learning as
probabilistic inference• Formal introduction to probabilistic
inference• A little bit of mathematical psychology• An introduction to Bayes nets
Plan for tonight
• Why be Bayesian?• Informal introduction to learning as
probabilistic inference• Formal introduction to probabilistic
inference• A little bit of mathematical psychology• An introduction to Bayes nets
Virtues of Bayesian framework • Generates principled models with strong explanatory
and descriptive power.
Virtues of Bayesian framework • Generates principled models with strong explanatory
and descriptive power. • Unifies models of cognition across tasks and domains.
– Categorization – Concept learning– Word learning– Inductive reasoning– Causal inference– Conceptual change
– Biology– Physics– Psychology– Language– . . .
Virtues of Bayesian framework • Generates principled models with strong explanatory
and descriptive power. • Unifies models of cognition across tasks and domains. • Explains which processing models work, and why.
– Associative learning– Connectionist networks – Similarity to examples– Toolkit of simple heuristics
Virtues of Bayesian framework • Generates principled models with strong explanatory
and descriptive power. • Unifies models of cognition across tasks and domains. • Explains which processing models work, and why. • Allows us to move beyond classic dichotomies.
– Symbols (rules, logic, hierarchies, relations) versus Statistics– Domain-general versus Domain-specific– Nature versus Nurture
Virtues of Bayesian framework • Generates principled models with strong explanatory and
descriptive power. • Unifies models of cognition across tasks and domains. • Explains which processing models work, and why. • Allows us to move beyond classic dichotomies.• A framework for understanding theory-based cognition:
– How are theories used to learn about the structure of the world?– How are theories acquired?
• Fundamental question: How do we update beliefs in light of data?
• Fundamental (and only) assumption:
Represent degrees of belief as probabilities.
• The answer:Mathematics of probability theory.
Rational statistical inference(Bayes, Laplace)
Frequentists: Probability as expected frequency• P(A) = 1: A will always occur. • P(A) = 0: A will never occur. • 0.5 < P(A) < 1: A will occur more often than not.
Subjectivists: Probability as degree of belief• P(A) = 1: believe A is true. • P(A) = 0: believe A is false.• 0.5 < P(A) < 1: believe A is more likely to be true
than false.
What does probability mean?
Frequentists: Probability as expected frequency• P(“heads”) = 0.5 ~ “If we flip 100 times, we expect
to see about 50 heads.”
Subjectivists: Probability as degree of belief• P(“heads”) = 0.5 ~ “On the next flip, it’s an even
bet whether it comes up heads or tails.”• P(“rain tomorrow”) = 0.8• P(“Saddam Hussein is dead”) = 0.1• . . .
What does probability mean?
Is subjective probability cognitively viable?
• Evolutionary psychologists (Gigerenzer, Cosmides, Tooby, Pinker) argue it is not.
“To understand the design of statistical inference mechanisms, then, one needs to examine what form inductive-reasoning problems -- and the information relevant to solving them -- regularly took in ancestral environments. […] Asking for the probability of a single event seems unexceptionable in the modern world, where we are bombarded with numerically expressed statistical information, such as weather forecasts telling us there is a 60% chance of rain today. […] In ancestral environments, the only external database available from which to reason inductively was one's own observations and, possibly, those communicated by the handful of other individuals with whom one lived. The ‘probability’ of a single event cannot be observed by an individual, however. Single events either happen or they don’t -- either it will rain today or it will not. Natural selection cannot build cognitive mechanisms designed to reason about, or receive as input, information in a format that did not regularly exist.”
(Brase, Cosmides and Tooby, 1998)
Is subjective probability cognitively viable?
• Evolutionary psychologists (Gigerenzer, Cosmides, Tooby, Pinker) argue it is not.
• Reasons to think it is:– Intuitions are old and potentially universal
(Aristotle, the Talmud).– Represented in semantics (and syntax?) of
natural language.– Extremely useful ….
Why be subjectivist?• Often need to make inferences about singular events
– e.g., How likely is it to rain tomorrow?
• Cox Axioms– A formal model of common sense
• “Dutch Book” + Survival of the Fittest– If your beliefs do not accord with the laws of probability, then you
can always be out-gambled by someone whose beliefs do so accord.
• Provides a theory of learning– A common currency for combining prior knowledge and the lessons
of experience.
Cox Axioms (via Jaynes)• Degrees of belief are represented by real numbers.• Qualitative correspondence with common sense, e.g.:
• Consistency:– If a conclusion can be reasoned in more than one way, then every
possible way must lead to the same result.– All available evidence should be taken into account when inferring
a degree of belief.– Equivalent states of knowledge should be represented with
equivalent degrees of belief.• Accepting these axioms implies Bel can be represented as a
probability measure.
)]([)( ABelfABel
)]|(),([)( ABBelABelgBABel
Plan for tonight
• Why be Bayesian?• Informal introduction to learning as
probabilistic inference• Formal introduction to probabilistic
inference• A little bit of mathematical psychology• An introduction to Bayes nets
Example: flipping coins• Flip a coin 10 times and see 5 heads, 5 tails. • P(heads) on next flip? 50%• Why? 50% = 5 / (5+5) = 5/10.• “Future will be like the past.”
• Suppose we had seen 4 heads and 6 tails.• P(heads) on next flip? Closer to 50% than to 40%.• Why? Prior knowledge.
Example: flipping coins• Represent prior knowledge as fictional observations F. • E.g., F ={1000 heads, 1000 tails} ~ strong expectation
that any new coin will be fair.• After seeing 4 heads, 6 tails, P(heads) on next flip =
1004 / (1004+1006) = 49.95%
• E.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair.
• After seeing 4 heads, 6 tails, P(heads) on next flip = 7 / (7+9) = 43.75%. Prior knowledge too weak.
Example: flipping thumbtacks• Represent prior knowledge as fictional observations F. • E.g., F ={4 heads, 3 tails} ~ weak expectation that tacks
are slightly biased towards heads.• After seeing 2 heads, 0 tails, P(heads) on next flip = 6 /
(6+3) = 67%.
• Some prior knowledge is always necessary to avoid jumping to hasty conclusions.
• Suppose F = { }: After seeing 2 heads, 0 tails, P(heads) on next flip = 2 / (2+0) = 100%.
Origin of prior knowledge• Tempting answer: prior experience• Suppose you have previously seen 2000 coin flips:
1000 heads, 1000 tails. • By assuming all coins (and flips) are alike, these
observations of other coins are as good as actual observations of the present coin.
Problems with simple empiricism
• Haven’t really seen 2000 coin flips, or any thumbtack flips. – Prior knowledge is stronger than raw experience justifies.
• Haven’t seen exactly equal number of heads and tails.– Prior knowledge is smoother than raw experience justifies.
• Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins.– Prior knowledge is more structured than raw experience.
A simple theory• “Coins are manufactured by a standardized procedure
that is effective but not perfect.” – Justifies generalizing from previous coins to the present coin.– Justifies smoother and stronger prior than raw experience
alone. – Explains why seeing 10 flips each for 200 coins is more
valuable than seeing 2000 flips of one coin.• “Tacks are asymmetric, and manufactured to less
exacting standards.”
Limitations
• Can all domain knowledge be represented so simply, in terms of an equivalent number of fictional observations?
• Suppose you flip a coin 25 times and get all heads. Something funny is going on ….
• But with F ={1000 heads, 1000 tails}, P(heads) on next flip = 1025 / (1025+1000) = 50.6%. Looks like nothing unusual.
Plan for tonight
• Why be Bayesian?• Informal introduction to learning as
probabilistic inference• Formal introduction to probabilistic
inference• A little bit of mathematical psychology• An introduction to Bayes nets
Basics
• Propositions: A, B, C, . . . .• Negation: • Logical operators “and”, “or”: • Obey classical logic, e.g.,
A
)( BABA
BABA ,
Basics
• Conservation of belief:• “Joint probability”:• For independent propositions:
• More generally:
1)()( APAP
),( written also ),( BAPBAP
)()(),( BPAPBAP
)|()()|()(),( BAPBPABPAPBAP
“Conditional probability”
Basics
• Example:– A = “Heads on flip 2”– B = “Tails on flip 2”
41
21
21)()(),( BPAPBAP
0021)|()(),( ABPAPBAP
Basics
• All probabilities should be conditioned on background knowledge K: e.g.,
• All the same rules hold conditioned on any K: e.g.,
• Often background knowledge will be implicit, brought in as needed.
)|( KAP
),|()|()|,( KABPKAPKBAP
Bayesian inference
• Definition of conditional probability:
• Bayes’ theorem:
)|()()|()(),( BAPBPABPAPBAP
)()|()()|(
APBAPBPABP
Bayesian inference
• Definition of conditional probability:
• Bayes’ rule:
• “Posterior probability”:• “Prior probability”:• “Likelihood”:
)|()()|()(),( BAPBPABPAPBAP
)()|()()|(
DPHDPHPDHP
)|( DHP
)(HP
)|( HDP
Bayesian inference
• Bayes’ rule:
• What makes a good scientific argument? P(H|D) is high if:– Hypothesis is plausible: P(H) is high – Hypothesis strongly predicts the observed data: P(D|H) is high– Data are surprising: P(D) is low
)()|()()|(
DPHDPHPDHP
Bayesian inference
• Deriving a more useful version:
)()|()()|(
APBAPBPABP
)()|()()|(
APBAPBPABP
1)|()|( ABPABP
Bayesian inference
• Deriving a more useful version:
)()|()()|(
APBAPBPABP
)()|()()|(
APBAPBPABP
1)(
)|()()(
)|()(
APBAPBP
APBAPBP
Bayesian inference
• Deriving a more useful version:
)()|()()|(
APBAPBPABP
)()|()()|(
APBAPBPABP
)()|()()|()( APBAPBPBAPBP “Conditionalization”
)(),(),( APBAPBAP “Marginalization”
Bayesian inference
• Deriving a more useful version:
)()|()()|(
APBAPBPABP
)()|()()|(
APBAPBPABP
)()|()()|()( APBAPBPBAPBP
Bayesian inference
• Deriving a more useful version:
)()|()()|(
APBAPBPABP
)()|()()|()( APBAPBPBAPBP
Bayesian inference
• Deriving a more useful version:
)|()()|()()|()()|(
BAPBPBAPBPBAPBPABP
Bayesian inference
• Deriving a more useful version:
)|()()|()()|()()|(
HDPHPHDPHPHDPHPDHP
Random variables
• Random variable X denotes a set of mutually exclusive exhaustive propositions (states of the world):
• Bayes’ theorem for random variables:
},,{ 1 nxxX
1)( ii
xXP
)|()()|()()|(
ii
i yYxXPyYPyYxXPyYPxXyYP
Random variables
• Random variable X denotes a set of mutually exclusive exhaustive propositions (states of the world):
• Bayes’ rule for more than two hypotheses:
},,{ 1 nxxX
1)( ii
xXP
)|()()|()()|(
ii
i hHdDPhHPhHdDPhHPdDhHP
Sherlock Holmes
• “How often have I said to you that when you have eliminated the impossible whatever remains, however improbable, must be the truth?” (The Sign of the Four)
)|()()|()()|(
ii
i hHdDPhHPhHdDPhHPdDhHP
Sherlock Holmes
• “How often have I said to you that when you have eliminated the impossible whatever remains, however improbable, must be the truth?” (The Sign of the Four)
)|()()|()()|()()|(
ihh
i hdPhPhdPhPhdPhPdhP
i
Sherlock Holmes
• “How often have I said to you that when you have eliminated the impossible whatever remains, however improbable, must be the truth?” (The Sign of the Four)
)|()()|()()|()()|(
ihh
i hdPhPhdPhPhdPhPdhP
i
= 0
Sherlock Holmes
• “How often have I said to you that when you have eliminated the impossible whatever remains, however improbable, must be the truth?” (The Sign of the Four)
1)|()()|()()|(
hdPhPhdPhPdhP
Plan for tonight
• Why be Bayesian?• Informal introduction to learning as
probabilistic inference• Formal introduction to probabilistic
inference• A little bit of mathematical psychology• An introduction to Bayes nets
Representativeness in reasoning
Which sequence is more likely to be produced by flipping a fair coin?
HHTHT
HHHHH
A reasoning fallacy
Kahneman & Tversky: people judge the probability of an outcome based on the extent to which it is representative of the generating process.
But how does “representativeness” work?
Predictive versus inductive reasoning
H
D
Hypothesis
Data
Predictiongiven
?
H
D
Likelihood: )|( HDP
Predictive versus inductive reasoning
Induction?
given
Predictiongiven
?
H
D
Representativeness:)|()|(
2
1HDPHDP
Predictive versus inductive reasoning
Likelihood: )|( HDP
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2) = x
Bayes’ Rule in odds form
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
D: dataH1, H2: models
P(H1|D): posterior probability that model 1 generated the data.
P(D|H1): likelihood of data given model 1
P(H1): prior probability that model 1 generated the data
= x
D: HHTHTH1, H2: fair coin, trick “all heads” coin.
P(D|H1) = 1/32 P(H1) = 999/1000
P(D|H2) = 0 P(H2) = 1/1000
P(H1|D) / P(H2|D) = infinity
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
Bayesian analysis of coin flipping
= x
D: HHHHHH1, H2: fair coin, trick “all heads” coin.
P(D|H1) = 1/32 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000
P(H1|D) / P(H2|D) = 999/32 ~ 30:1
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
Bayesian analysis of coin flipping
= x
D: HHHHHHHHHHH1, H2: fair coin, trick “all heads” coin.
P(D|H1) = 1/1024 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000
P(H1|D) / P(H2|D) = 999/1024 ~ 1:1
P(H1|D) P(D|H1) P(H1)
P(H2|D) P(D|H2) P(H2)
Bayesian analysis of coin flipping
= x
The role of theories
The fact that HHTHT looks representative of a fair coin and HHHHH does not reflects our implicit theories of how the world works. – Easy to imagine how a trick all-heads coin
could work: high prior probability.– Hard to imagine how a trick “HHTHT” coin
could work: low prior probability.
Plan for tonight
• Why be Bayesian?• Informal introduction to learning as
probabilistic inference• Formal introduction to probabilistic
inference• A little bit of mathematical psychology• An introduction to Bayes nets
• Three binary variables: Cavity, Toothache, Catch (whether dentist’s probe catches in your tooth).
Scaling up
)()|()()|()()|(
)|(
cavityPcavcatchachePcavPcavcatchachePcavPcavcatchacheP
catchachecavP
• Three binary variables: Cavity, Toothache, Catch (whether dentist’s probe catches in your tooth).
• With n pieces of evidence, we need 2n+1 conditional probabilities.
• Here n=2. Realistically, many more: X-ray, diet, oral hygiene, personality, . . . .
Scaling up
)()|()()|()()|(
)|(
cavityPcavcatchachePcavPcavcatchachePcavPcavcatchacheP
catchachecavP
• All three variables are dependent, but Toothache and Catch are independent given the presence or absence of Cavity.
• Both Toothache and Catch are caused by Cavity, but via independent causal mechanisms.
• In probabilistic terms:
• With n pieces of evidence, x1, …, xn, we need 2 n conditional probabilities:
Conditional independence
)|()|()|( cavcatchPcavachePcavcatchacheP
)|()|()|( cavcatchPcavachePcavcatchacheP )|()|(1 cavcatchPcavacheP
)|(),|( cavxPcavxP ii
• Graphical representation of relations between a set of random variables:
• Causal interpretation: independent local mechanisms• Probabilistic interpretation: factorizing complex terms
A simple Bayes net
Cavity
Toothache Catch
)()|,(),,( CavPCavCatchAchePCavCatchAcheP )()|()|( CavPCavCatchPCavAcheP
},,{
])[parents|(),,(CBAV
VVPCBAP
• Joint distribution sufficient for any inference:
A more complex systemBattery
Radio Ignition Gas
Starts
On time to work
)|(),|()()|()|()(),,,,,( SOPGISPGPBIPBRPBPOSGIRBP
)(
)|(),|()()|()|()(
)(),()|( ,,,
GP
SOPGISPGPBIPBRPBP
GPGOPGOP SIRB
• Joint distribution sufficient for any inference:
A more complex systemBattery
Radio Ignition Gas
Starts
On time to work
)|(),|()()|()|()(),,,,,( SOPGISPGPBIPBRPBPOSGIRBP
S IBSOPGISPBIPBP
GPGOPGOP )|(),|()|()()(
),()|(,
• Joint distribution sufficient for any inference:
• General inference algorithm: local message passing
A more complex systemBattery
Radio Ignition Gas
Starts
On time to work
)|(),|()()|()|()(),,,,,( SOPGISPGPBIPBRPBPOSGIRBP
• Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on:
Explaining awayRain Sprinkler
Grass Wet
.andif0 sSrR
),|()()(),,( RSWPSPRPWSRP
rRsSRSwWP orif1),|(
Explaining awayRain Sprinkler
Grass Wet
)()()|()|(
wPrPrwPwrP
Compute probability it rained last night, given that the grass is wet:
.andif0 sSrR
),|()()(),,( RSWPSPRPWSRP
rRsSRSwWP orif1),|(
Explaining awayRain Sprinkler
Grass Wet
srsrPsrwP
rPrwPwrP
,),(),|(
)()|()|(Compute probability it rained last night, given that the grass is wet:
.andif0 sSrR
),|()()(),,( RSWPSPRPWSRP
rRsSRSwWP orif1),|(
Explaining awayRain Sprinkler
Grass Wet
),(),(),()()|(
srPsrPsrPrPwrP
Compute probability it rained last night, given that the grass is wet:
.andif0 sSrR
),|()()(),,( RSWPSPRPWSRP
rRsSRSwWP orif1),|(
Explaining awayRain Sprinkler
Grass Wet
Compute probability it rained last night, given that the grass is wet:
),()()()|(
srPrPrPwrP
.andif0 sSrR
),|()()(),,( RSWPSPRPWSRP
rRsSRSwWP orif1),|(
Explaining awayRain Sprinkler
Grass Wet
)()()()()|(
sPrPrPrPwrP
Compute probability it rained last night, given that the grass is wet:
Between 1 and P(s)
)(rP
.andif0 sSrR
),|()()(),,( RSWPSPRPWSRP
rRsSRSwWP orif1),|(
Explaining awayRain Sprinkler
Grass Wet
Compute probability it rained last night, given that the grass is wet and sprinklers were left on: )|(
)|(),|(),|(swP
srPsrwPswrP
Both terms = 1
.andif0 sSrR
),|()()(),,( RSWPSPRPWSRP
rRsSRSwWP orif1),|(
Explaining awayRain Sprinkler
Grass Wet
Compute probability it rained last night, given that the grass is wet and sprinklers were left on:
)(rP)|(),|( srPswrP
.andif0 sSrR
),|()()(),,( RSWPSPRPWSRP
rRsSRSwWP orif1),|(
Explaining awayRain Sprinkler
Grass Wet
)(rP)|(),|( srPswrP )()()(
)()|(sPrPrP
rPwrP
)(rP
“Discounting” to prior probability.
.andif0 sSrR
),|()()(),,( RSWPSPRPWSRP
rRsSRSwWP orif1),|(
• Observing rain, Wet becomes more active. • Observing grass wet, Rain and Sprinkler become more active.• Observing grass wet and sprinkler, Rain cannot become less active. No explaining away!
• Excitatory links: Rain Wet, Sprinkler Wet
Contrast w/ spreading activationRain Sprinkler
Grass Wet
• Observing grass wet, Rain and Sprinkler become more active.• Observing grass wet and sprinkler, Rain becomes less active: explaining away.
• Excitatory links: Rain Wet, Sprinkler Wet• Inhibitory link: Rain Sprinkler
Contrast w/ spreading activationRain Sprinkler
Grass Wet
• Each new variable requires more inhibitory connections.• Interactions between variables are not causal.• Not modular.
– Whether a connection exists depends on what other connections exist, in non-transparent ways. – Big holism problem. – Combinatorial explosion.
Contrast w/ spreading activationRain
Sprinkler
Grass Wet
Burst pipe
• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.
• Example:
Causality and the Markov property
Cavity
Ache Catch
},,{
])[parents|(),,(CBAV
VVPCBAP
)(),,()|,(
CavPCavCatchAchePCavCatchAcheP
• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.
• Example:
Causality and the Markov property
Cavity
Ache Catch
},,{
])[parents|(),,(CBAV
VVPCBAP
)()()|()|()|,(
CavPCavPCavCatchPCavAchePCavCatchAcheP
• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.
• Example:
Causality and the Markov property
Cavity
Ache Catch
},,{
])[parents|(),,(CBAV
VVPCBAP
)|()|()|,( CavCatchPCavAchePCavCatchAcheP
• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.
• Example:
Causality and the Markov property
},,{
])[parents|(),,(CBAV
VVPCBAP
Rain Sprinkler
Grass Wet
Wet
WetSprinklerRainPSprinklerRainP ),,(),(
• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.
• Example:
Causality and the Markov property
},,{
])[parents|(),,(CBAV
VVPCBAP
Rain Sprinkler
Grass Wet
Wet
SprinklerPRainPSprinklerRainWetPSprinklerRainP )()(),|(),(
=1, for any values of Rain and Sprinkler
• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.
• Example:
Causality and the Markov property
},,{
])[parents|(),,(CBAV
VVPCBAP
Rain Sprinkler
Grass Wet
)()(),( SprinklerPRainPSprinklerRainP
• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.
• Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”:
• Does not capture the correlation between symptoms: falsely believe P(Ache, Catch) = P(Ache) P(Catch).
Causality and the Markov property
Ache Catch
Cavity
• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.
• Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”:
• Inserting a new arrow allows us to capture this correlation.• This model is too complex: do not believe that
Causality and the Markov property
Ache Catch
Cavity
)|()|()|,( CavCatchPCavAchePCavCatchAcheP
• Markov property: Any variable is conditionally independent of its non-descendants, given its parents.• Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”:
• New symptoms require a combinatorial proliferation of new arrows. Too general, not modular, holism, yuck . . . .
Causality and the Markov property
Ache Catch
Cavity
X-ray
Still to come• Applications to models of categorization• More on the relation between causality and probability:
• Learning causal graph structures.• Learning causal abstractions (“diseases cause
symptoms”)• What’s missing
Causal structure
Statistical dependencies
The end
Mathcamp data: raw
Mathcamp data: collapsed over parity
Zenith radio data: collapsed over parity