Upload
tex
View
44
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Bayesian models of human inductive learning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL). Collaborators. Vikash Mansinghka. Tom Griffiths. Pat Shafto. Charles Kemp. Takeshi Yamada. Chris Baker. Naonori Ueda. Lauren Schmidt. - PowerPoint PPT Presentation
Citation preview
Bayesian models of human inductive learning
Josh TenenbaumMIT
Department of Brain and Cognitive SciencesComputer Science and AI Lab (CSAIL)
Charles KempPat Shafto
Lauren SchmidtChris Baker
Collaborators
Funding: US NSF, AFOSR, ONR, DARPA, NTT Communication Sciences Laboratories, Schlumberger, Eli Lilly & Co., James S. McDonnell Foundation
Vikash Mansinghka Tom Griffiths
Takeshi Yamada Naonori Ueda
The probabilistic revolution in AI
• Principled and effective solutions for inductive inference from ambiguous data:– Vision– Robotics– Machine learning– Expert systems / reasoning– Natural language processing
• Standard view: no necessary connection to how the human brain solves these problems.
Bayesian models of cognitionVisual perception [Weiss, Simoncelli, Adelson, Richards, Freeman, Feldman, Kersten, Knill, Maloney,
Olshausen, Jacobs, Pouget, ...]
Language acquisition and processing [Brent, de Marken, Niyogi, Klein, Manning, Jurafsky, Keller, Levy, Hale, Johnson, Griffiths, Perfors, Tenenbaum, …]
Motor learning and motor control [Ghahramani, Jordan, Wolpert, Kording, Kawato, Doya, Todorov, Shadmehr, …]
Associative learning [Dayan, Daw, Kakade, Courville, Touretzky, Kruschke, …]
Memory [Anderson, Schooler, Shiffrin, Steyvers, Griffiths, McClelland, …]
Attention [Mozer, Huber, Torralba, Oliva, Geisler, Yu, Itti, Baldi, …]
Categorization and concept learning [Anderson, Nosfosky, Rehder, Navarro, Griffiths, Feldman, Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, …]
Reasoning [Chater, Oaksford, Sloman, McKenzie, Heit, Tenenbaum, Kemp, …]
Causal inference [Waldmann, Sloman, Steyvers, Griffiths, Tenenbaum, Yuille, …]
Decision making and theory of mind [Lee, Stankiewicz, Rao, Baker, Goodman, Tenenbaum, …]
Everyday inductive leaps
How can people learn so much about the world from such limited evidence?– Learning concepts from examples
“horse” “horse” “horse”
Everyday inductive leaps
How can people learn so much about the world from such limited evidence?– Kinds of objects and their properties– The meanings of words, phrases, and sentences – Cause-effect relations– The beliefs, goals and plans of other people– Social structures, conventions, and rules
The solutionStrong prior knowledge (inductive bias).
– How does background knowledge guide learning from sparsely observed data?
– What form does the knowledge take, across different domains and tasks?
– How is that knowledge itself learned?
Our goal: Computational models that answer these questions, with strong quantitative fits to human behavioral data and a bridge to state-of-the-art AI and machine learning.
1. How does background knowledge guide learning from sparsely observed data?
Bayesian inference:
2. What form does background knowledge take, across different domains and tasks?
Probabilities defined over structured representations: graphs, grammars, predicate logic, schemas, theories.
3. How is background knowledge itself acquired, constraining learning while maintaining flexibility?
Hierarchical probabilistic models, with inference at multiple levels of abstraction. Nonparametric models in which complexity grows automatically as the data require.
The approach: from statistics to intelligence
Hhii
i
hPhdPhPhdPdhP
)()|()()|()|(
Basics of Bayesian inference
• Bayes’ rule:• An example
– Data: John is coughing – Some hypotheses:
1. John has a cold2. John has lung cancer3. John has a stomach flu
– Likelihood P(d|h) favors 1 and 2 over 3– Prior probability P(h) favors 1 and 3 over 2– Posterior probability P(h|d) favors 1 over 2 and 3
Hhii
i
hPhdPhPhdPdhP
)()|()()|()|(
• How likely is the conclusion, given the premises?
“Similarity”, “Typicality”,
“Diversity”
Gorillas have T9 hormones.Seals have T9 hormones.Squirrels have T9 hormones.
Horses have T9 hormones.Gorillas have T9 hormones.Chimps have T9 hormones.Monkeys have T9 hormones.Baboons have T9 hormones.
Horses have T9 hormones.
Gorillas have T9 hormones.Seals have T9 hormones.Squirrels have T9 hormones.
Flies have T9 hormones.
Property induction
The computational problem
?
?????
??
Features New property
?
HorseCow
ChimpGorillaMouse
SquirrelDolphin
SealRhino
Elephant
85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…
“Transfer Learning”, “Semi-Supervised Learning”
???????
?
HorseCow
ChimpGorillaMouse
SquirrelDolphin
SealRhino
Elephant
... ...
Horses have T9 hormonesRhinos have T9 hormones
Cows have T9 hormones
X
Y
}
Xh
YXh
hP
hPXYP
with consistent
, with consistent
)(
)()|(
Prior P(h)
Hypotheses h
???????
?
HorseCow
ChimpGorillaMouse
SquirrelDolphin
SealRhino
Elephant
... ...
Horses have T9 hormonesRhinos have T9 hormones
Cows have T9 hormones
}
Prediction P(Y | X) Hypotheses h
Prior P(h)
X
Y
Xh
YXh
hP
hPXYP
with consistent
, with consistent
)(
)()|(
F: form
S: structure
D: data
Tree with species at leaf nodes
mouse
squirrel
chimp
gorilla
mousesquirrel
chimpgorilla
F1
F2 F3 F4 Has
T9
horm
ones
??
?
…
P(structure | form)
P(data | structure)
P(form)
Hierarchical Bayesian Framework
Smooth: P(h) high
P(D|S): How the structure constrains the data of experience
• Define a stochastic process over structure S that generates candidate property extensions h.– Intuition: properties should vary smoothly over structure.
Not smooth: P(h) low
S
y
Gaussian Process (~ random walk, diffusion)
Threshold
P(D|S): How the structure constrains the data of experience
h
[Zhu, Lafferty & Ghahramani 2003]
S
y
Gaussian Process (~ random walk, diffusion)
Threshold
P(D|S): How the structure constrains the data of experience
[Zhu, Lafferty & Ghahramani 2003]
h
Species 1Species 2Species 3Species 4Species 5Species 6Species 7Species 8Species 9Species 10
Structure S
Data D
Features85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…
Species 1Species 2Species 3Species 4Species 5Species 6Species 7Species 8Species 9Species 10
Features New property
Structure S
Data D ?
?????
??
85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…
Gorillas have property P.Mice have property P.Seals have property P.
All mammals have property P.
Cows have property P.Elephants have property P.
Horses have property P.
Tree
2D
Learning about spatial properties Geographic inference task: “Given that a certain kind of
native American artifact has been found in sites near city X, how likely is the same artifact to be found near city Y?”
Tree
2D
Hierarchical Bayesian Framework
F: form
S: structure
D: data mousesquirrel
chimpgorilla
F1
F2 F3 F4
Tree
mouse
squirrel
chimp
gorilla
mousesquirrel
chimpgorilla
SpaceChain
chimp
gorilla
squirrel
mouse
Discovering structural forms
Ostrich
Robin
Crocod
ile
Snake
Bat
Orangu
tan
Turtle
Ostrich Robin Crocodile Snake Bat OrangutanTurtle
Ostrich
Robin
Crocod
ile
Snake
Bat
Orangu
tan
Turtle
Angel
GodRock
Plant
Ostrich Robin Crocodile Snake Bat OrangutanTurtle
Discovering structural forms
Linnaeus
“Great chain of being”
• Scientific discoveries
• Children’s cognitive development– Hierarchical structure of category labels– Clique structure of social groups– Cyclical structure of seasons or days of the week– Transitive structure for value
People can discover structural forms
Tree structure for biological species
Periodic structure for chemical elements
(1579) (1837)
Systema Naturae
Kingdom Animalia Phylum Chordata Class Mammalia Order Primates Family Hominidae Genus Homo Species Homo sapiens
(1735)
“great chain of being”
Typical structure learning algorithms assume a fixed structural form
Flat Clusters
K-MeansMixture modelsCompetitive learning
Line
Guttman scalingIdeal point models
Tree
Hierarchical clusteringBayesian phylogenetics
Circle
Circumplex models
Euclidean Space
MDSPCAFactor Analysis
Grid
Self-Organizing MapGenerative topographic
mapping
The ultimate goal
“Universal Structure Learner”
K-MeansHierarchical clusteringFactor AnalysisGuttman scalingCircumplex modelsSelf-Organizing maps···
Data Representation
F: form
S: structure
D: data mousesquirrel
chimpgorilla
F1
F2 F3 F4
Favors simplicity
Favors smoothness[Zhu et al., 2003]
Tree
mouse
squirrel
chimp
gorilla
ClustersLinear
chimp
gorilla
squirrel
mouse
mouse
squirrel
chimp
gorilla
Summary so far
Bayesian inference over hierarchies of structured representations provides a framework to understand core questions of human cognition:– What is the content and form of
human knowledge, at multiple levels of abstraction?
– How does abstract domain knowledge guide learning of new concepts?
– How is abstract domain knowledge learned? What must be built in?
F: form
S: structure
D: data
mouse
squirrel
chimp
gorilla
mousesquirrel
chimpgorilla
F1
F2 F3 F4
Other questions• How can we learn domain structures if we do not already
know in advance which features are relevant? • How can we discover richer models of a domain, with
multiple ways of structuring objects? • How can we learn models for more complex domains, with
not just a single object-property matrix but multiple different types of objects, their properties and relations to each other?
• How do these ideas & tools apply to other aspects of cognition, beyond categorizing and predicting the properties of objects?
Conventional clustering (CRP mixture):
A single way of structuring a domain rarely describes all its features…
Learning multiple structures to explain different feature subsets
(Shafto et al.; Shafto, Mansinghka, Tenenbaum, Yamada & Ueda, 2007)
System 1 System 2 System 3CrossCat:
Discovering structure in relational data
391
135
117
142
106
1248
15
3 9 1 13 5 11 7 14 2 10 6 12 4 8 15
3 9 113 511
7 14 210 6
12 4 8 15
123456789
101112131415
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Input Output
pers
on
TalksTo(person,person)
person
O
z
Infinite Relational Model (IRM)(Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)
3 9 113 511
7 14 210 6
12 4 8 15
0.90.1 0.1
0.1 0.1 0.9
0.9 0.1 0.1
391
135
117
142
106
1248
15
3 9 1 13 5 11 7 14 2 10 6 12 4 8 15
conc
ept
concept
predicate
Biomedical predicate data from UMLS (McCrae et al.): – 134 concepts: enzyme, hormone, organ, disease, cell function ...– 49 predicates: affects(hormone, organ), complicates(enzyme, cell
function), treats(drug, disease), diagnoses(procedure, disease) …
Infinite Relational Model (IRM)(Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)
Learning a medical ontology
e.g., Diseases affect Organisms
Chemicals interact with Chemicals
Chemicals cause Diseases
International relations circa 1965 (Rummel)– 14 countries: UK, USA, USSR, China, ….– 54 binary relations representing interactions between countries: exports
to( USA, UK ), protests( USA, USSR ), …. – 90 (dynamic) country features: purges, protests, unemployment,
communists, # languages, assassinations, ….
Infinite Relational Model (IRM)(Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)
Classes = {R, D, S}Laws = {R D, D S}
( : possible causal link)Classes = {R, D, S}Laws = {S D}
Classes = {C}Laws = {C C}
Abstract causal theories
patients
conditions
has(patient,condition)
Classes = {R, D, S}Laws = {R D, D S}
R: working in factory, smoking, stress, high fat diet, …D: flu, bronchitis, lung cancer, heart disease, …S: headache, fever, coughing, chest pain, …
Abstract theory
Observed events
Bayesian network
Learning causal theories
patients
conditions
has(patient,condition)
Causal graphical model
Classes z
Laws
1 2 3 40.30.0 0.01
0.0 0.0 0.25
0.0 0.0 0.0
5 6 7 8
9 1011 12
‘B’ ‘D’
‘S’
‘B’ ‘D’ ‘S’‘B’
‘D’
‘S’
Abstract theory
Observed events
Bayesian network
IRM
1
2
3
4
5
6
7
8
9
10
11
12
True structure of Bayesian network N:
edge (N)
class (Z)
edge (N)
1 2 3 4 5 6
7 8 9 10 11 12 13 14 15 16
# of samples: 20 80 1000
Data D
Network N
Data D
Network N
AbstractTheory
1 2 3 4 5 6…
7 8 9 10 11 12 1314 15 16…
…
0.40.0
0.0 0.0…
…
(Mansinghka, Kemp, Tenenbaum, Griffiths UAI 06)
c1 c2
c1
c2
c1
c2
Classes Z
“blessing of abstraction”
The flexibility of a nonparametric prior
edge (N)
class (Z)
edge (N)
12
3
4567
8
9
1011 12
# of samples: 40 100 1000
True structure of Bayesian network N:
Data D
Network N
Data D
Network N
AbstractTheory 1 2 3 4
5 6 7 89 10 11 12
…
0.1
c1
c1
c1
Classes Z
……
…
VerbVPNPVPVP
VNPRelRelClauseRelClauseNounAdjDetNP
VPNPS
][][][
Phrase structure
Utterance
Speech signal
Grammar
“Universal Grammar” Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)
P(phrase structure | grammar)
P(utterance | phrase structure)
P(speech | utterance)
(c.f. Chater and Manning, 2006)
P(grammar | UG)
Goal inference as inverseprobabilistic planning
(Baker, Tenenbaum & Saxe)
Constraints Goals
Actions
Rational planning(PO)MDP
model predictions
hum
an
judg
men
ts
Conclusions• The big questions: How does the mind build rich models of the world from sparse
data? What is the form and function of abstract knowledge, and how can abstractions be learned? – These questions are central in vision, language, categorization, causal reasoning, planning, social
understanding… perhaps all of cognition?
• Some powerful tools for making progress on these questions:– Bayesian inference in probabilistic generative models– Hierarchical models, with inference at all levels of abstraction– Structured representations: graphs, grammars, logic– Flexible representations, growing in response to observed data
• New ways to think about development of cognitive systems.– Domain-specific representations can be learned by domain-general mechanisms.– Structure symbolic knowledge can support and even be acquired via statistical learning.– Powerful abstractions can be learned “from the top down”, together with or prior to learning more
concrete knowledge.
Summary
Structure
Data
mouse
squirrel
chimp
gorilla
mousesquirrel
chimpgorilla
F1
F2 F3 F4
Abstractknowledge
Modeling human inductive learning as Bayesian inference over hierarchies of flexibly structured representations.
Classes of variables: B, D, SCausal laws: B D, D S
“dax”
“zav”
“fep”
“zav”
“zav”
“zav”“dax”
“dax”
“dax” “fep”
“fep”
“fep”
Shape varies across categories but not within categories.
Texture, color, size vary within categories.
Word learning Property induction Causal learning
Conclusions• Learning algorithms for discovering domain
structure, given feature or relational data. • Broader themes
– Combining structured representations with statistical inference yields powerful knowledge discovery tools.
– Hierarchical Bayesian modeling allows us to learn domain structure at multiple levels of abstraction.
– Nonparametric Bayesian formulations allow the complexity of representations to be determined automatically and on the fly, growing as the data require.
Beyond similarity-based induction• Reasoning based
on dimensional thresholds: (Smith et al., 1993)
• Reasoning based on causal relations: (Medin et al., 2004; Coley & Shafto, 2003)
Poodles can bite through wire.
German shepherds can bite through wire.
Dobermans can bite through wire.
German shepherds can bite through wire.
Salmon carry E. Spirus bacteria.
Grizzly bears carry E. Spirus bacteria.
Grizzly bears carry E. Spirus bacteria.
Salmon carry E. Spirus bacteria.
Different sources for priors
Chimps have T9 hormones.
Gorillas have T9 hormones.
Poodles can bite through wire.
Dobermans can bite through wire.
Salmon carry E. Spirus bacteria.
Grizzly bears carry E. Spirus bacteria.
Taxonomic similarity
Jaw strength
Food web relations
Property type “has T9 hormones” “can bite through wire” “carry E. Spirus bacteria”
Theory Structure taxonomic tree directed chain directed network + diffusion process + drift process + noisy transmission
Class C
Class AClass D
Class E
Class G
Class F
Class BClass C
Class A
Class D
Class E
Class G
Class F
Class B
Class AClass BClass CClass DClass EClass FClass G
. . . . . . . . .
Class C
Class GClass FClass EClass D
Class BClass A
Hypotheses
Reasoning with two property types
Bio
logi
cal
prop
erty
Dis
ease
prop
erty
Tree Web
Kelp Human
Dolphin
Sand shark
Mako sharkTunaHerring
Kelp
HumanDolphin
Sand sharkMako sharkTuna
Herring
(Shafto, Kemp, Bonawitz, Coley & Tenenbaum)
“Given that X has property P, how likely is it that Y does?”
Summary so far• A framework for modeling human inductive
reasoning as rational statistical inference over structured knowledge representations– Qualitatively different priors are appropriate for different
domains of property induction. – In each domain, a prior that matches the world’s structure
fits people’s judgments well, and better than alternative priors.
– A language for representing different theories: graph structure defined over objects + probabilistic model for the distribution of properties over that graph.
• Remaining question: How can we learn appropriate theories for different domains?
Principles
Structure
Data
Whole-object principleShape biasTaxonomic principleContrast principleBasic-level bias
Learning word meanings
“tufa” “tufa”
“tufa”
Word learningBayesian inference over tree-structured hypothesis space:
(Xu & Tenenbaum; Schmidt & Tenenbaum)
Causal learning with prior knowledge(Griffiths, Sobel, Tenenbaum & Gopnik)
AB Trial A TrialInitial
“Backwards blocking” paradigm:
Learning grounded causal models(Goodman, Mansinghka & Tenenbaum)
A child learns that petting the cat leads to purring, while pounding leads to growling. But how to learn these symbolic event concepts over which causal links are defined?
a
b
c
a b c a b c a b c
The big picture• What we need to understand: the mind’s ability to build rich
models of the world from sparse data.– Learning about objects, categories, and their properties.– Causal inference– Understanding other people’s actions, plans, thoughts, goals– Language comprehension and production– Scene understanding
• What do we need to understand these abilities?– Bayesian inference in probabilistic generative models– Hierarchical models, with inference at all levels of abstraction– Structured representations: graphs, grammars, logic– Flexible representations, growing in response to observed data
Overhypotheses• Syntax: Universal Grammar• Phonology Faithfulness constraints
Markedness constraints• Word Learning Shape bias
Principle of contrastWhole object bias
• Folk physics Objects are unified, bounded and persistent bodies
• Predicability M-constraint• Folk biology Taxonomic principle ... ...
(Spelke)
(Markman)
(Keil)
(Atran)
(Chomsky)(Prince, Smolensky)
Beyond similarity-based induction• Inference based on
dimensional thresholds: (Smith et al., 1993)
• Inference based on causal relations: (Medin et al., 2004; Coley & Shafto, 2003)
Poodles can bite through wire.
German shepherds can bite through wire.
Dobermans can bite through wire.
German shepherds can bite through wire.
Salmon carry E. Spirus bacteria.
Grizzly bears carry E. Spirus bacteria.
Grizzly bears carry E. Spirus bacteria.
Salmon carry E. Spirus bacteria.
Property type “has T9 hormones” “can bite through wire” “carry E. Spirus bacteria”
Form of background knowledge taxonomic tree directed chain directed network + diffusion process + drift process + noisy transmission
Class C
Class AClass D
Class E
Class G
Class F
Class BClass C
Class A
Class D
Class E
Class G
Class F
Class B
Class AClass BClass CClass DClass EClass FClass G
. . . . . . . . .
Class C
Class GClass FClass EClass D
Class BClass A
Hypotheses
Beyond similarity-based induction
Bio
logi
cal
prop
erty
Dis
ease
prop
erty
Tree Web
Kelp Human
Dolphin
Sand shark
Mako sharkTunaHerring
Kelp
HumanDolphin
Sand sharkMako sharkTuna
Herring
(Shafto, Kemp, Bonawitz, Coley & Tenenbaum)
“Given that X has property P, how likely is it that Y does?”
Model fitting• Evaluate each form in parallel• For each form, heuristic search over structures
based on greedy growth from a one-node seed:
Synthetic 2D data
Flat Line Ring Tree Grid
Flat Line Ring Tree Grid
log posterior probabilities
Model Selection results:Data:Continuous features drawn from a Gaussian field over these points.
Clustering models for relational data
• Social networks: block models
Does person x respect person y?
Does prisoner xlike prisoner y?
Conclusions• Computational tools for studying core questions of human learning (and
building more human-like ML?)– What is the content and form of human knowledge, at multiple levels of abstraction?– How does abstract domain knowledge guide new learning? – How can abstract domain knowledge itself be learned? – How can inductive biases be so strong yet so flexible?
• Go beyond the traditional dichotomies of cog sci (and AI). – Instead of “nature vs. nurture”: Powerful abstractions can be learned “from the top
down”, together with or prior to learning more concrete knowledge.– Instead of “domain-general” vs. “domain-specific”: Domain-general learning
mechanisms acquire domain-specific knowledge representations? – Instead of “statistics” vs. “structure”: How can structured symbolic representations
be acquired by statistical learning?