Upload
douglas-campbell
View
222
Download
0
Embed Size (px)
Citation preview
Categorization and density estimation
Tom GriffithsUC Berkeley
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
dogdog
dog
dog
dog
dog
dog
cat
cat cat
cat
cat
cat
catCategorization
?
Categorization
cat small furry domestic carnivore
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture. QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Borderline cases
• Is a tomato a vegetable?– around 50% say yes
(Hampton, 1979)
• Is an olive a fruit?– around 22% change their mind
(McClosky & Glucksberg, 1978)
Borderline cases
• Is a tomato a vegetable?– around 50% say yes
(Hampton, 1979)
• Is an olive a fruit?– around 22% change their mind
(McClosky & Glucksberg, 1978)
Typicality
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Typicality
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Typical Atypical
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Typicality and generalization
Penguins can catch disease XAll birds can catch disease X
Robins can catch disease XAll birds can catch disease X
(Rips, 1975)
How can we explain typicality?
• One answer: reject definitions, and have a new representation for categories
• Prototype theory:– categories are represented by a prototype– other members share a family resemblance
relation to the prototype– typicality is a function of similarity to the
prototype
Prototypes
(Posner & Keele, 1968)
Prototype
Posner and Keele (1968)
• Prototype effect in categorization accuracy
• Constructed categories by perturbing prototypical dot arrays
• Ordering of categorization accuracy at test:– old exemplars– prototypes– new exemplars
Formalizing prototype theories
Representation:
Each category (e.g., A, B) has a corresponding prototype (A,B)
Categorization: (for a new stimulus x)
Choose category that minimizes (maximizes) the distance (similarity) from x to its prototype
(e.g., Reed, 1972)
Formalizing prototype theories
Prototype is most frequent or “typical” member
Spaces (Binary) FeaturesPrototypee.g., average of membersof category
Distancee.g., Euclidean distance
Prototypee.g., binary vector with most frequent feature values
Distancee.g., Hamming distance
€
d(x,π A ) = (xk − μA ,k )2
k
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
1/ 2
€
d(x,π A ) = xk − μA ,k
k
∑
Formalizing prototype theories
Category A Category B
Prototypes (category means)
Decision boundary at equal distance (always a straight line for two categories)
Predicting prototype effects
• Prototype effects are built into the model:– assume categorization becomes easier as
proximity to the prototype increases…– …or distance from the boundary increases
• But what about the old exemplar advantage?(Posner & Keele, 1968)
• Prototype models are not the only way to get prototype effects…
Exemplar theories
Store every member (“exemplar”) of the category
Formalizing exemplar theories
Representation:
A set of stored exemplars y1, y2, …, yn, each with its own category label
Categorization: (for a new stimulus x)
Choose category A with probability
€
P(A | x) =
β A η xy
y∈A
∑
β A η xy + β B η xy
y∈B
∑y∈A
∑
“Luce-Shepard choice rule”
xy is similarity of x to y
A is bias towards A
The context model(Medin & Schaffer, 1978)
Defined for stimuli with binary features(color, form, size, number)
1111 = (red, triangle, big, one) 0000 = (green, circle, small, two)
€
xy = η xyk
k
∏
Define similarity as the product of similarity on each dimension
€
xyk =1 xk = yk
sk otherwise
⎧ ⎨ ⎩
The generalized context model(Nosofsky, 1986)
Defined for stimuli in psychological space
€
xy = exp{−cd(x,y)p}
€
P(A | x) =
β A η xy
y∈A
∑
β A η xy + β B η xy
y∈B
∑y∈A
∑
c is “specificity”p = 1 is exponentialp = 2 is Gaussian
where
The generalized context model
Category A Category B
Decision boundary determined by exemplars
90% A
10% A
50% A
Category A Category B
Prototypes vs. exemplars• Exemplar models produce prototype effects
– if prototype minimizes distance to all exemplars in a category, then it has high probability
• Also predicts old exemplar advantage– being close (or identical) to an old exemplar of the
category gives high probability
• Predicts new effects prototype models cannot produce…– stimuli close to an old exemplar should have high
probability, even far from the prototype
Prototypes vs. exemplarsExemplar models can capture complex boundaries
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Prototypes vs. exemplars
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Exemplar models can capture complex boundaries
Some questions…
• Both prototype and exemplar models seem reasonable… are they “rational”?– are they solutions to the computational problem?
• Should we use prototypes, or exemplars?
• How can we define other models that handle more complex categorization problems?
A computational problem• Categorization is a classic inductive problem
– data: stimulus x– hypotheses: category c
• We can apply Bayes’ rule:
and choose c such that P(c|x) is maximized
€
P(c | x) =p(x | c)P(c)
p(x | c)P(c)c
∑
Density estimation
• We need to estimate some probability distributions– what is P(c)?– what is p(x|c)?
• Two approaches:– parametric– nonparametric
P(c)
p(x|c)
c
x
Parametric density estimation
• Assume that p(x|c) has a simple form, characterized by parameters
• Given stimuli X = x1, x2, …, xn from category c, find by maximum-likelihood estimation
or some form of Bayesian estimation
€
ˆ θ = argmaxθ
log p(X | c,θ) + log p(θ)[ ]€
ˆ θ = argmaxθ
p(X | c,θ)
Binary features
• x = (x1,x2,…,xm) {0,1}m
• Rather than estimating distribution over 2m possibilities, assume feature independence
P(c)
P(x|c)
c
x
Binary features
• x = (x1,x2,…,xm) {0,1}m
• Rather than estimating distribution over 2m possibilities, assume feature independence
• Called “Naïve Bayes”, because independence is a naïve assumption!
P(c)c
x1 …x2 xm
€
P(x | c) = P(xk | c)k=1
m
∏
Spatial representations
• Assume a simple parametric form for p(x|c): a Gaussian
• For each category, estimate parameters– mean – variance
P(c)
p(x|c)
c
x}
Bayesian inference
€
P(c | x) =P(x | c)P(c)
P(x | c)P(c)c
∑
x
Pro
babi
lity
Nonparametric density estimation
• Rather than estimating a probability density from a parametric family, use a scheme that can give you any possible density
• “Nonparametric” can mean:– not making distributional assumptions– covering a broad class of functions– a model with infinitely many parameters
Kernel density estimation
Approximate a probability distribution as a sum of many “kernels” (one per data point)
X = {x1, x2, …, xn} independently sampled from something
€
p(x) ≈1
nk(x,x i)
i=1
n
∑ for kernel function k(x,y) such that
€
k(x,y) dx =1∫
€
k(x,y) =1
2π hexp{−(x − y)2 /2h2}
e.g. the Gaussian kernel:
“window size”
Kernel density estimation
x
Pro
babi
lity
estimated functionindividual kernelstrue function
h = 0.25n = 1
Kernel density estimation
x
Pro
babi
lity
estimated functionindividual kernelstrue function
h = 0.25n = 2
Kernel density estimation
x
Pro
babi
lity
estimated functionindividual kernelstrue function
h = 0.25n = 5
Kernel density estimation
x
Pro
babi
lity
estimated functionindividual kernelstrue function
h = 0.25n = 10
Kernel density estimation
x
Pro
babi
lity
estimated functionindividual kernelstrue function
h = 0.25n = 100
Bayesian inference
€
P(c | x) =P(x | c)P(c)
P(x | c)P(c)c
∑
x
Pro
babi
lity
h = 0.5
Bayesian inference
€
P(c | x) =P(x | c)P(c)
P(x | c)P(c)c
∑
x
Pro
babi
lity
h = 0.1
Bayesian inference
€
P(c | x) =P(x | c)P(c)
P(x | c)P(c)c
∑
x
Pro
babi
lity h = 2.0
Advantages and disadvantages
• Which method should we choose?
• Methods are complementary:– parametric estimation requires less data, but is
severely constrained in possible distributions– nonparametric estimation requires a lot of data,
but can model any function
An instance of the bias-variance tradeoff
Categorization as density estimation
• Prototype and exemplar models can be interpreted as rational Bayesian inference
• Different forms of density estimation:– prototype: parametric density estimation– exemplar: nonparametric density estimation
• Suggests other categorization models…
Prototype models
Prototype model:
Choose category with the closest prototype
Bayesian categorization:
Choose category with highest P(c|x)
Equivalent if P(x|c) decreases as a function of the distance of x from the prototype c and the prior probability is equal for all categories
Exemplar models
Exemplar model:
Choose category A with probability
P(A|x) is the posterior probability of category A when P(x|A) is approximated using a kernel density estimator
€
P(A | x) =
β A η xy
y∈A
∑
β A η xy + β B η xy
y∈B
∑y∈A
∑
Bayesian exemplars
is the prior probability of category A (divided by the number of exemplars in A)
• The summed similarity is proportional to p(x|A) using a kernel density estimator
€
P(A | x) =
β A η xy
y∈A
∑
β A η xy + β B η xy
y∈B
∑y∈A
∑
€
P(A | x) =p(x | A)P(A)
p(x | A)P(A) + p(x | B)P(B)
€
p(x | A) ≈1
nA
k(x, y)y∈A
∑ ∝ η xy
y∈A
∑ for appropriate kerneland similarity function
The generalized context model
x
Pro
babi
lity
exemplar similarity gradient
Implications
• Prototype and exemplar models can be interpreted as rational Bayesian inference
• Different strengths and limitations– exemplar models are rational categorization
models for any category density (and large n)
• Suggests other categorization models…– alternatives between prototypes and exemplars
Prototypes vs. exemplars• Prototype model
– parametric density estimation– P(x|A) specified by one Gaussian
• Exemplar model– nonparametric density estimation
– P(x|A) is a sum of nA Gaussians
• Compromise– “semiparametric” density estimation
– sum of more than one and less than nA Gaussians
– “mixture of Gaussians” (Rosseel, 2003)
The Rational Model of Categorization(RMC; Anderson 1990; 1991)
• Computational problem: predicting a feature based on observed data– assume that category labels are just features
• Predictions are made on the assumption that objects form clusters with similar properties– each object belongs to a single cluster– feature values likely to be the same within clusters– the number of clusters is unbounded
Representation in the RMCFlexible representation can interpolate between prototype and exemplar models
Feature Value Feature Value
Pro
babi
lity
Pro
babi
lity
The “optimal solution” The probability of the missing feature (i.e., the
category label) taking a certain value is
where j is a feature value, Fn are the observed features of a set of n objects, and xn is a partition of objects into clusters
posterior over partitions
The posterior over partitions
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Likelihood Prior
Sum over all partitions
The prior over partitions• An object is assumed to have a constant probability of joining
same cluster as another object, known as the coupling probability• This allows some probability that a stimulus forms a new cluster,
so the probability that the ith object is assigned to the kth cluster is
Nonparametric Bayes• Bayes: treat density estimation as a
problem of Bayesian inference, defining a prior over the number of components
• Nonparametric: use a prior that places no limit on the number of components
• Dirichlet process mixture models (DPMMs) use a prior of this kind…
The prior over components• Each sample is assumed to come from a single (possibly previously unseen)
mixture component• The ith sample is drawn from the kth component with probability
where is a parameter of the model
Equivalence Neal (1998) showed that the prior for the
RMC and the DPMM are the same, with
RMC prior:
DPMM prior:
The computational challenge
The probability of the missing feature (i.e., the category label) taking a certain value is
where j is a feature value, Fn are the observed features of a set of n objects, and xn is a partition of objects into groups
n 1 2 3 4 5 6 7 8 9 10
|xn| 1 2 5 15 52 203 877 4140 21147 115975
Anderson’s approximation
• Data observed sequentially• Each object is deterministically
assigned to the cluster with the highest posterior probability
• Call this the “Local MAP”– choosing the cluster with
the maximum a posteriori probability
111
111
011
111
0110.54 0.46
111
011
100
111
011
1000.33 0.67
Final Partition
Two uses of Monte Carlo methods
1. For solving problems of probabilistic inference involved in developing computational models
2. As a source of hypotheses about how the mind might solve problems of probabilistic inference
Alternative approximation schemes
• There are several methods for making approximations to the posterior in DPMMs– Gibbs sampling– Particle filtering
• These methods provide asymptotic performance guarantees (in contrast to Anderson’s procedure)
(Sanborn, Griffiths, & Navarro, 2006)
Gibbs sampling for the DPMM
111
011
100
Starting Partition
111
011
100
111
011
100
111
011
100
111
011
100
0.33 0.67
111
011
100
0.48 0.12 0.40
111
011
100
0.33
111
011
100
0.67
Sample #1
Sample #2
• All the data are required at once (a batch procedure)
• Each stimulus is sequentially assigned to a cluster based on the assignments of all of the remaining stimuli
• Assignments are made probabilistically, using the full conditional distribution
Particle filter for the DPMM• Data are observed
sequentially• The posterior
distribution at each point is approximated by a set of “particles”
• Particles are updated, and a fixed number of are carried over from trial to trial
111 111
111
011
111
0110.27 0.23
111
011
111
0110.27 0.23
111
011
1000.17
111
011
1000.33 0.08
111
011
100
111
100
0110.16
111
011
1000.26
Sample #1
Sample #2
Approximating the posterior
• For a single order, the Local MAP will produce a single partition
• The Gibbs sampler and particle filter will approximate the exact DPMM distribution
111
011
100
111
011
100
111
100
011
111
011
100
111
011
100
111
011
100
111
011
100
111
100
011
111
011
100
111
011
100
Order effects in human data• The probabilistic model underlying the
DPMM does not produce any order effects– follows from exchangeability
• But… human data shows order effects(e.g., Medin & Bettger, 1994)
• Anderson and Matessa tested local MAP predictions about order effects in an unsupervised clustering experiment
(Anderson, 1990)
Anderson and Matessa’s Experiment
• Subjects were shown all sixteen stimuli that had four binary features
• Front-anchored ordered stimuli emphasized the first two features in the first eight trials; end-anchored ordered emphasized the last two
scadsporm
scadstirm
sneksporb
snekstirb
sneksporm
snekstirm
scadsporb
scadstirb
snadstirb
snekstirb
scadsporm
sceksporm
sneksporm
snadsporm
scedstirb
scadstirb
Front-Anchored Order
End-Anchored Order
Front-Anchored Order
End-Anchored Order
Experimental Data
0.55
0.30
Local MAP
1.00
0.00
Particle Filter (1)
0.59
0.38
Particle Filter (100)
0.50
0.50
Gibbs Sampler
0.48
0.49
Proportion that are Divided Along a Front-Anchored Feature
Anderson and Matessa’s Experiment
A “rational process model”• A rational model clarifies a problem and
serves as a benchmark for performance• Using a psychologically plausible
approximation can change a rational model into a “rational process model”
• Research in machine learning and statistics has produced useful approximations to statistical models which can be tested as general-purpose psychological heuristics
Summary
• Traditional models of the cognitive processes involved in categorization can be reinterpreted as rational models (via density estimation)
• Prototypes vs. exemplars is about schemes for density estimation (and representations)
• Nonparametric Bayes lets us explore options between these extremes (as well as some new models)– all models instance of “hierarchical Dirichlet process”
(Griffiths, Canini, Sanborn, Navarro, 2007)
• Monte Carlo provides hypotheses about how people address the computational challenges