Categorization and density estimation Tom Griffiths UC Berkeley

Categorization and density estimation

Tom GriffithsUC Berkeley

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.













dogdog

dog

dog

dog

dog

dog

cat

cat cat

cat

cat

cat

catCategorization

?

Categorization

cat small furry domestic carnivore


are needed to see this picture. QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Borderline cases

• Is a tomato a vegetable?– around 50% say yes

(Hampton, 1979)

• Is an olive a fruit?– around 22% change their mind

(McClosky & Glucksberg, 1978)

Borderline cases

• Is a tomato a vegetable?– around 50% say yes

(Hampton, 1979)

• Is an olive a fruit?– around 22% change their mind

(McClosky & Glucksberg, 1978)

Typicality














Typicality













Typical Atypical








Typicality and generalization

Penguins can catch disease XAll birds can catch disease X

Robins can catch disease XAll birds can catch disease X

(Rips, 1975)

How can we explain typicality?

• One answer: reject definitions, and have a new representation for categories

• Prototype theory:– categories are represented by a prototype– other members share a family resemblance

relation to the prototype– typicality is a function of similarity to the

prototype

Prototypes

(Posner & Keele, 1968)

Prototype

Posner and Keele (1968)

• Prototype effect in categorization accuracy

• Constructed categories by perturbing prototypical dot arrays

• Ordering of categorization accuracy at test:– old exemplars– prototypes– new exemplars

Formalizing prototype theories

Representation:

Each category (e.g., A, B) has a corresponding prototype (A,B)

Categorization: (for a new stimulus x)

Choose category that minimizes (maximizes) the distance (similarity) from x to its prototype

(e.g., Reed, 1972)


Prototype is most frequent or “typical” member

Spaces (Binary) FeaturesPrototypee.g., average of membersof category

Distancee.g., Euclidean distance

Prototypee.g., binary vector with most frequent feature values

Distancee.g., Hamming distance

€

d(x,π A ) = (xk − μA ,k )2

k

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

1/ 2

€

d(x,π A ) = xk − μA ,k

k

∑


Category A Category B

Prototypes (category means)

Decision boundary at equal distance (always a straight line for two categories)

Predicting prototype effects

• Prototype effects are built into the model:– assume categorization becomes easier as

proximity to the prototype increases…– …or distance from the boundary increases

• But what about the old exemplar advantage?(Posner & Keele, 1968)

• Prototype models are not the only way to get prototype effects…

Exemplar theories

Store every member (“exemplar”) of the category

Formalizing exemplar theories

Representation:

A set of stored exemplars y1, y2, …, yn, each with its own category label

Categorization: (for a new stimulus x)

Choose category A with probability

€

P(A | x) =

β A η xy

y∈A

∑

β A η xy + β B η xy

y∈B

∑y∈A

∑

“Luce-Shepard choice rule”

xy is similarity of x to y

A is bias towards A

The context model(Medin & Schaffer, 1978)

Defined for stimuli with binary features(color, form, size, number)

1111 = (red, triangle, big, one) 0000 = (green, circle, small, two)

€

xy = η xyk

k

∏

Define similarity as the product of similarity on each dimension

€

xyk =1 xk = yk

sk otherwise

⎧ ⎨ ⎩

The generalized context model(Nosofsky, 1986)

Defined for stimuli in psychological space

€

xy = exp{−cd(x,y)p}

€

P(A | x) =

β A η xy

y∈A

∑


y∈B

∑y∈A

∑

c is “specificity”p = 1 is exponentialp = 2 is Gaussian

where

The generalized context model


Decision boundary determined by exemplars

90% A

10% A

50% A


Prototypes vs. exemplars• Exemplar models produce prototype effects

– if prototype minimizes distance to all exemplars in a category, then it has high probability

• Also predicts old exemplar advantage– being close (or identical) to an old exemplar of the

category gives high probability

• Predicts new effects prototype models cannot produce…– stimuli close to an old exemplar should have high

probability, even far from the prototype

Prototypes vs. exemplarsExemplar models can capture complex boundaries

QuickTime™ and aTIFF (LZW) decompressor


Prototypes vs. exemplars





Exemplar models can capture complex boundaries

Some questions…

• Both prototype and exemplar models seem reasonable… are they “rational”?– are they solutions to the computational problem?

• Should we use prototypes, or exemplars?

• How can we define other models that handle more complex categorization problems?

A computational problem• Categorization is a classic inductive problem

– data: stimulus x– hypotheses: category c

• We can apply Bayes’ rule:

and choose c such that P(c|x) is maximized

€

P(c | x) =p(x | c)P(c)

p(x | c)P(c)c

∑

Density estimation

• We need to estimate some probability distributions– what is P(c)?– what is p(x|c)?

• Two approaches:– parametric– nonparametric

P(c)

p(x|c)

c

x

Parametric density estimation

• Assume that p(x|c) has a simple form, characterized by parameters

• Given stimuli X = x1, x2, …, xn from category c, find by maximum-likelihood estimation

or some form of Bayesian estimation

€

ˆ θ = argmaxθ

log p(X | c,θ) + log p(θ)[ ]€

ˆ θ = argmaxθ

p(X | c,θ)

Binary features

• x = (x1,x2,…,xm) {0,1}m

• Rather than estimating distribution over 2m possibilities, assume feature independence

P(c)

P(x|c)

c

x

Binary features

• x = (x1,x2,…,xm) {0,1}m

• Rather than estimating distribution over 2m possibilities, assume feature independence

• Called “Naïve Bayes”, because independence is a naïve assumption!

P(c)c

x1 …x2 xm

€

P(x | c) = P(xk | c)k=1

m

∏

Spatial representations

• Assume a simple parametric form for p(x|c): a Gaussian

• For each category, estimate parameters– mean – variance

P(c)

p(x|c)

c

x}

Bayesian inference

€

P(c | x) =P(x | c)P(c)

P(x | c)P(c)c

∑

x

Pro

babi

lity

Nonparametric density estimation

• Rather than estimating a probability density from a parametric family, use a scheme that can give you any possible density

• “Nonparametric” can mean:– not making distributional assumptions– covering a broad class of functions– a model with infinitely many parameters

Kernel density estimation

Approximate a probability distribution as a sum of many “kernels” (one per data point)

X = {x1, x2, …, xn} independently sampled from something

€

p(x) ≈1

nk(x,x i)

i=1

n

∑ for kernel function k(x,y) such that

€

k(x,y) dx =1∫

€

k(x,y) =1

2π hexp{−(x − y)2 /2h2}

e.g. the Gaussian kernel:

“window size”


x

Pro

babi

lity

estimated functionindividual kernelstrue function

h = 0.25n = 1


x

Pro

babi

lity


h = 0.25n = 2


x

Pro

babi

lity


h = 0.25n = 5


x

Pro

babi

lity


h = 0.25n = 10


x

Pro

babi

lity


h = 0.25n = 100

Bayesian inference

€

P(c | x) =P(x | c)P(c)

P(x | c)P(c)c

∑

x

Pro

babi

lity

h = 0.5

Bayesian inference

€

P(c | x) =P(x | c)P(c)

P(x | c)P(c)c

∑

x

Pro

babi

lity

h = 0.1

Bayesian inference

€

P(c | x) =P(x | c)P(c)

P(x | c)P(c)c

∑

x

Pro

babi

lity h = 2.0

Advantages and disadvantages

• Which method should we choose?

• Methods are complementary:– parametric estimation requires less data, but is

severely constrained in possible distributions– nonparametric estimation requires a lot of data,

but can model any function

An instance of the bias-variance tradeoff

Categorization as density estimation

• Prototype and exemplar models can be interpreted as rational Bayesian inference

• Different forms of density estimation:– prototype: parametric density estimation– exemplar: nonparametric density estimation

• Suggests other categorization models…

Prototype models

Prototype model:

Choose category with the closest prototype

Bayesian categorization:

Choose category with highest P(c|x)

Equivalent if P(x|c) decreases as a function of the distance of x from the prototype c and the prior probability is equal for all categories

Exemplar models

Exemplar model:

Choose category A with probability

P(A|x) is the posterior probability of category A when P(x|A) is approximated using a kernel density estimator

€

P(A | x) =

β A η xy

y∈A

∑


y∈B

∑y∈A

∑

Bayesian exemplars

is the prior probability of category A (divided by the number of exemplars in A)

• The summed similarity is proportional to p(x|A) using a kernel density estimator

€

P(A | x) =

β A η xy

y∈A

∑


y∈B

∑y∈A

∑

€

P(A | x) =p(x | A)P(A)

p(x | A)P(A) + p(x | B)P(B)

€

p(x | A) ≈1

nA

k(x, y)y∈A

∑ ∝ η xy

y∈A

∑ for appropriate kerneland similarity function

The generalized context model

x

Pro

babi

lity

exemplar similarity gradient

Implications

• Prototype and exemplar models can be interpreted as rational Bayesian inference

• Different strengths and limitations– exemplar models are rational categorization

models for any category density (and large n)

• Suggests other categorization models…– alternatives between prototypes and exemplars

Prototypes vs. exemplars• Prototype model

– parametric density estimation– P(x|A) specified by one Gaussian

• Exemplar model– nonparametric density estimation

– P(x|A) is a sum of nA Gaussians

• Compromise– “semiparametric” density estimation

– sum of more than one and less than nA Gaussians

– “mixture of Gaussians” (Rosseel, 2003)

The Rational Model of Categorization(RMC; Anderson 1990; 1991)

• Computational problem: predicting a feature based on observed data– assume that category labels are just features

• Predictions are made on the assumption that objects form clusters with similar properties– each object belongs to a single cluster– feature values likely to be the same within clusters– the number of clusters is unbounded

Representation in the RMCFlexible representation can interpolate between prototype and exemplar models

Feature Value Feature Value

Pro

babi

lity

Pro

babi

lity

The “optimal solution” The probability of the missing feature (i.e., the

category label) taking a certain value is

where j is a feature value, Fn are the observed features of a set of n objects, and xn is a partition of objects into clusters

posterior over partitions

The posterior over partitions



Likelihood Prior

Sum over all partitions

The prior over partitions• An object is assumed to have a constant probability of joining

same cluster as another object, known as the coupling probability• This allows some probability that a stimulus forms a new cluster,

so the probability that the ith object is assigned to the kth cluster is

Nonparametric Bayes• Bayes: treat density estimation as a

problem of Bayesian inference, defining a prior over the number of components

• Nonparametric: use a prior that places no limit on the number of components

• Dirichlet process mixture models (DPMMs) use a prior of this kind…

The prior over components• Each sample is assumed to come from a single (possibly previously unseen)

mixture component• The ith sample is drawn from the kth component with probability

where is a parameter of the model

Equivalence Neal (1998) showed that the prior for the

RMC and the DPMM are the same, with

RMC prior:

DPMM prior:

The computational challenge

The probability of the missing feature (i.e., the category label) taking a certain value is

where j is a feature value, Fn are the observed features of a set of n objects, and xn is a partition of objects into groups

n 1 2 3 4 5 6 7 8 9 10

|xn| 1 2 5 15 52 203 877 4140 21147 115975

Anderson’s approximation

• Data observed sequentially• Each object is deterministically

assigned to the cluster with the highest posterior probability

• Call this the “Local MAP”– choosing the cluster with

the maximum a posteriori probability

111

111

011

111

0110.54 0.46

111

011

100

111

011

1000.33 0.67

Final Partition

Two uses of Monte Carlo methods

1. For solving problems of probabilistic inference involved in developing computational models

2. As a source of hypotheses about how the mind might solve problems of probabilistic inference

Alternative approximation schemes

• There are several methods for making approximations to the posterior in DPMMs– Gibbs sampling– Particle filtering

• These methods provide asymptotic performance guarantees (in contrast to Anderson’s procedure)

(Sanborn, Griffiths, & Navarro, 2006)

Gibbs sampling for the DPMM

111

011

100

Starting Partition

111

011

100

111

011

100

111

011

100

111

011

100

0.33 0.67

111

011

100

0.48 0.12 0.40

111

011

100

0.33

111

011

100

0.67

Sample #1

Sample #2

• All the data are required at once (a batch procedure)

• Each stimulus is sequentially assigned to a cluster based on the assignments of all of the remaining stimuli

• Assignments are made probabilistically, using the full conditional distribution

Particle filter for the DPMM• Data are observed

sequentially• The posterior

distribution at each point is approximated by a set of “particles”

• Particles are updated, and a fixed number of are carried over from trial to trial

111 111

111

011

111

0110.27 0.23

111

011

111

0110.27 0.23

111

011

1000.17

111

011

1000.33 0.08

111

011

100

111

100

0110.16

111

011

1000.26

Sample #1

Sample #2

Approximating the posterior

• For a single order, the Local MAP will produce a single partition

• The Gibbs sampler and particle filter will approximate the exact DPMM distribution

111

011

100

111

011

100

111

100

011

111

011

100

111

011

100

111

011

100

111

011

100

111

100

011

111

011

100

111

011

100

Order effects in human data• The probabilistic model underlying the

DPMM does not produce any order effects– follows from exchangeability

• But… human data shows order effects(e.g., Medin & Bettger, 1994)

• Anderson and Matessa tested local MAP predictions about order effects in an unsupervised clustering experiment

(Anderson, 1990)

Anderson and Matessa’s Experiment

• Subjects were shown all sixteen stimuli that had four binary features

• Front-anchored ordered stimuli emphasized the first two features in the first eight trials; end-anchored ordered emphasized the last two

scadsporm

scadstirm

sneksporb

snekstirb

sneksporm

snekstirm

scadsporb

scadstirb

snadstirb

snekstirb

scadsporm

sceksporm

sneksporm

snadsporm

scedstirb

scadstirb

Front-Anchored Order

End-Anchored Order

Front-Anchored Order

End-Anchored Order

Experimental Data

0.55

0.30

Local MAP

1.00

0.00

Particle Filter (1)

0.59

0.38

Particle Filter (100)

0.50

0.50

Gibbs Sampler

0.48

0.49

Proportion that are Divided Along a Front-Anchored Feature

Anderson and Matessa’s Experiment

A “rational process model”• A rational model clarifies a problem and

serves as a benchmark for performance• Using a psychologically plausible

approximation can change a rational model into a “rational process model”

• Research in machine learning and statistics has produced useful approximations to statistical models which can be tested as general-purpose psychological heuristics

Summary

• Traditional models of the cognitive processes involved in categorization can be reinterpreted as rational models (via density estimation)

• Prototypes vs. exemplars is about schemes for density estimation (and representations)

• Nonparametric Bayes lets us explore options between these extremes (as well as some new models)– all models instance of “hierarchical Dirichlet process”

(Griffiths, Canini, Sanborn, Navarro, 2007)

• Monte Carlo provides hypotheses about how people address the computational challenges

Documents

Categorization and density estimation Tom Griffiths UC Berkeley