72
Categorization and density estimation Tom Griffiths UC Berkeley

Categorization and density estimation Tom Griffiths UC Berkeley

Embed Size (px)

Citation preview

Page 1: Categorization and density estimation Tom Griffiths UC Berkeley

Categorization and density estimation

Tom GriffithsUC Berkeley

Page 2: Categorization and density estimation Tom Griffiths UC Berkeley

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

dogdog

dog

dog

dog

dog

dog

cat

cat cat

cat

cat

cat

catCategorization

?

Page 3: Categorization and density estimation Tom Griffiths UC Berkeley

Categorization

cat small furry domestic carnivore

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture. QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Page 4: Categorization and density estimation Tom Griffiths UC Berkeley

Borderline cases

• Is a tomato a vegetable?– around 50% say yes

(Hampton, 1979)

• Is an olive a fruit?– around 22% change their mind

(McClosky & Glucksberg, 1978)

Page 5: Categorization and density estimation Tom Griffiths UC Berkeley

Borderline cases

• Is a tomato a vegetable?– around 50% say yes

(Hampton, 1979)

• Is an olive a fruit?– around 22% change their mind

(McClosky & Glucksberg, 1978)

Page 6: Categorization and density estimation Tom Griffiths UC Berkeley

Typicality

Page 7: Categorization and density estimation Tom Griffiths UC Berkeley

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Typicality

Page 8: Categorization and density estimation Tom Griffiths UC Berkeley

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 9: Categorization and density estimation Tom Griffiths UC Berkeley

Typical Atypical

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 10: Categorization and density estimation Tom Griffiths UC Berkeley

Typicality and generalization

Penguins can catch disease XAll birds can catch disease X

Robins can catch disease XAll birds can catch disease X

(Rips, 1975)

Page 11: Categorization and density estimation Tom Griffiths UC Berkeley

How can we explain typicality?

• One answer: reject definitions, and have a new representation for categories

• Prototype theory:– categories are represented by a prototype– other members share a family resemblance

relation to the prototype– typicality is a function of similarity to the

prototype

Page 12: Categorization and density estimation Tom Griffiths UC Berkeley

Prototypes

(Posner & Keele, 1968)

Prototype

Page 13: Categorization and density estimation Tom Griffiths UC Berkeley

Posner and Keele (1968)

• Prototype effect in categorization accuracy

• Constructed categories by perturbing prototypical dot arrays

• Ordering of categorization accuracy at test:– old exemplars– prototypes– new exemplars

Page 14: Categorization and density estimation Tom Griffiths UC Berkeley

Formalizing prototype theories

Representation:

Each category (e.g., A, B) has a corresponding prototype (A,B)

Categorization: (for a new stimulus x)

Choose category that minimizes (maximizes) the distance (similarity) from x to its prototype

(e.g., Reed, 1972)

Page 15: Categorization and density estimation Tom Griffiths UC Berkeley

Formalizing prototype theories

Prototype is most frequent or “typical” member

Spaces (Binary) FeaturesPrototypee.g., average of membersof category

Distancee.g., Euclidean distance

Prototypee.g., binary vector with most frequent feature values

Distancee.g., Hamming distance

d(x,π A ) = (xk − μA ,k )2

k

∑ ⎡

⎣ ⎢

⎦ ⎥

1/ 2

d(x,π A ) = xk − μA ,k

k

Page 16: Categorization and density estimation Tom Griffiths UC Berkeley

Formalizing prototype theories

Category A Category B

Prototypes (category means)

Decision boundary at equal distance (always a straight line for two categories)

Page 17: Categorization and density estimation Tom Griffiths UC Berkeley

Predicting prototype effects

• Prototype effects are built into the model:– assume categorization becomes easier as

proximity to the prototype increases…– …or distance from the boundary increases

• But what about the old exemplar advantage?(Posner & Keele, 1968)

• Prototype models are not the only way to get prototype effects…

Page 18: Categorization and density estimation Tom Griffiths UC Berkeley

Exemplar theories

Store every member (“exemplar”) of the category

Page 19: Categorization and density estimation Tom Griffiths UC Berkeley

Formalizing exemplar theories

Representation:

A set of stored exemplars y1, y2, …, yn, each with its own category label

Categorization: (for a new stimulus x)

Choose category A with probability

P(A | x) =

β A η xy

y∈A

β A η xy + β B η xy

y∈B

∑y∈A

“Luce-Shepard choice rule”

xy is similarity of x to y

A is bias towards A

Page 20: Categorization and density estimation Tom Griffiths UC Berkeley

The context model(Medin & Schaffer, 1978)

Defined for stimuli with binary features(color, form, size, number)

1111 = (red, triangle, big, one) 0000 = (green, circle, small, two)

xy = η xyk

k

Define similarity as the product of similarity on each dimension

xyk =1 xk = yk

sk otherwise

⎧ ⎨ ⎩

Page 21: Categorization and density estimation Tom Griffiths UC Berkeley

The generalized context model(Nosofsky, 1986)

Defined for stimuli in psychological space

xy = exp{−cd(x,y)p}

P(A | x) =

β A η xy

y∈A

β A η xy + β B η xy

y∈B

∑y∈A

c is “specificity”p = 1 is exponentialp = 2 is Gaussian

where

Page 22: Categorization and density estimation Tom Griffiths UC Berkeley

The generalized context model

Category A Category B

Decision boundary determined by exemplars

90% A

10% A

50% A

Category A Category B

Page 23: Categorization and density estimation Tom Griffiths UC Berkeley

Prototypes vs. exemplars• Exemplar models produce prototype effects

– if prototype minimizes distance to all exemplars in a category, then it has high probability

• Also predicts old exemplar advantage– being close (or identical) to an old exemplar of the

category gives high probability

• Predicts new effects prototype models cannot produce…– stimuli close to an old exemplar should have high

probability, even far from the prototype

Page 24: Categorization and density estimation Tom Griffiths UC Berkeley

Prototypes vs. exemplarsExemplar models can capture complex boundaries

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 25: Categorization and density estimation Tom Griffiths UC Berkeley

Prototypes vs. exemplars

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Exemplar models can capture complex boundaries

Page 26: Categorization and density estimation Tom Griffiths UC Berkeley

Some questions…

• Both prototype and exemplar models seem reasonable… are they “rational”?– are they solutions to the computational problem?

• Should we use prototypes, or exemplars?

• How can we define other models that handle more complex categorization problems?

Page 27: Categorization and density estimation Tom Griffiths UC Berkeley

A computational problem• Categorization is a classic inductive problem

– data: stimulus x– hypotheses: category c

• We can apply Bayes’ rule:

and choose c such that P(c|x) is maximized

P(c | x) =p(x | c)P(c)

p(x | c)P(c)c

Page 28: Categorization and density estimation Tom Griffiths UC Berkeley

Density estimation

• We need to estimate some probability distributions– what is P(c)?– what is p(x|c)?

• Two approaches:– parametric– nonparametric

P(c)

p(x|c)

c

x

Page 29: Categorization and density estimation Tom Griffiths UC Berkeley

Parametric density estimation

• Assume that p(x|c) has a simple form, characterized by parameters

• Given stimuli X = x1, x2, …, xn from category c, find by maximum-likelihood estimation

or some form of Bayesian estimation

ˆ θ = argmaxθ

log p(X | c,θ) + log p(θ)[ ]€

ˆ θ = argmaxθ

p(X | c,θ)

Page 30: Categorization and density estimation Tom Griffiths UC Berkeley

Binary features

• x = (x1,x2,…,xm) {0,1}m

• Rather than estimating distribution over 2m possibilities, assume feature independence

P(c)

P(x|c)

c

x

Page 31: Categorization and density estimation Tom Griffiths UC Berkeley

Binary features

• x = (x1,x2,…,xm) {0,1}m

• Rather than estimating distribution over 2m possibilities, assume feature independence

• Called “Naïve Bayes”, because independence is a naïve assumption!

P(c)c

x1 …x2 xm

P(x | c) = P(xk | c)k=1

m

Page 32: Categorization and density estimation Tom Griffiths UC Berkeley

Spatial representations

• Assume a simple parametric form for p(x|c): a Gaussian

• For each category, estimate parameters– mean – variance

P(c)

p(x|c)

c

x}

Page 33: Categorization and density estimation Tom Griffiths UC Berkeley

Bayesian inference

P(c | x) =P(x | c)P(c)

P(x | c)P(c)c

x

Pro

babi

lity

Page 34: Categorization and density estimation Tom Griffiths UC Berkeley

Nonparametric density estimation

• Rather than estimating a probability density from a parametric family, use a scheme that can give you any possible density

• “Nonparametric” can mean:– not making distributional assumptions– covering a broad class of functions– a model with infinitely many parameters

Page 35: Categorization and density estimation Tom Griffiths UC Berkeley

Kernel density estimation

Approximate a probability distribution as a sum of many “kernels” (one per data point)

X = {x1, x2, …, xn} independently sampled from something

p(x) ≈1

nk(x,x i)

i=1

n

∑ for kernel function k(x,y) such that

k(x,y) dx =1∫

k(x,y) =1

2π hexp{−(x − y)2 /2h2}

e.g. the Gaussian kernel:

“window size”

Page 36: Categorization and density estimation Tom Griffiths UC Berkeley

Kernel density estimation

x

Pro

babi

lity

estimated functionindividual kernelstrue function

h = 0.25n = 1

Page 37: Categorization and density estimation Tom Griffiths UC Berkeley

Kernel density estimation

x

Pro

babi

lity

estimated functionindividual kernelstrue function

h = 0.25n = 2

Page 38: Categorization and density estimation Tom Griffiths UC Berkeley

Kernel density estimation

x

Pro

babi

lity

estimated functionindividual kernelstrue function

h = 0.25n = 5

Page 39: Categorization and density estimation Tom Griffiths UC Berkeley

Kernel density estimation

x

Pro

babi

lity

estimated functionindividual kernelstrue function

h = 0.25n = 10

Page 40: Categorization and density estimation Tom Griffiths UC Berkeley

Kernel density estimation

x

Pro

babi

lity

estimated functionindividual kernelstrue function

h = 0.25n = 100

Page 41: Categorization and density estimation Tom Griffiths UC Berkeley

Bayesian inference

P(c | x) =P(x | c)P(c)

P(x | c)P(c)c

x

Pro

babi

lity

h = 0.5

Page 42: Categorization and density estimation Tom Griffiths UC Berkeley

Bayesian inference

P(c | x) =P(x | c)P(c)

P(x | c)P(c)c

x

Pro

babi

lity

h = 0.1

Page 43: Categorization and density estimation Tom Griffiths UC Berkeley

Bayesian inference

P(c | x) =P(x | c)P(c)

P(x | c)P(c)c

x

Pro

babi

lity h = 2.0

Page 44: Categorization and density estimation Tom Griffiths UC Berkeley

Advantages and disadvantages

• Which method should we choose?

• Methods are complementary:– parametric estimation requires less data, but is

severely constrained in possible distributions– nonparametric estimation requires a lot of data,

but can model any function

An instance of the bias-variance tradeoff

Page 45: Categorization and density estimation Tom Griffiths UC Berkeley

Categorization as density estimation

• Prototype and exemplar models can be interpreted as rational Bayesian inference

• Different forms of density estimation:– prototype: parametric density estimation– exemplar: nonparametric density estimation

• Suggests other categorization models…

Page 46: Categorization and density estimation Tom Griffiths UC Berkeley

Prototype models

Prototype model:

Choose category with the closest prototype

Bayesian categorization:

Choose category with highest P(c|x)

Equivalent if P(x|c) decreases as a function of the distance of x from the prototype c and the prior probability is equal for all categories

Page 47: Categorization and density estimation Tom Griffiths UC Berkeley

Exemplar models

Exemplar model:

Choose category A with probability

P(A|x) is the posterior probability of category A when P(x|A) is approximated using a kernel density estimator

P(A | x) =

β A η xy

y∈A

β A η xy + β B η xy

y∈B

∑y∈A

Page 48: Categorization and density estimation Tom Griffiths UC Berkeley

Bayesian exemplars

is the prior probability of category A (divided by the number of exemplars in A)

• The summed similarity is proportional to p(x|A) using a kernel density estimator

P(A | x) =

β A η xy

y∈A

β A η xy + β B η xy

y∈B

∑y∈A

P(A | x) =p(x | A)P(A)

p(x | A)P(A) + p(x | B)P(B)

p(x | A) ≈1

nA

k(x, y)y∈A

∑ ∝ η xy

y∈A

∑ for appropriate kerneland similarity function

Page 49: Categorization and density estimation Tom Griffiths UC Berkeley

The generalized context model

x

Pro

babi

lity

exemplar similarity gradient

Page 50: Categorization and density estimation Tom Griffiths UC Berkeley

Implications

• Prototype and exemplar models can be interpreted as rational Bayesian inference

• Different strengths and limitations– exemplar models are rational categorization

models for any category density (and large n)

• Suggests other categorization models…– alternatives between prototypes and exemplars

Page 51: Categorization and density estimation Tom Griffiths UC Berkeley

Prototypes vs. exemplars• Prototype model

– parametric density estimation– P(x|A) specified by one Gaussian

• Exemplar model– nonparametric density estimation

– P(x|A) is a sum of nA Gaussians

• Compromise– “semiparametric” density estimation

– sum of more than one and less than nA Gaussians

– “mixture of Gaussians” (Rosseel, 2003)

Page 52: Categorization and density estimation Tom Griffiths UC Berkeley

The Rational Model of Categorization(RMC; Anderson 1990; 1991)

• Computational problem: predicting a feature based on observed data– assume that category labels are just features

• Predictions are made on the assumption that objects form clusters with similar properties– each object belongs to a single cluster– feature values likely to be the same within clusters– the number of clusters is unbounded

Page 53: Categorization and density estimation Tom Griffiths UC Berkeley

Representation in the RMCFlexible representation can interpolate between prototype and exemplar models

Feature Value Feature Value

Pro

babi

lity

Pro

babi

lity

Page 54: Categorization and density estimation Tom Griffiths UC Berkeley

The “optimal solution” The probability of the missing feature (i.e., the

category label) taking a certain value is

where j is a feature value, Fn are the observed features of a set of n objects, and xn is a partition of objects into clusters

posterior over partitions

Page 55: Categorization and density estimation Tom Griffiths UC Berkeley

The posterior over partitions

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Likelihood Prior

Sum over all partitions

Page 56: Categorization and density estimation Tom Griffiths UC Berkeley

The prior over partitions• An object is assumed to have a constant probability of joining

same cluster as another object, known as the coupling probability• This allows some probability that a stimulus forms a new cluster,

so the probability that the ith object is assigned to the kth cluster is

Page 57: Categorization and density estimation Tom Griffiths UC Berkeley

Nonparametric Bayes• Bayes: treat density estimation as a

problem of Bayesian inference, defining a prior over the number of components

• Nonparametric: use a prior that places no limit on the number of components

• Dirichlet process mixture models (DPMMs) use a prior of this kind…

Page 58: Categorization and density estimation Tom Griffiths UC Berkeley

The prior over components• Each sample is assumed to come from a single (possibly previously unseen)

mixture component• The ith sample is drawn from the kth component with probability

where is a parameter of the model

Page 59: Categorization and density estimation Tom Griffiths UC Berkeley

Equivalence Neal (1998) showed that the prior for the

RMC and the DPMM are the same, with

RMC prior:

DPMM prior:

Page 60: Categorization and density estimation Tom Griffiths UC Berkeley

The computational challenge

The probability of the missing feature (i.e., the category label) taking a certain value is

where j is a feature value, Fn are the observed features of a set of n objects, and xn is a partition of objects into groups

n 1 2 3 4 5 6 7 8 9 10

|xn| 1 2 5 15 52 203 877 4140 21147 115975

Page 61: Categorization and density estimation Tom Griffiths UC Berkeley

Anderson’s approximation

• Data observed sequentially• Each object is deterministically

assigned to the cluster with the highest posterior probability

• Call this the “Local MAP”– choosing the cluster with

the maximum a posteriori probability

111

111

011

111

0110.54 0.46

111

011

100

111

011

1000.33 0.67

Final Partition

Page 62: Categorization and density estimation Tom Griffiths UC Berkeley

Two uses of Monte Carlo methods

1. For solving problems of probabilistic inference involved in developing computational models

2. As a source of hypotheses about how the mind might solve problems of probabilistic inference

Page 63: Categorization and density estimation Tom Griffiths UC Berkeley

Alternative approximation schemes

• There are several methods for making approximations to the posterior in DPMMs– Gibbs sampling– Particle filtering

• These methods provide asymptotic performance guarantees (in contrast to Anderson’s procedure)

(Sanborn, Griffiths, & Navarro, 2006)

Page 64: Categorization and density estimation Tom Griffiths UC Berkeley

Gibbs sampling for the DPMM

111

011

100

Starting Partition

111

011

100

111

011

100

111

011

100

111

011

100

0.33 0.67

111

011

100

0.48 0.12 0.40

111

011

100

0.33

111

011

100

0.67

Sample #1

Sample #2

• All the data are required at once (a batch procedure)

• Each stimulus is sequentially assigned to a cluster based on the assignments of all of the remaining stimuli

• Assignments are made probabilistically, using the full conditional distribution

Page 65: Categorization and density estimation Tom Griffiths UC Berkeley

Particle filter for the DPMM• Data are observed

sequentially• The posterior

distribution at each point is approximated by a set of “particles”

• Particles are updated, and a fixed number of are carried over from trial to trial

111 111

111

011

111

0110.27 0.23

111

011

111

0110.27 0.23

111

011

1000.17

111

011

1000.33 0.08

111

011

100

111

100

0110.16

111

011

1000.26

Sample #1

Sample #2

Page 66: Categorization and density estimation Tom Griffiths UC Berkeley

Approximating the posterior

• For a single order, the Local MAP will produce a single partition

• The Gibbs sampler and particle filter will approximate the exact DPMM distribution

111

011

100

111

011

100

111

100

011

111

011

100

111

011

100

111

011

100

111

011

100

111

100

011

111

011

100

111

011

100

Page 67: Categorization and density estimation Tom Griffiths UC Berkeley

Order effects in human data• The probabilistic model underlying the

DPMM does not produce any order effects– follows from exchangeability

• But… human data shows order effects(e.g., Medin & Bettger, 1994)

• Anderson and Matessa tested local MAP predictions about order effects in an unsupervised clustering experiment

(Anderson, 1990)

Page 68: Categorization and density estimation Tom Griffiths UC Berkeley

Anderson and Matessa’s Experiment

• Subjects were shown all sixteen stimuli that had four binary features

• Front-anchored ordered stimuli emphasized the first two features in the first eight trials; end-anchored ordered emphasized the last two

scadsporm

scadstirm

sneksporb

snekstirb

sneksporm

snekstirm

scadsporb

scadstirb

snadstirb

snekstirb

scadsporm

sceksporm

sneksporm

snadsporm

scedstirb

scadstirb

Front-Anchored Order

End-Anchored Order

Page 69: Categorization and density estimation Tom Griffiths UC Berkeley

Front-Anchored Order

End-Anchored Order

Experimental Data

0.55

0.30

Local MAP

1.00

0.00

Particle Filter (1)

0.59

0.38

Particle Filter (100)

0.50

0.50

Gibbs Sampler

0.48

0.49

Proportion that are Divided Along a Front-Anchored Feature

Anderson and Matessa’s Experiment

Page 70: Categorization and density estimation Tom Griffiths UC Berkeley

A “rational process model”• A rational model clarifies a problem and

serves as a benchmark for performance• Using a psychologically plausible

approximation can change a rational model into a “rational process model”

• Research in machine learning and statistics has produced useful approximations to statistical models which can be tested as general-purpose psychological heuristics

Page 71: Categorization and density estimation Tom Griffiths UC Berkeley

Summary

• Traditional models of the cognitive processes involved in categorization can be reinterpreted as rational models (via density estimation)

• Prototypes vs. exemplars is about schemes for density estimation (and representations)

• Nonparametric Bayes lets us explore options between these extremes (as well as some new models)– all models instance of “hierarchical Dirichlet process”

(Griffiths, Canini, Sanborn, Navarro, 2007)

• Monte Carlo provides hypotheses about how people address the computational challenges

Page 72: Categorization and density estimation Tom Griffiths UC Berkeley