Models of the primary visual cortex

Modelsof the

primaryvisual cortex

Part of the courseModelling of Vision

Course #582450 (UH), S-114.4204 (TKK)

Aapo Hyvarinen, spring 2006

1

Contents0. Basic terminology and properties of the visual system

1. Simple cells and linear models

2. Nonlinearities in simple cells

3. Complex cells and complex Gabor functions

4. Inhibition and normalization

5. Why are the receptive field as they are?

6. Introduction to statistical models of natural images

7. Principal component analysis

8. Sparse coding

9. Independent component analysis

10. Basic dependencies between components

11. Complex cells and estimation of energy detectors

12. Conclusion

2

0.

Basic terminology and propertiesof the visual system

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


3

Early visual system

• From the retina to the visual cortex

4

The visual cortex

• Many different areas, most relatively unknown

5

Neurons

• Information transmitted between neurons by spikes (action

potentials)

• Activity of a neuron measured by emitted spike rate

= average number of spikes per second

• Neurons gathers input signals from other neurons, and compute their

output mainly based on them.

• Spontaneous firing rate is not zero, but often defined as zero

(baseline)

6

Contrast and selectivity

• Contrast describes intensity differences in the stimulus. Important

because:

– The visual system is specialised in detecting intensity differences.

– Most important visual information may be conveyed by intensity

differences.

• Most cells in sensory areas are selective to (or “tuned” for) certain

stimulus properties

– Only active (firing rate increased) when the input stimulus has

those properties

– Most of the time they are not active.

7

1.

Simple cells and linear models

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


8

Simple and complex cells

• Basic dichotomy of neurons in the primary visual cortex

• Relatively simple vs. less simple response characteristics

• Division into two groups is may not be so clear in practice

9

Receptive fields in simple cells

• RF means the area where contrast changes the neuron’s firing rate

• “On” / “off” areas: positive or negative contrast increases firing rate

10

Receptive fields in simple cells (2)

• Some stimuli and the spikes they elicit

11

Simple cell responses

• For certain stimuli (input), simple cells increase their firing rate

(excitation).

• Simple cells respond strongly to bars and edges, not single dots.

• Receptive fields (RF) small, localized.

• Contrast of “wrong” sign decreases firing rate (inhibition)

• Selective for orientation: respond mainly to edges/bars of given

orientation

• Selective for frequency/scale

• response ∝ contrast

12

Linear model of responses

• Input image gray-scale value (0=gray) I(x,y)

• RF weight parameters W (x,y)

• Response modelled as weighted sum ∑x,y I(x,y)W (x,y)

• Alternative notations:R

I(x,y)W (x,y)dx dy or 〈I,W 〉

• Alternative terminology: correlation, dot-product, inner product

• If neuron is linear, its weights can be computed by observing

responses to different stimuli, using well-known statistical theory.

13

A linear receptive field

• Measurement from V1:

• How to describe this kind of RF mathematically?

14

Fourier Analysis

15

Fourier series

• Basic observation: Neurons are selective to frequency, so we need

Fourier analysis

• Express a 1-D function f (x),x ∈ [−π,π] as a sum of frequency

components

f (x) = a0 + ∑k≥1

ak cos(kx)+bk sin(kx) (1)

• Decompose the function into “coarse” and “fine” parts:

= +

16

Fourier series (cntd)

• The coefficients are computed as the inner products

ak =1π

Z π

−πf (x)cos(kx)dx (2)

bk =1π

Z π

−πf (x)sin(kx)dx (3)

[But a0 =1

2π

Z π

−πf (x)dx] (4)

• Squared sum (energy) of Fourier coefficients a2k +b2

k gives the

“strength” of the frequency k in the function f .

17

Example of frequency bands

• Original image, and its low- and high- frequency parts

18

Gabor Analysis

19

Gabor functions

• Receptive fields are localized (“small”) which is not true of Fourier

analysis

• Gabor functions are a local Fourier analysis.

• Simple way of obtaining a W so that the receptive fields have the

basic properties of simple cells.

• Selectivity for frequency: Fourier analysis with sinusoids.

• Selectivity for location: do the analysis in a small area using a

windowing function, W zero elsewhere.

× =

20

One-dimensional Gabor functions

• Definition of two functions:

g1(x;α,β,γ,x0) = exp(−α2(x− x0)2)cos(2πβ(x− x0)+ γ)

g2(x;α,β,γ,x0) = exp(−α2(x− x0)2)sin(2πβ(x− x0)+ γ) (5)

– α determines the width of the function in space.

– x0 defines the center (location).

– β gives the frequency of oscillation.

– γ (often 0) is the phase of the oscillation.

21

One-dimensional Gabor function (ctnd)

• Examples:

22

Two-dimensional Gabor functions

• Selectivity for orientation: Fourier analysis in one orientation only.

• Take a 1-D Gabor function along one of the dimensions and

multiplying it by a gaussian envelope in the other dimension:

g2d(x,y) = exp(−α22(y− y0)

2)g1(x) (6)

• Rotate by any angle to obtain different orientations

23

What kind of parameters are good?

• The values for the parameters can be determined from single-cell

recordings from V1: different methods give somewhat different

results

• But some basic properties seem to be there most of the time

• The gaussian envelope is either circular (α = α2), or a bit shorter in

the orientation of the oscillation

• There are usually not many oscillations inside the envelope, just 2-3

different regions.

• So, the Gabor models are not that far from the intuitive idea of bars

and edges.

24

Spatio-temporal RFs

• Previously, the receptive fields were static, purely spatial.

• In reality, change of stimulus in time is important

• Spatio-temporal receptive fields: Neurons may respond only when

there is movement of a bar in a given direction

• Linear model has weights that change with respect to time: W (x,y, t)

• Some cells are selective to direction of motion and speed: best

stimulus is an edge moving with a certain speed over the receptive

field.

• Directional selectivity possible if W is “inseparable”, i.e. “oblique” in

the x-y-t -space

25

A spatio-temporal RF

26

Colour and stereopsis

• Some simple cells respond to colour and not general luminance (e.g.

borders between red and green)

• Colour vision can still be modelled by linear filters, if input is RGB

values (red, green and blue light intensities).

• Stereopsis: 3-D structure can be inferred because the two eyes give

views from slightly different angles

• Stereopsis can be modelled by couples of linear filters that are

slightly different for the input from the two eyes (binocular cells) or

respond only to input from one eye (monocular cells).

• Conclusion: linear models offer a flexible tool that can model may

kinds of selectivities.

27

2.

Nonlinearities in simple cells

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


28

Problem of negative responses

• In the linear model, responses can be positive or negative

• In reality, firing rate cannot decrease much⇒ asymmetry

• Negative outputs of linear model are not observed in full.

• Correct model by a half-wave rectifying nonlinearity

• y = max(0,x)

• Every linear filter then corresponds to two cells, one for positive partand one for negative part.

29

Problem of small responses

• When the linear model gives small outputs, no output (increase in

firing rate) is observed in simple cells.

• The cell is using a threshold:

• y = max(0,x− c)

30

Saturation

• Due to biological properties, neurons have a maximum firing rate

• But the linear model has no maximum response

• Correction: use a nonlinearity that saturates as well, i.e. has a

maximum

• y = min(d,max(0,x− c))

31

Final model for a single cell (so far)

• output is f (∑x,y I(x,y)W (x,y)), where

W(x,y) is a Gabor function, and

f is the nonlinearity f (u) = min(d,max(0,u− c))

• Smoother version: f (u) = d u2

c′+u2 .

32

3.

Complex cells and complexGabor functions

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


33

Basic properties of complex cells

• Selective to frequency and orientation

• Not selective to phase (white/black edge/bar, illustrated below):

fundamental difference to simple cells

• Less selective to location than simple cells

• Responses to moving sinusoidal gratings are relatively constant

(simple cells: oscillating response).

• Basically, like the sum of responses of several simple cells.

34

Complex cells and drifting (moving) sinusoidal gratings

• A simple test of linearity to distinguish simple and complex cells

+

+

+

-

-

-

-

-

-

+

+

+

-

-

-

-

-

-

35

Complex notation for Fourier series

• Convenient (?) notation using complex exponents:

since by definition

exp(ix) = cos(x)+ isin(x) (7)

we can write the Fourier coefficients as:

ak + ibk =1π

Z

f (x)exp(ikx)dx

=1π

Z

f (x)cos(kx)dx+ i1π

Z

f (x)sin(kx)dx (8)

• The functionR

f (x)exp(iξx)dx is called the Fourier Transform (N.B.

different definitions exist, with minus signs and multiplying

constants)

36

Modulus vs. phase

• Fourier Transform is a complex number, so it has modulus (absolute

value, “length”) and phase (direction/angle in complex plane)

• Modulus and phase each contain half of the information

• To compute the “power” of a frequency, you take the (square of the)

modulus of the Fourier transform

[Z

f (x)sin(αx)dx]2+[Z

f (x)cos(αx)dx]2 = |Z

f (x)exp(−iαx)dx|2

37

Complex Gabor functions

• Definition as complex function:

g(x;α,β,x0) = exp(−α2(x− x0)2)exp(i2πβ(x− x0))

= exp(−α2(x− x0)2)[cos(2πβ(x− x0))+ isin(2πβ(x− x0))]

• One complex Gabor function defines two functions: real part (cosine)

and imaginary part (sine).

• The two functions have same envelope and Fourier power, only

90 degree difference in phase (“quadrature-phase”).

38

Modulus of Gabors

• The local Fourier power can be obtained as square of the modulus of

the output of the complex Gabor filter:

|1π

Z

f (x)[exp(−α2(x− x0)2)exp(i2πβ(x− x0))]dx|2 (9)

• Equal to the sum of the squares of the inner products with the two

real Gabor functions:

[1π

Z

f (x)exp(−α2(x− x0)2)cos(2πβ(x− x0))dx]2

+[1π

Z

f (x)exp(−α2(x− x0)2)sin(2πβ(x− x0))dx]2 (10)

39

Invariance of Fourier power

• Modulus of Fourier coefficient (or its square, “Fourier power”) is

invariant to change in location:

equal for f (x) and f (x− c) where c is some constant.

• Proof: by definition of Fourier transformZ

f (x− c)exp(−iξx)dx =Z

f (y)exp(−iξy)dxexp(−iξc) (11)

(make transform of variable y = x− c)

Here, exp(−iξc) has constant modulus (one), and therefore does not

affect the modulus.

• Location shift is only visible in the phase of the Fourier transform.

• Likewise, global change in the phase of the input (multiplication by a

constant of modulus one) does not change the Fourier power.

40

Complex cells and complex Gabors

• Thus, modulus of the output of a complex Gabor filter has invariance

with respect to phase (sign black/white or edge/bar) and small

location shift.

• A complex Gabor function or filter is thus a valid model of the

complex cell output

• In this model, a complex cell pools (adds) nonlinearly the outputs of

two linear filters / four rectified linear filters / several simple cells.

• Could also pool several complex Gabors in nearby positions to get

more invariance.

41

4.

Inhibition and normalization

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


42

Inhibition

• An added stimulus (mask) that is orthogonal to the receptive field can

make the response of a simple cell smaller when it is excited

(inhibition).

Original stimulus mask stimulus + mask

43

Interaction of linear filters

• This inhibition cannot be explained by the previous models, since the

mask should have no effect on the linear filter stage, because the

mask is orthogonal to the receptive field.

• There must be some interaction between the linear filters (cells or

groups):

The outputs of some cells must reduce the outputs of others.

• Previously, the linear filters were considered “independent channels”,

but this must be modified.

44

Divisive normalization model

• Cells with receptive fields in (more or less) the same location inhibiteach other.

• First, linear filters modelled as Li = ∑x,y I(x,y)Wi(x,y)

• These are “half-squared” (rectified) Ai = max(0,Li)2.

• Divisive normalization: Simple cell outputs are given as:

Si = kAi

σ2 +∑ j A j(12)

• k,σ are constants.

• The sum in the denominator could be the sum of outputs of(non-normalized) complex cells.

• The sum is taken among linear filters that are more or less in thesame location, but have different phases, frequencies, andorientations (including the filter i itself).

45

Example

• The normalization model includes thresholding and saturation: if

only the neuron in question is stimulated, the nonlinearity after the

linear filtering is (solid line):

Horizontal: contrast, Vertical: cell output

• Dashed line shows the effect with some stimulation of other

normalizing cells.

46

BTW: What is a receptive field?

• Area where W (x,y) 6= 0 in linear model is called “classical” RF.

• A white or black dot in the classical RF is sufficient to increase firing

rate (if strong enough contrast).

• Inhibitory effects come from an area that is larger: “non-classical

RF”, “extended RF”.

• Outside the classical RF, a white or a black alone does not make the

cell fire, because normalization is divisive.

47

5. Why are the receptive field asthey are?

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


48

Descriptive and normative models / theories

• Descriptive models try to describe how a system behaves, e.g. linear

Gabor model of simple cells.

• Normative models try to show how the system should behave in order

to be optimal in some sense.

(Another meaning: what you should do to the system, e.g. in

economics)

• Normative models can often be used in biology based on the

hypothesis that evolution has brought the system close to optimality.

• The hypothesis can be quite wrong in some cases, but quite useful in

others.

• Normative models can give a deeper understanding of the system,

answering the question Why the system is as it is.

49

Theory 1: Optimal space-frequency localization

• Classic attempt of a normative theory for Gabor analysis.

• The (complex) Gabor function is localized both in space and in

frequency.

• Modulus of Fourier transform is a Gaussian kernel

• Compromise between single pixels, and sinusoids (Fourier basis):

Pixels localized in space only, Fourier basis localized in frequency

only

50

Theory 2: Contour detection

• Another goal attributed to visual neurons is to detect contours of

objects (edges or possibly bars).

⇒

• Simple cells do this separately for different orientations

• First step to object recognition, basic features of objects?

• Complex cells are less sensitive to small shifts/transformations in the

image

51

Are these theories any good?

• Problem: they give only vague and limited predictions.

• Many different kinds of RF’s can are localized both in space and

frequency, and detect edges

• Also: why should the RF’s be localized in space and frequency, and

why should they detect edges

⇒ These models are not really normative in the sense of using an

optimality criterion.

• Can we find a principle that explains the form and purpose of RF’s

better?

• Yes, see the rest of this course!

52

Theory 3: Statistical-ecological approach

• Ecology (situatedness): What is important in a real environment?

• Statistics: Natural images have statistical regularities.

• Logic:

1. Different sets of features (RF’s) are good for different kinds of

data.

2. The images that our eyes receive have certain statistical

properties.

3. The visual cortex has learned these statistical properties.

4. This enables optimal statistical inference and statistical signal

processing.

53

Theory 3: Statistical-ecological approach (cntd)

• In other words, can we “explain” receptive fields by basic statistical

properties of natural images?

• We will see below that we can, and we have a phenomenon of

“Emergence”:

many precise predictions from very few statistical assumptions.

• Statistical models of natural images (below) give the optimal RF’s in

a statistical sense

• In this case, the representation in V1 is dictated by the need for a

general-purpose code or statistical model, not a particular goal (in

contrast to texture recognition / contour detection)

54

6. Introduction to statisticalmodels of natural images

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


55

Why statistical models?

• Input to visual system is noisy, fuzzy, uncertain, highly complex and

in many ways difficult.

• Modern framework to handle such data: Bayesian inference

• Statistical models of natural images incorporate the prior

information: they give the prior probability distribution to be used in

Bayesian inference.

• Bayesian inference combines the prior information with the incoming

input (observed signal).

• Prior tells us what (natural) images are typically like

• Here, we concentrate on finding such prior models, and how they

provide models of receptive fields.

56

Some example of Bayesian inference

• Reduction of noise:

• Completion of contours

57

Linear statistical models of natural images

= s1· + s2· + · · ·+ sk·

• Model each image (or small part, patch) as a linear superposition of

basis vectors (features).

58

Linear statistical models of images (2)

• Denote by x = (x1, ...,xn) a vector that contains one image (patch),

e.g. gray-scale values of pixels (previously I(x,y)).

• We observe x many times, so we can consider it a random vector.

• Model xi by a statistical model:

x = As = ∑i

aisi or x j = ∑i, j

a jisi (13)

• Image is superposition of basis vectors ai with random coefficients si

• Inverting the system: si = ∑ j wi jx j , we see that the si are linear filter

(simple cell) outputs.

• How to find suitable basis vectors by observing only x?

This is called unsupervised learning.

• Gives the “best” basis vectors from a statistical (Bayesian) viewpoint.

59

7. Principal component analysis

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


60

Principal component analysis

• There are different methods for unsupervised learning of a linear

representation.

• PCA is a classic method of estimating a statistically motivated

representation for multivariate data.

• Basic idea: find directions of maximum variance

61

PCA and projections of maximum variance

• In general, consider a random vector x = (x1,x2, ...,xn)T that has zero

mean: E{x}= 0 (i.e. the mean has been subtracted from each

variable).

• Goal: Find a linear combination wT x = ∑i wixi that has maximum

variance.

• We must constrain the norm of w: ‖w‖= 1, otherwise solution is that

w is infinite.

• We must solve:

maxw:‖w‖=1

E{(wT x)2} (14)

• We have

E{(wT x)2}= E{(wT x)(xT w)}= E{wT (xxT )w}= wT E{xxT}w

• A well-known problem in linear algebra.

62

Covariance matrix

• PCA is based on information in the covariance matrix C, i.e. matrixwhose elements are the covariances:

Ci j = cov(xi,x j) = E{xix j}−E{xi}E{x j} (15)

• Because the variables have zero mean, we have simply:

C = E{xxT} (16)

• The covariance matrix of y = Mx equals MCMT .

• Near-by pixels in images have strong covariances:

−3 −2 −1 0 1 2 3 4 5−3

−2

−1

0

1

2

3

4

5

63

Computing PCA

• Solution for

maxw:‖w‖=1

wT Cw (17)

is given by the eigenvector of C that corresponds to the largest

eigenvalue. Eigenvalue decomposition is given by:

C = Ediag(di)ET (18)

where the columns of E are the eigenvectors, which are orthogonal.

• Eigenvalues are real because C is symmetric, and non-negative

because C is semidefinite positive.

64

Computing PCA (2)

• To compute more than one principal component, find the direction of

maximum variance which is orthogonal to the components previously

found. This is solved by the k-th eigenvector.

• Thus, PCA can be done by computing the eigen-value decomposition

of the covariance matrix C

• columns of the matrix E give the directions of the principal

components (the basis vectors).

• The values di in the diagonal matrix show their contribution to the

variance.

• Ordering the columns according to the descending order of di gives

the first, second etc. principal components.

65

PCA of image patches

160 first basis vectors given by PCA for 16×16 image patches.

These are like Fourier analysis, not like simple cells!

66

PCA and dimension reduction

• PCA is no good as a model of simple cell RF’s

• However, it is good for dimension reduction.

• Assume we have a very large number of random variables x1, . . . ,xm,

and computations that use all the variables would be too burdensome.

• Let’s linearly transform the variables into a smaller number of

variables z1, . . . ,zn:

zi =m

∑j=1

wi jx j, for all i = 1, . . . ,n (19)

• How to do this so that we preserve maximum amount of information

(i.e. error when we reconstruct the original data xi linearly from the

z j is minimized)?

• Solution: Take n first principal components

67

PCA, decorrelation, and whitening

• PCA is also useful for decorrelating and whitening the data.

• Decorrelation means we make a linear transformation y = Wx so that

the yi are uncorrelated, i.e.

cov(yi,y j) = E{yiy j}= 0 for i 6= j (20)

(we assume that the mean is zero)

• Basic result: principal components are uncorrelated!

• Often we need a normalized form of decorrelation, called whitening

or sphering, where E{yyT}= I, i.e. the variables yi are uncorrelated

and have unit variance E{y2i }= 1.

• We can whiten x by simply dividing the principal components by

theis standard deviations.

68

PCA and gaussianity

• What is missing in PCA?

• In PCA and whitening, it is assumed that the only interesting aspect

of the data is variances and covariances.

• This is the case with gaussian data whose pdf is:

p(x) =1

(2π)n/2|C|−1/2exp(−1

2xT C−1x) (21)

where C is the covariance matrix of the data.

• Thus, the probability distribution is completely characterized by the

covariances (and the means that are assumed zero here).

• So, for gaussian data, we can’t do much more than PCA.

• But what is data is nongaussian? That’s the theme of the rest of this

course.

69

8. Sparse coding

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


70

Nongaussianity in image data

• For gaussian data, all we can do is to analyze covariances and do

something like PCA.

• Fortunately, image data is very nongaussian.

• Histogram of outputs of a Gabor-like filter with natural image input,

compared with gaussian pdf of same variance:

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

71

Sparseness

• What kind of nongaussianity can we see in image data?

• Sparseness is a form of nongaussianity (higher-order structure) often

encountered in natural signals

• Sparseness means that a random variable is “active” only rarely

−5

0

5ga

ussi

an

−5

0

5

spar

se

(These variables have the same variance)

72

Sparseness (2)

• A random variable is sparse if its density has heavy tails, and a peak

at zero.

• Classic sparse pdf: Laplacian density (standardized to unit variance):

p(s) =1√2

exp(√

2|s|) (22)

73

Kurtosis

• Classic measure of sparseness.

• Definition:

kurt(s) = E{s4}−3(E{s2})2 (23)

• Depends on the variance as well, so we have to fix the variance

(to unity) to make this a proper measure of sparseness

• If variance constrained to unity, essentially 4th moment.

• Simple algebraic properties:

kurt(s1 + s2) = kurt(s1)+ kurt(s2) (24)

kurt(αs1) = α4 kurt(s1) (25)

74

Why kurtosis is not optimal

• Sensitive to outliers:

Consider a sample of 1000 values with unit var, and one value equal

to 10.

Kurtosis equals at least 104/1000−3 = 7.

• Especially for image data, other measures should be used.

• Consider measures of the form

E{h(s2)} (26)

where h is some nonlinear function, and s standardized to zero mean

and unit variance.

• Sparseness measures obtained when h is convex.

75

Convexity

• Convexity means that the graph of the function is always below theline segment that connects two points on the graph.

0 0.5 1 1.5 20

0.5

1

1.5

• Can be expressed as

h(αx1 +(1−α)x2) < αh(x1)+(1−α)h(x2) (27)

• This is true if the second derivative of h is positive.

• Why convexity? Expectation of a convex function has a large value ifthe data is concentrated in the extremes, in this case near zero andvery far from zero.

76

Robust sparseness measures

• Kurtosis is obtained by setting h(u) = u2.

• By choosing a different h, we can obtain a measure that is robust (not

sensitive) to outliers.

• Take a function that does not grow fast when going far from the

origin, for example:

h2(u) =−√

u (28)

• In other words, sparseness measure takes the form

E{√

s2}= E{|s|} (29)

77

Maximally sparse features of natural images

• Maximizing sparseness for natural image patches we get

• These are like Gabor functions and simple cell RF’s!

– Localized in space, frequency, orientation

– Right aspect ratio (length vs. width)

– Suitable number of oscillations

• Good normative model of simple cell receptive fields!

• (Of course, they are a bit noisy because they are learned from real

data.)

78

Tuning curves

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.5 1 1.5 2 2.5 3 3.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 1 2 3 4 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.5 1 1.5 2 2.5 3 3.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

0 1 2 3 4 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.5 1 1.5 2 2.5 3 3.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 1 2 3 4 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.5 1 1.5 2 2.5 3 3.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 2 3 4 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Change one of the parameters of an optimal grating stimulus and plotresponse as function of that parameter.

Left: frequency. Middle: orientation. Right: phase.

79

How is sparseness useful?

• The central idea here: it is useful to find good statistical models,

needed in Bayesian inference

• But in addition: firing of cells consumes energy. Sparse coding is

energy-efficient.

• A fortunate coincidence !?

80

9. Independent componentanalysis

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


81

Problems with maximization of sparseness

• The choice of the sparseness measure was rather ad hoc.

• How to learn of a set of features?

• How do we get a prior distribution? Just finding maximally sparse

features does not give us a probability distribution.

• To solve these problems, we introduce independent component

analysis (ICA).

82

Historical background: Blind source separation

ICA was first developed to solve the source separation problem.

Assume we have four “source signals”:

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3

0 10 20 30 40 50 60 70 80 90 100−8

−6

−4

−2

0

2

4

Due to some external circumstances, only linear mixtures of the

source signals are observed.

0 10 20 30 40 50 60 70 80 90 100−4

−3

−2

−1

0

1

2

3

4

0 10 20 30 40 50 60 70 80 90 100−8

−6

−4

−2

0

2

4

0 10 20 30 40 50 60 70 80 90 100−6

−4

−2

0

2

4

6

8

10

0 10 20 30 40 50 60 70 80 90 100−6

−4

−2

0

2

4

6

8

Estimate (separate) original signals!

83

Blind source separation (2)

• Use only information on statistical independence to recover:

0 10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 100−4

−2

0

2

4

6

8

These are the “independent components”!

84

Independent Component Analysis (ICA)

• In ICA, we approach the problem from statistical model estimation:

• Observed random vector x is modelled by a linear latent variable

model

xi =m

∑j=1

ai js j, i = 1...n (30)

or in matrix form:

x = As (31)

where

– The ’mixing’ matrix A is constant (a parameter matrix).

– The si are latent random variables calledthe independent components.

– Estimate both A and s, observing only x.

• In the end, very similar to image representation problem!

85

Basic properties of the ICA model

• Must assume:

– The si are mutually statistically independent

– The si are nongaussian.

– For simplicity: The matrix A is square.

• The si defined only up to a multiplicative constant.

• The si are not ordered.

86

Statistical independence

• Random variables y1 and y2 are independent if information on the

value of y1 does not give any information on the value of y2, and vice

versa.

• Formal definition: p(y1,y2) = p1(y1)p2(y2) i.e. the joint probability

density is the product of the individual (marginal) probability

densities.

• If yi and y j are independent, any nonlinear transformation would be

uncorrelated

cov(g1(yi),g2(y j)) = E{g1(yi)g2(y j)}−E{g1(yi)}E{g2(y j)}= 0

for any two functions g1 and g2.

• Thus, independence is a much more general property than

uncorrelatedness.

87

Whitening is not enough for ICA

• Independent random variables are uncorrelated. So, why not estimateICA by whitening? Find a linear transformation that whitens the data.

• Are whitened variables necessarily equal to the si? NO! whiteningcan be done with many different matrices (e.g. PCA)

• Any orthogonal transformation of a white y is white as well. Denoteby U any orthogonal matrix. Then z = Uy is also white:E{(Uy)(Uy)T}= E{UyyT UT}= E{UUT}= I

• So, whitening gives us the independent components only up to anorthogonal transformation. In other words, after whitening, we canconsider the mixing matrix A to be orthogonal:

E{xxT}= I = AE{ssT}AT = AAT (32)

• Another viewpoint: Whitening uses only covariance matrix: ≈ n2/2equations, but A has n2 elements.

88

Gaussian data is no good

• ICA is not possible is the data is gaussian, because PCA and

whitening exhaust all the information in the data.

• The key to estimating the ICA model is nongaussianity.

• For nongaussian data, independence gives much more information

than uncorrelatedness and ICA is possible.

• In fact, uncorrelatedness implies independence in gaussian case.

• But in the general case, independence implies uncorrelated of any

nonlinear transformation (see above). In fact, nongaussianity

(“higher-order structure”) enables estimation of ICA model.

89

Illustration of whitening of nongaussian data

• Assume that there are just two pixels, and the si have uniform

distributions (this is not true at all, but it makes a simple illustration):

Distributions of:

independent components si, pixels xi, whitened pixels.

90

Whitening of gaussian data

• Now assume the si have gaussian distributions:

Distributions of:

independent components si, pixels xi, whitened pixels.

Whitening gives the original distribution!

91

Basic intuitive principle of ICA estimation.

• Instead of variance, look at other properties of the linear combination

wT x = ∑i wixi.

• In fact, we ignore here the variance completely by constraining wT xto have constant variance (equal to one).

• Note that the linear combination is also a linear combination of the

independent components: ∑ j w jx j = wT x = ∑ j q js j = qT s

• Our approach is motivated by the Central Limit Theorem, a

fundamental theorem in probability theory.

92

Central limit theorem (CLT)

• Basic idea: average of many independent random variables will have

a distribution that is close(r) to gaussian

• In the limit of an infinite number of random variables, the distribution

tends to gaussian:

limN→∞

1N

N

∑n=1

sn = gaussian (33)

• Some technical restrictions are necessary for this results to hold

exactly. (E.g. the sn all have the same distribution which has finite

moments.)

93

Application of CLT to ICA

• CLT says something like: mixture of independent components is

more gaussian than the original independent components (at least if

they all have the same distribution.)

• That is: qisi +q js j is more gaussian than si for qi,q j 6= 0

• Maximizing the nongaussianity of qT s, we can find si.

94

Illustration of changes in nongaussianity (1)

Marginal and joint densities, uniform distributions.

Marginal and joint densities, whitened mixtures of uniform ICs

95

Illustration of changes in nongaussianity (2)

Marginal and joint densities, supergaussian distributions.

Whitened mixtures of supergaussian ICs

96

Kurtosis as nongaussianity measure.

• Problem: how to measure nongaussianity?

• Classic solution: take square (or absolute value) of kurtosis

• zero for gaussian random variable, non-zero for most nongaussian

random variables.

• positive vs. negative kurtosis have typical forms of pdf.

97

Classic super- and subgaussian densities

Left: Laplacian pdf 1√2

exp(√

2|x|), positive kurt (“supergaussian”).

Right: Uniform pdf, negative kurt (“subgaussian”).

(Gaussian probability density function given for comparison as

dashed line)

98

The extrema of kurtosis give independent components

• by the properties of kurtosis:

kurt(wT x) = kurt(qT s) = q41 kurt(s1)+q4

2 kurt(s2) (34)

• constrain variance to equal unity

E{(wT x)2}= E{(qT s)2}= q21 +q2

2 = 1 (35)

• for simplicity, consider kurtoses equal to one.

• maxima of kurtosis give independent components (see next slides)

• general result: absolute value of kurtosis maximized by the wT x = si

• Note: extrema are orthogonal due to whitening.

99

Optimization landscape for kurtosis

Thick curve is unit sphere, thin curves are contours where kurtosis is

constant.

100

Illustration of optimization of kurtosis

0 0.5 1 1.5 2 2.5 3 3.52

2.5

3

3.5

4

4.5

5

5.5

angle of w

kurt

osis

Kurtosis as a function of the direction of projection. For positive

kurtosis, kurtosis (and its absolute value) are maximized in the

directions of the independent components.

101

ICA and sparseness

• Now we see that ICA and sparse coding are very closely related.

• Maximization of sparseness is the same as maximization of

nongaussianity for sparse independent components

– E.g. max of square of kurtosis is max of kurtosis for positive

kurtosis

• Now we have a properly defined probability model and distribution to

be usedin Bayesian inference

– Estimate the marginal distributions of the si, or approximate them

by well-known distribution (Laplacian), and you have a pdf.

• But how does ICA help with the other problems we had with sparse

coding?

• Answer lies in formulating maximum likelihood estimation for ICA.

102

Maximum likelihood estimation of ICA

• Maximum likelihood: find parameter values that give maximum

probability for the observed data sample x(1), . . . ,x(T ).

• Assume data is whitened, so A can be constrined orthogonal

• Log-pdf can be formulated as:

log p(x(1), . . . ,x(T )) =T

∑t=1

n

∑i=1

log psi(aTi x(t)) (36)

where psi is pdf of each independent component, and ai is one

column of A (one basis vector).

• This is sum of sparsenesses!

103

Maximum likelihood estimation of ICA (2)

• Now we know how to estimate a set of features:

– Maximize this sum of sparsenesses under orthogonality constraint

• An optimal measure of sparseness (according to this criterion) is

given by

h(s2) = log psi (37)

Typically, taking h(u) =√

u is not far from the truth

104

Independent Component Analysis of image patches

ICA basis vectors for image patches, after reducing dimension with

PCA. (Note that basis vectors are not quite the same as RF’s)

105

Sparse coding

• Sparse coding properly speaking means: For random vector x, find

linear representation:

x = As (38)

so that the components si are as sparse as possible.

• Important property: a given data point is represented using only a

limited number of “active” (clearly non-zero) components si.

• In contrast to PCA, the active components change from image patch

to another.

• Cf. vocabulary of a language which can describe many different

things by combining a small number of active words.

106

10. Inhibition and normalization

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


107

Dependence of Gabor filters

• The “independent components” of images are not really independent.

• The statistical structure of image data is much more complicated (of

course).

• In fact, independent components can not be found for most kinds of

data.

• More sophisticated statistical models of images consider some

dependencies of the simple cell outputs.

108

Correlation of squares

• What is the most important form of dependencies that remains after

linear ICA transformation?

• Answer: the squares of simple cell ouputs si are correlated (i.e. their

covariance is not zero).

• This means “simultaneous activation”.

• General activity levels are correlated.

109

Illustration of correlation of squares

Two signals that are uncorrelated but whose squares are correlated.

110

A statistical model of square correlations

• Assume that the variances of the components change from patch

I(x,y) =m

∑i=1

Ai(x,y)(vsi) = vm

∑i=1

Ai(x,y)si (39)

• Here, v is a random variable.

• Doing ICA, what we observe is components of the form vsi, all

multiplied by v.

• In this model, the components vsi have square correlations. The

variable v implies “simultaneous activation”.

111

Normalization as model estimation

• We want to get rid of v. We can estimate v and divide the image patch

by v:

I(x,y)← I(x,y)v

(40)

• A very simple estimator:

v = c√

∑x,y

I(x,y) (41)

• Rather similar to a divisive normalization model!

• Normalization can be justified by structure of natural images.

112

Physical interpretation

• The same scene can be observed under very different lighting

conditions: daylight, dusk, indoors. Single objects can be receive

different amounts of light.

• The light coming onto the retina is a function of the reflectances of

the surfaces in the scene (R), and the light (illumination) level (L):

I(x,y) = L(x,y)R(x,y) (42)

• Changes in the illumination can change the observed data drastically.

113

Normalization models and natural images (2)

• Basically, one is interested in the objects (R), and not in the

illumination conditions (L)

• Estimation of R can be accomplished by estimating L (similar to v a

couple of slides ago), and dividing I by the estimate

• This is possible because L is more or less constant for small image

patches.

• In reality, such operations are done in several parts of the visual

system:

– Retinal cells adapt to the mean luminance

– Cells in V1 adapt to the mean contrast

114

Another viewpoint on normalization

• Contrasts can have very different values in different contexts, but

neurons have a limited output range due to saturation (and threshold)

• Problem: in high-contrast environments, the cell outputs could be

saturated almost all the time, in low-contrast environments they could

be zero most of the time.

• Divisive normalization looks at the general contrast level, and

normalizes the cell outputs so that they transmit more information,

staying in the useful range.

115

11. Complex cells and estimationof energy detectors

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


116

Dependence of Gabor filters after normalization

• Even after divisive normalization, the outputs of simple cell models

are not independent.

• A nice thing, we can hope to model further properties!

• Basic models group components (cells) by combining two ideas:

– Division of components into subspaces, and

– Spherically symmetric distributions.

117

Division into independent subspaces

• A very basic approach to modelling dependencies.

• Assumption: the si can be divided into groups of n components, such

that

– the si in the same group may be dependent on each other

– dependencies between different groups are not allowed.

• Every group corresponds to a subspace (spanned by the

corresponding basis vectors)

• We also need to specify the distributions inside the subspaces...

118

Invariant-feature subspaces

• An abstract approach to representing invariant features:

Generalization of the idea of complex cell models.

• Linear filters (like in ICA) necessarily lack any invariance.

• Principle: invariant feature is a linear subspace in a feature space.

• The value of the invariant feature is given by norm of the projection

on that subspace.√

√

√

√

k

∑i=1

(∑xy

Wi(x,y)I(x,y))2 (43)

• (Sum of) squares is also called “energy” especially in Fourier

analysis, so these are “energy detectors”

119

Graphical illustration

IInput

8<w , I>

7<w , I>

(.)

2(.)

2

6

<w , I>

2<w , I>

1<w , I>

3

<w , I>

5<w , I>

4<w , I>

(.)

2(.)

2(.)Σ

2(.)

2(.)

2

2(.)

Σ

120

Independent Subspace Analysis (ISA)

• Combination of independent subspaces and invariant-featuresubspaces.

• The probability density inside each subspace is sphericallysymmetric, i.e. depends only on the norm of the projection (the valueof the invariant feature).

• The invariant features have sparse distributions, i.e. exponential.

• For whitened data, the pdf of the data equals

log p(x) = ∑k

h( ∑i∈S(k)

(wTi x)2) (44)

where S(k) gives the indices of components in k-th subspace (e.g.S(1) = {1,2,3,4} for a subspace size of four).

• Nonlinear function h is chosen as with ICA to measure sparseness(e.g. negative of square root).

121

Estimation of ISA

• Estimation can be done by maximizing likelihood. Given

observations x(1), . . . ,x(T ), maximize

T

∑t=1

∑k

h( ∑i∈S(k)

(wTi x(t))2) (45)

with respect to the wi

• Estimation equivalent to maximizing sparsenesses of invariant

feature detectors.

• Receptive fields wi estimated at the same time as their grouping.

• Nonlinear function h could also be estimated from the data by

maximization of likelihood, but fixing it of negative of square root is

not far from optimal.

122

Dependencies further analyzed

• What kind of dependencies of linear filter outputs are actually

modelled by independent subspace analysis?

• Actually, it is the same kind of square correlations that were found to

be dominant after a linear transformation

• For typical convex h, components in the same subspace have square

correlations, whereas components in different subspaces have none.

• In practice, divisive normalization cannot cancel all square

correlations, and the RF’s are estimated so that the correlations are

maximized inside subspaces, and minimized between subspaces.

123

Application on natural image data

• Applied on image data, independent subspace analysis shows

emergence of complex cell properties.

• Each subspace can be interpreted as a complex cell.

• We have phase invariance,

as well as orientation and frequency selectivity.

• Similar to complex Gabor models for complex cells

124

Independent subspaces of natural image data

Each group of 4 basis vectors corresponds to one complex cell.

125

Tuning curves

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 100

0.5

1

1.5

0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 100

0.5

1

1.5

0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.5 1 1.5 2 2.5 3 3.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Change one of the parameters of an optimal grating stimulus and plotresponse as function of that parameter.

Left: frequency. Middle: orientation. Right: phase.

126

Independent Subspace Analysis: Conclusions

• A simple way of modelling some of the dependencies of simple cell

outputs, after divisive normalization or without it.

• Instead of simple cells outputs, only complex cell outputs are

statistically independent.

• Learns groups of simple cells that resemble quadrature-phase filter

pairs.

127

12. Conclusion

Modelling of Vision

Course #582450 (UH), S-114.4204 (TKK)


128

Gabor models

• Classic descriptive models of receptive fields

• Widely used in different simulations and experiments

• Closely related to wavelet models, which provide a whole basis but

with with poorer choice of parameters.

• Do not satisfactorily answer the normative “Why” question.

129

Statistical models

• We can estimate statistical models of natural images.

• Taking into account nongaussian structure mostly amounts to

different kinds of sparse coding.

• Models based on ICA show clear similarities to simple cells,

complex cells etc.

• A modern normative model of why the receptive fields are as they are:

They provide a statistical model or code of ecologically valid stimuli.

• Statistical models can be used as priors in Bayesian inference.

• An open field of research, new models developed all the time.

• Actual applications for these models developed mainly in image

processing, often using (fixed) wavelet bases.

130

Documents

Models of the primary visual cortex