Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Modelsof the
primaryvisual cortex
Part of the courseModelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
1
Contents0. Basic terminology and properties of the visual system
1. Simple cells and linear models
2. Nonlinearities in simple cells
3. Complex cells and complex Gabor functions
4. Inhibition and normalization
5. Why are the receptive field as they are?
6. Introduction to statistical models of natural images
7. Principal component analysis
8. Sparse coding
9. Independent component analysis
10. Basic dependencies between components
11. Complex cells and estimation of energy detectors
12. Conclusion
2
0.
Basic terminology and propertiesof the visual system
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
3
Early visual system
• From the retina to the visual cortex
4
The visual cortex
• Many different areas, most relatively unknown
5
Neurons
• Information transmitted between neurons by spikes (action
potentials)
• Activity of a neuron measured by emitted spike rate
= average number of spikes per second
• Neurons gathers input signals from other neurons, and compute their
output mainly based on them.
• Spontaneous firing rate is not zero, but often defined as zero
(baseline)
6
Contrast and selectivity
• Contrast describes intensity differences in the stimulus. Important
because:
– The visual system is specialised in detecting intensity differences.
– Most important visual information may be conveyed by intensity
differences.
• Most cells in sensory areas are selective to (or “tuned” for) certain
stimulus properties
– Only active (firing rate increased) when the input stimulus has
those properties
– Most of the time they are not active.
7
1.
Simple cells and linear models
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
8
Simple and complex cells
• Basic dichotomy of neurons in the primary visual cortex
• Relatively simple vs. less simple response characteristics
• Division into two groups is may not be so clear in practice
9
Receptive fields in simple cells
• RF means the area where contrast changes the neuron’s firing rate
• “On” / “off” areas: positive or negative contrast increases firing rate
10
Receptive fields in simple cells (2)
• Some stimuli and the spikes they elicit
11
Simple cell responses
• For certain stimuli (input), simple cells increase their firing rate
(excitation).
• Simple cells respond strongly to bars and edges, not single dots.
• Receptive fields (RF) small, localized.
• Contrast of “wrong” sign decreases firing rate (inhibition)
• Selective for orientation: respond mainly to edges/bars of given
orientation
• Selective for frequency/scale
• response ∝ contrast
12
Linear model of responses
• Input image gray-scale value (0=gray) I(x,y)
• RF weight parameters W (x,y)
• Response modelled as weighted sum ∑x,y I(x,y)W (x,y)
• Alternative notations:R
I(x,y)W (x,y)dx dy or 〈I,W 〉
• Alternative terminology: correlation, dot-product, inner product
• If neuron is linear, its weights can be computed by observing
responses to different stimuli, using well-known statistical theory.
13
A linear receptive field
• Measurement from V1:
• How to describe this kind of RF mathematically?
14
Fourier Analysis
15
Fourier series
• Basic observation: Neurons are selective to frequency, so we need
Fourier analysis
• Express a 1-D function f (x),x ∈ [−π,π] as a sum of frequency
components
f (x) = a0 + ∑k≥1
ak cos(kx)+bk sin(kx) (1)
• Decompose the function into “coarse” and “fine” parts:
= +
16
Fourier series (cntd)
• The coefficients are computed as the inner products
ak =1π
Z π
−πf (x)cos(kx)dx (2)
bk =1π
Z π
−πf (x)sin(kx)dx (3)
[But a0 =1
2π
Z π
−πf (x)dx] (4)
• Squared sum (energy) of Fourier coefficients a2k +b2
k gives the
“strength” of the frequency k in the function f .
17
Example of frequency bands
• Original image, and its low- and high- frequency parts
18
Gabor Analysis
19
Gabor functions
• Receptive fields are localized (“small”) which is not true of Fourier
analysis
• Gabor functions are a local Fourier analysis.
• Simple way of obtaining a W so that the receptive fields have the
basic properties of simple cells.
• Selectivity for frequency: Fourier analysis with sinusoids.
• Selectivity for location: do the analysis in a small area using a
windowing function, W zero elsewhere.
× =
20
One-dimensional Gabor functions
• Definition of two functions:
g1(x;α,β,γ,x0) = exp(−α2(x− x0)2)cos(2πβ(x− x0)+ γ)
g2(x;α,β,γ,x0) = exp(−α2(x− x0)2)sin(2πβ(x− x0)+ γ) (5)
– α determines the width of the function in space.
– x0 defines the center (location).
– β gives the frequency of oscillation.
– γ (often 0) is the phase of the oscillation.
21
One-dimensional Gabor function (ctnd)
• Examples:
22
Two-dimensional Gabor functions
• Selectivity for orientation: Fourier analysis in one orientation only.
• Take a 1-D Gabor function along one of the dimensions and
multiplying it by a gaussian envelope in the other dimension:
g2d(x,y) = exp(−α22(y− y0)
2)g1(x) (6)
• Rotate by any angle to obtain different orientations
23
What kind of parameters are good?
• The values for the parameters can be determined from single-cell
recordings from V1: different methods give somewhat different
results
• But some basic properties seem to be there most of the time
• The gaussian envelope is either circular (α = α2), or a bit shorter in
the orientation of the oscillation
• There are usually not many oscillations inside the envelope, just 2-3
different regions.
• So, the Gabor models are not that far from the intuitive idea of bars
and edges.
24
Spatio-temporal RFs
• Previously, the receptive fields were static, purely spatial.
• In reality, change of stimulus in time is important
• Spatio-temporal receptive fields: Neurons may respond only when
there is movement of a bar in a given direction
• Linear model has weights that change with respect to time: W (x,y, t)
• Some cells are selective to direction of motion and speed: best
stimulus is an edge moving with a certain speed over the receptive
field.
• Directional selectivity possible if W is “inseparable”, i.e. “oblique” in
the x-y-t -space
25
A spatio-temporal RF
26
Colour and stereopsis
• Some simple cells respond to colour and not general luminance (e.g.
borders between red and green)
• Colour vision can still be modelled by linear filters, if input is RGB
values (red, green and blue light intensities).
• Stereopsis: 3-D structure can be inferred because the two eyes give
views from slightly different angles
• Stereopsis can be modelled by couples of linear filters that are
slightly different for the input from the two eyes (binocular cells) or
respond only to input from one eye (monocular cells).
• Conclusion: linear models offer a flexible tool that can model may
kinds of selectivities.
27
2.
Nonlinearities in simple cells
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
28
Problem of negative responses
• In the linear model, responses can be positive or negative
• In reality, firing rate cannot decrease much⇒ asymmetry
• Negative outputs of linear model are not observed in full.
• Correct model by a half-wave rectifying nonlinearity
• y = max(0,x)
• Every linear filter then corresponds to two cells, one for positive partand one for negative part.
29
Problem of small responses
• When the linear model gives small outputs, no output (increase in
firing rate) is observed in simple cells.
• The cell is using a threshold:
• y = max(0,x− c)
30
Saturation
• Due to biological properties, neurons have a maximum firing rate
• But the linear model has no maximum response
• Correction: use a nonlinearity that saturates as well, i.e. has a
maximum
• y = min(d,max(0,x− c))
31
Final model for a single cell (so far)
• output is f (∑x,y I(x,y)W (x,y)), where
W(x,y) is a Gabor function, and
f is the nonlinearity f (u) = min(d,max(0,u− c))
• Smoother version: f (u) = d u2
c′+u2 .
32
3.
Complex cells and complexGabor functions
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
33
Basic properties of complex cells
• Selective to frequency and orientation
• Not selective to phase (white/black edge/bar, illustrated below):
fundamental difference to simple cells
• Less selective to location than simple cells
• Responses to moving sinusoidal gratings are relatively constant
(simple cells: oscillating response).
• Basically, like the sum of responses of several simple cells.
34
Complex cells and drifting (moving) sinusoidal gratings
• A simple test of linearity to distinguish simple and complex cells
+
+
+
-
-
-
-
-
-
+
+
+
-
-
-
-
-
-
35
Complex notation for Fourier series
• Convenient (?) notation using complex exponents:
since by definition
exp(ix) = cos(x)+ isin(x) (7)
we can write the Fourier coefficients as:
ak + ibk =1π
Z
f (x)exp(ikx)dx
=1π
Z
f (x)cos(kx)dx+ i1π
Z
f (x)sin(kx)dx (8)
• The functionR
f (x)exp(iξx)dx is called the Fourier Transform (N.B.
different definitions exist, with minus signs and multiplying
constants)
36
Modulus vs. phase
• Fourier Transform is a complex number, so it has modulus (absolute
value, “length”) and phase (direction/angle in complex plane)
• Modulus and phase each contain half of the information
• To compute the “power” of a frequency, you take the (square of the)
modulus of the Fourier transform
[Z
f (x)sin(αx)dx]2+[Z
f (x)cos(αx)dx]2 = |Z
f (x)exp(−iαx)dx|2
37
Complex Gabor functions
• Definition as complex function:
g(x;α,β,x0) = exp(−α2(x− x0)2)exp(i2πβ(x− x0))
= exp(−α2(x− x0)2)[cos(2πβ(x− x0))+ isin(2πβ(x− x0))]
• One complex Gabor function defines two functions: real part (cosine)
and imaginary part (sine).
• The two functions have same envelope and Fourier power, only
90 degree difference in phase (“quadrature-phase”).
38
Modulus of Gabors
• The local Fourier power can be obtained as square of the modulus of
the output of the complex Gabor filter:
|1π
Z
f (x)[exp(−α2(x− x0)2)exp(i2πβ(x− x0))]dx|2 (9)
• Equal to the sum of the squares of the inner products with the two
real Gabor functions:
[1π
Z
f (x)exp(−α2(x− x0)2)cos(2πβ(x− x0))dx]2
+[1π
Z
f (x)exp(−α2(x− x0)2)sin(2πβ(x− x0))dx]2 (10)
39
Invariance of Fourier power
• Modulus of Fourier coefficient (or its square, “Fourier power”) is
invariant to change in location:
equal for f (x) and f (x− c) where c is some constant.
• Proof: by definition of Fourier transformZ
f (x− c)exp(−iξx)dx =Z
f (y)exp(−iξy)dxexp(−iξc) (11)
(make transform of variable y = x− c)
Here, exp(−iξc) has constant modulus (one), and therefore does not
affect the modulus.
• Location shift is only visible in the phase of the Fourier transform.
• Likewise, global change in the phase of the input (multiplication by a
constant of modulus one) does not change the Fourier power.
40
Complex cells and complex Gabors
• Thus, modulus of the output of a complex Gabor filter has invariance
with respect to phase (sign black/white or edge/bar) and small
location shift.
• A complex Gabor function or filter is thus a valid model of the
complex cell output
• In this model, a complex cell pools (adds) nonlinearly the outputs of
two linear filters / four rectified linear filters / several simple cells.
• Could also pool several complex Gabors in nearby positions to get
more invariance.
41
4.
Inhibition and normalization
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
42
Inhibition
• An added stimulus (mask) that is orthogonal to the receptive field can
make the response of a simple cell smaller when it is excited
(inhibition).
Original stimulus mask stimulus + mask
43
Interaction of linear filters
• This inhibition cannot be explained by the previous models, since the
mask should have no effect on the linear filter stage, because the
mask is orthogonal to the receptive field.
• There must be some interaction between the linear filters (cells or
groups):
The outputs of some cells must reduce the outputs of others.
• Previously, the linear filters were considered “independent channels”,
but this must be modified.
44
Divisive normalization model
• Cells with receptive fields in (more or less) the same location inhibiteach other.
• First, linear filters modelled as Li = ∑x,y I(x,y)Wi(x,y)
• These are “half-squared” (rectified) Ai = max(0,Li)2.
• Divisive normalization: Simple cell outputs are given as:
Si = kAi
σ2 +∑ j A j(12)
• k,σ are constants.
• The sum in the denominator could be the sum of outputs of(non-normalized) complex cells.
• The sum is taken among linear filters that are more or less in thesame location, but have different phases, frequencies, andorientations (including the filter i itself).
45
Example
• The normalization model includes thresholding and saturation: if
only the neuron in question is stimulated, the nonlinearity after the
linear filtering is (solid line):
Horizontal: contrast, Vertical: cell output
• Dashed line shows the effect with some stimulation of other
normalizing cells.
46
BTW: What is a receptive field?
• Area where W (x,y) 6= 0 in linear model is called “classical” RF.
• A white or black dot in the classical RF is sufficient to increase firing
rate (if strong enough contrast).
• Inhibitory effects come from an area that is larger: “non-classical
RF”, “extended RF”.
• Outside the classical RF, a white or a black alone does not make the
cell fire, because normalization is divisive.
47
5. Why are the receptive field asthey are?
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
48
Descriptive and normative models / theories
• Descriptive models try to describe how a system behaves, e.g. linear
Gabor model of simple cells.
• Normative models try to show how the system should behave in order
to be optimal in some sense.
(Another meaning: what you should do to the system, e.g. in
economics)
• Normative models can often be used in biology based on the
hypothesis that evolution has brought the system close to optimality.
• The hypothesis can be quite wrong in some cases, but quite useful in
others.
• Normative models can give a deeper understanding of the system,
answering the question Why the system is as it is.
49
Theory 1: Optimal space-frequency localization
• Classic attempt of a normative theory for Gabor analysis.
• The (complex) Gabor function is localized both in space and in
frequency.
• Modulus of Fourier transform is a Gaussian kernel
• Compromise between single pixels, and sinusoids (Fourier basis):
Pixels localized in space only, Fourier basis localized in frequency
only
50
Theory 2: Contour detection
• Another goal attributed to visual neurons is to detect contours of
objects (edges or possibly bars).
⇒
• Simple cells do this separately for different orientations
• First step to object recognition, basic features of objects?
• Complex cells are less sensitive to small shifts/transformations in the
image
51
Are these theories any good?
• Problem: they give only vague and limited predictions.
• Many different kinds of RF’s can are localized both in space and
frequency, and detect edges
• Also: why should the RF’s be localized in space and frequency, and
why should they detect edges
⇒ These models are not really normative in the sense of using an
optimality criterion.
• Can we find a principle that explains the form and purpose of RF’s
better?
• Yes, see the rest of this course!
52
Theory 3: Statistical-ecological approach
• Ecology (situatedness): What is important in a real environment?
• Statistics: Natural images have statistical regularities.
• Logic:
1. Different sets of features (RF’s) are good for different kinds of
data.
2. The images that our eyes receive have certain statistical
properties.
3. The visual cortex has learned these statistical properties.
4. This enables optimal statistical inference and statistical signal
processing.
53
Theory 3: Statistical-ecological approach (cntd)
• In other words, can we “explain” receptive fields by basic statistical
properties of natural images?
• We will see below that we can, and we have a phenomenon of
“Emergence”:
many precise predictions from very few statistical assumptions.
• Statistical models of natural images (below) give the optimal RF’s in
a statistical sense
• In this case, the representation in V1 is dictated by the need for a
general-purpose code or statistical model, not a particular goal (in
contrast to texture recognition / contour detection)
54
6. Introduction to statisticalmodels of natural images
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
55
Why statistical models?
• Input to visual system is noisy, fuzzy, uncertain, highly complex and
in many ways difficult.
• Modern framework to handle such data: Bayesian inference
• Statistical models of natural images incorporate the prior
information: they give the prior probability distribution to be used in
Bayesian inference.
• Bayesian inference combines the prior information with the incoming
input (observed signal).
• Prior tells us what (natural) images are typically like
• Here, we concentrate on finding such prior models, and how they
provide models of receptive fields.
56
Some example of Bayesian inference
• Reduction of noise:
• Completion of contours
57
Linear statistical models of natural images
= s1· + s2· + · · ·+ sk·
• Model each image (or small part, patch) as a linear superposition of
basis vectors (features).
58
Linear statistical models of images (2)
• Denote by x = (x1, ...,xn) a vector that contains one image (patch),
e.g. gray-scale values of pixels (previously I(x,y)).
• We observe x many times, so we can consider it a random vector.
• Model xi by a statistical model:
x = As = ∑i
aisi or x j = ∑i, j
a jisi (13)
• Image is superposition of basis vectors ai with random coefficients si
• Inverting the system: si = ∑ j wi jx j , we see that the si are linear filter
(simple cell) outputs.
• How to find suitable basis vectors by observing only x?
This is called unsupervised learning.
• Gives the “best” basis vectors from a statistical (Bayesian) viewpoint.
59
7. Principal component analysis
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
60
Principal component analysis
• There are different methods for unsupervised learning of a linear
representation.
• PCA is a classic method of estimating a statistically motivated
representation for multivariate data.
• Basic idea: find directions of maximum variance
61
PCA and projections of maximum variance
• In general, consider a random vector x = (x1,x2, ...,xn)T that has zero
mean: E{x}= 0 (i.e. the mean has been subtracted from each
variable).
• Goal: Find a linear combination wT x = ∑i wixi that has maximum
variance.
• We must constrain the norm of w: ‖w‖= 1, otherwise solution is that
w is infinite.
• We must solve:
maxw:‖w‖=1
E{(wT x)2} (14)
• We have
E{(wT x)2}= E{(wT x)(xT w)}= E{wT (xxT )w}= wT E{xxT}w
• A well-known problem in linear algebra.
62
Covariance matrix
• PCA is based on information in the covariance matrix C, i.e. matrixwhose elements are the covariances:
Ci j = cov(xi,x j) = E{xix j}−E{xi}E{x j} (15)
• Because the variables have zero mean, we have simply:
C = E{xxT} (16)
• The covariance matrix of y = Mx equals MCMT .
• Near-by pixels in images have strong covariances:
−3 −2 −1 0 1 2 3 4 5−3
−2
−1
0
1
2
3
4
5
63
Computing PCA
• Solution for
maxw:‖w‖=1
wT Cw (17)
is given by the eigenvector of C that corresponds to the largest
eigenvalue. Eigenvalue decomposition is given by:
C = Ediag(di)ET (18)
where the columns of E are the eigenvectors, which are orthogonal.
• Eigenvalues are real because C is symmetric, and non-negative
because C is semidefinite positive.
64
Computing PCA (2)
• To compute more than one principal component, find the direction of
maximum variance which is orthogonal to the components previously
found. This is solved by the k-th eigenvector.
• Thus, PCA can be done by computing the eigen-value decomposition
of the covariance matrix C
• columns of the matrix E give the directions of the principal
components (the basis vectors).
• The values di in the diagonal matrix show their contribution to the
variance.
• Ordering the columns according to the descending order of di gives
the first, second etc. principal components.
65
PCA of image patches
160 first basis vectors given by PCA for 16×16 image patches.
These are like Fourier analysis, not like simple cells!
66
PCA and dimension reduction
• PCA is no good as a model of simple cell RF’s
• However, it is good for dimension reduction.
• Assume we have a very large number of random variables x1, . . . ,xm,
and computations that use all the variables would be too burdensome.
• Let’s linearly transform the variables into a smaller number of
variables z1, . . . ,zn:
zi =m
∑j=1
wi jx j, for all i = 1, . . . ,n (19)
• How to do this so that we preserve maximum amount of information
(i.e. error when we reconstruct the original data xi linearly from the
z j is minimized)?
• Solution: Take n first principal components
67
PCA, decorrelation, and whitening
• PCA is also useful for decorrelating and whitening the data.
• Decorrelation means we make a linear transformation y = Wx so that
the yi are uncorrelated, i.e.
cov(yi,y j) = E{yiy j}= 0 for i 6= j (20)
(we assume that the mean is zero)
• Basic result: principal components are uncorrelated!
• Often we need a normalized form of decorrelation, called whitening
or sphering, where E{yyT}= I, i.e. the variables yi are uncorrelated
and have unit variance E{y2i }= 1.
• We can whiten x by simply dividing the principal components by
theis standard deviations.
68
PCA and gaussianity
• What is missing in PCA?
• In PCA and whitening, it is assumed that the only interesting aspect
of the data is variances and covariances.
• This is the case with gaussian data whose pdf is:
p(x) =1
(2π)n/2|C|−1/2exp(−1
2xT C−1x) (21)
where C is the covariance matrix of the data.
• Thus, the probability distribution is completely characterized by the
covariances (and the means that are assumed zero here).
• So, for gaussian data, we can’t do much more than PCA.
• But what is data is nongaussian? That’s the theme of the rest of this
course.
69
8. Sparse coding
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
70
Nongaussianity in image data
• For gaussian data, all we can do is to analyze covariances and do
something like PCA.
• Fortunately, image data is very nongaussian.
• Histogram of outputs of a Gabor-like filter with natural image input,
compared with gaussian pdf of same variance:
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.2
0.4
0.6
0.8
1
1.2
1.4
71
Sparseness
• What kind of nongaussianity can we see in image data?
• Sparseness is a form of nongaussianity (higher-order structure) often
encountered in natural signals
• Sparseness means that a random variable is “active” only rarely
−5
0
5ga
ussi
an
−5
0
5
spar
se
(These variables have the same variance)
72
Sparseness (2)
• A random variable is sparse if its density has heavy tails, and a peak
at zero.
• Classic sparse pdf: Laplacian density (standardized to unit variance):
p(s) =1√2
exp(√
2|s|) (22)
73
Kurtosis
• Classic measure of sparseness.
• Definition:
kurt(s) = E{s4}−3(E{s2})2 (23)
• Depends on the variance as well, so we have to fix the variance
(to unity) to make this a proper measure of sparseness
• If variance constrained to unity, essentially 4th moment.
• Simple algebraic properties:
kurt(s1 + s2) = kurt(s1)+ kurt(s2) (24)
kurt(αs1) = α4 kurt(s1) (25)
74
Why kurtosis is not optimal
• Sensitive to outliers:
Consider a sample of 1000 values with unit var, and one value equal
to 10.
Kurtosis equals at least 104/1000−3 = 7.
• Especially for image data, other measures should be used.
• Consider measures of the form
E{h(s2)} (26)
where h is some nonlinear function, and s standardized to zero mean
and unit variance.
• Sparseness measures obtained when h is convex.
75
Convexity
• Convexity means that the graph of the function is always below theline segment that connects two points on the graph.
0 0.5 1 1.5 20
0.5
1
1.5
• Can be expressed as
h(αx1 +(1−α)x2) < αh(x1)+(1−α)h(x2) (27)
• This is true if the second derivative of h is positive.
• Why convexity? Expectation of a convex function has a large value ifthe data is concentrated in the extremes, in this case near zero andvery far from zero.
76
Robust sparseness measures
• Kurtosis is obtained by setting h(u) = u2.
• By choosing a different h, we can obtain a measure that is robust (not
sensitive) to outliers.
• Take a function that does not grow fast when going far from the
origin, for example:
h2(u) =−√
u (28)
• In other words, sparseness measure takes the form
E{√
s2}= E{|s|} (29)
77
Maximally sparse features of natural images
• Maximizing sparseness for natural image patches we get
• These are like Gabor functions and simple cell RF’s!
– Localized in space, frequency, orientation
– Right aspect ratio (length vs. width)
– Suitable number of oscillations
• Good normative model of simple cell receptive fields!
• (Of course, they are a bit noisy because they are learned from real
data.)
78
Tuning curves
0 1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 0.5 1 1.5 2 2.5 3 3.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 1 2 3 4 5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.5 1 1.5 2 2.5 3 3.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 3 4 5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
0 0.5 1 1.5 2 2.5 3 3.50
0.5
1
1.5
0 1 2 3 4 5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.5 1 1.5 2 2.5 3 3.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 1 2 3 4 5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.5 1 1.5 2 2.5 3 3.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 3 4 5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Change one of the parameters of an optimal grating stimulus and plotresponse as function of that parameter.
Left: frequency. Middle: orientation. Right: phase.
79
How is sparseness useful?
• The central idea here: it is useful to find good statistical models,
needed in Bayesian inference
• But in addition: firing of cells consumes energy. Sparse coding is
energy-efficient.
• A fortunate coincidence !?
80
9. Independent componentanalysis
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
81
Problems with maximization of sparseness
• The choice of the sparseness measure was rather ad hoc.
• How to learn of a set of features?
• How do we get a prior distribution? Just finding maximally sparse
features does not give us a probability distribution.
• To solve these problems, we introduce independent component
analysis (ICA).
82
Historical background: Blind source separation
ICA was first developed to solve the source separation problem.
Assume we have four “source signals”:
0 10 20 30 40 50 60 70 80 90 100−1.5
−1
−0.5
0
0.5
1
1.5
0 10 20 30 40 50 60 70 80 90 100−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 10 20 30 40 50 60 70 80 90 100−3
−2
−1
0
1
2
3
0 10 20 30 40 50 60 70 80 90 100−8
−6
−4
−2
0
2
4
Due to some external circumstances, only linear mixtures of the
source signals are observed.
0 10 20 30 40 50 60 70 80 90 100−4
−3
−2
−1
0
1
2
3
4
0 10 20 30 40 50 60 70 80 90 100−8
−6
−4
−2
0
2
4
0 10 20 30 40 50 60 70 80 90 100−6
−4
−2
0
2
4
6
8
10
0 10 20 30 40 50 60 70 80 90 100−6
−4
−2
0
2
4
6
8
Estimate (separate) original signals!
83
Blind source separation (2)
• Use only information on statistical independence to recover:
0 10 20 30 40 50 60 70 80 90 100−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 10 20 30 40 50 60 70 80 90 100−3
−2
−1
0
1
2
3
0 10 20 30 40 50 60 70 80 90 100−1.5
−1
−0.5
0
0.5
1
1.5
0 10 20 30 40 50 60 70 80 90 100−4
−2
0
2
4
6
8
These are the “independent components”!
84
Independent Component Analysis (ICA)
• In ICA, we approach the problem from statistical model estimation:
• Observed random vector x is modelled by a linear latent variable
model
xi =m
∑j=1
ai js j, i = 1...n (30)
or in matrix form:
x = As (31)
where
– The ’mixing’ matrix A is constant (a parameter matrix).
– The si are latent random variables calledthe independent components.
– Estimate both A and s, observing only x.
• In the end, very similar to image representation problem!
85
Basic properties of the ICA model
• Must assume:
– The si are mutually statistically independent
– The si are nongaussian.
– For simplicity: The matrix A is square.
• The si defined only up to a multiplicative constant.
• The si are not ordered.
86
Statistical independence
• Random variables y1 and y2 are independent if information on the
value of y1 does not give any information on the value of y2, and vice
versa.
• Formal definition: p(y1,y2) = p1(y1)p2(y2) i.e. the joint probability
density is the product of the individual (marginal) probability
densities.
• If yi and y j are independent, any nonlinear transformation would be
uncorrelated
cov(g1(yi),g2(y j)) = E{g1(yi)g2(y j)}−E{g1(yi)}E{g2(y j)}= 0
for any two functions g1 and g2.
• Thus, independence is a much more general property than
uncorrelatedness.
87
Whitening is not enough for ICA
• Independent random variables are uncorrelated. So, why not estimateICA by whitening? Find a linear transformation that whitens the data.
• Are whitened variables necessarily equal to the si? NO! whiteningcan be done with many different matrices (e.g. PCA)
• Any orthogonal transformation of a white y is white as well. Denoteby U any orthogonal matrix. Then z = Uy is also white:E{(Uy)(Uy)T}= E{UyyT UT}= E{UUT}= I
• So, whitening gives us the independent components only up to anorthogonal transformation. In other words, after whitening, we canconsider the mixing matrix A to be orthogonal:
E{xxT}= I = AE{ssT}AT = AAT (32)
• Another viewpoint: Whitening uses only covariance matrix: ≈ n2/2equations, but A has n2 elements.
88
Gaussian data is no good
• ICA is not possible is the data is gaussian, because PCA and
whitening exhaust all the information in the data.
• The key to estimating the ICA model is nongaussianity.
• For nongaussian data, independence gives much more information
than uncorrelatedness and ICA is possible.
• In fact, uncorrelatedness implies independence in gaussian case.
• But in the general case, independence implies uncorrelated of any
nonlinear transformation (see above). In fact, nongaussianity
(“higher-order structure”) enables estimation of ICA model.
89
Illustration of whitening of nongaussian data
• Assume that there are just two pixels, and the si have uniform
distributions (this is not true at all, but it makes a simple illustration):
Distributions of:
independent components si, pixels xi, whitened pixels.
90
Whitening of gaussian data
• Now assume the si have gaussian distributions:
Distributions of:
independent components si, pixels xi, whitened pixels.
Whitening gives the original distribution!
91
Basic intuitive principle of ICA estimation.
• Instead of variance, look at other properties of the linear combination
wT x = ∑i wixi.
• In fact, we ignore here the variance completely by constraining wT xto have constant variance (equal to one).
• Note that the linear combination is also a linear combination of the
independent components: ∑ j w jx j = wT x = ∑ j q js j = qT s
• Our approach is motivated by the Central Limit Theorem, a
fundamental theorem in probability theory.
92
Central limit theorem (CLT)
• Basic idea: average of many independent random variables will have
a distribution that is close(r) to gaussian
• In the limit of an infinite number of random variables, the distribution
tends to gaussian:
limN→∞
1N
N
∑n=1
sn = gaussian (33)
• Some technical restrictions are necessary for this results to hold
exactly. (E.g. the sn all have the same distribution which has finite
moments.)
93
Application of CLT to ICA
• CLT says something like: mixture of independent components is
more gaussian than the original independent components (at least if
they all have the same distribution.)
• That is: qisi +q js j is more gaussian than si for qi,q j 6= 0
• Maximizing the nongaussianity of qT s, we can find si.
94
Illustration of changes in nongaussianity (1)
Marginal and joint densities, uniform distributions.
Marginal and joint densities, whitened mixtures of uniform ICs
95
Illustration of changes in nongaussianity (2)
Marginal and joint densities, supergaussian distributions.
Whitened mixtures of supergaussian ICs
96
Kurtosis as nongaussianity measure.
• Problem: how to measure nongaussianity?
• Classic solution: take square (or absolute value) of kurtosis
• zero for gaussian random variable, non-zero for most nongaussian
random variables.
• positive vs. negative kurtosis have typical forms of pdf.
97
Classic super- and subgaussian densities
Left: Laplacian pdf 1√2
exp(√
2|x|), positive kurt (“supergaussian”).
Right: Uniform pdf, negative kurt (“subgaussian”).
(Gaussian probability density function given for comparison as
dashed line)
98
The extrema of kurtosis give independent components
• by the properties of kurtosis:
kurt(wT x) = kurt(qT s) = q41 kurt(s1)+q4
2 kurt(s2) (34)
• constrain variance to equal unity
E{(wT x)2}= E{(qT s)2}= q21 +q2
2 = 1 (35)
• for simplicity, consider kurtoses equal to one.
• maxima of kurtosis give independent components (see next slides)
• general result: absolute value of kurtosis maximized by the wT x = si
• Note: extrema are orthogonal due to whitening.
99
Optimization landscape for kurtosis
Thick curve is unit sphere, thin curves are contours where kurtosis is
constant.
100
Illustration of optimization of kurtosis
0 0.5 1 1.5 2 2.5 3 3.52
2.5
3
3.5
4
4.5
5
5.5
angle of w
kurt
osis
Kurtosis as a function of the direction of projection. For positive
kurtosis, kurtosis (and its absolute value) are maximized in the
directions of the independent components.
101
ICA and sparseness
• Now we see that ICA and sparse coding are very closely related.
• Maximization of sparseness is the same as maximization of
nongaussianity for sparse independent components
– E.g. max of square of kurtosis is max of kurtosis for positive
kurtosis
• Now we have a properly defined probability model and distribution to
be usedin Bayesian inference
– Estimate the marginal distributions of the si, or approximate them
by well-known distribution (Laplacian), and you have a pdf.
• But how does ICA help with the other problems we had with sparse
coding?
• Answer lies in formulating maximum likelihood estimation for ICA.
102
Maximum likelihood estimation of ICA
• Maximum likelihood: find parameter values that give maximum
probability for the observed data sample x(1), . . . ,x(T ).
• Assume data is whitened, so A can be constrined orthogonal
• Log-pdf can be formulated as:
log p(x(1), . . . ,x(T )) =T
∑t=1
n
∑i=1
log psi(aTi x(t)) (36)
where psi is pdf of each independent component, and ai is one
column of A (one basis vector).
• This is sum of sparsenesses!
103
Maximum likelihood estimation of ICA (2)
• Now we know how to estimate a set of features:
– Maximize this sum of sparsenesses under orthogonality constraint
• An optimal measure of sparseness (according to this criterion) is
given by
h(s2) = log psi (37)
Typically, taking h(u) =√
u is not far from the truth
104
Independent Component Analysis of image patches
ICA basis vectors for image patches, after reducing dimension with
PCA. (Note that basis vectors are not quite the same as RF’s)
105
Sparse coding
• Sparse coding properly speaking means: For random vector x, find
linear representation:
x = As (38)
so that the components si are as sparse as possible.
• Important property: a given data point is represented using only a
limited number of “active” (clearly non-zero) components si.
• In contrast to PCA, the active components change from image patch
to another.
• Cf. vocabulary of a language which can describe many different
things by combining a small number of active words.
106
10. Inhibition and normalization
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
107
Dependence of Gabor filters
• The “independent components” of images are not really independent.
• The statistical structure of image data is much more complicated (of
course).
• In fact, independent components can not be found for most kinds of
data.
• More sophisticated statistical models of images consider some
dependencies of the simple cell outputs.
108
Correlation of squares
• What is the most important form of dependencies that remains after
linear ICA transformation?
• Answer: the squares of simple cell ouputs si are correlated (i.e. their
covariance is not zero).
• This means “simultaneous activation”.
• General activity levels are correlated.
109
Illustration of correlation of squares
Two signals that are uncorrelated but whose squares are correlated.
110
A statistical model of square correlations
• Assume that the variances of the components change from patch
I(x,y) =m
∑i=1
Ai(x,y)(vsi) = vm
∑i=1
Ai(x,y)si (39)
• Here, v is a random variable.
• Doing ICA, what we observe is components of the form vsi, all
multiplied by v.
• In this model, the components vsi have square correlations. The
variable v implies “simultaneous activation”.
111
Normalization as model estimation
• We want to get rid of v. We can estimate v and divide the image patch
by v:
I(x,y)← I(x,y)v
(40)
• A very simple estimator:
v = c√
∑x,y
I(x,y) (41)
• Rather similar to a divisive normalization model!
• Normalization can be justified by structure of natural images.
112
Physical interpretation
• The same scene can be observed under very different lighting
conditions: daylight, dusk, indoors. Single objects can be receive
different amounts of light.
• The light coming onto the retina is a function of the reflectances of
the surfaces in the scene (R), and the light (illumination) level (L):
I(x,y) = L(x,y)R(x,y) (42)
• Changes in the illumination can change the observed data drastically.
113
Normalization models and natural images (2)
• Basically, one is interested in the objects (R), and not in the
illumination conditions (L)
• Estimation of R can be accomplished by estimating L (similar to v a
couple of slides ago), and dividing I by the estimate
• This is possible because L is more or less constant for small image
patches.
• In reality, such operations are done in several parts of the visual
system:
– Retinal cells adapt to the mean luminance
– Cells in V1 adapt to the mean contrast
114
Another viewpoint on normalization
• Contrasts can have very different values in different contexts, but
neurons have a limited output range due to saturation (and threshold)
• Problem: in high-contrast environments, the cell outputs could be
saturated almost all the time, in low-contrast environments they could
be zero most of the time.
• Divisive normalization looks at the general contrast level, and
normalizes the cell outputs so that they transmit more information,
staying in the useful range.
115
11. Complex cells and estimationof energy detectors
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
116
Dependence of Gabor filters after normalization
• Even after divisive normalization, the outputs of simple cell models
are not independent.
• A nice thing, we can hope to model further properties!
• Basic models group components (cells) by combining two ideas:
– Division of components into subspaces, and
– Spherically symmetric distributions.
117
Division into independent subspaces
• A very basic approach to modelling dependencies.
• Assumption: the si can be divided into groups of n components, such
that
– the si in the same group may be dependent on each other
– dependencies between different groups are not allowed.
• Every group corresponds to a subspace (spanned by the
corresponding basis vectors)
• We also need to specify the distributions inside the subspaces...
118
Invariant-feature subspaces
• An abstract approach to representing invariant features:
Generalization of the idea of complex cell models.
• Linear filters (like in ICA) necessarily lack any invariance.
• Principle: invariant feature is a linear subspace in a feature space.
• The value of the invariant feature is given by norm of the projection
on that subspace.√
√
√
√
k
∑i=1
(∑xy
Wi(x,y)I(x,y))2 (43)
• (Sum of) squares is also called “energy” especially in Fourier
analysis, so these are “energy detectors”
119
Graphical illustration
IInput
8<w , I>
7<w , I>
(.)
2(.)
2
6
<w , I>
2<w , I>
1<w , I>
3
<w , I>
5<w , I>
4<w , I>
(.)
2(.)
2(.)Σ
2(.)
2(.)
2
2(.)
Σ
120
Independent Subspace Analysis (ISA)
• Combination of independent subspaces and invariant-featuresubspaces.
• The probability density inside each subspace is sphericallysymmetric, i.e. depends only on the norm of the projection (the valueof the invariant feature).
• The invariant features have sparse distributions, i.e. exponential.
• For whitened data, the pdf of the data equals
log p(x) = ∑k
h( ∑i∈S(k)
(wTi x)2) (44)
where S(k) gives the indices of components in k-th subspace (e.g.S(1) = {1,2,3,4} for a subspace size of four).
• Nonlinear function h is chosen as with ICA to measure sparseness(e.g. negative of square root).
121
Estimation of ISA
• Estimation can be done by maximizing likelihood. Given
observations x(1), . . . ,x(T ), maximize
T
∑t=1
∑k
h( ∑i∈S(k)
(wTi x(t))2) (45)
with respect to the wi
• Estimation equivalent to maximizing sparsenesses of invariant
feature detectors.
• Receptive fields wi estimated at the same time as their grouping.
• Nonlinear function h could also be estimated from the data by
maximization of likelihood, but fixing it of negative of square root is
not far from optimal.
122
Dependencies further analyzed
• What kind of dependencies of linear filter outputs are actually
modelled by independent subspace analysis?
• Actually, it is the same kind of square correlations that were found to
be dominant after a linear transformation
• For typical convex h, components in the same subspace have square
correlations, whereas components in different subspaces have none.
• In practice, divisive normalization cannot cancel all square
correlations, and the RF’s are estimated so that the correlations are
maximized inside subspaces, and minimized between subspaces.
123
Application on natural image data
• Applied on image data, independent subspace analysis shows
emergence of complex cell properties.
• Each subspace can be interpreted as a complex cell.
• We have phase invariance,
as well as orientation and frequency selectivity.
• Similar to complex Gabor models for complex cells
124
Independent subspaces of natural image data
Each group of 4 basis vectors corresponds to one complex cell.
125
Tuning curves
0 2 4 6 8 100
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
0.8
1
1.2
1.4
0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 100
0.5
1
1.5
0 0.5 1 1.5 2 2.5 3 3.50
0.5
1
1.5
0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 100
0.5
1
1.5
0 0.5 1 1.5 2 2.5 3 3.50
0.5
1
1.5
0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.5 1 1.5 2 2.5 3 3.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 100
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
0.8
1
1.2
1.4
0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Change one of the parameters of an optimal grating stimulus and plotresponse as function of that parameter.
Left: frequency. Middle: orientation. Right: phase.
126
Independent Subspace Analysis: Conclusions
• A simple way of modelling some of the dependencies of simple cell
outputs, after divisive normalization or without it.
• Instead of simple cells outputs, only complex cell outputs are
statistically independent.
• Learns groups of simple cells that resemble quadrature-phase filter
pairs.
127
12. Conclusion
Modelling of Vision
Course #582450 (UH), S-114.4204 (TKK)
Aapo Hyvarinen, spring 2006
128
Gabor models
• Classic descriptive models of receptive fields
• Widely used in different simulations and experiments
• Closely related to wavelet models, which provide a whole basis but
with with poorer choice of parameters.
• Do not satisfactorily answer the normative “Why” question.
129
Statistical models
• We can estimate statistical models of natural images.
• Taking into account nongaussian structure mostly amounts to
different kinds of sparse coding.
• Models based on ICA show clear similarities to simple cells,
complex cells etc.
• A modern normative model of why the receptive fields are as they are:
They provide a statistical model or code of ecologically valid stimuli.
• Statistical models can be used as priors in Bayesian inference.
• An open field of research, new models developed all the time.
• Actual applications for these models developed mainly in image
processing, often using (fixed) wavelet bases.
130