Upload
gerard-lawson
View
218
Download
1
Embed Size (px)
Citation preview
Population Coding
Alexandre PougetOkinawa Computational Neuroscience Course
Okinawa, Japan November 2004
Outline
• Definition
• The encoding process
• Decoding population codes
• Quantifying information: Shannon and Fisher information
• Basis functions and optimal computation
Outline
• Definition
• The encoding process
• Decoding population codes
• Quantifying information: Shannon and Fisher information
• Basis functions and optimal computation
Receptive field
s: Direction of motion
Stimulus
Response
Code: number of spikes10
10
7
8
4
Receptive field
s: Direction of motion
Trial 1
Stimulus
Trial 2
Trial 3
Trial 4
Variance of the noise, i()2
Encoded variable (s)
Mean activity fi()
Variance, i(s)2, can depend on the input
Tuning curve fi(s)
Tuning curves and noise
Example of tuning curves:
Retinal location, orientation, depth, color, eye movements, arm movements, numbers… etc.
Population Codes
Tuning Curves Pattern of activity (r)
-100 0 1000
20
40
60
80
100
Direction (deg)
Act
ivit
y
-100 0 1000
20
40
60
80
100
Preferred Direction (deg)
Act
ivit
y s?
Bayesian approach
We want to recover P(s|r). Using Bayes theorem, we have:
||
P s P sP s
P
rr
r
Bayesian approach
Bayes rule:
, | |
||
P s P s P P s P s
P s P sP s
P
r r r r
rr
r
Bayesian approach
We want to recover P(s|r). Using Bayes theorem, we have:
likelihood of s
posterior distribution over sprior distribution over r
prior distribution over s
||
P s P sP s
P
rr
r
Bayesian approach
If we are to do any type of computation with population codes, we need a probabilistic model of how the activity are generated, p(r|s), i.e., we need to model the encoding process.
Activity distribution
P(ri|s=-60)
P(ri|s=0)
P(ri|s=-60)
Tuning curves and noise
The activity (# of spikes per second) of a neuron can be written as:
where fi() is the mean activity of the neuron (the tuning curve) and ni is a noise with zero mean. If the noise is gaussian, then:
i i ir f s n s
0,i in s N s
Probability distributions and activity
• The noise is a random variable which can be characterized by a conditional probability distribution, P(ni|s).
• The distributions of the activity P(ri|s). and the noise differ only by their means (E[ni]=0, E[ri]=fi(s)).
Gaussian noise with fixed variance
Gaussian noise with variance equal to the mean
Examples of activity distributions
2
22
1| exp
22
i ii
f s rP r s
2
1| exp
22
i ii
ii
f s rP r s
f sf s
Poisson distribution:
The variance of a Poisson distribution is equal to its mean.
|!
iirf s
ii
i
e f sP r s
r
Comparison of Poisson vs Gaussian noise with variance equal to the mean
0 20 40 60 80 100 120 1400
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Activity (spike/sec)
Pro
bab
ilit
y
Gaussian noise with fixed variance
Population of neurons
2
22
| |
1exp
22
ii
i i
i
P s P r s
f s r
r
Gaussian noise with arbitrary covariance matrix :
Population of neurons
11| exp
2
TP s s s
r f r f r
Outline
• Definition
• The encoding process
• Decoding population codes
• Quantifying information: Shannon and Fisher information
• Basis functions and optimal computation
Population Codes
Tuning Curves Pattern of activity (r)
-100 0 1000
20
40
60
80
100
Direction (deg)
Act
ivit
y
-100 0 1000
20
40
60
80
100
Preferred Direction (deg)
Act
ivit
y s?
Nature of the problem
In response to a stimulus with unknown value s, you observe a pattern of activity r. What can you say about s given r?
Bayesian approach: recover p(s|r) (the posterior distribution)
Estimation theory: come up with a single value estimate from rs
Estimation Theory
-100 0 1000
20
40
60
80
100
Preferred orientation
Activity vector: r
Decoder ss Encoder(nervous system)
-100 0 100
0
20
40
60
80
100
Preferred retinal location
r200
Decoder
Trial 200
200ss Encoder(nervous system)
-100 0 1000
20
40
60
80
100
Preferred retinal location
r2
Decoder
Trial 2
2ss Encoder(nervous system)
-100 0 1000
20
40
60
80
100
Preferred retinal location
r1
Decoder
Trial 1
1ss Encoder(nervous system)
...
-100 0 1000
20
40
60
80
100
Preferred retinal location
r
Decoder ss Encoder
Estimation Theory
If , the estimate is said to be unbiasedˆ[ | ]E s s s
If is as small as possible, the estimate is said to be efficient2s
Estimation theory
• A common measure of decoding performance is the mean square error between the estimate and the true value
• This error can be decomposed as:
2ˆMSE |E s s s
2 2ˆ ˆ| |
2 2ˆ ˆ| |
ˆMSE | s E s s
s E s s
E s s s
bias
Efficient Estimators
The smallest achievable variance for an unbiased estimator is known as the Cramer-Rao bound, CR
2.
An efficient estimator is such that
In general :
2 2| CRs s
2 2| CRs s
Estimation Theory
-100 0 1000
20
40
60
80
100
Preferred orientation
Activity vector: r
Decoder ss Encoder(nervous system)
Examples of decoders
Voting Methods
Optimal Linear Estimator
ˆ i ii
s w r
Linear Estimators
1
1
*
2*
1
2
1
1
1
1
*
*0 0
,...,
,...,
1
2
1
2
0
0
1
n
n
n
i ii
n
i ii
n
i ii
n
i ii
n
i ii
x x
y y
y ax b
E y y
ax b y
Eax b y
b
E
b
ax b y
b y axn
b y a x
y y a x x
y ax
X
Y
Linear Estimators
*
2*
1
2
1
1
1
12
2
1
1
2
1
2
0
0
n
i ii
n
i ii
n
i i ii
n
i i ii
n
i ixyi
nx
ii
y ax
E y y
ax y
Ex ax y
a
E
a
x ax y
x yC
ax
Linear Estimators
1
1
11 1
1
11 1
1
1
* T
T
2*
1
11 T T
T2 2
...
... ... ...
...
...
... ... ...
...
... 1
1
2
... m
m
n
nm m
n
np p
i
i
ip
n
i ii
XX XY
x yx y
x x
x x
m n
x x
y y
p n
y y
y
p
y
p m
E
n mp
m p m m m p
CC
X
Y
y
y W x
W
y y
W C C XX XY
W
*2
1
i
i
mx y
ii x
Cx
y
X and Y must be zero mean
Trust cells that have small variances and large covariances
Voting Methods
Optimal Linear Estimator
1ˆ ,T
i i si
s w r C C rr rW r W
Voting Methods
Optimal Linear Estimator
Center of Mass
ˆi i
i ii
ij jj j
r sr
s sr r
Linear in ri/jrj
Weights set to si
1ˆ ,T
i i si
s w r C C rr rW r W
Center of Mass/Population Vector
• The center of mass is optimal (unbiased and efficient) iff: The tuning curves are gaussian with a zero baseline, uniformly distributed and the noise follows a Poisson distribution
• In general, the center of mass has a large bias and a large variance
Voting Methods
Optimal Linear Estimator
Center of Mass
Population Vector
ˆi i
i
ii
r ss
r
ˆ
ˆˆ ( )
i i i ii i
r r
s angle
P P P
P
1ˆ ,T
i i si
s w r rr rW r W C C
Population Vector
sriPi
P
Voting Methods
Optimal Linear Estimator
Center of Mass
Population Vector
ˆi i
i
ii
r ss
r
ˆ
ˆˆ ( )
i i i ii i
r r
s angle
P P P
P
1ˆ ,T
i i si
s w r rr rW r W C C
Linear in ri
Weights set to Pi
Nonlinear step
Population Vector
11 112 21
1 ?
ˆ Tmi i
i mm
s
rp p
rp p
r
P
rr r P
P P W r
W C C W
Typically, Population vector is not the optimal linear estimator.
Population Vector
Population Vector
• Population vector is optimal iff: The tuning curves are cosine, uniformly distributed and the noise follows a normal distribution with fixed variance
• In most cases, the population vector is biased and has a large variance
Maximum Likelihood
The maximum likelihood estimate is the value of s maximizing the likelihood P(r|s). Therefore, we seek such that:
is unbiased and efficient.
s
MLˆ arg max |s
s P s r
Noise distributionMLs
Maximum Likelihood
Tuning Curves
-100 0 1000
20
40
60
80
100
Direction (deg)
Act
ivit
y
Pattern of activity (r)
-100 0 1000
20
40
60
80
100
Preferred Direction (deg)
Act
ivit
y
-100 0 1000
20
40
60
80
100
Preferred Direction (deg)
Act
ivit
y
Maximum Likelihood
Template
-100 0 100
20
40
60
80
100
0
Preferred Direction (deg)
Act
ivit
y
Maximum Likelihood
Template
MLs
ML and template matching
Maximum likelihood is a template matching procedure BUT the metric used is not always the Euclidean distance, it depends on the noise distribution.
Maximum Likelihood
The maximum likelihood estimate is the value of s maximizing the likelihood P(r|s). Therefore, we seek such that:
s
MLˆ arg max |s
s P s r
Maximum Likelihood
If the noise is gaussian and independent
Therefore
and the estimate is given by:
2
2ˆ arg min
2i i
s i
r f ss
2
2| exp
2i i
i
r f sP s
r
2
2log |
2i i
i
r f sP s
r
Distance measure:Template matching
Maximum Likelihood
-100 0 100
20
40
60
80
100
0
Preferred Direction (deg)
Act
ivit
y
2
i ir f s
MLs
Gaussian noise with variance proportional to the mean
If the noise is gaussian with variance proportional to the mean, the distance being minimized changes to:
2
ˆ arg min2
i i
s i i
r f ss
f s
Data point with small variance are weighted more heavily
Bayesian approach
We want to recover P(s|r). Using Bayes theorem, we have:
||
P s P sP s
P
rr
r
Bayesian approach
• The prior P(s) correspond to any knowledge we may have about s before we get to see any activity.
• Note: the Bayesian approach does not reduce to the use of a prior…
Bayesian approach
Once we have P(sr), we can proceed in two different ways. We can keep this distribution for Bayesian inferences (as we would do in a Bayesian network) or we can make a decision about s. For instance, we can estimate s as being the value that maximizes P(s|r), This is known as the maximum a posteriori estimate (MAP). For flat prior, ML and MAP are equivalent.
Bayesian approach
Limitations: the Bayesian approach and ML require a lot of data (estimating P(r|s) requires at least n+(n-1)(n-1)/2 parameters)…
11| exp
2
TP s s s
r f r f r
Bayesian approach
Limitations: the Bayesian approach and ML require a lot of data (estimating P(r|s) requires at least O(n2) parameters, n=100, n2=10000)…
Alternative: estimate P(s|r) directly using a nonlinear estimate (if s is a scalar and P(s|r)
is gaussian, we only need to estimate two parameters!).
Outline
• Definition
• The encoding process
• Decoding population codes
• Quantifying information: Shannon and Fisher information
• Basis functions and optimal computation
Fisher information is defined as:
and it is equal to:
where P(r|s) is the distribution of the neuronal noise.
Fisher Information
2
1
CR
I
2
2
ln |P sI E
s
r
Fisher Information
2
2
1 1
1
''
1
22 ' ''''
221
22 '
22
ln P |
P | P |!
ln P | ln ln !
ln P |
ln P |
ln P |
i ik fn n
ii i
i i i
n
i i i ii
ni i
ii i
ni i i i
ii ii
i i i i
i
I E
f ea k
k
k f f k
k ff
f
k f k ff
ff
f f f fE
f
A
A
A
A
A
A
''''
1
2'
1
n
ii i
ni
i i
ff
fI
f
Fisher Information
• For one neuron with Poisson noise
• For n independent neurons :
The more neurons, the better! Small variance is good!
Large slope is good!
2f
fi
i i
sI s
s
2
2f
fi
ii
sI s d
s
Fisher Information and Tuning Curves
• Fisher information is maximum where the slope is maximum
• This is consistent with adaptation experiments
• Fisher information adds up for independent neurons (unlike Shannon information!)
Fisher Information
• In 1D, Fisher information decreases as the width of the tuning curves increases
• In 2D, Fisher information does not depend on the width of the tuning curve
• In 3D and above, Fisher information increases as the width of the tuning curves increases
• WARNING: this is true for independent gaussian noise.
Ideal observer
The discrimination threshold of an ideal observer, s, is proportional to the variance of the Cramer-Rao Bound.
In other words, an efficient estimator is an ideal observer.
CRs
• An ideal observer is an observer that can recover all the Fisher information in the activity (easy link between Fisher information and behavioral performance)
• If all distributions are gaussian, Fisher information is the same as Shannon information.
Population Vector and Fisher Information
Population vector
CR bound
Population vector should NEVER be used to estimateinformation content!!!! The indirect method is prone to severe problems…
1/F
ishe
r in
form
atio
n
Outline
• Definition
• The encoding process
• Decoding population codes
• Quantifying information: Shannon and Fisher information
• Basis functions and optimal computation
• So far we have only talked about decoding from the point of view of an experimentalists.
• How is that relevant to neural computation? Neurons do not decode, they compute!
• What kind of computation can we perform with population codes?
Computing functions
• If we denote the sensory input as a vector S and the motor command as M, a sensorimotor transformation is a mapping from S to M:
M=f(S)
Where f is typically a nonlinear function
Example
• 2 Joint arm:
x
y
1 2
1 2
1
sin sin
cos cos
x
y
h
h
X θ
θ X
Basis functions
Most nonlinear functions can be approximated by linear combinations of basis functions:
Ex: Fourier Transform
Ex: Radial Basis Functions
1
( ) sinn
i i ii
y f x c x
2
1
( ) exp2
ni
ii i
x xy f x c
Basis Functions
-100 0 1000
50
100
150
200
250
Direction (deg)
Act
ivity
-200 -100 0 100 2000
0.2
0.4
0.6
0.8
1
Preferred Direction (deg)
Act
ivity
2
1
( ) exp2
ni
ii i
x xy f x c
Basis Functions
• A basis functions decomposition is like a three layer network. The intermediate units are the basis functions
1 1 1
1
( )m m n
i i i ij ji i j
n
i ij jj
y c h c g w x f
h g w x
x
X
y
Basis Functions
• Networks with sigmoidal units are also basis function networks
1 1 1
1
( )m m n
i i i ij ji i j
n
i ij jj
y c h c g w x f
h s w x
x
Basis Function Layer
A B
C D
X Y
Z
2 3
Y
Z
Z
Z
X
Y XXY
Linear Combination
Y X
Y X
Y X
Y X
Basis Functions
• Decompose the computation of M=f(S,P) in two stages:
1. Compute basis functions of S and P
2. Combine the basis functions linearly to obtain the motor command
1
Bn
i ii
c
M S,P
Basis Functions
• Note that M can be a population code, e.g. the components of that vector could correspond to units with bell-shaped tuning curves.
1
Bn
j j ji ii
G c
M M S,P
EyePosition: Xe
Head position
Gaze+
Fixation point
Head-centeredLocation: Xa
Retinal Location: Xr
Example: Computing the head-centered location of an object
from its retinal location
a r e X X X
Basis Functions
,
,
i i a i r e
i r e
ij j r ej
a G x G x x
h x x
c G x x
Hk=Ri+Ej
Preferred retinal location-100 0 100
0
20
40
60
80
100
Preferred eye location-100 0 100
0
20
40
60
80
100
Preferred head centered location-100 0 100
0
20
40
60
80
100
Ri Ej
Basis Function Units
Gain Field
-80 -40 0 40 800
5
10
15
Act
ivit
y
Eye-centered location
E=20°E=0°E=-20°
Hk=Ri+Ej
Preferred retinal location-100 0 100
0
20
40
60
80
100
Preferred eye location-100 0 100
0
20
40
60
80
100
Preferred head centered location-100 0 100
0
20
40
60
80
100
Ri Ej
Basis Function Units
Partially shiftingreceptive field
-80 -40 0 40 800
5
10
15
Act
ivit
y
Eye-centered location
E=20°E=0°E=-20°
Fixation point
Head-centered location
Retinotopic location
Screen
Visual receptive fields in VIP are partially shifting with the eye
(Duhamel, Bremmer, BenHamed and Graf, 1997)
Summary
• Definition• Population codes involve the concerted
activity of large populations of neurons
• The encoding process• The activity of the neurons can be
formalized as being the sum of a tuning curve plus noise
Summary
• Decoding population codes • Optimal decoding can be performed with Maximum
Likelihood estimation (xML) or Bayesian inferences (p(s|r))
• Quantifying information: Fisher information• Fisher information provides an upper bound on the
amount of information available in a population code
Summary
• Basis functions and optimal computation
• Population codes can be used to perform arbitrary nonlinear transformations because they provide basis sets.
Where do we go from here?
Computation and Bayesian inferences
• Knill, Koerding, Todorov: Experimental evidence for Bayesian inferences in humans.
• Shadlen: Neural basis of Bayesian inferences• Latham, Olshausen: Bayesian inferences in
recurrent neural nets
Where do we go from here?
Other encoding hypothesis: probabilistic interpretations
• Zemel, Rao
log
i i i
i
r f s n
r P s C