Upload
samuel-warner
View
216
Download
1
Embed Size (px)
Citation preview
BCS547
Neural Decoding
Population Code
Tuning Curves Pattern of activity (r)
-100 0 1000
20
40
60
80
100
Direction (deg)
Act
ivit
y
-100 0 1000
20
40
60
80
100
Preferred Direction (deg)
Act
ivit
y s?
Nature of the problem
In response to a stimulus with unknown orientation s, you observe a pattern of activity r. What can you say about s given r?
Bayesian approach: recover p(s|r) (the posterior distribution)
Estimation theory: come up with a single value estimate from rs
Maximum Likelihood
Tuning Curves
-100 0 1000
20
40
60
80
100
Direction (deg)
Act
ivit
y
Pattern of activity (r)
-100 0 1000
20
40
60
80
100
Preferred Direction (deg)
Act
ivit
y
-100 0 1000
20
40
60
80
100
Preferred Direction (deg)
Act
ivit
y
Maximum Likelihood
Template
-100 0 100
20
40
60
80
100
0
Preferred Direction (deg)
Act
ivit
y
Maximum Likelihood
Template
MLs
Maximum Likelihood
-100 0 100
20
40
60
80
100
0
Preferred Direction (deg)
Act
ivit
yMLs
Maximum Likelihood
The maximum likelihood estimate is the value of s maximizing the likelihood p(r|s). Therefore, we seek such that:
s
MLˆ arg max |s
s P s r
Noise distribution
Activity distribution
P(ai|=-60)
P(ri|s=0)
P(ri|s=-60)
Maximum Likelihood
The maximum likelihood estimate is the value of s maximizing the likelihood p(s|r). Therefore, we seek such that:
is unbiased and efficient.
s
MLˆ arg max |s
s P s r
Noise distributionMLs
Estimation Theory
-100 0 1000
20
40
60
80
100
Preferred orientation
Activity vector: r
Decoder ss Encoder(nervous system)
-100 0 1000
20
40
60
80
100
Preferred retinal location
r2
Decoder
Trial 2
2ss Encoder(nervous system)
-100 0 1000
20
40
60
80
100
Preferred retinal location
r1
Decoder
Trial 1
1ss Encoder(nervous system)
Decoder
Trial 200
200ss Encoder(nervous system)
-100 0 1000
20
40
60
80
100
Preferred retinal location
r200
Estimation Theory
If , the estimate is said to be unbiasedˆ[ | ]E s s s
If is as small as possible, the estimate is said to be efficient2ˆ|s s
-100 0 1000
20
40
60
80
100
Preferred orientation
Activity vector: r
Decoder ss Encoder(nervous system)
Estimation theory
• A common measure of decoding performance is the mean square error between the estimate and the true value
• This error can be decomposed as:
2ˆMSE |E s s s
2 2ˆ|
2 2ˆ|
ˆMSE | s s
s s
E s s s
bias
Efficient Estimators
The smallest achievable variance for an unbiased estimator is known as the Cramer-Rao bound, CR
2.
An efficient estimator is such that
In general :
2 2| CRs s
2 2| CRs s
and it is equal to:
where p(r|s) is the distribution of the neuronal noise.
Fisher Information
2
1
CR
I s
2
2
ln |P sI s E
s
r
Fisher information is defined as:
Fisher Information
2
2
1 1
1
''
1
22 ' ''''
221
22 '
22
ln P |
P | P |!
ln P | ln ln !
ln P |
ln P |
ln P |
i ik f sn n
ii i
i i i
n
i i i ii
ni i
ii i
ni i i i
ii ii
i i i i
i
sI E
s
f s es r k s
k
s k f s f s k
s k f sf s
s f s
s k f s k f sf s
s f sf s
s f s f s f s fE
s f s
r
r
r
r
r
r
''''
1
2'
1
n
ii i
ni
i i
sf s
f s
f sI
f s
Fisher Information
• For one neuron with Poisson noise
• For n independent neurons :
The more neurons, the better! Small variance is good!
Large slope is good!
2f
fi
i i
sI s
s
2
2f
fi
ii
sI s d
s
Fisher Information and Tuning Curves
• Fisher information is maximum where the slope is maximum
• This is consistent with adaptation experiments
Fisher Information
• In 1D, Fisher information decreases as the width of the tuning curves increases
• In 2D, Fisher information does not depend on the width of the tuning curve
• In 3D and above, Fisher information increases as the width of the tuning curves increases
• WARNING: this is true for independent gaussian noise.
Ideal observer
The discrimination threshold of an ideal observer, s, is proportional to the variance of the Cramer-Rao Bound.
In other words, an efficient estimator is an ideal observer.
CRs
• An ideal observer is an observer that can recover all the Fisher information in the activity (easy link between Fisher information and behavioral performance)
• If all distributions are gaussians, Fisher information is the same as Shannon information.
Estimation theory
Other examples of decoders
-100 0 1000
20
40
60
80
100
Preferred orientation
Activity vector: r
Decoder ss Encoder(nervous system)
Voting Methods
Optimal Linear Estimator
ˆ i ii
s w r
Linear Estimators
1
1
*
2*
1
2
1
1
1
1
*
*0 0
,...,
,...,
1
2
1
2
0
0
1
n
n
n
i ii
n
i ii
n
i ii
n
i ii
n
i ii
x x
y y
y ax b
E y y
ax b y
Eax b y
b
E
b
ax b y
b y axn
b y a x
y y a x x
y ax
X
Y
Linear Estimators
*
2*
1
2
1
1
1
12
2
1
1
2
1
2
0
0
n
i ii
n
i ii
n
i i ii
n
i i ii
n
i ixyi
nx
ii
y ax
E y y
ax y
Ex ax y
a
E
a
x ax y
x yC
ax
Linear Estimators
1
1
11 1
1
11 1
1
1
* T
T
2*
1
11 T T
T2 2
...
... ... ...
...
...
... ... ...
...
... 1
1
2
... m
m
n
nm m
n
np p
i
i
ip
n
i ii
XX XY
x yx y
x x
x x
m n
x x
y y
p n
y y
y
p
y
p m
E
n mp
m p m m m p
CC
X
Y
y
y W x
W
y y
W C C XX XY
W
*2
1
i
i
mx y
ii x
Cx
y
X and Y must be zero mean
Trust cells that have small variances and large covariances
Voting Methods
Optimal Linear Estimator
1ˆ ,T
i i si
s w r C C rr rW r W
Voting Methods
Optimal Linear Estimator
Center of Mass
ˆi i
i ii
ij jj j
r sr
s sr r
Linear in ri/jrj
Weights set to si
1ˆ ,T
i i si
s w r C C rr rW r W
Center of Mass/Population Vector
• The center of mass is optimal (unbiased and efficient) iff: The tuning curves are gaussian with a zero baseline, uniformly distributed and the noise follows a Poisson distribution
• In general, the center of mass has a large bias and a large variance
Voting Methods
Optimal Linear Estimator
Center of Mass
Population Vector
ˆi i
i
ii
r ss
r
ˆ
ˆˆ ( )
i i i ii i
r r
s angle
P P P
P
1ˆ ,T
i i si
s w r rr rW r W C C
Linear in ri
Weights set to Pi
Nonlinear step
Population Vector
sriPi
P
Population Vector
11 112 21
1 ?
ˆ Tmi i
i mm
s
rp p
rp p
r
P
rr r P
P P W r
W C C W
Typically, Population vector is not the optimal linear estimator.
Population Vector
• Population vector is optimal iff: The tuning curves are cosine, uniformly distributed and the noise follows a normal distribution with fixed variance
• In most cases, the population vector is biased and has a large variance
• The variance of the population vector estimate does not reflect Fisher information
Population Vector
Population vector
CR bound
Population vector should NEVER be used to estimateinformation content!!!! The indirect method is prone to severe problems…
Population Vector
PVs
Maximum Likelihood
-100 0 100
20
40
60
80
100
0
Preferred Direction (deg)
Act
ivit
yMLs
Maximum Likelihood
If the noise is gaussian and independent
Therefore
and the estimate is given by:
2
2ˆ arg min
2i i
s i
r f ss
2
2| exp
2i i
i
r f sP s
r
2
2log |
2i i
i
r f sP s
r
Distance measure:Template matching
Gradient descent for ML
• To minimize the likelihood function with respect to s, one can use a gradient descent technique in which s is updated according to:
1t t t
t
s s s
Ls
s
Gaussian noise with variance proportional to the mean
If the noise is gaussian with variance proportional to the mean, the distance being minimized changes to:
2
ˆ arg min2
i i
s i i
r f ss
f s
Data point with small variance are weighted more heavily
Poisson noise
If the noise is Poisson then
And :
| ( | )
!
iii
ii
f sr
ii
ii
p s p r s
e f s
r
r
|
!
i ir f s
ii
i
f s ep r s
r
ML and template matching
Maximum likelihood is a template matching procedure BUT the metric used is not always the Euclidean distance, it depends on the noise distribution.
Bayesian approach
We want to recover p(s|r). Using Bayes theorem, we have:
likelihood of s
posterior distribution over sprior distribution over r
prior distribution over s
||
p s p sp s
p
rr
r
Bayesian approach
What is the likelihood of sp(r| s)?It is the distribution of the noise… It is the same distribution we used for maximum likelihood.
Bayesian approach
• The prior p(s) correspond to any knowledge we may have about s before we get to see any activity.
• Ex: prior for smooth and slow motions
Bayesian approach
Once we have p(sr), we can proceed in two different ways. We can keep this distribution for Bayesian inferences (as we would do in a Bayesian network) or we can make a decision about s. For instance, we can estimate s as being the value that maximizes p(s|r), This is known as the maximum a posteriori estimate (MAP). For flat prior, ML and MAP are equivalent.
Bayesian approach
Limitations: the Bayesian approach and ML require a lot of data (estimating p(r|s) requires at least n+(n-1)(n-1)/2 parameters for multivariate gaussian)…
Alternative: 1- Naïve Bayes: assume independence and hope for the best2- Use clever method for fitting p(r|s).3- Estimate p(s|r) directly using a nonlinear estimate.4- hope the brain uses likelihood functions that have only N free parameters, e.g., the exponential family with linear sufficient statistics
Bayesian approach:logistic regression
Example: Decoding finger movements in M1. On each trial, we observe 100 cells and we want to know which one of the 5 fingers is being moved.
1 2 3 100
1 2 3 4 5
…100 input units
5 categories
P(F5|r)
r
1
0
| Ti iP F g t r W r
g(x)
P(F5|r)
Bayesian approach:logistic regression
Example: 5N free parameters instead of O(N2)
1 2 3 100
1 2 3 4 5
…100 input units
5 categories
r
1
0
| Ti iP F s t r W r
s
Bayesian approach:multinomial distributions
Example: Decoding finger movements in M1. Each finger can take 3 mutually exclusive states: no movement, flexion, extension.
Probability of no movementProbability of flexionProbability of extension
Activity of the N M1 neurons
W
Digit 1 Wrist
Softmax
Digit 2 Digit 3 Digit 4 Digit 5
Decoding time varying signals
s(t)
(t)
Decoding time varying signals
s t
ˆ *t
os t k t t k t d
t
Note the time shift…
Decoding time varying signals
1
1
ˆ o
t
t n
ii
n
ii
s t t k t
k t d
k t t d
k t t
Discrete sum of templates centered on spikes
Decoding time varying signals
s(t)
(t)
Decoding time varying signals
• Finding the optimal kernel (similar to OLE)
ˆ
s
s t k t
s k
Qk
Q
est 01
est 0
2
est 0 00
2
00
0
0
0 0 01
1
1
' ' '
1'
if
1 1then
n
ii
T
T
s
T
n
s ii
s t K t t r d K
s t d t r K
E dt s t s tT
E dt d t r K s tT
d Q K Q
Q dt t r t rT
Q r
K Q C s tr n
0
otherwise
1exp
2
exps
K d K i
Q iK
Q
Autocorrelation function of the spike train
Appendix A chapter 2
If the spike train is uncorrelated, the optimal kernel is the spike triggered average of the stimulus
Correlation of the firing rate and stimulus
1'
T
sQ dt t r s tT
0
01
0 01
1
1'
1
1 1
1
T
s
NT
ii
N T T
ii
N
ii
Q dt t r s tT
dt t r s tT
dt t s t dtr s tT T
s tT