Normative Models for Sensory Representation

7/29/2019 Normative Models for Sensory Representation

1/21

Homework Topic: A Concise Review of

Normative Models for Sensory Representation in

the Early Visual System

Andrew Worzella

January 30, 2013

I. Introduction

The purpose of this paper is to review a normative theory which can be usedto formulate models of representation in neural sensory systems. A normativemodel for the visual sensory system can be defined as a paradigm which at-tempts to explain the reasons for why neural representation takes a particularstructural form by investigating how a representation should be if it were opti-mal for processing the types of stimuli occurring in natural circumstances[17].The specific adaptation depends on the tendencies of a particular organism be-cause behaviour and environment are ultimately the forces which determine thesensory signals frequenting the system. This form of perceptual optimizationcan occur on evolutionary, developmental and behavioural time scales, and the

result of such an adaptation should yield a neural substrate that is useful forthe survival of an organism.In general, normative models for representation are derived by identifying

some computational aspect of visual processing and formulating it into an ob-jective function subject to a set of appropriate constraints. Then the objectivecan be optimized, using methods from optimization theory, to yield an estima-tion for optimal sensory representation. If the result of such an optimizationprocess yields a sensory representation that is qualitatively similar to that ofthe visual system, then it is reasonable to assume that the brain has solved avery similar problem according to analogous principles.

Normative models provide a tractable method for understanding some fun-damental aspects of visual processing by assuming optimality, however, theincredulous reader might wonder whether or not the brain is actually optimal.

Furthermore, if the brain is indeed optimal, how should optimality be defined?In actuality, the brain probably does not have a representation which is optimalfor all types of stimuli given the high degree of complexity and variability inthe nature world. Moreover, the definition of optimality is rather ambiguousbecause the notion depends entirely on ones ability to identify all of the vari-ables and intricacies that completely describe the evolution of a neural system.

1


2/21

In this respect, it is more useful to view normative models in computationalvision as a tool for unveiling the most efficient representation given by a set of

constraints and assumptions about the computational tasks performed by thevisual system. The hope for this strategy is that understanding efficient repre-sentation can provide some insight into explaining how the brain encodes visualsensory information.

The next section will be dedicated to discussing the notion of efficiency inthe brain and how it can be quantified. In following sections, these ideas willbe applied to models which consider the roles that natural scenes might haveplayed in the formation of the early visual system. The intention is to pro-vide a survey of the theoretical background and seminal works which providecompelling evidence for the structural symbiosis of the visual system and thenatural world. Many of the topics borrow language from statistics and informa-tion theory and concise explanations of the essential concepts and definitionsfrom these disciplines will be included, however, it will be assumed that thereader has some basic knowledge of Fourier analysis, statistics, and anatomy inthe visual system.

II. The Efficient Coding Hypothesis for the Visual

System

Information from the environment arrives at the gateway of the visual systemin the form of light as retinal images are cast upon the back surface of theeye. Knowledge about the source of light (i.e. objects in the environment)cant be easily discerned because this information is embedded within the com-plicated activity patterns of photoreceptors which are responding to different

spectral and intensive contrasts. There are an estimated 1.5 million ganglioncells in each eye which implies that at any given moment the retina takes one of21,500,000 possible states[17]. Encoding the entire set of activity states in a trivialmanner with a unitary response for every state is physiologically implausible.Therefore, it is reasonable to assume that the visual system has discovered anon-trivial method for encoding sensory information.

In general, non-trivial methods for encoding information exist whenever theinformation elements (e.g. the set of possible activity states in the retina) occurwith unequal frequencies. Claude Shannon discovered this principle by consid-ering the optimal ways for encoding and transmitting messages across a noisychannel. He found that messages (signals) could be transmitted more efficientlyif frequently occurring signals were encoded with a short bit-length and infre-quently occurring signals were encoded with a longer bit-length[25]. This coding

strategy is a form of compression that effectively eliminates redundancy in theset of signals which are to be transmitted, viz more information on average isconveyed for every bit transmitted across the channel. Non-trivial represen-tation for message encoding was only possible with proper knowledge of thestatistical regularities of the English language.

2


3/21

In Shannons view, knowledge of the statistical redundancies in messageswas useful for designing a code which could effectively eliminate useless in-

formation that would otherwise waste channel space. In the visual system, itis possible that the retina encounters a very similar source-channel problem.Signals from retinal ganglion cells are passed to the lateral geniculate nucleus(LGN) via the optic nerve where its limited bandwidth could be fully utilizedwith a proper choice of neural code. However, compressive coding does notseem to be an accurate depiction of how the brain processes information inhigher regions of the visual pathway, such as the primary visual cortex (V1)[7].Nevertheless, redundancy reduction is not limited to compressive coding andknowledge of redundant information is still useful in deriving efficient sensoryrepresentation.

Attneave (1954) was the first to explicitly point out that a crucial part ofvisual perception involves identifying redundancies in visual scenes [2]. He cre-ated a demonstrative guessing game where human subjects where presented thepixels of an image in a serial fashion, revealing one pixel after another. Beforerevealing each pixel, subjects were asked to infer the value of the next con-secutive pixel. Predictions were relatively successful for pixel values in regionsof the image that contained redundant color or contour information, but werelargely unsuccessful at predicting pixel values in areas where either the color orcontour made an abrupt change. This was a simple but effective demonstrationwhich showed that inferences could be facilitated with a proper understandingof the spatial redundancies of an image. Moreover, he proposed that the task ofvisual perception was to find an economic representation of any redundanciesfound within the visual field. These ideas were later formulated into a neuro-physiological context with Barlows proposal stating that early sensory neuronscould be encoding visual information so that redundancies in the sensory input

are reduced[21]. The propositions in these seminal works lay the foundationalelements of what is now known as the efficient coding hypothesis.The efficient coding hypothesis says that knowledge of statistical redundan-

cies in sensory input can be used to explain and derive a representation whichefficiently codes for the stimuli observed in an organisms natural habitat. Sincethe set of stimuli actually observed is different from the entire set of stimuliwhich could be observed, the visual system could have adopted a representationthat exploits this fact and results in being especially adept in dealing with thetypes of stimuli ordinarily sensed (but perhaps poorly suited for some othertypes of stimuli which occur infrequently). This type of coding provides priorinformation about the structure of a habitat which increases an organisms abil-ity to make better inferences within a set of circumstances associated with itsbehavior and environment.

Measures of Efficiency

The general metrics which are used to quantify the notion of efficiency are en-tropy and mutual information. Given two random variables, e.g. Sand R, these

3


4/21

tools can be used as a measure of a codes ability to convey information aboutthe joint instances of these variables. The variables R and S were chosen for

convenience as they relate directly to experimental paradigms where a stimulus(S) is presented and a neuronal response (R) is observed. Equations 1 and 2give the definitions of entropy and mutual information for a discrete set of stim-uli and responses where Pr is the probability of observing a particular neuronalresponse, Pr|s is the conditional probability of observing a neuronal responsegiven a stimulus, and Pr,s is the corresponding joint distribution.

H(R) = r

Prlog(Pr) (1)

MI(R;S) = r

Prlog(Pr) +s,r

Pr,slog(Pr|s) (2)

The first term on the right hand side in the expression for mutual information(eguation 2) is the total response entropy and the second term is the conditional

response entropy averaged over all stimuli, which is sometimes referred to as thenoise entropy. In general, a neuronal response is described most accurately bya sequence of times when action potentials were fired, however, the standardconvention for normative models is to assume that firing rate can sufficientlycharacterize the response of a neuron. This simplification makes the task ofcharacterizing the distribution of responses a more tractable problem. For theremainder of this paper, it can be assumed that all models equate neuronalresponses to firing rates.

Observing a neuronal response to a given stimulus is analogous to the source-channel problem studied by Shannon. A stimulus is a signal which is transmittedinto the visual system and a neuronal response is the message received by theexperimenter. The responses measured (after repeated presentation of the stim-uli) constitute the conditional distribution, Pr|s, which can used to find out ifa neuron has an affinity for any of the stimuli. If a neuron is selective for a

certain type of stimulus, then it should have little response variability for re-peated presentation of the favored stimulus and large variability across the setof all stimuli, viz response variability is indicative of a neurons affinity for aparticular stimulus[10].

It is reasonable to assume that the brain could benefit from a representationof sensory input where neuronal responses convey maximum information aboutthe stimuli frequenting the system. Using the definition of mutual information,a neuron is said to be efficient if the entropic difference between response andnoise is maximized. If the noise entropy for a single neuron is negligible, thenthe task of maximizing the mutual information is reduced to the problem ofmaximizing the total response entropy subject to an appropriate constraint.Theoretical predications can be made about the optimal response distributionsone should expect under special constraints. For example, it can be shown thatrestricting the firing rate to be within the interval between zero and a maxi-

mum firing rate yields an optimal distribution of responses that is uniform[10].Similarly, if the interval and variance of firing rate is constrained, the optimaldistribution is Gaussian[26].

When considering the most efficient response for a population of neurons,each neuron should respond optimally and the collective activity of the en-tire population should be optimal so that two or more neurons do not convey

4


5/21

overlapping information. In this case, the probability of responses becomes amulti-variate probability distribution, Pr, where each element of the vector r is

a response for a particular neuron in the population. The response entropy ofthe ith neuron and total population response are given by equations 3 and 41.

Hi(R) =

Pri log(Pri) (3)

Hpopulation(R) =

Prlog(Pr)dr log2(r) (4)

Since the entropy for a population response cannot be greater than the sumof the entropies of each neuronal response within the population, a naturaldefinition for redundancy in a population response is given by the differencebetween the joint response entropy and the summation of the individual responseentropies, H

Hi. It can be shown that this quantity is greater than or equal

to zero, which implies that minimizing redundancy in the population response

is equivalent to minimizing the ratio of

Hi/H[10]. This ratio is equal tounity if and only if the responses of every neuron are identically distributed andindependent[26]. Therefore, the criteria which define an efficient representationfor a population of neurons are (1) factorization (i.e. independence of responsesfor different neurons) and (2) probability distribution equalization.

III. Efficient Coding for Natural Scenes

If the efficient coding hypothesis is correct, then neural systems should adopt anefficient sensory representation which is adapted to the stimuli frequenting thesystem. Since the criteria for defining an efficient representation are in termsof information-theoretic metrics, it is necessary to describe the distribution of

stimuli in an organisms natural habitat. Most of the literature on this subjecthas focused on investigating the statistics of natural scenes and the possibleroles that it may have played in determining the representational form of visualsensory systems. Therefore, much of the remaining paper will be dedicatedto reviewing how statistics of the sensory signals within natural environmentscan explain the representational form of the early visual system, however, otherinfluences such as the imaging process and behavior will be discussed in sectionIV.

Probabilistic Models of Natural Images

Much of what is known about the statistics of natural scenes comes from study-ing the statistics of static natural images. A set of natural images can be

defined as visual stimuli that are similar in some regard to the types of stim-uli that have presumably influenced the development and evolution of neural

1The more precise definition of entropy for a continuous valued response is limr0

Prlog(Pr) + log2(r), where r must be finite so that the entropy is finite. See Chapter

4 of [10] for more details.

5


6/21

systems[17]. Some examples of images that could qualify as natural are given inFig.1. Although these examples can give one some intuition as to which types ofimages can be considered natural, a more precise description of natural imagescan be defined with a probabilistic model.

The first step towards creating a statistical description of natural imagesnecessarily involves determining the joint probability distribution of the pixelscomprising an image space. An image space consists of a large set of identically-sized image vectors, where each image is an observation of a single statisticalevent and each pixel cell contains a random variable. The number of pixelsin a typical image is often quite large and running statistical analysis on suchdata is beyond the computational capabilities of modern computers. Thus, the

convention for creating an image space usually involves sampling smaller imagepatches (e.g. 20x20pixels) randomly and uniformly from the entire set of wholeimages. In addition to decreasing the size of each image vector, another sim-plification used in most literature is to consider only intensity statistics and toignore different spectral contrasts. This restricts the number of possible states

6


7/21

that can be taken by each pixel to scalar quantity on the interval of grayscalevalues. From this point forward, the words natural images and images willrefer to vectorized grayscale image patches, unless otherwise mentioned.

Second-order statistics of Static Natural Images

The simplest description of the multi-variate distribution of natural images isgiven in terms of second-order statistics, viz with expected values of the firstand second moments. Second-order analysis can be performed be evaluating theautocorrelation function of neighboring pixels within a given image. Using theWiener-Khinchin theorm, this function can be directly related to the averagepower spectrum of spatial frequencies (f). When performed on natural images,second-order analysis reveals a ubiquitous structure that is inherent to naturalscenes, mainly, that the power spectrum falls off with increasing frequency in away that resembles a power law (log(powerspectrum) f2). The implicationof the power law is that the spatial relationships of natural images are scale-invariant[17]2. Fig.1c shows the power spectra corresponding to the images fromFig.1a-b.

Principal component analysis (PCA) provides an alternative method forexamining the second-order statistical regularity of natural images. Principal

components are eigenvectors of the covariance matrix of the image space. Thesymmetry of the covariance matrix ensures that components constitute an linear,orthogonal basis which can be viewed as features of the image space correspond-ing to maximumly different variances[17]. The principal components of a set of

2Scale-invariance in natural images is thought to be a consequence of the distribution andpositioning of objects in the environment[17]

7


8/21

natural images have been plotted in Fig.2a with the corresponding spectrum ofeigenvalues given in Fig.2b. A well-documented result obtained by performingPCA on natural images is that the first principal component is commonly hor-izontal, which reflects the fact that natural scenes tend to have more variancein a top-bottom direction. Said another way, pixels of natural images tend tocontain stronger horizontal-pairwise correlations relative to vertical ones. Thisanisotropic signature of nature scenes is likely a consequence of the combinedeffects of gravity and depth perspective[5].

Time-varying Natural Images

The world and the stimuli originating from within it are anything but static. Upto this point, the statistical methods which were mentioned have been dedicated

exclusively to describing spatial statistics of environmental stimuli. Normativemodels for representation which are built upon purely spatial statistical modelsof natural images have been successful in explaining some of the fundamentalaspects of representational form of the visual sensory system, but theoreticalpredictions are usually reserved only for special cases[11]. Since environmentalstimuli are generally non-static, a time-varying probability distribution of stim-uli is necessary to provide a more complete theoretical description of efficientrepresentation for natural scenes.

Time-varying natural images have strong correlations in space in time. Thesecond-order spatiotemporal analysis of time-varying natural images (i.e. a se-quential frames of a video) reveals that the spatiotemporal power spectrum isgenerally not separable[11]. In (Dong and Atick, 1995), pairwise correlationswere sampled for a large number of time-varying images while varying the dis-tance of separation in time and space. Spatial frequencies were assumed to

be radially symmetric so that the spatiotemporal power spectrum could be ex-pressed as a scalar function of spatial frequency, f, and temporal frequency, .To be consistent with the notation used by Dong and Atick, R(f, ) will bedefined as the spatiotemporal power spectrum. Their measurements for spatialand temporal power are given in Fig.3. In Fig.3a, temporal frequency increasesgoing from the highest to the lowest curve, and in Fig.3b spatial frequency in-

8


9/21

creases going from the highest curve to the lowest. The non-trivial changes inpower spectra seen when temporal or spatial frequency is varied indicates that

the spatiotemporal power spectrum is not separable.Furthermore, Dong and Atick were able to derive an analytic expression for

the spatiotemporal power spectrum of time-varying natural images by assumingthat stimuli have planar movement between frames and that the distributionof object velocities (Dv) in each scene is given by an inverse power law. Thisinformation is summarized in equations 5-7.

R(f, ) f(m+1)Y(

f) (5)

Y(

f) =

r2

r1

Dv(

fr)rdr (6)

Dv = (v + 0.6)2.3 (7)

Equation 5 gives the analytical expression for the spatiotemporal power spec-trum. It consists of two terms; the left term which depends exclusively onspatial frequency and the right term which contains the temporal dependencegiven by the function Y(

f). Equation 6 gives the definition of Y(

f) which de-

pends upon the distribution of velocities (Dv) for objects in a scene located ata distance (r) away from the observer. The interval defined by r1 and r2 definesthe minimal and maximal distances that objects can be located away from theobserver. Finally, equation 7 gives the definition of the distribution of objectvelocities, where the average velocity was chosen to be 0.6m/s and the power ofthe distribution of velocities was set to -2.3.

Interestingly, the power m (See equation 5) which was in best agreementwith empirical measurement of the spatiotemporal power was identical to the

value which was estimated for the purely spatial power spectrum of the samedata. The equivalence of estimated values for m in spatial and spatiotemporalpower spectra suggests that the spatiotemporal power is governed largely by themotion of objects with a static spatial power spectrum. Finally, in the limitingcase where r1 0 and r2 , the spaitiotemporal power spectrum becomesseparable (i.e. R(f, ) 1/fm12).

Representation of Second-order Statistics in Natural Scenes

Second-order statistical information about natural scenes provides the basis forexplaining the representational form of the center-surround receptive fields inthe retina and LGN. The neurons in these areas have been predicted to perform

rudimentary processing of sensory input via an efficient representation whichexploits second-order statistical information in natural scenes. The primarypurpose of retinal processing seems to be for decorrelating spatially redundantsensory information, whereas neurons in the LGN tend to reduce temporal re-dundancies persisting past the retina.[1][12]. The theoretical models which leadto these conclusions can be explained by considering the criteria which define

9


10/21

an efficient population response given the statistical knowledge of the naturalworld. Recall that the process of finding a population response which maximizesinformation conveyed about stimuli is approximately equivalent to maximizingthe total response entropy if noise entropy is relatively small. Response entropyis maximized when the distribution of responses is independent for all neuronsin the population. It will be seen that receptive fields for the retina and LGNhave forms which are strikingly similar to models for efficient representationwhich consider only second-order statistics of natural images.

Spatial Decorrelation with P-type Retinal Ganglion Cells

Atick and Redlich hypothesized that the primary objective for retinal processingcould be understood as an attempt to decorrelate spatial sensory input so thatthe retinal output travelling along each axon in the retinofugal projection couldbe as close to independent as possible[1]. Their proposition was motivated byconsidering what the expected power spectrum of retinal output would be if theinput were the f2 spatial power spectrum of natural scenes. The output fora retinal ganglion cell is defined in equation 8 as the convolution of the linearresponse kernel (K) with each image (I). An expression for the power spectrumof retinal output can be obtained by performing the convolution of equation 8in Fourier space, and then simply applying the Wiener-Khinchin theorem. Thisresult is given in equation 9.

O(x, y) =

K(x0 x

, y0 y)I(x, y)dxdy = K I (8)

O(fx, fy)O(fx, fy) = (K(fx, fy)I(x

, y))(K(fx, fy)I(x, y)) (9)

Empirical measurement of the amplitude spectrum of the response kernel (i.e.the squared root of the power spectrum) for ganglion cells of a monkey is shown

10


11/21

in Fig.4a. The amplitude spectrum for the convolution of the kernel with nat-ural images is given in Fig.4b. The flattening of the output spectrum at lowinput frequencies implies that decorrelation (a.k.a whitening) is performed atthis end of the spectrum. Another observation which can be made when com-paring Fig.4a and Fig.4b is that the amplitude spectrum begins to taper offafter a critical frequency. Evidently, the retina does not whiten stimuli at highspatial frequencies. In this region of the spectrum, the signal-to-noise ratio be-comes small and thus increases the chance of amplifying noise in the system.

Whitening in this range would be counter productive for encoding informationbecause if all retinal output were statistically uncorrelated, then the LGN wouldbe unable to decipher which signals are noise and which are useful. This is whythe retina is said to represent environmental stimuli in a basis which is as closeto independent as possible.

Retinal whitening is the first form of signal processing that occurs in thevisual system and a similar transformation can be performed for natural imagepatches. An example of an whitened image is given in Fig.5b. Two generalmethods exist for deriving linear-decorrelating filters from the image space:patch-based and filter-based whitening[17]. The patch-based method will bereviewed here because it makes less assumptions about the data3. A review offilter-based methods can be found in [17].

The objective for patch-based whitening can be stated as deriving a set oforthogonal bases which transform images to a space where second-order sta-tistical dependencies are removed. This is equivalent to finding a set of lineartransformations which maximize information conveyed about second-order mo-

3Filter-based methods assume translation invariance and that noise is not amplifiedunduly[17]

11


12/21

ments in a population response. PCA yields components that are orthogonal,however, the variance is not equally distributed across the components (SeeFig.2b). Thus, principal components can be used as the set of orthogonal basesif they are scaled to unit variance. This method results in center-surround spa-tial decorrelating filters which can be seen in Fig.5a.

Temporal Decorrelation in the LGN

Many X-cells4 in the feline LGN receive input from a small number of retinalganglion cells. Each retinal ganglion receptive field is one of the two possible

4X-cells are analogous to P-cells in monkeys.

12


13/21

types (e.g. either ON or OFF center-surround), and each geniculate cell typ-ically receives input from only one ganglion type. As a consequence, most of

the spatially decorrelated information leaving the retina is preserved within thereceptive fields of the LGN[13]. For each type of retinal input, geniculate cellscan have one of two possible temporal responses: nonlagged (XN) and lagged(XL). Thus, there are a total of four different types of responses which can beobserved in the LGN: ON XN, ON XL, OFF XN and OFF XL (SeeSaul Humphrey 1990).

Temporal redundancies exist beyond the retina because of the sustained re-sponses which are characteristic of the retinal ganglion cells forming synapseswith X-geniculate cells[9]. In contrast to their retinal counterparts, responsesfrom cells in the LGN are transient[18]. This type of processing is beneficial tothe organism because transient responses require less metabolic activity. Thesefacts suggest that the purpose for LGN processing is to reduce temporal redun-dancies persisting past the retina with an efficient representation.

Dong and Atick formulated a normative model for LGN processing whichdecorrelates second-order moments of the population response for naturalisticstimuli (See [12] for a description of methods). The optimal LGN responsekernels obtained from their model was non-linear due to rectified input and out-put. Rectification of the input corresponds to the ON/OFF-type signal receivedfrom ganglion cells and rectification of the output represents lagged and non-lagged responses. The physiological and theoretical response kernels of ON-Xcells are visualized in Fig.6. Responses were measured by flashing a light onthe center regions of a ON-type retinal ganglion cells while sinusoidally mod-ulating the luminance of the spot. The results show that the response of XNcells grows larger when the luminance is increased above zero, i.e. the meanluminance of the stimuli, and attenuates when the signal begins to decrease in

luminance. In contrast, the response ofXL cells is increasing while luminanceis decreasing and decreasing while luminance is increasing. OFF-X cells havesimilar responses, only the relationships are inverted to temporal regions wherethe modulated stimulus is negative. The recirprocal relationship between ON-and OFF-X cells ensures that the time evolution of the stimulus is preserved.

Neural Representation in V1

The representational form of receptive fields in the retina and LGN seem tobe explained largely by the second-order statistics of natural scenes. Efficientrepresentation was derived by maximizing the information conveyed about thedistribution of stimuli in natural scenes in the second-order moments of a pop-ulation response. This is equivalent to assuming that the causes of the stimuli

in natural images have a Gaussian distribution[10]. The non-Gaussian char-acter of natural image statistics is evident in Fig.7, which shows the responsedistribution of the natural images convolved with Gabor filters5. The responsedistribution (solid line) is more sharply peaked around zero meaning that the

5In frequency space, the phase-independent response between a Gabor and a Gaussian isalso Gaussian.

13


14/21

likelihood of having zero response is greater than for a Gaussian distribution ofequal variance. In addition, the response distribution has a relatively heavytail, so that very high responses are also more probable than the Gaussian dis-tribution. These types of response distributions are said to be sparse.

The representational form of the mammalian visual system has been shownto be precisely the type of coding scheme which would be expected if the visualsensory system were trying to find a sparse-distributed representation for the

statistics of natural scenes[14]. If the cortex were using sparse coding, a smallnumber of active units should encode for a large amount of the stimulus-variancepresent in any given scene. Less active units at any given moment would bebeneficial for an organism because it conserves metabolic resources consumedby the brain.

The classic solution for finding an efficient distributed-sparse representationof natural scenes was presented by Olshausen and Field (1996). By assumingthat an image space consists of a set of structural primitives from which all im-ages can be synthesized, finding sparse features is equivalent to finding a linearcombination of a set of latent causes which best explain the statistics of theimage space in the presence of additive white-gaussian noise, , with mean and variance . This computation can be formulated into an unsupevised ma-chine learning algorithm by attempting to match the probability distribution ofobserving images from the model (PI|model) to the actual distribution of natural

images(PI). This is equivalent to minimizing the Kullbach-Liebler divergencebetween PI|model and P

I . The model for the image space is defined in equations

10 and 11, and the Kullbach-Liebler divergence is given in equation 12.

14


15/21

I(x) =

NBj=1

sjAj(xi) + (xi) (10)

PI|model =

PI|s,APsds (11)

KL(PI;PI|model) =

PI log

PIPI|model

dI (12)

As can been seen in Equation 10, the coefficient sj determines how much ofthe basis function Aj(xi) is present for a particular image. Each base is thesize of an image patch and consists of luminance values which correspond tothe magnitude of each unit response in the population. Equation 11 definesthe probability distribution of observing and image given a basis function anda coefficient. An efficient population response requires that every unit in the

population have identical and independent distributions. Ifs

is vector contain-ing the coefficients for all the basis functions, then independence implies thatPs = Ps1Ps2...PsNb where each Psj is given by a pre-specified sparse distribu-tion. The sparse distribution used in the Olshausen and Field model is theCauchy distribution (Ps elog(1+s

2)).The purpose of the sparse coding algorithm is to minimize the KL-divergence

(equation 12) by finding the basis function which best represents randomly se-lected subsets of the image space. Since an over-complete basis implies that nounique solution exists for each image, an estimation is made for the coefficientswhich corresponds to the sparsest solution. This means that the algorithm willsearch for the representation of the image space which is given by the fewestactive bases for any given image. Again, this is equivalent to assuming thatany given image in the image space consists of a small number of structural

primitives. The basis functions that emerge from applying a sparse coding algo-rithm are localized in space, bandpass and oriented, analogous to the receptivefields of simple cells in V1 (See Fig.8). The remarkable fact about the modelis that the optimization process was unsupervised and began with random ini-tializations for basis functions. The results suggest that sensory driven simplecells in V1 have these types of receptive fields in order to efficiently represent adistributed-sparse code for the stimuli of natural scenes6.

IV. Other Aspects Which Influence Sensory Rep-

resentation

Up until this point, the models reviewed in this paper have offered explanations

for the representational form of receptive fields in the visual sensory systemby likening the task of finding an efficient representation for sensory signals tofinding an efficient representation for environmental stimuli observed in naturalscenes. Although natural scene statistics can provide some insight into what

6See [27] and [3] for spatiotemporal models of simple-cells.

15


16/21

information is being fed into the visual system, it cannot provide a completedescription for the statistics of the sensory input. This follows from the fact

that the types of environmental stimuli frequenting the system may vary acrossthe visual field. For example, V1 neurons responding to the upper visual fieldwill have a higher probability of exposure to images of objects located at largedistances away from the observer (e.g clouds, tree tops, etc.) when comparedto the cells responding to the lower visual field where objects tend to be locatedon the ground and much closer to the observer. Thus, the statistics of natu-ral scenes are not isotropic and, as a consequence, neither is vision. Moreover,different types of behavior can influence the distribution of stimuli frequentingeach region of the visual field and ultimately determine the criteria which con-stitutes an efficient representation. In other words, an efficient response for apopulation of neurons should be specially tailored for the redundant informa-tion which occurs in the regions of the visual field to which that populationresponds. The goal of this section is to consider regional and behavioral aspectswhich can lead to different realizations of efficient representation.

The anisotropies of the environment can be readily examined by individu-ally analyzing different regions of the image plane. In (Rothkopf et al., 2009),nine different localized regions of the image plane were analyzed with a sparsecoding algorithm to examine how regional differences in the statistics of theenvironment could lead to differences in the tuning properties of the resultingbasis functions. The basis functions which emerged are given in Fig.9b whereeach set of basis functions is enclosed by a different colored square correspondingto various regions of the image plane which were sampled in Fig.9a. There areconspicuous regional differences in the tuning properties of the basis functions.For instance, basis functions from the top-left portion of the image plane tend tobe more aligned with radial meridians of 135 degrees when compared the basis

functions learned in the center or top-right regions of the image plane. The his-tograms of the orientation preferences are shown in Fig.9c. These results are inagreement with the oblique and meridional effects, two psychophysical phenom-ena relating to an organisms ability to discern contours at different orientations.The oblique effect refers to an organisms superior performance in discriminationtasks involving stimuli oriented along cardinal axes relative to tasks involvingobliquely oriented stimuli7. This is consistent with the histogram of orientationselectivity at the center of the visual field (Fig.9c). The meridional effect refersto a subjects increased ability to detect obliquely radially-oriented stimuli atperipheral locations in the visual field[rovamo,1982], which is also observed incorner-regions of Fig.9c. Physiological measurements of the oblique effect havebeen shown to originate from V1[15], and the results from [22] suggest that thephysiological origins of the meridional effect can be attributed to this cortical

region as well.In addition to showing that tuning properties depend on the location of the

visual field being encoded, Rothkopf et al. (2009) also showed how the dis-tribution of tuning properties depends on the camera orientation, i.e. where

7The oblique effect is associated only with foveal vision

16


17/21

17


18/21

the camera is looking. The results of the previous paragraph were obtainedwith images of the forest where the observers line of sight was parallel with the

ground plane. In this second experiment, photos of the forest were obtainedwith the camera facing more towards the ground. The tuning properties ofbasis functions learned under these circumstances are not in agreement withthe oblique and meridional effects (See the histogram of orientation preferencesin Fig.9d). These results suggest that an observer which tends to look at theground can have receptive fields with radically different tuning properties thanan observer whose line of sight is parallel with the ground. Gaze allocation wasthe only difference between these two cases, and so this study is an example ofhow different types of behavior can lead to different neural representations forthe visual system.

DiscussionThe brain is faced with the arduous task of interpreting a high degree of variabil-ity in its sensory input. A conceivable goal for neural sensory systems could be toadapt its representation to the statistics of the sensory input so that stereotypi-cal circumstances are processed efficiently and perceptual faculties are reservedfor making judgements about unpredictable circumstances. The efficient codinghypothesis suggests that knowledge about redundancies in sensory signals canbe exploited in order to construct efficient sensory representation, and it hasbeen shown that normative models can provide some insight into which compu-tational principles underlay sensory processing in the early visual system.

Because normative models reveal the representational form that a popula-tion of neurons should take if it were to optimally perform a computational

task, metrics from information theory were used to give a definitive mathemat-ical description of the notion of efficiency. This approach should be met withsome caution since information theory was developed for communication sys-tems rather than neural systems. Indeed, the limited bandwidth comprised bythe retinofugal projection presents an interesting parallel to message transmis-sion over a noisy channel, however, bandwidth does not seem be a limitation inhigher cortical areas. For example, LGN input is over-sampled by cells in V1and as a result, the ratio of V1 output to LGN input is substantially large (e.g. 25:1 in felines and 50:1 in the brain of the macaque)[20]. Instead, it couldbe that metabolic constraints limit the capacity of a code, which would meanthat there is no longer a straightforward correlate between compression codingschemes used for the message passing problem and the types of encoding whichare expected in the cortex[1].

Normative theory can explain why the visual system takes a particular rep-resentational form, however, one obvious limitation of this approach is that thetheory ignores implementation. For instance, a sparse code is metabolically effi-cient because it maximizes information conveyed about stimuli of natural scenessubject to a mean firing rate constraint[4], however, the model does not specifyhow this may be achieved. One suggestion for how a population could realize

18


19/21

such a code was given by the Olshausen and Field model for sparse coding,where information was represented by a few active elements in the population

with high firing rates and a large number of inactive elements. An alternativeway that a neural population could realize such a sparse code could be by max-imizing information transmission where all firing rates, as opposed to just highrates, convey information[4].

Another limitation of the normative approach is that these models ignorethe specific time scales upon which the normative optimization process operates.The times at which neural systems adapt are crucial for proper development ofthe visual system. For example, receptive fields do not develop properly if thespontaneous activities occurring in the pre-natal retina are disrupted [8]. Fur-thermore, knowledge about which adaptations are innate and which are learnedfrom the environment is useful for understanding the extents to which organisms(and models) can adapt to the statistics of the input.

Despite some of the limitations, the efficient coding hypothesis for the vi-sual sensory system seems to provide reasonable accounts for some fundamentalcomputational properties of the retina, LGN and V1. The models presented inthis review underline the important roles that environment and behavior playin shaping the visual system, however, they are only the beginning. Statisti-cally oriented theories of the early visual system have proven their potential toincrease our knowledge of early vision, and they provide hope that we are onthe right track towards finding a better explanation of higher visual processesand a complete theory for the brain.

References

[1] J. Atick and N. Redlich. What does the retina know about natural scenes?

Neural Comput., 4(2):196210, 1992.

[2] F. Attneave. Some informational aspects of visual perception. PsychologicalReview, 61(3):183193, May 1954.

[3] O. Ba. Learning sparse overcomplete representations of time-varying nat-ural images. 2003.

[4] R. Baddeley, L. F. Abbott, M. C. Booth, F. Sengpiel, T. Freeman, E. A.Wakeman, and E. T. Rolls. Responses of neurons in primary and inferiortemporal visual cortices to natural scenes. Proceedings. Biological sciences/ The Royal Society, 264(1389):17751783, Dec. 1997. PMID: 9447735.

[5] R. Baddeley and P. J. B. Hancock. A Statistical Analysis of Natural Images

Matches Psychophysically Derived Orientation Tuning Curves. 1991.[6] H. Barlow. Possible principles underlying the transformation of sensory

messages. Sensory Communication, pages 217234, 1961.

[7] H. Barlow. Redundancy reduction revisited. Network: Computation inNeural Systems, 12(3):241253, 2001.

19


20/21

[8] J. Cang, R. C. Rentera, M. Kaneko, X. Liu, D. R. Copenhagen, and M. P.Stryker. Development of precise maps in visual cortex requires patterned

spontaneous activity in the retina. Neuron, 48(5):797809, Dec. 2005.PMID: 16337917.

[9] B. G. Cleland, M. W. Dubin, and W. R. Levick. Sustained and transientneurones in the cats retina and lateral geniculate nucleus. The Journal ofphysiology, 217(2):473496, Sept. 1971. PMID: 5097609.

[10] P. Dayan and L. F. Abbott. Theoretical Neuroscience: Computational andMathematical Modeling of Neural Systems. The MIT Press, 1 edition, Sept.2005.

[11] D. Dong and J. Atick. Statistics of natural time-varying images. Network,6(3):345358, 1995.

[12] D. W. Dong and J. J. Atick. Temporal decorrelation: A theory of laggedand nonlagged responses in the lateral geniculate nucleus. In Network, page159178, 1995.

[13] K. P. E Kaplan. Contrast affects the transmission of visual informationthrough the mammalian lateral geniculate nucleus. The Journal of physi-ology, 391:26788, 1987.

[14] D. Field. What is the goal of sensory coding? Neural Computation,6(4):559601, July 1994.

[15] C. Furmanski and S. Engel. An oblique effect in human primary visualcortex. Nat Neurosci, 3(6):535536, June 2000.

[16] P. Hancock, R. Baddeley, and L. Smith. The principal components ofnatural images. Network: Computation in Neural Systems, 3(1):6170,Feb. 1992.

[17] A. Hyvrinen, J. Hurri, and P. O. Hoyer. Natural Image Statistics: AProbabilistic Approach to Early Computational Vision. Springer, softcoverreprint of hardcover 1st ed. 2009 edition, Dec. 2010.

[18] M. W. Levine and J. B. Troy. The variability of the maintained discharge ofcat dorsal lateral geniculate cells. The Journal of physiology, 375:339359,June 1986. PMID: 3795062.

[19] B. Olshausen and D. Field. Emergence of simple-cell receptive field prop-erties by learning a sparse code for natural images. Nature, 381(6583):607

609, June 1996.[20] B. Olshausen and D. Field. What is the other 85% of v1 doing? 2004.

[21] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basisset: a strategy employed by v1? Vision research, 37(23):33113325, Dec.1997. PMID: 9425546.

20


21/21

[22] C. A. Rothkopf, T. H. Weisswange, and J. Triesch. Learning independentcauses in natural images explains the spacevariant oblique effect. In Pro-

ceedings of the 2009 IEEE 8th International Conference on Developmentand Learning, DEVLRN 09, page 16, Washington, DC, USA, 2009. IEEEComputer Society.

[23] J. Rovamo, V. Virsu, P. Laurinen, and L. Hyvrinen. Resolution of grat-ings oriented along and across meridians in peripheral vision. Investigativeophthalmology & visual science, 23(5):666670, Nov. 1982. PMID: 7129811.

[24] S. Schwartz. Visual Perception: A Clinical Orientation, Fourth Edition.McGraw-Hill Medical, 4 edition, Nov. 2009.

[25] C. E. Shannon and W. Weaver. The Mathematical Theory of Communica-tion. University of Illinois Press, 1971.

[26] E. P. Simoncelli and B. A. Olshausen. Natural image statistics and neu-ral representation. Annual Review of Neuroscience, 24:11931216, 2001.PMID: 11520932.

[27] J. H. van Hateren and D. L. Ruderman. Independent component analysisof natural image sequences yields spatio-temporal filters similar to simplecells in primary visual cortex. Proceedings. Biological sciences / The RoyalSociety, 265(1412):23152320, Dec. 1998. PMID: 9881476.

21

Documents

Normative Models for Sensory Representation