Chapter III. The phenomenology of seeing

BiologicalandComputerVision GabrielKreimanChapter3 2019

1

ChapterIII. Thephenomenologyofseeing1http://klab.tch.harvard.edu/academia/classes/Neuro230/AdditionalMaterials/Chapter3.html 2 3 We want to understand the neural mechanisms responsible for visual 4cognition and we want to instantiate these mechanisms into computational 5algorithms that resemble and perhaps even surpass human performance. In order 6to build such biologically inspired visually intelligent machines, we first need to 7define visual cognition capabilities at the behavioral level. What shapes can be 8recognized and when and how? Under what conditions do people make mistakes? 9How much experience and what type of experience with the world is required to 10learn to recognize objects? To answer these questions, we need to quantify human 11performance under well-controlled visual tasks. A discipline with the picturesque 12and attractive name of Psychophysics aims to rigorously characterize, quantify and 13understand behavior during cognitive tasks. 14

3.1. Whatyougetain’twhatyousee1516 As already alluded to in Chapter 2, it is clear that what we end up 17perceiving is a significantly transformed version of the pattern of photons impinging 18on the retina. Our brains filter and process visual inputs to understand the physical 19world by constructing an interpretation that is consistent with our experiences. The 20notion that our brains make up stuff may seem counterintuitive at first: our 21perception is a sufficiently reasonable representation of the outside world to allow 22us to navigate, to grasp objects, to interpret where things are going, and to discern 23whether a friend is happy or not. It is extremely tempting to assume that our visual 24system is actually capturing a perfect literal rendering of the outside world. 25 26 Visual illusions constitute strong examples of the dissociation between 27what is in the real world and what we end up perceiving. A simple example of the 28dissociation between inputs and percepts is given by the blind spot (Chapter 2). 29Our brains fill in the missing information. 30 31

Figure III-1: Our brains make up stuff. A. The brain creates a white triangle from the incomplete information provided by the pacmen in the figure. The illusion is broken by closing the circles (B) or rotating the pacmen (C).


2

Visual illusions are not the 32exception to the rule; they 33illustrate the fundamental 34principle that our perception 35is a construct, a 36confabulation, inspired by 37the visual inputs. There is a 38lot of information in the world 39that we just do not see. For 40example, we do not perceive 41with our eyes information in 42the ultraviolet portion of the 43light spectrum (but other 44animals like mice do). 45Another simple example is 46our percepts while watching 47a movie. A movie is nothing 48more and nothing less than a 49

sequence of frames, typically presented at a rate of 30 frames per second or more. 50Our brains do not perceive this sequence of frames; instead, we interpret objects 51as moving on the screen. 52 53 Our brains miss a lot of visual information from the world such as UV light, 54interpolate to fill in holes as in the blind spot, join adjacent frames to create the 55illusion of movement in a movie, and our brains can also completely invent things 56that just do not exist at all in the outside world. Consider for example, the Kanizsa 57triangle illustrated in Figure III-1. We perceive a large white triangle in the center of 58the image and we can trace each of the sides of said triangle. Yet, those edges 59are composed of illusory contours: in between the edge of one pacman and the 60edge of the adjacent pacman, there is no white edge. 61

3.2. Perception depends on adequately grouping parts of an image 62through specific rules 63

64 Our brains are confabulators, pretty useful confabulators that follow 65systematical rules to create our perceptual worlds. One of the early and founding 66attempts at establishing basic principles of visual perception originated from the 67German philosophers and experimental psychologists in the late nineteenth 68century. The so-called Gestalt laws (in German “gestalt” means shape) provide 69basic constraints about how patterns of light are integrated into perceptual 70sensations (Reagan, 2000). These rules arose from attempts to understand the 71basic perceptual principles that lead to interpreting objects as wholes rather than 72the constituent isolated lines or elements that give rise to them. These grouping 73laws are usually summarized by pointing out that the forms (Gestalten) are more 74than the mere sum of the component parts. 75 76

Figure III-2. Figure ground segregation. We tend to separate figure, here a person running, from the background, here uniform black.


3

n Figure-ground 77segregation. We 78readily separate the 79figure from the 80background based on 81their relative contrast, 82size, color, etc. Figure 83III-2. The famous artist 84M.C. Escher 85capitalized on this 86aspect of cognition to 87render ambiguous 88images where the 89figure and background 90merge back and forth in 91different regions. 92Evolution probably 93discovered the 94

importance of separating figure from ground when detecting a prey, leading to the 95phenomenon of camouflage whereby the figure blends into the background making 96it difficult to spot. 97

n Closure. We complete lines and 98extrapolate to complete known patterns or 99regular figures. We tend to put together 100different parts of the image to make a single, 101recognizable, shape. For example, our brain 102creates a triangle in the middle of the Kanizsa 103image from incomplete information (Figure 104III-1). 105n Similarity. We tend to group similar 106objects together. Similarity can be defined by 107shape, color, size, brightness, etc. (Figure III-3) 108

n Proximity. We tend to group objects based on their relative distances 109(Figure III-4). Proximity is a very powerful cue 110and it can often trump some of the other 111grouping criteria. 112n Symmetry. We tend to group 113symmetrical images. 114n Continuity. We tend to continue regular 115patterns (Figure III-5). 116n Common fate. Elements with the same 117moving direction tend to be grouped. 118Movement is one of the strongest and most 119important cues for grouping and segmentation 120of an image, superseding the other criteria. 121

Figure III-4. Grouping by proximity. We perceive this figure as vertical lines.

Figure III-5. Grouping by continuity. We tend to assume that the dark gray circles form a continuous line.

Figure III-3. Grouping by similarity. We tend to group objects that share some properties. A. We perceive horizontal lines composed of black squares interleaved with horizontal lines composed of white squares, grouping the items by their color. B. We perceive several groups based on their shapes.


4

Because of this, an animal that wants to camouflage with the background should 122stay very still. 123

3.3. The whole can be more than the parts 124 125 The Gestalt grouping rules dictate the organization of elements in an 126image into higher-order structures, new interpretable combinations of simple 127elements. An interesting demonstration of the processing and interpretation of a 128combination of elements beyond what can be discerned from the individual 129components is referred to as holistic processing. A particularly well studied form of 130holistic processing is the interpretation of faces. Three main observations have 131been put forward to document the holism of face processing. First, the inversion 132effect (Yin, 1969; Valentine, 1988) describes how difficult it is to distinguish local 133changes in a face when it is turned upside down. An illusion known as the 134“Thatcher effect” illustrates this point: distorted images of Britain’s prime minister 135can be easily distinguished from the original when they are right side up but not 136when they are upside down. The second observation suggestive of holistic 137processing is the composite face illusion: putting together the upper part of a given 138face A and the bottom part of another face B, one can create a novel face that 139appears to be perceptually distinct everywhere from the two original ones (Young 140et al., 1987). A third argument for holistic processing is the parts and wholes effect: 141

changing a local aspect of a 142face distorts the overall 143perception of the entire face 144(Tanaka and Farah, 1993). 145The observation that the 146whole can be more than the 147parts is not restricted to faces; 148expertise in other domains 149including fingerprint 150identification or recognition of 151novel arbitrary shapes also 152leads to similar holistic effects 153(Gauthier et al., 1999; Busey 154and Vanderkolk, 2005). 155

3.4. The visual system 156tolerates large image 157transformations 158 159The observation that the 160interpretation of the whole 161object is not merely a list of 162object parts makes it 163challenging to build models of 164object recognition that are 165based on a checklist of 166

Figure III-6. Tolerance in visual recognition. The lighthouse can be readily recognized despite very large changes in the appearance of the image.


5

specific object segments. Another serious challenge to this type of checklist model 167of recognition is that in many cases several of the parts may not be visible or may 168be severely distorted. A hallmark of visual recognition is the ability to identify and 169categorize objects in spite of large transformations in the image. An object can cast 170an infinite number of projections onto the retina due to changes in position, scale, 171rotation, illumination, color, etc. This invariance to image transformations is critical 172to recognition, constitutes one of the fundamental challenges in vision (Chapter 1731), and it is therefore one of the key goals for computational models (Chapter 8). 174Visual recognition capabilities would be quite useless without the ability to abstract 175away image changes. 176 177 To further illustrate the critical role of tolerance to image transformations 178in visual recognition, consider a very simple algorithm that we will refer to as “the 179rote memorization machine” (Figure I-4). This algorithm receives inputs from a 180digital camera and perfectly remembers every single pixel. It can remember the 181Van Gogh sunflowers, it can remember a selfie taken two weeks ago on Monday 182at 2:30pm, it can remember exactly what your car looked like three years ago on 183a Saturday at 5:01pm. While such extraordinary memory might seem quite 184remarkable at first, it turns out that this would constitute a rather brittle approach 185to recognition. This algorithm would not be able to recognize your car in the parking 186lot today, because you may see it under a different illumination, a different angle, 187and with different amounts of dust than in any of the memorized photographs. The 188problem with the rote memorization machine is beautifully illustrated in a short 189story by Argentinian fiction writer Jorge Luis Borges titled “Funes the memorious”. 190The story relates the misadventures of a character who acquires infinite memory 191due to a brain accident. Funes’ initial enthusiasm with his extraordinary memory 192soon fades when he cannot achieve visual invariance as manifested for example 193by failing to understand that a dog at 3pm is the same dog at 3:01pm when seen 194from a slightly different angle. Borges concludes: “To think is to forget differences, 195generalize, make abstractions”. 196 197 Our visual system is able to abstract away many image transformations to 198recognize objects (Figure III-6), demonstrating a degree of robustness to changes 199in many image properties. 200• Tolerance to scale changes, i.e., recognizing an object at different sizes. In 201vision, object sizes are typically measured in degrees of visual angle, that is, the 202angle that the object subtends on the eye. If you extend one arm in front of you 203and the other one to the side, that angle corresponds to 90 degrees. One degree 204of visual angle is approximately the size of your thumb at arms-length. Now, 205consider again the sketch of a person running in Figure III-2. If you are holding the 206page approximately at arms-length, the person would subtend approximately two 207degrees of visual angle. Moving the page closer and closer will lead to a multiple-208fold increase in its size, mostly without affecting recognition. There are limits to 209recognition imposed by visual acuity (if the page is moved too far away), and there 210are also limits to visual recognition at the other end if the image becomes too large. 211


6

However, there is a very large range of scales over which we can recognize objects 212(Cooper et al., 1992). 213• Position with respect to fixation, i.e., recognizing an object placed at different 214distances with respect to the fixation point. For example, fixate on a given point, 215for example your right thumb. Make sure not to move your eyes or your thumb. 216Then move the running man in Figure III-2 to different positions. You can still 217recognize the image at different locations with respect to the fixation point. As 218discussed in Chapter 2, acuity decreases rather sharply as we move away from 219the fixation point. Therefore, if you keep moving the page away from fixation (and 220then you stop moving it because motion is easily detected in the periphery), 221eventually the image of the running man will become unrecognizable. Yet, there is 222a wide range of positions where recognition still works. 223• 2D rotation, i.e., recognizing an object that is rotated in the same plane (Figure 224III-6G). You can recognize the running man even if you rotate the page, or if you 225tilt your head. Recognition performance is not quite completely invariant to 2D 226rotation, as mentioned above for upside down faces in the Thatcher face illusion. 227• 3D rotation, i.e., recognizing an object from different viewpoints. Recognition 228shows some degree of tolerance to 3D rotation of an object, but it is not quite 229completely invariant to viewpoint changes. Rotation in the three-dimensional world 230is a particularly challenging transformation because the types of features revealed 231about the object can depend quite strongly on the viewpoint. In particular, some 232objects are much easier to recognize from a certain canonical 233viewpoint rather than from other viewing angles. 234• Color. In many cases, objects can be readily recognized in a photograph 235whether it’s in color, sepia, grayscale, etc. (Figure III-6E). Color can certainly add 236important information and can enhance recognition, yet recognition abilities are 237quite robust to color changes. 238• Illumination. In most cases, objects can be readily identified regardless of 239whether they are illuminated from the left, right, top or bottom. 240 241 These are but a few of the myriad transformations an object can go through, 242with minimal impact on recognition, many other examples are illustrated in Figure 243III-6. The visual system can also tolerate many types of non-rigid transformations 244such as recognizing faces even with changes in expression, aging, make-up, or 245shaving. The examples in Figure III-6 all depend on identifying the lighthouse based 246on its sharp contrast edges but objects can be readily identified even without such 247edges. For example, motion cues can be used to define an object’s shape. 248 249

A particularly intriguing example of tolerance is given by the capability to 250recognize caricatures and line drawings (Figure III-7). At the pixel level, these 251images bear little resemblance to the actual objects and yet, we can recognize 252them quite efficiently, sometimes even better than the real images. It is likely that 253the ability to interpret such line drawings depends on specifically learning to identify 254symbols and certain conventions about how to sketch those objects rather than 255visual shape similarity with the objects represented by those drawings. 256


7

257 In all of these cases, recognition is 258robust to image changes but it is 259not perfectly invariant to those 260changes. It is possible to break 261recognition by changing the image. 262Thus, although many investigators 263discuss invariant visual 264recognition, a better term is 265probably transformation-tolerant 266visual recognition to emphasize 267that we do not expect complete 268invariance to any amount of image 269change. 270

3.5. Pattern completion: inferring 271the whole from visible parts 272273

A particularly challenging image transformation that is rather ubiquitous during 274natural vision is occlusion. If you look at the objects around you, you will realize 275that for many of them you only have direct access to partial information due to poor 276illumination or because another object is in front. Deciphering what the object is 277when only parts of it are visible requires extrapolating to complete patterns. A 278crude example of occlusion is shown in Figure III-6A. It is easy to identify the 279lighthouse even though less than half of its pixels are visible. The visual system 280has a remarkable ability to make inferences from partial information, to put Humpty 281Dumpty together again and thus solve the problem that challenged all the king’s 282men. The ability to make inferences from partial information is not exclusive to 283vision and is apparent in many other modalities including understanding speech 284corrupted by noise, or even in higher domains of cognition such as imagining a 285story from a few words printed on a page or deciphering social interactions from 286sparse interactions. 287

288Vision is an ill-posed problem because the solution is not unique. In general, 289

there could be infinite interpretations of the world that are consistent with a given 290image. The infinity of solutions is easy to appreciate in the case of occlusion. There 291are infinitely many ways to complete contours from partial information. Despite 292these infinite possible solutions, the visual system typically lands on a single 293interpretation of the image, which is in most cases the correct one. Investigators 294refer to amodal completion when there is an explicit occluder (e.g., Figure III-8A) 295and modal completion when illusory contours are created to complete the object 296without an occluder (e.g., Figure III-1A). The presence of an occluder leads to 297inferring depth between the occluder shape and the occluded object, which leads 298to a surface-based representation of the scene (Nakayama et al., 1995). The 299

Figure III-7. Recognition of line drawings. We can identify the objects in these line drawings despite the extreme simplicity in the traces and the minimal degree of resemblance to the actual objects.


8

occluder helps interpret the 300occluded object, as demonstrated 301in the famous illusion by Bregman 302with rotated B letters (Figure III-8). 303

304The visual system can work with 305

very small amounts of information. 306It is possible to occlude up to 80% 307of the pixels of an object with only 308small deterioration in recognition 309performance (Tang et al., 2014). 310Recognition depends on which 311specific object features are 312occluded. Certain parts of an 313object are more diagnostic than 314

others. One approach to investigate which object parts are more diagnostic is to 315present objects through bubbles randomly positioned in the image, controlling 316which parts of the object is visible and which ones are not. Averaging performance 317over multiple recognition experiments, it is possible to estimate which object 318features lead to enhanced recognition and which object features provide less 319useful information. Instead of presenting an image through an occluder, or 320revealing features through bubbles, another approach to studying pattern 321completion is to reduce an image by cropping or blurring until it becomes 322unrecognizable. Using this approach, investigators have described minimal 323images which can be readily recognized but which are rendered unrecognizable 324upon further reduction in size. 325

326

3.6. Visual recognition is very fast 327 328 To recap, what we perceive is a subjective construct created by our brains, 329following a series of phenomenological rules to group elements in the image, 330making inferences to arrive at a unique solution for an ill-posed problem, giving 331rise to a representation that allows us to interpret a scene and identify object 332despite the infinite number of possible projections for an object onto the retina. 333Given the complexity of this process, one might imagine that it would take a lot of 334computational time to do vision. Quite on the contrary, vision seems almost 335instantaneous. As we discussed in Chapter 2, there is no such thing as 336instantaneous vision: even the conversion of incoming light signals into the output 337of the retina takes time, on the order of 40 milliseconds. Subsequent processing 338of the image by the rest of the brain also takes additional time. What is quite 339remarkable is that all the processing of sensory inputs, tolerance to 340transformations, inferences from partial information, can be accomplished in a 341small fraction of a second. Arguably, vision would be far less useful if it took many 342seconds to arrive at an answer. 343 344

Figure III-8. Pattern completion. A. It is possible to recognize the rotated B letters despite partial information. B. It is easier to recognize the objects when an explicit occluder is present (A) compared to the same object parts when the occluder is absent (B).


9

One of the original studies to illustrate the speed of vision consisted of 345showing images in a rapid sequence (known in the field as rapid serial visual 346presentation tasks). Subjects could interpret each of the individual images even 347when objects were presented at rates of 8 per second (Potter and Levy, 1969). 348Nowadays, it is relatively easy to present stimuli on a screen for short periods of 349time spanning tens of milliseconds or even shorter timescales. In earlier days, 350investigators had to go through ingenious maneuvers to ensure that stimuli were 351presented only briefly; one device invented in 1859 to accomplish rapid exposure 352to light signals was the tachistoscope, which uses a projector and a shutter similar 353to the ones in single-lens reflex photo cameras. This device was subsequently 354used during World War II to train pilots to rapidly discriminate silhouettes of aircraft. 355Complex objects can be recognized when presented tachistoscopically for < 50 356milliseconds, even in the absence of any prior expectation or other knowledge 357(Vernon, 1954). 358 359 Reaction times measured in response to visual stimuli take much longer 360than 50 milliseconds. Emitting any type of response (pressing a button, emitting a 361verbal response, or moving the eyes) requires several steps beyond visual 362processing including decision making processes and the neural steps required to 363prepare and execute the behavioral response. In an attempt to constrain the 364amount of time required for visual recognition, Thorpe and colleagues recorded 365evoked response potentials from scalp electroencephalographic (EEG) signals 366while subjects performed a go/no-go animal categorization task (Thorpe et al., 3671996). What exactly these EEG signals measure remains unclear. However, it is 368possible to measure minute voltage signals on the order of a few microvolts at the 369level of the scalp and detect changes that are evoked by visual stimuli. The 370investigators found that EEG revealed a signal at about 150 milliseconds after 371stimulus onset that was different between trials when an animal was shown or 372when no animal was present. It is not known whether this EEG measurement 373constitutes a visual signal, a decision signal, a motor signal or some combination 374of all of these types of processes. Regardless, these results impose an upper 375bound for this specific recognition task; the investigators argued that visual 376discrimination of animals versus non-animals embedded in natural scenes should 377happen before 150 milliseconds. Consistent with this temporal bound, in another 378study subjects were asked to make a saccade as soon as possible to one of two 379alternative locations to discriminate the presence of a face or non-face stimulus. 380Saccades are appealing to measure behavioral reaction times because they are 381faster than pressing buttons or verbally producing a response. It took subjects on 382average 140 milliseconds from stimulus onset to initiate an eye movement in this 383task (Kirchner and Thorpe, 2006). These observations place a strong constraint 384into the mechanisms that underlie visual processing. 385 386 Such speed in object recognition also suggests that the mechanisms that 387integrate information in time must occur rather rapidly. Under normal viewing 388conditions, all parts of an object reach the eye more or less simultaneously (in the 389absence of occlusion and object movement). By disrupting such synchronous 390


10

access to the parts of an object, it is possible to probe the speed of temporal 391integration in vision. In a behavioral experiment to quantify the speed of integration, 392the investigators presented different parts of an object asynchronously (Figure III-9), 393like breaking Humpty Dumpty and trying to put the pieces back together again. In 394between the presentation of object parts, subjects were presented with noise for a 395given amount of time, known as the stimulus onset asynchrony (SOA). The 396researchers conjectured that If there was a long interval between presentation of 397different objects parts (long SOA), subjects would be unable to interpret what the 398object was. Conversely, if the parts were presented in close temporal proximity, 399the brain would be able to integrate the parts back to a unified perception of the 400object. The results showed that subjects are able to integrate information up to 401asynchronies of about 30 ms. 402 403 Another striking example of temporal integration is the phenomenon 404known as anorthoscopic perception, defined as perception of a whole object in 405

Figure III-9. Spatiotemporal pattern completion: subjects can integrate asynchronously presented object information. Subjects were presented with different parts of an object in an asynchronous fashion (in this example, a camel). The middle part of the diagram shows the sequence of steps in the experiment. Subjects fixated for 500 ms, and then observed a sequence of frames in which the object fragments were separated by a stimulus onset asynchrony (SOA). Subjects performed a 5-alternative forced choice categorization task. Subjects could integrate information up to asynchronies of about 30 ms.


11

cases where only a part of it is seen at a given time. In classical experiments, an 406image is seen through a slit. The image moves rapidly allowing the viewer to catch 407only a small part of the whole at any given time. The brain integrates all the 408snapshots and puts them together to create a perception of a whole object moving. 409 410 The power of temporal integration is also nicely illustrated in experiments 411where an actor wearing black attire is placed in a completely dark room with only 412a few sources of light placed along his body. With just a handful of light points, it 413is possible to infer the actor’s motion patterns (Johansson, 1973). Related studies 414have shown that it is possible to dynamically group and segment information purely 415based on temporal integration (Anstis, 1970; Kellman and Cohen, 1984). 416 417 Not all visual tasks are so fast. Finding a needle in a haystack is famously 418challenging. Searching for Waldo can be rather infuriating and take several 419seconds during which the observer will typically move his eyes multiple times, 420sequentially scrutinizing different parts of the image. Even without making eye 421movements, certain visual tasks require more time. In general, there is a trade-off 422between speed and accuracy. One example task that requires more time is the 423pattern completion problem described above. Different types of experiments where 424subjects have limited time to process an image show that completion of heavily 425occluded objects requires more time than recognition of the fully visible object 426counterparts. The simplest of these experiments are time-forced tasks, where 427identification of heavily occluded objects typically lags recognition of fully visible 428objects by 50 to 150 milliseconds. Another type of experiment where subjects have 429limited computational time is priming experiments. Priming refers to a form of 430temporal contextual modulation whereby an image A is preceded in time by 431another image P called the prime. The perception of A depends on P, P is said to 432prime perception of A. For example, the presentation of the prime P might influence 433how well or how fast subjects recognize the stimulus A. Priming is not restricted to 434the visual domain. For example, consider the planets in the solar system: Mercury, 435Venus, Earth, Mars, Jupiter. Now try to complete the following word: M _ _ N. It is 436quite likely that you thought about “moon”, although the word “mean” would be at 437least as good an answer (in fact, according to Google’s Ngrams, the word “mean” 438is three times for frequent in the English language compared to the word “moon”). 439When the prime P is a heavily occluded object, the magnitude of the priming effect 440depends on the time interval between P and A. If this interval is too short, typically 441less than 50 milliseconds, the priming effect vanishes, suggesting that it was not 442enough to complete the pattern and thus have any impact on subsequent 443recognition. Finally, another common tool to limit processing time in the 444psychophysicist’s arsenal is backward masking. In backward masking 445experiments, a stimulus A is closely followed by a noise pattern B. If the interval 446between A and B is very short, typically on the order of less than 20 milliseconds, 447the initial stimulus A is essentially invisible. With longer intervals, subjects can still 448see the initial stimulus A but recognition is impaired. When A is a heavily occluded 449object and a noise pattern B is introduced about 50 milliseconds after A, it becomes 450very difficult to complete the pattern in A (Tang et al., 2018). Investigators argue 451


12

that the noise pattern interrupts the computations required for pattern completion. 452If the interval between A and the noise pattern is longer than approximately 100 453milliseconds, the effect of backward masking disappears. These different types of 454experiments show converging evidence that putting together the parts to infer the 455whole, during a single fixation, requires additional computational steps manifested 456through longer reaction times. 457 458

3.7. Spatial context matters 459 460 In addition to temporal 461integration, visual recognition also 462exploits the possibility of 463integrating spatial information. 464You miss important aspects of 465recognition if you take vision out of 466context. 467 468 Several visual illusions 469demonstrate strong contextual 470effects in visual recognition. In a 471simple yet elegant demonstration, the perceived size of a circle can be strongly 472influenced by the size of the neighboring stimuli (Figure III-10). Another simple 473example is the Muller-Lyer illusion: the perceived length of a line with arrows at the 474two ends depends on the directions of the two arrows. Several entertaining 475examples of contextual effects have been reported (e.g., (Sinha and Poggio, 1996; 476Eagleman, 2001)). These strong contextual dependences illustrate that the visual 477system spatially integrates information and the perception of local features may 478depend on the global surrounding properties. 479 480 Such contextual effects are not restricted to visual illusions and 481psychophysics demos like the one in Figure III-10. Everyday vision capitalizes on 482contextual information. Consider Figure III-11: what is the object in the white box? It 483

is typically very hard to answer this 484question with any degree of 485certainty. Now, turn your attention 486to Figure III-13. What is the object in 487the white box? This is a much 488easier question! Even though the 489pixels inside the white box are 490identical in both figures, the 491surrounding contextual information 492dramatically changes the 493probability of correctly detecting 494the object. One could imagine that 495the observer may examine multiple 496different parts of the image before 497

Figure III-10. Context matters. The dark circle in the center appears to be larger on the right than on the left, but they are actually the same size.

Figure III-11.. Context matters in the real world too. What is the object in the white box?


13

fixating on the white box to deduce 498what the object is. However, in 499laboratory experiments where we 500can precisely monitor eye gaze, 501subjects show a notable and rapid 502improvement in recognition 503performance even when they are 504only fixating on the white box and 505the image disappears before 506subjects can move their eyes. 507These contextual effects are fast, 508depend on the amount of context, 509and can be at least partly triggered 510by presenting even simpler and 511blurred version of the background 512information. These effects also emphasize that perception constitutes an 513interpretation of the input in the light of temporal and spatial context. 514

3.8. The value of experience 515 516

Our percepts are influenced by previous visual experience at multiple 517temporal scales. The phenomena that we have described so far including the 518ability to discriminate animals from non-animals, to detect faces, the integration of 519spatially discontinuous object fragments, the consequences of priming and 520backward masking take place over temporal scales of tens to hundreds of 521milliseconds. Several visual illusions show the powerful effects of temporal context 522on scales of seconds, one of these is visual adaptation. A famous example of visual 523adaptation is the waterfall effect: after staring at a waterfall for about 30 seconds, 524upon shifting the gaze to other static objects, those objects appear to be moving 525upward. The visual system is adapted to downward movement and things that are 526not moving appear to be moving upwards. Adaptation is not restricted to motion. 527Similar after-effects can be observed after adapting to colors, to textures, or to 528objects like faces. For example, fixate on the x in the center in Figure III-13 for about 52930 seconds, then move your eyes to a white surface. You will experience an after 530effect: the white surface will appear to show blobs of color approximately at the 531same positions in the retina as the circles in Figure III-13 but with complementary 532colors. 533

534The role of experience in perception extends well beyond the scale of 535

seconds and minutes. Even lifelong expertise can play a dramatic effect on how 536we perceive the visual world. For example, the interpretation of an image can 537strongly depend on whether one has seen that particular image before or not. In 538the so-called Dalmatian dog illusion. the first-time observer thinks that the image 539consists of a smudge of black and white spots. However, after recognizing the dog, 540subjects can immediately interpret the scene and spot the dog again the next time. 541Several similar images created by Craig Mooney are commonly used to assess 542the role of experience in perceptual grouping (Mooney, 1957). 543

Figure III-12. Context matters in the real world too. What is the object in the white box?


14

544There are many other 545

situations where images may 546seem unintelligible to the novice 547observer. You may have seen 548clinical images such as X-rays or 549magnetic resonance images. In 550many cases, those images may 551reveal nothing beyond strange 552grayscale surfaces and textures to 553the untrained brain (note that the 554brain is trained to interpret images, 555not the eyes, it does not make any 556sense to speak of an untrained 557eye!). However, the experienced 558clinician can rapidly interpret the 559image to come up with a diagnosis. 560Similarly, if you do not read 561Chinese, Chinese text may look 562like a collection of picturesque 563hieroglyphs. 564

565Another aspect of how our 566

experience with the world influences our perceptions is the interpretation of 3D 567structure from 2D images. Many visual illusions are based on intriguing 3D 568interpretations. For example, street artists create striking illusions that convey a 569stunning 3D scene when a 2D image painted on the street is seen from the right 570angle. Even when we know that these are illusions, they are so powerful that our 571brains, laden with years of experience, cannot help but send their top-down 572cognitive influences to enforce a strong perceptual experience. 573

574Another example of how our pre-conceived experience-dependent notions 575

with the 3D world influence what we see is the hollow-face illusion. A 3D face mask 576is rotated in such a way that in certain angles it appears convex, protruding towards 577the viewer, whereas in other angles it should be concave and appear hollow. 578However, the concave version is still perceived as a convex face by the viewer. 579There is a strong top-down bias to interpret the face as convex, probably because 580we rarely encounter concave hollow versions of a face. 581

582Faces have always been a particularly fascinating domain of study for 583

psychologists. Understanding and identifying faces is prone to the same 584experience-dependent effects as any other visual stimulus. For example, 585psychologists have characterized the “other-race” effect whereby it is much harder 586for people to identify faces from races that they are not used to seeing. For 587example, imagine someone born in Asia who has not had contact with the western 588world either in person or in movies or TV, etc. Western faces would all look very 589

Figure III-13. Color aftereffect. Fixate on the center x without moving your eyes and count slowly to 30. Then move your eyes to a white surface. What do you see?


15

similar to that person. The converse is also true: western people who have not 590been exposed to many Asian faces may find it difficult to discriminate among them. 591And if you are a shepherd who has spent years tending to sheep, you may be quite 592good at identifying individual sheep whereas they may all look alike to the naïve 593observer. 594

3.9. People are the same wherever you go 595596

In the previous sections, we discussed properties of human vision, imagining a 597generic adult individual as a prototypical subject. To a very good first-order 598approximation, the basic observations described so far hold regardless of the 599person’s gender, skin color, religion, cultural background, even age, except for the 600first few years of life. People see the world in approximately the same way 601wherever you go. 602

603There are exceptions to this rule. One exception was discussed in the last 604

section and is due to the role of experience. Doctors may see structure when 605examining an X-ray image and a shepherd can identify individual sheep. Other 606obvious exceptions include cases where the hardware is different or malfunctions. 607For example, as discussed in Chapter 2, many males only have two types of 608cones. Several other distinct eye conditions have been described including 609amblyopia (reduced vision in one eye), nystagmus (repetitive uncontrolled eye 610movements), etc. Many people require corrective glasses to fix problems in 611accommodation by the eyes’ lens. Albinism also leads to vision challenges under 612bright lighting conditions. As we will discuss in Chapter 4, there are also brain 613lesions that lead to abnormal visual perception. 614

615Age matters too. Many people develop cataracts or a condition known as 616

macular degeneration as they age. Infants and very young children also see the 617world differently, not only because of their expertise with the world, but also 618because their visual system is not fully mature. Humans are not born with their fully 619developed visual system. The visual acuity of a newborn is approximately between 62020/200 and 20/400, which means that what they see 20 feet away is comparable 621to what adults see at 200 or 400 feet. In the US, a person with a visual acuity of 62220/200 or less is considered to be legally blind. 623 624 More recently there has been increased interest in understanding 625differences in vision among normally sighted individuals. Although the general 626principles outlined in this chapter apply generally, there still remains an interesting 627amount of variation in perceptual abilities. An example of such variations has been 628recently brought to the forefront during the rather passionate discussion about the 629color of a dress (Chapter 1). There was an approximately bimodal distribution of 630the color names used by people to describe the dress. Additionally, there have 631also been studies documenting variability in other visual domains. For example, 632there is considerable variability in the abilities to recognize faces, with some people 633being particularly good and others particularly bad. Moving into higher 634


16

psychological territory, beauty is in the brain of the beholder: there is considerable 635variation in visual aesthetic preferences. 636

3.10. Animals excel at vision too 637638

In the next chapters, we will delve into the brain to enquire about the neural 639computations responsible for visual perception. It is easier to investigate the 640insides of non-human animals’ (henceforth animals) brains rather than the human 641brain. Therefore, a large fraction of the discussion in the next chapter will focus on 642other species. The converse is true about behavior: it is much easier to study visual 643behavior in humans than in other animals. This chapter has focused on human 644visual behavioral measurements. Before we delve into the brain, it is important to 645ask whether animals share all the amazing properties of vision described so far. 646 647 Almost every existing animal species has capitalized on the advantages of 648visual processing, from flies to fish to birds to rodents and primates. Diversity rules 649in Biology: there is an extraordinary repertoire of variations in the visual system 650and we cannot do justice here to the flamboyant arsenal of visual capabilities 651displayed in the animal kingdom. Animals adapted to their own niches and survival 652needs have developed their own specialized uses for visual processing. 653 654

Some properties of animal vision are distinct from human visual capabilities. 655Humans are limited to the visible part of the spectrum (defined as visible by 656humans!), whereas other species can sense ultraviolet light (e.g., mice, dogs, 657many types of birds) and also infrared light (e.g., many types of snakes). Whereas 658humans have three types of cone photoreceptors (Chapter 2), the number of cone 659receptors varies quite widely across the animal kingdom with some species 660containing only one type of cone (e.g., various bats and the common raccoon), 661other animals containing two types of cones (e.g., cats and dogs), and even 662species with sixteen different types of cones (the mantis shrimp) or up to thirty 663types of cones (some species of dragonflies). Cuttlefish can also sense light 664polarization, which humans cannot. 665 666

Even the number and position of eyes show rather wide variation. Spiders have 667between eight and twelve eyes, five-arm star fish have five eyes, and horseshoe 668crabs have ten eyes. The position of the eyes dictates what information is 669accessible to the visual system. Snails have eyes in their tentacles, starfish in each 670of their arms. Even for species with two eyes, the position of the eyes plays a 671critical role. Approximately forward facing eyes implies that the central parts of the 672visual field will be accessible to both eyes, enabling the capability of estimating 673depth from stereopsis (the small difference in sampling between the two eyes). At 674the same time, more lateral facing eyes provide a wider field of view. In an extreme 675example of laterally positioned eyes, rabbits have a blind spot at the center of the 676visual field. Humans have approximately 120 degrees of binocular field and a total 677visual field of approximately 180 degrees. In contrast, other animals with two eyes 678positioned so that they face to the sides can have more than 300 degrees of visual 679field (e.g., cows, goats, horses). In most cases, the two eyes in humans are 680


17

essentially yoked together so that their positions are strongly correlated. In 681contrast, other species like the chameleon can move each eye independently and 682focus on two completely different locations in the visual field. 683 684

The resolution of the visual system also shows enormous diversity from 685animals like the star fish that represents the entire visual scene with approximately 686200 pixels to species such as preying birds that surpass human acuity. Nocturnal 687predators have higher sensitivity than humans in low light conditions than humans 688(e.g., owls, tigers, lions, jaguars, leopards, geckos). 689 690

The ability to detect movement is perhaps one of the few universal properties 691of visual systems, probably a testament to the importance of responding to moving 692predators and preys, as well as to other imminent looming danger. Many species 693are particularly specialized to rapidly detect motion changes. For example, wing 694movements triggered by visual stimuli can be evoked in dragonflies in about 30 ms 695after stimulus onset, faster than the time it takes for information to get out of the 696human retina. 697 698

Thus, the human visual system, as amazing as it is, is not unique. There are 699multiple species that display “better” vision in terms of the ability to detect 700ecologically relevant features, where what is ecologically relevant depends on the 701species. Our sense of vision largely dictates how we perceive the world around us. 702Without the aid of other tools, we are confined to an interpretation of the world 703based on our senses and we are often arrogant or unimaginative enough to think 704that the world is exactly as we see it. The short list of visual system properties 705outlined above emphasizes that our view of the world is but one limited 706representation, that we can see things that others can’t and vice versa. We are 707missing a lot of exciting action in the world. 708 709 What about the perceptual properties described earlier in this chapter? 710What do the Gestalt rules look like for other species? Can animals also perform 711pattern completion? Deciphering what animals perceive is not an easy task and 712requires well-designed experiments and careful training. Monkeys, particularly 713rhesus macaque monkeys, constitute one of the main species of interest to study 714the visual system. Their eyes are quite similar to the human ones, and it is possible 715to train them to perform rather sophisticated visual tasks. Chimpanzees and 716bonobos have even more similar visual systems but they have been less explored, 717particularly in terms of their brain properties. Monkeys can be trained to perform 718multiple visual tasks including discriminating the presence or absence of visual 719stimulation, reporting the direction of a moving stimulus, detecting same/different 720in color, texture, shape, and other stimulus properties. Monkeys have been trained 721to discriminate complex objects, including faces as well as numeric symbols. 722Monkeys have been trained to trace lines and contours. Monkeys can even learn 723that the symbol 7 corresponds to 7 items on the screen, and is larger than the 724number 3. Monkeys can also learn to play simple video games. 725 726


18

How well can monkeys and other animals extrapolate to novel stimuli that 727they have not been trained on? For example, to what extent are their recognition 728abilities tolerant to the type of image transformations described earlier in this 729chapter? We can define multiple levels of increasingly more complex sophistication 730and abstraction in the ability to perform visual discrimination tasks (Herrnstein, 7311990): (1) discrimination, as in evaluating the presence or absence of a light 732source; (2) rote categorization, as in the ability to memorize a few exemplars within 733a class of objects and distinguish those exemplars from a few exemplars in a 734different class; (3) open-ended categorization, extending the previous ability to 735situations where there is a large and perhaps continuous number of exemplars 736within a category; (4) concepts, where animals have the ability to draw inferences 737across different exemplars; (5) abstract relations, dealing with relationships 738between exemplars as well as relations between concepts. Macaque monkeys do 739seem to be capable of a fairly sophisticated level of abstraction including 740transformation-tolerant visual categorization. After training with a set of visual 741object categories, their performance as well as their pattern of mistakes resembles 742those of humans when tested under the same conditions (Rajalingham et al., 7432018). Yet, there are also tasks that call into question how abstract those 744representations are. For example, monkeys excelling at a visual discrimination 745task in the upper left visual hemifield may have to be re-trained extensively if asked 746to perform the same task in the bottom right visual hemifield whereas humans 747would be able to rapidly transfer their learning across stimulus locations. 748 749

Over the last decade, there has also been increased interest in using 750rodents, particularly mice and rats, to investigate visual function. There are multiple 751exciting advantages and opportunities when considering rodents including the 752number of individuals that can be examined, and the availability of an extraordinary 753repertoire of molecular tools. While the type of visual discrimination tasks that 754rodents have been trained on is rather limited compared to the behavioral skills of 755macaque monkeys, rats do seem to be able to perform basic comparisons 756between visual shapes, even with some degree of extrapolation to novel 757renderings of the objects in terms of size, rotation, and illumination (Zoccolan et 758al., 2009). 759 760

3.11. Summary 761762

• Psychophysics refers to quantifying behavior including reaction time 763metrics, performance metrics, and eye movements. 764

765• Brains make up stuff. Subjective perception is a construct that is 766

constrained by sensory information in the light of previous experience. 767Visual illusions illustrate the dissociation between sensory inputs and 768perception. 769

770


19

• The Gestalt rules of perception describe how we normally group image 771parts to construct objects. Such rules include closure, proximity, similarity, 772figure/ground separation, continuity, and common fate. 773

774• Visual recognition performance shows tolerance to large transformations 775

of an image. 776 777

• It is possible to make inferences from partial information, for example, 778during recognition of occluded objects. 779

780• Visual recognition is very fast. Many recognition questions can be 781

answered in approximately 150 ms. 782 783

• Subjects can integrate information that is presented asynchronously but 784only over a few tens of milliseconds. 785

786• Contextual information can help recognize objects. 787

788• Humans are quite consistent in their visual recognition abilities, yet there 789

is inter-individual variability particularly when it comes to tasks requiring 790extensive prior experience. 791

792• Animals excel at vision too and it is essentially to study them in order to 793

elucidate the mechanism of vision. 794 795

3.12. References 796 797AnstisSM(1970)Phimovementasasubtractionprocess.Visionresearch10:1411-798

1430.799Busey TA, Vanderkolk JR (2005)Behavioral and electrophysiological evidence for800

configuralprocessinginfingerprintexperts.Visionresearch45:431-448.801CooperEE,BiedermanI,HummelJE(1992)Metricinvarianceinobjectrecognition:a802

reviewandfurtherevidence.CanJPsychol46:191-214.803EaglemanDM(2001)Visualillusionsandneurobiology.NatRevNeurosci2:920-926.804Gauthier I,BehrmannM,TarrM(1999)Can face recognitionreallybedissociated805

fromobjectrecognition?JournalofCognitiveNeuroscience11:349-370.806Herrnstein RJ (1990) Levels of stimulus control: a functional approach. Cognition807

37:133-166.808Johansson G (1973) Visual perception of biological motion and a model for its809

analysis..PerceptionandPsychophysics14:201-211.810Kellman PJ, Cohen MH (1984) Kinetic subjective contours. Perception and811

Psychophysics.812Kirchner H, Thorpe SJ (2006) Ultra-rapid object detection with saccadic eye813

movements:visualprocessingspeedrevisited.Visionresearch46:1762-1776.814


20

MooneyCM(1957)Ageinthedevelopmentofclosureabilityinchildren.CanJPsychol81511:219-226.816

NakayamaK,He Z, Shimojo S (1995) Visual surface representation: a critical link817betweenlower-levelandhigher-levelvision.In:Visualcognition(KosslynS,818OshersonD,eds).Cambridge:TheMITpress.819

PotterM,LevyE(1969)Recognitionmemoryforarapidsequenceofpictures.Journal820ofexperimentalpsychology81:10-15.821

RajalinghamR,IssaEB,BashivanP,KarK,SchmidtK,DiCarloJJ(2018)Large-Scale,822High-ResolutionComparisonoftheCoreVisualObjectRecognitionBehavior823ofHumans,Monkeys, and State-of-the-ArtDeepArtificialNeuralNetworks.824JournalofNeuroscience38:7255-7269.825

ReaganD(2000)Humanperceptionofobjects:Sinauer.826SinhaP,PoggioT(1996)IthinkIknowthatface.Nature384:404.827Tanaka JW, FarahMJ (1993) Parts andwholes in face recognition. The Quarterly828

journal of experimental psychology A, Human experimental psychology82946:225-245.830

TangH, Buia C,MadhavanR,Madsen J, AndersonW, CroneN, KreimanG (2014)831Spatiotemporal dynamics underlying object completion in human ventral832visualcortex.Neuron83:736-748.833

TangH,LotterW,SchrimpfM,ParedesA,Ortega J,HardestyW,CoxD,KreimanG834(2018) Recurrent computations for visual pattern completion. PNAS835115:8835-8840.836

ThorpeS,FizeD,MarlotC(1996)Speedofprocessinginthehumanvisualsystem.837Nature381:520-522.838

ValentineT(1988)Upside-downfaces:areviewoftheeffectofinversionuponface839recognition.BrJPsychol79(Pt4):471-491.840

VernonM(1954)Visualperception.Cambridge:HarvardUniversityPress.841Yin R (1969) Looking at upside-down faces. Journal of experimental psychology842

81:141-145.843Young AW, Hellawell D, Hay DC (1987) Configurational information in face844

perception.Perception16:747-759.845ZoccolanD, Oertelt N, DiCarlo JJ, CoxDD (2009) A rodentmodel for the study of846

invariantvisualobject recognition.Proceedingsof theNationalAcademyof847SciencesoftheUnitedStatesofAmerica106:8748-8753.848

849

Documents

Chapter III. The phenomenology of seeing