A neglected problem in the Representational Theory of Mind Object Tracking and the Mind-World Connection

Before I begin I would like you to see a video game to which I will refer later The demonstration shows a task called Multiple Object Tracking Track the initially-distinct (flashing) items through the trial (here 10 secs) and indicate at the end which items are the targets After each example Id like you to ask yourself, How do I do it? If you are like most of our subjects you will have no idea, or a false idea

Keep track of the objects that flash 512x6.83 172x 169

How did you do it? What properties of individual objects did you use in order to track them? Did you use some grouping or chunking heuristic? Does your introspection reveal how you tracked the targets? Does your introspection ever reveal what processes go on in your mind?

How do we do it? What properties of individual objects do we use?

Going behind occluding surfaces does not disrupt tracking Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual objecthood. Cognitive Psychology, 38(2), 259-290.

Not all well-defined features can be tracked: Track endpoints of these lines Endpoints move exactly as the squares did!

What determines our behavior is not how the world is, but how we represent it as being As Chomsky pointed out in his review of Skinner, if we describe behavior in relation to the objective properties of the world, we would have to conclude that behavior is essentially stimulus-independent Every naturally-occurring behavioral regularity is cognitively penetrable Any information that changes beliefs can systematically and rationally change behavior The basic problem of cognitive science

Representation and Mind Why representations are essential Do representations only come into play in higher level mental activities, such as reasoning? Even at early stages of perception many of the states that must be postulated are representations (i.e. their content or what they are about plays a role in explanations).

Examples from vision (1): Intrapercept constraints Epstein, W. (1982). Percept-percept couplings. Perception, 11, 75-83.

Another example of a classical representation

Other forms of representation. a) Lines FG, BC are parallel and equal. b) Lines EH, AD are parallel and equal. c) Lines FB, GC are parallel and equal. d) Lines EA, HD are parallel and equal. e) Vertices EF, HG, DC and AB are joined.... f) Part-Of{Cube, Top-Face(EFGH), Bottom- Face(ABCD), Front-Face(FGCB), Back- Face(EHDA)} g) Part-Of{Top-Face(Front-Edge(FG), Back- Edge(EH), Left-Edge(EF), Right-Edge(HG)},

Whats wrong with these representations? Whats wrong is that the CTM is incomplete it does not address a number of fundamental questions It fails to specify how representations connect with what they represent its not enough to use English words in the representation (thats been a common confusion in AI) or to draw pictures (a common confusion in theories of mental imagery) English labels and pictures may help the theorist recall which objects are being referred to but What makes it the case that a particular mental symbol refers to one thing rather than another? Or, How are concepts grounded? (Symbol Grounding Problem)

Another way to look at what the Computational Theory of Mind lacks The missing function in the CTM is a mechanism that allows perception to refer to individual things in the visual field directly without appealing to their properties i.e., nonconceptually: Not as whatever has properties P 1, P 2, P 3,..., but as a singular term that refers directly to an individual and does not appeal to a representation of the individuals properties. Such a reference is like a proper name, or like a demonstrative term (like this or that) in natural language or like a pointer in a computer data structure. There is more to come on the mechanism of visual indexing

An example from personal history: Why we need to pick out individual things without referring to their properties We wanted to develop a computer system that would reason about geometry by actually drawing a diagram and noticing adventitious properties of the diagram from which it would conjecture lemmas to prove We wanted the system to be as psychologically realistic as possible so we assumed that it had a narrow field of view and noticed only limited, spatially- restricted information as it examined the drawing This immediately raised the problem of coordinating noticings and led us to the idea of visual indexes to keep track of previously encoded parts of the diagram.

Begin by drawing a line. L1

Now draw a second line. L2

And draw a third line. L3

What do we have so far? We know there are three lines, but we dont know the spatial relations between them. That requires: 1. Seeing several of them together (at least in pairs) 2. Knowing which object seen at time t+1 corresponds to a particular object that was seen at time t. Establishing (2) requires solving one form of the correspondence problem. This problem is ubiquitous in perception. Solving it over time is called tracking.

For example, suppose you recall noticing two intersecting lines such as these: You know that there is an intersection of two lines But which of the two lines you drew earlier are they? There is no way to indicate which individual things are seen again without a way to refer to individual token thingsL1 L2

Look around some more to see what is there . Here is another intersection of two lines Is it the same intersection as the one seen earlier? Without a special way to keep track of individuals the only way to tell would be to encode unique properties of each of the lines. Which properties should you encode? L5 L2 V12

In examining a geometrical figure one only gets to see a sequence of local glimpses

A note about the use of labels in this example There are two purposes for figure labels. One is to specify what type of individual it is (line, vertex,..). The other is to specify which individual it is in order to keep track of it and in order to bind it to the argument of a predicate. The second of these is what I am concerned with because indicating which individual it is is essential in vision. Many people (e.g., Marr, Yantis) have suggested that individuals may be marked by tags, but that wont do since one cannot literally place a tag on an object and even if we could it would not obviate the need to individuate and index just as labels dont help. Labeling things in the world is not enough because to refer to the line labeled L 1 you would have to be able to think this is line L 1 and you could not think that unless you had a way to first picking out the referent of this.

The correspondence Problem A frequent task in perception is to establish a correspondence between proximal tokens that arise from the same distal token. Apparent Motion. Tokens at different times may correspond to the same object that has moved. Constructing a representation over time (and over eye fixations) requires determining the correspondence between tokens at different stages in constructing the representation. Tracking token individuals over time/space. To distinguish here it is again from here is another one and so to maintain the identity of objects. Stereo Vision requires establishing a correspondence between two proximal (retinal) tokens one in each eye

Apparent Motion solves a correspondence problem Dawson Configuration (Dawson &Pylyshyn, 1988) Linear trajectory? Which criterion does the visual module prefer? Curved trajectory?

Dawson Configuration (animated)

Apparent Motion solves a correspondence problem Dawson Configuration (Dawson &Pylyshyn, 1988) Nearest mean distance? Nearest vector distance? Nearest configural distance? Which criterion does the visual module prefer?

Dawson Configuration (animated)

Colors & shapes are ignored

Dawson Configuration Different properties Ignored

Yantis use of the Ternus Configuration to demonstrate the early visual effect of objecthood Short time delays result in element motion (the middle object persists as the same object so it does not appear to move)

Long time delays result in group motion because the middle object does not persist but is perceived as a new object each time it reappears

Relevance to the present theme These different examples illustrate the need to keep track of objects numerical identity (or their same- individuality) in a primitive non-conceptual way (and of putting their token representations in correspondence) In each case the correspondence is computed without any conscious awareness by the early vision module The examples (apparent motion, stereovision, incremental construction of representations, and keeping track of individuality over time/space) are on different time scales so it is an empirical matter whether they involve the same mechanism, but they do address the same problem tracking individuals without using their unique properties.

The incremental construction of visual representations requires solving a correspondence problem over time We have to determine whether a particular individual element seen at time t is identical to another individual element seen at a previous time t- . This is one manifestation of the correspondence problem. Solving the correspondence problem is equivalent to picking out and tracking the identity of token individuals as they change their appearance, their location or the way they are encoded or conceptualized To do that we need the capacity to refer to token individuals (I will call them objects) without doing so by appealing to their properties. This requires a special form of demonstrative reference I call a Visual Index.

The difference between a direct (demonstrative) and a descriptive way of picking something out has produced many You are here cartoons. It is also illustrated in this recent New Yorker cartoon

The difference between descriptive and demonstrative ways of picking something out (illustrated in this New Yorker cartoon by Sipress )

Picking out Picking out entails individuating, in the sense of separating something from a background (what Gestalt psychologists called a figure-ground distinction) This sort of picking out has been studied in psychology under the heading of focal or selective attention. Focal attention appears to pick out and adhere to objects rather than places In addition to a unitary focal attention there is also evidence for a mechanism of multiple references (about 4 or 5), that I have called a visual index or a FINST Indexes are different from focal attention in many ways that we have studied in our laboratory (I will mention a few later) A visual index is like a pointer in a computer data structure it allows access but does not itself tell you anything about what is being pointed to. Note that the English word pointer is misleading because it suggests that vision picks out objects by pointing to their location.

The requirements for picking out and keeping track of several individual things reminded me of an early comic book character called Plastic Man

Imagine being able to place several of your fingers on things in the world without recognizing their properties while doing so. You could then refer to those things (e.g. what finger # 2 is touching) and could move your attention to them. You would then be said to possess FIN gers of INST antiation ( FINSTs)

FINST Theory postulates a limited number of pointers in early vision that are elicited by certain events in the visual field and that index the objects associated with the event. These enable vision to refer to those objects without doing so under a concept/description

This idea is intriguing but it is missing one or two details as well as some distinctions We need to distinguish the mechanisms of early vision (inside the vision module) from those of general cognition We need to distinguish different types of information in different parts of vision (e.g., representations vs physical states, conceptual vs nonconceptual, as well as personal vs subpersonal). Closely related to these, we need to distinguish between the process of vision from those of belief fixation. Finally, we need to provide a motivated proposal for what the modular (subpersonal?) part of vision hands off to the rest of the cognitive mind. This is a difficult problem and will occupy some of our time in the rest of this class.

Returning to the FINST Theory

First Approximation: FINSTs and Object Files and the link between the world and its conceptualization Object File contents are conceptual! Information (causal) link FINST Demonstrative reference link The only nonconceptual contents in this picture are FINST indexes!

Summarizing the theory so far A FINST index is a primitive mechanism of reference that refers to individual visible objects in the world. There are a small number (~4-5) indexes available at any one time. Indexes refer to individual objects without referring to them under conceptual categories, so they provide nonconceptual reference. Q: Is this a case of seeing without seeing as? Indexing objects is prior to encoding any of their properties. So objects are picked out and referred to without using any encoding of their properties. This does not mean that object properties are irrelevant to the grabbing of indexes or the subsequent process of tracking The claim that we initially refer to objects without having encoded their location is surprising to many people (why?) What may be even more surprising is that we can index and refer to objects without knowing what they are!

Summarizing the theory so far An important function of these indexes is to bind arguments of visual predicates to things in the world to which they refer. Only predicates with bound arguments can be evaluated. Since predicates are quintessential concepts, an index serves as a bridge from objects to conceptual representations. Indexes can also bind arguments of motor commands, including the command to move focal attention or gaze to the indexed object: e.g., MoveGaze(x) Some hard problems that Fodor and I will discuss at a later lecture Getting information about a particular object into its Object File How and when does this happen? Who can use the information in an object file? Can it be used to track objects by checking whether a candidate object has the same properties as a particular previous object?

Some hard problems and some open empirical questions To be discussed at various later lectures How and when does information about a particular object get into its Object File? Who can use the information in an object file? Is the Object File inside the vision module or outside? Can the information in the file be used to determine the correspondence between objects by checking whether they have the same properties? Is this how tracking is accomplished? Is information in the Object File used to solve the many-properties binding problem? Is this done during tracking?

A note on terminology I sometimes refer to an index as a pointer, or as a demonstrative or even as the name of an object All of these are misleading because unlike a demonstrative an index is grabbed by things in the world independent of the intentions of a perceiver. Although an index is like a proper name except that it can only refer to objects with which the perceiver is in sensory contact (in what Fodor calls the perceptual circle). Notice a strange consequence of the assumption that indexes do not pick out or refer to objects as members of a person-level equivalence class: When we have indexed something we initially do not know where it is nor what it is!

Part 2

Some notes on how indexes might be implemented A thought experiment: How might one implement an indexing system? The attempt might clarify how it is possible to index an object without having explicit access to the coordinates or other properties of objects. I will sketch a network model but will only describe how it looks functionally to a user who pushes buttons and notices which lights come on. The model takes as input an activation map (on the proximal stimulus) with a set of sensors at each point (each pixel). Based on the relative activity at each point it indexes a number of active objects and illuminates a light for each. The user choses one of the illuminated objects (by name nobody knows where they are) and pressing a button beside one of the lights.

Some notes on how indexes might be implemented The person then presses a button on a property detection panel marked with a property name. If the light beside the button illuminates then we know that the object indexed in panel 2 has property indicated by panels at 3. The way this model is wired up is simple. The first panel feeds a Winner-Take-All network which inhibits every input unit but the most active one (a classical Darwinian or capitalist world). This enables a circuit from the button next to the illuminated light to the input unit which led to the light being on (thats the index). Pressing the button sends a unit of activity to that input unit which now has a property transducer and an activity selector on (two out of the required 3 before it send out a general tremor of activity). Now you press the button by a property inquiry (panel 3) which activates all P detectors. If the selected input unit, the property transducer, and the property inquiry signal are all on, that input fires.

Moral It is trivial to design a circuit that allows one to check whether a particular place on a proximal stimulus that has grabbed an index, has a particular property. All it takes are some threshold units and some and and or units. Although the simple black box I showed you can only detect one static input place at a time, it can inquiry about several properties. Extending this to moving objects is easy using the same ideas you partially activate regions near each selected input units, this increasing the likelihood that it will be selected at the next cycle and so on.

Some evidence for indexes and Object Files The correspondence problem The binding problem Evaluating multi-place visual predicates Recognizing shapes by their part-whole relations Operating over several visual elements at once without having to search for them first Subitizing Subset search Multiple-Object Tracking Imagining space without requiring a spatial display in the head {This is a large topic beyond the scope of this class, but see Things and Places, Chapter 5}

A quick tour of some evidence for FINSTs The correspondence problem (mentioned earlier) The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without having to search for them first Subitizing Subset selection Multiple-Object Tracking Imagining space without requiring a spatial display in the head

Pandemonium An early architecture for vision, called Pandemonium, was proposed by Oliver Selfridge in 1959. This idea continues to be at the heart of many psycho-logical models, including ones implemented in contemporary connectionist or neural net models. It is also the basic idea in what are called Blackboard Architec- tures in AI (e.g., Hearsay speech recognition systems). These architectures have no way to represent that some of the features detected actually belonw with other features detected.

Introduction to the Binding Problem: Encoding conjunctions of properties Experiments show the special difficulty that vision has in detecting conjunctions of several properties It seems that items have to be attended (i.e., individuated and selected) in order for their property-conjunction to be encoded When a display is not attended, conjunction errors are frequent

Read the vertical line of digits in this display What were the letters and their colors?

This is what you saw briefly Under these conditions Conjunction Errors are very frequent

Encoding conjunctions requires attention One source of evidence is from search experiments: Single feature search is fast and appears to be independent of the number if items searched through (suggesting it is automatic and pre-attentive) Conjunction search is slower and the time increases with the number of items searched through (suggesting it requires serial scanning of attention)

Rapid visual search (Treisman) Find the following simple figure in the next slide:

This case is easy and the time is independent of how many nontargets there are because there is only one red item. This is called a popout search

This case is also easy and the time is independent of how many nontargets there are because there is only one right-leaning item. This is also a popout search.

Rapid visual search (conjunction) Find the following simple figure in the next slide:

Constraints on nonconceptual representation of visual information (and the binding problem) Because early (nonconceptual) vision must not fuse the conjunctive grouping of properties, visual properties cant just be represented as being present in the scene because then the binding problem could not be solved! What else is required? The most common answer is that each property must be represented as being at a particular location According to Peter Strawson and Austin Clark, the basic unit of sensory representation is Feature-F-at-location-L This is the so-called feature placing proposal. This proposal fails for interesting empirical reasons But if feature placing is not the answer, what is?

The role of attention to location in Treismans Feature Integration Theory

Individual objects and the binding problem We can distinguish scenes that differ by conjunctions of properties, so early vision must somehow keep track of how properties co-occur conjunction must not be obscured. How to do this is called the binding problem. The most common proposal is that vision keeps track of properties according to their location and binds together colocated properties.12

The proposal of binding conjunctions by the location of conjuncts does not work when feature location is not punctate and becomes even more problematic if they are colocated e.g., if their relation is inside

Binding as object-based The proposal that properties are conjoined by virtue of their common location has many problems In order to assign a location to a property you need to know its boundaries, which requires distinguishing the object that has those properties from its background (figure-ground individuation) Properties are properties of objects, not of locations which is why properties move when objects move. Empty locations have no causal properties. The alternative to conjoining-by-location is conjoining by object. According to this view, solving the binding problem requires first selecting individual objects and then keeping track of each objects properties (in its object file or OF) If only properties of selected objects are encoded and if those properties are recorded in each objects OFs, then all conjoined properties will be recorded in the same object file, thus solving the binding problem

Attention spreads over perceived objects Using a priming method (Egly, Driver & Rafal, 1994) showed that the effect of a prime spreads to other parts of the same visual object compared to equally distant parts of different objects. Spreads to B and not C Spreads to B and not C Spreads to C and not B Spreads to C and not B *

A quick tour of some evidence for FINSTs The correspondence problem (mentioned earlier) The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without having to search for them first Subitizing Subset selection Multiple-Object Tracking Cognizing space without requiring a spatial display in the head

Being able to refer to individual objects or object-parts is essential for recognizing patterns Encoding relational predicates; e.g., Collinear (x,y,z,..) ; Inside (x, C) ; Above (x,y) ; Square (w,x,y,z), requires simultaneously binding the arguments of n-place predicates to n elements* in the visual scene Evaluating such visual predicates requires individuating and referring to the objects over which the predicate is evaluated: i.e., the arguments in the predicate must be bound to individual elements in the scene. *Note: elements is used to refer to objects that serve as parts of other objects

Several objects must be picked out at once in making relational judgments When we judge that certain objects are collinear, we must first pick out the relevant objects while ignoring their properties

Several objects must be picked out at once in making relational judgments The same is true for other relational judgments like inside or on- the-same-contour etc. We must pick out the relevant individual objects first. Are dots Inside-same contour? On-same contour? * Note: Ullman (1984) has shown that some patterns cannot be recognized without doing so in a serial manner, where the serial elements must be indexed first. doing so in a serial manner, where the serial elements must be indexed first. And that is yet another reason why Connectionist architectures cannot work! And that is yet another reason why Connectionist architectures cannot work!

A quick tour of some evidence for FINSTs The correspondence problem The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without first having to search for them Subitizing Subset selection Multiple-Object Tracking Cognizing space without requiring a spatial display in the head

More functions of FINSTs Further experimental explorations Recognizing the cardinality of small sets of things: Subitizing vs counting (Trick, 1994) Searching through subsets selecting items to search through (Burkell, 1997) Selecting subsets and maintaining the selection during a saccade (Currie, 2002) Application of FINST index theory to infant cardinality studies (Carey, Spelke, Leslie, Uller, etc) Indexes may explain how children are able to acquire words for objects by ostension without suffering Quines Gavagai problem.

Signature subitizing phenomena only appear when objects are automatically individuated and indexed Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.

Subitizing results There is evidence that a different mechanism is involved in enumerating small (n 4) numbers of items (even different brain mechanisms Dehaene & Cohen, 1994 ) Rapid small-number enumeration (subitizing) only occurs when items are first (automatically) individuated * Unlike counting, subitizing is not enhanced by precuing location * Subitizing is insensitive to distance among items * Our account for what is special about subitizing is that once FINST indexes are assigned to n< 4 individual objects, the objects can be enumerated without first searching for them. In fact they might be enumerated simply by counting active indexes which is fast and accurate because it does not require visual scanning * Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.

Subset selection for search Burkell, J., & Pylyshyn, Z. W. (1997). Searching through subsets: A test of the visual indexing hypothesis. Spatial Vision, 11(2), 225-258.

Subset search results: Only properties of the subset matter but note that properties of the entire subset must be taken into account simultaneously (since that is what distin- guishes a feature search from a conjunction search) If the subset is a single-feature search it is fast and the slope (RT vs number of items) is shallow If the subset is a conjunction search set, it takes longer is more error prone and is more sensitive to the set size As with subitizing, the distance between targets does not matter, so observers dont seem to be scanning the display looking for the target

The stability of the visual world entails the capacity to track some individuals after a saccade There is no problem about how the tactile sense can provide a stable world when you move around while keeping your fingers on the same objects because in that case retaining individual identity is automatic But with FINSTs the same can be true in vision at least for a small number of visual objects This is compatible with the fact that it appears that one retains the relative location of only about 4 elements during saccadic eye movements (Irwin, 1996) [ Irwin, D. E. (1996). Integrating information across saccadic eye movements. Current Directions in Psychological Science, 5(3), 94-100.]

The selective search experiment with a saccade induced between the late onset cues and start of search Even with a saccade between selection and access, items can be accessed efficiently Onset of new objects grabs indexes

A quick tour of some evidence for FINSTs The correspondence problem (mentioned earlier) The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without having to search for them first Subitizing Subset selection Multiple-Object Tracking Imagining space without requiring a spatial display in the head

Demonstrating the function of FINSTs with Multiple Object Tracking (MOT) In a typical experiment, 8 simple identical objects are presented on a screen and 4 of them are briefly distinguished in some visual manner usually by flashing them on and off. After these 4 targets are briefly identified, all objects resume their identical appearance and move randomly. The observers task is to keep track of the ones that had been designated as targets at the start After a period of 5-10 seconds the motion stops and observers must indicate, using a mouse, which objects are the targets

Another example of MOT: With self occlusion 5 x 5 1.75 x 1.75

Self occlusion dues not seriously impair tracking

Basic finding: Most people can track at least 4 targets that move randomly among identical non-target objects (even some 5 year old children can track 3 objects) Object properties do not appear to be recorded during tracking and tracking is not improved if no two objects have the same color, shape or size (asynch vs synch changes) How is tracking done? We showed that it is unlikely that the tracking is done by keeping a record of the targets locations and updating them by serially visiting the objects (Pylyshyn & Storm, 1998) Other strategies may be employed (e.g., tracking a single deforming pattern), but they do not explain tracking Hypothesis: FINST Indexes are grabbed by blinking targets. At the end of the trial these indexes can be used to move attention to the targets and hence to select them in making the response Some findings with Multiple Object Tracking

What role do visual properties play in MOT? Certain properties must be present in order for an index to be grabbed, and certain properties (probably different properties) must be present in order for the index to keep track of the object, but this does not mean that such properties are encoded, stored, or used in tracking. Is there something special about location? Do we record and track properties-at-locations? Location in time & space may be essential for individuating or clustering objects, but metrical coordinates need not be encoded or made cognitively available The fact that an object is actually at some location or other does not mean that it is represented as such. Representing property P (where P happens to be at location L) Representing property P-at-L.

A way of viewing what goes on in MOT An object file may contain information about the object to which it is bound. But according to FINST Theory, keeping track of the objects identity does not require the use of this information. The evidence suggests that in MOT, little or nothing is stored in the object file. Occasionally some information may get encoded and entered in the Object File (e.g., when an object appears or disappears) but this is not used in the tracking process itself.* *We will see later that this has to be stated with care since location may be stored in the object file and used in a certain sense when the usual continuous tracking does not work.

Another way of viewing MOT What makes something the same object over time is that it remains connected to the same object-file by the same Index. Thus, for something to be the same enduring object no appeal to properties or concepts is needed. The only requirement is that it be trackable. Another view of tracking is that it is the basis of objecthood: An object is something that can be perceptually tracked (Fodor). There seems to be growing evidence that tracking is a reflex -- it proceeds without interference from other attentive tasks.* Franconeri et al.** showed that the apparent sensitivity of tracking performance to such properties as speed is due to a confound of speed with object density. Distance between objects is critical to MOT performance, which is predicted by parallel tracking models. *Although tracking feels effortful, many secondary tasks do not interfere with tracking (search) ** Franconeri, S., Lin, J., Pylyshyn, Z., Fisher, B., & Enns, J. (2008). Evidence against a speed limit in multiple-object tracking. Psychonomic Bulletin & Review, 15(4), 802-808.

Why is this relevant to foundational questions in the philosophy of mind? According to Quine, Strawson, and most philosophers, you cannot pick out or track individuals without concepts (sortals) But you also cannot pick out individuals with only concepts Sooner or later you have to pick out individuals using nonconceptual causal connections between things and thoughts. The present proposal is that FINSTs provide the needed nonconceptual mechanism for individuating objects and for tracking their (numerical) identity, which works most of the time in our kind of world. It relies on some natural constraints (Marr). FINST indexes provide the right sort of connection to allow the arguments of predicates to be bound to objects prior to the predicates being evaluated. They may also be the basis for learning nouns by ostension.

But there must be some properties that cause indexes to be grabbed! Of course there are properties that are causally responsible for indexes being grabbed, and also properties (probably different ones) that make it possible for objects to be tracked; But these properties need not be represented (encoded) and used in tracking The distinction between properties that cause indexes to be grabbed and those that are represented (in Object Files) is similar to Kripkes distinction between properties that are needed to name an object (by baptismal) and those that constitute its meaning

Effect of target properties on MOT Changes of object properties are not noticed during MOT Keeping all targets at different color, size, or shape does not improve tracking Observers do not use target speed or direction in tracking (e.g., they do not track by anticipating where the targets will be when they reappear after occlusion) Targets can go behind an opaque screen and come out the other side transformed in: color, shape, speed or direction of motion (up to 60 from pre-occlusion direction), without affecting tracking, but also without observers noticing the change! What affects tracking is the distance travelled while behind the occluding screen. The closer the reappearance to the point of disappearance the better the tracking even if the closer location happens to be in the middle of the occluding screen!

Some open questions We have arrived at the view that only properties of selected (indexed) objects enter into subsequent conceptualization and perception-based thought (i.e., only information in object files is made available to cognition) So what happens to the rest of the visual information? Visual information seems rich and fine-grained while this theory says that properties of only 4 or 5 objects are encoded! The present view also leaves no room for representations whose content corresponds to the content of conscious experience According to the present view, the only content that modular nonconceptual representations have is the demonstrative content of indexes that refer to perceptual objects Question: Why do we need any more than that?

An intriguing possibility. Maybe the theoretically relevant information we take in is less than (or at least different from) what we experience This possibility has received attention recently with the discovery of various blindnesses (e.g., change-blindness, inattentional blindness, blindsight) as well as the discovery of independent- vision systems (e.g., recognition and motor control) The qualitative content of conscious experience may not play a role in explanations of cognitive processes Even if detailed quantitative information enters into causal process (e.g., motor control) it may not be represented not even as nonconceptual representation For something to be a representation its content must figure in explanations it must capture generalizations. It must have truth conditions and therefore allow for misrepresentation. It is an empirical question whether current proposals do (e.g., primal sketch, scenarios). cf Devitt: Pylyshyns Razor

An alternative view of reference by Indexes This provisional revised theory responds to Fodors argument that there is no seeing without seeing-as According to Fodor, the visual module must do more than the current theory assumes, because its output must provide the basis for induction over what something is seen as. This is not the traditional argument that percepts have a finer grain than most theories provide for especially theories that assume a symbolic output like this one. That argument relies too much on our phenomenology which more often than not leads us astray. So the vision module must contain more than object files. It must be able to classify objects by their visual properties alone, or to compute for each object a particular appearance- class to which it belongs (see black swan example).

An alternative view of reference by Indexes Since the vision module is encapsulated it must have a mechanism for assigning each object x to an equivalence class based solely on what x looks like. It must do this for a large number of such classes, based both on its innate mechanisms and its visual experience [Look of x = L (x)]. L (x) is thus an equivalence class induced by the sensorium which includes the current token x. The L (x) associated with each token x must be sufficiently distinctive to allow the cognitive system to recognize x unambiguously as an token of something it knows about (e.g., L ( x) => looks like a cow & this is a farm => x is likely a cow). The sequence from x to recognition must be correct most of the time in our kind of world (so it must embody a natural constraint).

An alternative view of reference by Indexes This idea of an appearance class L (x) has been explored in computational vision, where a number of different functions have been proposed, many of them based on mathematical compression or encoding functions. An early idea which has implications for the present discussion, is a proposal by David Marr called a Multiple-View proposal. He wrote: The Multiple View representation is based on the insight that if one chooses ones primitives correctly, the number of qualitatively different views of an object may be quite small and Marr cites Minsky as speculating that the representation of a 3D shape might consist of a catalog of different appearances of that shape, and that catalog may not need to be very large. (Marr & Nishihara, 1976) The search for the most general form of representation has yielded many proposals, many of which have been tested in Psychology Labs. E.g., generalized cylinders and part-decomposition: Biederman, I. (1987). Recognition-by-components: A theory of human image interpretation. Psychological Review, 94, 115-148.

Seeing without Seeing As? Its true that instances of visual encounters deliver an equivalence-class to which the object belongs by virtue of its appearance as mapped by the function L (x). It is an appearance class because it can only use information from the sensorium and the natural constraints built into the modular vision system. So in that respect one might say that seeing is always a seeing as where the relevant category is L (x). But this is unlikely to be the category under which the object enters into thought. So the kind of seeing as category L (x), is not the same category as the one under which the object is contemplated in thought, where its category would depend on background knowledge and personal history. The appearance L (x) is now replaced by familiar categories of thought (e.g. card table, Ford car, Coca Cola bottle, Warhol Brillo Box, and so on, categories rich in their interconnections).

More on the structure of the Visual Module In order to compute L (x), the vision module must possess enough machinery to map a token object x onto an equivalence class designated by L (x) using only sensory information and module-specific processes and representations, without appealing to general knowledge. The module must also have some 4-5 Object Files, because it needs those to solve the binding problem as well as to bind predicate arguments to objects (and also to use the proposed Recognition-By-Parts process for recognizing complex objects).

Alternative view of whats in the module The alternative view of what goes on inside the visual module would furnish it with more processes to catalog and lookup of object shape- types L (x). Our assumptions would seem to require that this augmented machinery also be barred from accessing cognitive memories and general inference capacity. Does this conflict with Fodors requirement that the output be right for belief fixation? L ( x ) Modular vision computer. Input is sensory information, output is standard form for appearance of objects L ( x ). Minimal (Just indexes) Original (indexes and files) Maximal (computing L ( x ) ) Which functions are in the visual module?

Summary of the current FINST model Up to 5 indexes can be grabbed based on local properties Active indexes bind objects to object files (initially empty) Bound objects can then be queried* and salient properties encoded in their Object File Does this require voluntary attention ? Indexes stay bound to the objects that grabbed them even as the objects change any of their properties, including briefly disappearing behind an occluding screen. When the objects change their location, the result is tracking which is automatic / reflexive We also have evidence that objects can be tracked through other continuously changing properties (Blaser, Pylyshyn & Holcombe 2000) The only factor that impairs tracking performance is spacing: too close yields item-ambiguity and tracking errors

Tracking and spatial proximity Many experiments show that the only factor that affects tracking performance is inter-item spacing: when items are too close there is item-ambiguity resulting in tracking errors Other factors that allegedly impair tracking (e.g., speed) do so only because they affect average spacing. The very process of tracking, which requires something like smooth continuous movement, makes use of proximity. So does the process of Gestalt individuation which must collect nearby pixels and features (regardless of type). We have many results showing that when objects disappear their only recalled property is where they were at the time and the only thing that determines how well they continue to be tracked when they reappear is how far away they have moved. Franconeri, S., Pylyshyn, Z. W., & Scholl, B. J. (2012). A simple proximity heuristic allows tracking of multiple objects through occlusion Attention, Perception and Psychophysics, 72(4).

How is location stored and used? It is possible that location is stored in object files since it is one of the more important properties of moving objects. Object location is a property that must be used in tracking since to track smoothly moving objects just is to solve the correspondence problem by taking the nearest object Many experiments show that the correspondence problem in this case does not involve choosing the most similar object or the one moving with the same speed or in the same direction but the closest one to the locus of disappearance. Does this mean that object location is stored and used in tracking, contrary to my earlier claim? Maybe, but That depends on whether location is in this case a conceptual property and tracking is a process involving conceptual representations and there is evidence that it is not.

Is location a conceptual property? Is location in this case a conceptual property and is tracking a process involving conceptual representations? Computing correspondence and tracking are prototypical automatic and cognitively impenetrable processes, likely computed by local parallel processes, which suggests that it is subpersonal, modular and nonconceptual, since most automatic processes are nonconceptual. Location plays a critical part in all motor control and there is reason to believe that it plays this role in a different way than the way conceptual information does. It typically involves a different visual system, the dorsal pathway. A great deal of evidence is now available showing that only the central pathway contributes to object recognition while the dorsal pathway is specialized for motor control (Milner & Goodale, 1995; 2004) All in all it seems more likely that location is used in MOT and other visual processes but that it is not a conceptual process at all. If you accept that location is conceptual, you pay a high price: you lose the goal of finding a nononceptual link between cognition and the world!

A note on top-down vs bottom-up flow This dichotomy has been the source of a great deal of misunderstanding of what goes on in a module. Information does flow in both directions, but our claim is that it does so only within the visual module, not across the capsule boundary A more interesting distinction alluded to here under the phrase the visual system queries o as opposed to passively receiving information. This is a question of where control resides. An appropriate event in the visual scene grabs an index the responsibility here rests with an external event. In computer talk this is called an interrupt because the ambient process is interrupted by an external event. But to say the system interrogates the scene through the index is to say that the initiative belongs with the internal process. In computer talk this corresponds to a test operation.

A note on top-down vs bottom-up flow What is interesting about the distinction between interrupt and test is that only an interrupt can be open ended. Things can be set up so that it is not known in advance what sort of event will cause an interrupt. On the other hand you cant have a test operation unless you specify what you are looking for you have to test for, which, like see as or select for, is an intentional act, where an interrupt can be a causal event So in our case grabbing is a causal event whereas querying is an intentional representation-governed event. Similarly selecting is intentional, so scanning or switching visual attention would be intentional whereas attention can also be elicited in which case the event would be causal. There may also be combinations of the two, as when you decide to track certain targets. To do that, according to this theory, you combine the intentional act of moving your focal attention to a particular object, with the causal event, whereby some object in the scope of the attention is enabled and can then grab an index.

Summary of augmented FINST model So far the only visual information that is available to the mind is contained in the Object Files in the visual module. The index mechanism discussed so far also makes it possible to use additional currently perceived information (see Things & Places, Chapt 5) Information in the module is in a symbolic form very similar to the subsequent conceptual representation, except: It is encoded in the vocabulary of modular (subpersonal) categories (that many would call nonconceptual), not in person-level conceptual vocabulary. Construction of the intramodular representation cannot use general knowledge, so all relevant representations must reside in the module The intramodular representation uses information in Object Files and preserves its bindings. The Object Files are the only mechanisms for dealing with the General Binding Problem, as well as the problem of binding predicate arguments to objects in the world.

Open Questions about the augmented FINST model The modular processes must somehow recover the relations between objects, and these may or may not be encoded in OFs. Since information in the module may serve a number of subsequent functions including visual-motor coordination and multimodal perceptual integration it will have to represent metrical information, very likely in a nonconceptual form. The question of representing metrical information is one we leave for the future since little is known about how analogue representation might function in cognition We now arrive at a central question of considerable importance to the view we are promoting: What form is the visual representation in when it is handed on to Cognition?

Vision science has always been deeply ambivalent about role of conscious experience Isnt how things appear one of the things that our theories must explain? Answer: There is no a priori must explain! The content of subjective experience is a major type of evidence. But it may turn out not to be the most reliable source for inferring the relevant functional states. It competes with other types of evidence. How things appear cannot be taken at face value: it carries substantive theoretical assumptions. It also draws on many levels of processing. It was a serious obstacle to early theories of vision (Kepler) It has been a poor guide in the case of theories of mental imagery (e.g., color mixing, image size, image distances). Reading X off an image is an illusion. It seems likely that vision science will use evidence of conscious experience the way linguistics uses evidence of grammatical intuitions only as it is filtered through developing theories. The questions a science is expected to answer cannot be set in advance they change as the science develops.

What next? This picture leaves many unanswered questions, but it does provide a mechanism for solving the binding problem and also explaining how mental representations could have a nonconceptual connection with objects in the world (something required if mental representations are to connect with actions)

Schema for how FINSTs function in hockey

For a copy of these slides see: http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionRefere nce.ppt http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionRefere nce.ppt Or MIT Press Paperback

Index capacity and training Daphne Baveliers lab (Rochester) has shown that videogame players can track a larger number of objects in MOT Jose Rivest (York) has shown that some athletes can track more targets than non- athletes Within individuals the main determiner of number of targets that can be tracked is the spacing between them

X You are now here But you are also here

MOT with occlusion MOT with virtual occluders MOT with matched nonoccluding disappearance Track endpoints of lines Track rubber-band linked boxes Track and remember ID by location Track and remember ID by name (number) Track while everything briefly disappears ( sec) and goes on moving while invisible Track while everything briefly disappears ( sec) and goes on moving while invisible Track while everything briefy disappears and reappears where they were when they disappeared Track while everything briefy disappears and reappears where they were when they disappeared Additional examples of MOT

Documents

A neglected problem in the Representational Theory of Mind Object Tracking and the Mind-World Connection