A neglected problem in the Representational Theory of Mind
Object Tracking and the Mind-World Connection
Slide 2
Before I begin I would like you to see a video game to which I
will refer later The demonstration shows a task called Multiple
Object Tracking Track the initially-distinct (flashing) items
through the trial (here 10 secs) and indicate at the end which
items are the targets After each example Id like you to ask
yourself, How do I do it? If you are like most of our subjects you
will have no idea, or a false idea
Slide 3
Keep track of the objects that flash 512x6.83 172x 169
Slide 4
Slide 5
How did you do it? What properties of individual objects did
you use in order to track them? Did you use some grouping or
chunking heuristic? Does your introspection reveal how you tracked
the targets? Does your introspection ever reveal what processes go
on in your mind?
Slide 6
How do we do it? What properties of individual objects do we
use?
Slide 7
Going behind occluding surfaces does not disrupt tracking
Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple
items through occlusion: Clues to visual objecthood. Cognitive
Psychology, 38(2), 259-290.
Slide 8
Not all well-defined features can be tracked: Track endpoints
of these lines Endpoints move exactly as the squares did!
Slide 9
Slide 10
What determines our behavior is not how the world is, but how
we represent it as being As Chomsky pointed out in his review of
Skinner, if we describe behavior in relation to the objective
properties of the world, we would have to conclude that behavior is
essentially stimulus-independent Every naturally-occurring
behavioral regularity is cognitively penetrable Any information
that changes beliefs can systematically and rationally change
behavior The basic problem of cognitive science
Slide 11
Representation and Mind Why representations are essential Do
representations only come into play in higher level mental
activities, such as reasoning? Even at early stages of perception
many of the states that must be postulated are representations
(i.e. their content or what they are about plays a role in
explanations).
Slide 12
Examples from vision (1): Intrapercept constraints Epstein, W.
(1982). Percept-percept couplings. Perception, 11, 75-83.
Slide 13
Another example of a classical representation
Slide 14
Other forms of representation. a) Lines FG, BC are parallel and
equal. b) Lines EH, AD are parallel and equal. c) Lines FB, GC are
parallel and equal. d) Lines EA, HD are parallel and equal. e)
Vertices EF, HG, DC and AB are joined.... f) Part-Of{Cube,
Top-Face(EFGH), Bottom- Face(ABCD), Front-Face(FGCB), Back-
Face(EHDA)} g) Part-Of{Top-Face(Front-Edge(FG), Back- Edge(EH),
Left-Edge(EF), Right-Edge(HG)},
Slide 15
Whats wrong with these representations? Whats wrong is that the
CTM is incomplete it does not address a number of fundamental
questions It fails to specify how representations connect with what
they represent its not enough to use English words in the
representation (thats been a common confusion in AI) or to draw
pictures (a common confusion in theories of mental imagery) English
labels and pictures may help the theorist recall which objects are
being referred to but What makes it the case that a particular
mental symbol refers to one thing rather than another? Or, How are
concepts grounded? (Symbol Grounding Problem)
Slide 16
Another way to look at what the Computational Theory of Mind
lacks The missing function in the CTM is a mechanism that allows
perception to refer to individual things in the visual field
directly without appealing to their properties i.e.,
nonconceptually: Not as whatever has properties P 1, P 2, P 3,...,
but as a singular term that refers directly to an individual and
does not appeal to a representation of the individuals properties.
Such a reference is like a proper name, or like a demonstrative
term (like this or that) in natural language or like a pointer in a
computer data structure. There is more to come on the mechanism of
visual indexing
Slide 17
An example from personal history: Why we need to pick out
individual things without referring to their properties We wanted
to develop a computer system that would reason about geometry by
actually drawing a diagram and noticing adventitious properties of
the diagram from which it would conjecture lemmas to prove We
wanted the system to be as psychologically realistic as possible so
we assumed that it had a narrow field of view and noticed only
limited, spatially- restricted information as it examined the
drawing This immediately raised the problem of coordinating
noticings and led us to the idea of visual indexes to keep track of
previously encoded parts of the diagram.
Slide 18
Begin by drawing a line. L1
Slide 19
Now draw a second line. L2
Slide 20
And draw a third line. L3
Slide 21
What do we have so far? We know there are three lines, but we
dont know the spatial relations between them. That requires: 1.
Seeing several of them together (at least in pairs) 2. Knowing
which object seen at time t+1 corresponds to a particular object
that was seen at time t. Establishing (2) requires solving one form
of the correspondence problem. This problem is ubiquitous in
perception. Solving it over time is called tracking.
Slide 22
For example, suppose you recall noticing two intersecting lines
such as these: You know that there is an intersection of two lines
But which of the two lines you drew earlier are they? There is no
way to indicate which individual things are seen again without a
way to refer to individual token thingsL1 L2
Slide 23
Look around some more to see what is there . Here is another
intersection of two lines Is it the same intersection as the one
seen earlier? Without a special way to keep track of individuals
the only way to tell would be to encode unique properties of each
of the lines. Which properties should you encode? L5 L2 V12
Slide 24
In examining a geometrical figure one only gets to see a
sequence of local glimpses
Slide 25
A note about the use of labels in this example There are two
purposes for figure labels. One is to specify what type of
individual it is (line, vertex,..). The other is to specify which
individual it is in order to keep track of it and in order to bind
it to the argument of a predicate. The second of these is what I am
concerned with because indicating which individual it is is
essential in vision. Many people (e.g., Marr, Yantis) have
suggested that individuals may be marked by tags, but that wont do
since one cannot literally place a tag on an object and even if we
could it would not obviate the need to individuate and index just
as labels dont help. Labeling things in the world is not enough
because to refer to the line labeled L 1 you would have to be able
to think this is line L 1 and you could not think that unless you
had a way to first picking out the referent of this.
Slide 26
The correspondence Problem A frequent task in perception is to
establish a correspondence between proximal tokens that arise from
the same distal token. Apparent Motion. Tokens at different times
may correspond to the same object that has moved. Constructing a
representation over time (and over eye fixations) requires
determining the correspondence between tokens at different stages
in constructing the representation. Tracking token individuals over
time/space. To distinguish here it is again from here is another
one and so to maintain the identity of objects. Stereo Vision
requires establishing a correspondence between two proximal
(retinal) tokens one in each eye
Slide 27
Apparent Motion solves a correspondence problem Dawson
Configuration (Dawson &Pylyshyn, 1988) Linear trajectory? Which
criterion does the visual module prefer? Curved trajectory?
Slide 28
Dawson Configuration (animated)
Slide 29
Apparent Motion solves a correspondence problem Dawson
Configuration (Dawson &Pylyshyn, 1988) Nearest mean distance?
Nearest vector distance? Nearest configural distance? Which
criterion does the visual module prefer?
Slide 30
Dawson Configuration (animated)
Slide 31
Colors & shapes are ignored
Slide 32
Dawson Configuration Different properties Ignored
Slide 33
Yantis use of the Ternus Configuration to demonstrate the early
visual effect of objecthood Short time delays result in element
motion (the middle object persists as the same object so it does
not appear to move)
Slide 34
Long time delays result in group motion because the middle
object does not persist but is perceived as a new object each time
it reappears
Slide 35
Relevance to the present theme These different examples
illustrate the need to keep track of objects numerical identity (or
their same- individuality) in a primitive non-conceptual way (and
of putting their token representations in correspondence) In each
case the correspondence is computed without any conscious awareness
by the early vision module The examples (apparent motion,
stereovision, incremental construction of representations, and
keeping track of individuality over time/space) are on different
time scales so it is an empirical matter whether they involve the
same mechanism, but they do address the same problem tracking
individuals without using their unique properties.
Slide 36
The incremental construction of visual representations requires
solving a correspondence problem over time We have to determine
whether a particular individual element seen at time t is identical
to another individual element seen at a previous time t- . This is
one manifestation of the correspondence problem. Solving the
correspondence problem is equivalent to picking out and tracking
the identity of token individuals as they change their appearance,
their location or the way they are encoded or conceptualized To do
that we need the capacity to refer to token individuals (I will
call them objects) without doing so by appealing to their
properties. This requires a special form of demonstrative reference
I call a Visual Index.
Slide 37
The difference between a direct (demonstrative) and a
descriptive way of picking something out has produced many You are
here cartoons. It is also illustrated in this recent New Yorker
cartoon
Slide 38
The difference between descriptive and demonstrative ways of
picking something out (illustrated in this New Yorker cartoon by
Sipress )
Slide 39
Picking out Picking out entails individuating, in the sense of
separating something from a background (what Gestalt psychologists
called a figure-ground distinction) This sort of picking out has
been studied in psychology under the heading of focal or selective
attention. Focal attention appears to pick out and adhere to
objects rather than places In addition to a unitary focal attention
there is also evidence for a mechanism of multiple references
(about 4 or 5), that I have called a visual index or a FINST
Indexes are different from focal attention in many ways that we
have studied in our laboratory (I will mention a few later) A
visual index is like a pointer in a computer data structure it
allows access but does not itself tell you anything about what is
being pointed to. Note that the English word pointer is misleading
because it suggests that vision picks out objects by pointing to
their location.
Slide 40
The requirements for picking out and keeping track of several
individual things reminded me of an early comic book character
called Plastic Man
Slide 41
Imagine being able to place several of your fingers on things
in the world without recognizing their properties while doing so.
You could then refer to those things (e.g. what finger # 2 is
touching) and could move your attention to them. You would then be
said to possess FIN gers of INST antiation ( FINSTs)
Slide 42
FINST Theory postulates a limited number of pointers in early
vision that are elicited by certain events in the visual field and
that index the objects associated with the event. These enable
vision to refer to those objects without doing so under a
concept/description
Slide 43
This idea is intriguing but it is missing one or two details as
well as some distinctions We need to distinguish the mechanisms of
early vision (inside the vision module) from those of general
cognition We need to distinguish different types of information in
different parts of vision (e.g., representations vs physical
states, conceptual vs nonconceptual, as well as personal vs
subpersonal). Closely related to these, we need to distinguish
between the process of vision from those of belief fixation.
Finally, we need to provide a motivated proposal for what the
modular (subpersonal?) part of vision hands off to the rest of the
cognitive mind. This is a difficult problem and will occupy some of
our time in the rest of this class.
Slide 44
Returning to the FINST Theory
Slide 45
First Approximation: FINSTs and Object Files and the link
between the world and its conceptualization Object File contents
are conceptual! Information (causal) link FINST Demonstrative
reference link The only nonconceptual contents in this picture are
FINST indexes!
Slide 46
Summarizing the theory so far A FINST index is a primitive
mechanism of reference that refers to individual visible objects in
the world. There are a small number (~4-5) indexes available at any
one time. Indexes refer to individual objects without referring to
them under conceptual categories, so they provide nonconceptual
reference. Q: Is this a case of seeing without seeing as? Indexing
objects is prior to encoding any of their properties. So objects
are picked out and referred to without using any encoding of their
properties. This does not mean that object properties are
irrelevant to the grabbing of indexes or the subsequent process of
tracking The claim that we initially refer to objects without
having encoded their location is surprising to many people (why?)
What may be even more surprising is that we can index and refer to
objects without knowing what they are!
Slide 47
Summarizing the theory so far An important function of these
indexes is to bind arguments of visual predicates to things in the
world to which they refer. Only predicates with bound arguments can
be evaluated. Since predicates are quintessential concepts, an
index serves as a bridge from objects to conceptual
representations. Indexes can also bind arguments of motor commands,
including the command to move focal attention or gaze to the
indexed object: e.g., MoveGaze(x) Some hard problems that Fodor and
I will discuss at a later lecture Getting information about a
particular object into its Object File How and when does this
happen? Who can use the information in an object file? Can it be
used to track objects by checking whether a candidate object has
the same properties as a particular previous object?
Slide 48
Some hard problems and some open empirical questions To be
discussed at various later lectures How and when does information
about a particular object get into its Object File? Who can use the
information in an object file? Is the Object File inside the vision
module or outside? Can the information in the file be used to
determine the correspondence between objects by checking whether
they have the same properties? Is this how tracking is
accomplished? Is information in the Object File used to solve the
many-properties binding problem? Is this done during tracking?
Slide 49
A note on terminology I sometimes refer to an index as a
pointer, or as a demonstrative or even as the name of an object All
of these are misleading because unlike a demonstrative an index is
grabbed by things in the world independent of the intentions of a
perceiver. Although an index is like a proper name except that it
can only refer to objects with which the perceiver is in sensory
contact (in what Fodor calls the perceptual circle). Notice a
strange consequence of the assumption that indexes do not pick out
or refer to objects as members of a person-level equivalence class:
When we have indexed something we initially do not know where it is
nor what it is!
Slide 50
Part 2
Slide 51
Slide 52
Some notes on how indexes might be implemented A thought
experiment: How might one implement an indexing system? The attempt
might clarify how it is possible to index an object without having
explicit access to the coordinates or other properties of objects.
I will sketch a network model but will only describe how it looks
functionally to a user who pushes buttons and notices which lights
come on. The model takes as input an activation map (on the
proximal stimulus) with a set of sensors at each point (each
pixel). Based on the relative activity at each point it indexes a
number of active objects and illuminates a light for each. The user
choses one of the illuminated objects (by name nobody knows where
they are) and pressing a button beside one of the lights.
Slide 53
Some notes on how indexes might be implemented The person then
presses a button on a property detection panel marked with a
property name. If the light beside the button illuminates then we
know that the object indexed in panel 2 has property indicated by
panels at 3. The way this model is wired up is simple. The first
panel feeds a Winner-Take-All network which inhibits every input
unit but the most active one (a classical Darwinian or capitalist
world). This enables a circuit from the button next to the
illuminated light to the input unit which led to the light being on
(thats the index). Pressing the button sends a unit of activity to
that input unit which now has a property transducer and an activity
selector on (two out of the required 3 before it send out a general
tremor of activity). Now you press the button by a property inquiry
(panel 3) which activates all P detectors. If the selected input
unit, the property transducer, and the property inquiry signal are
all on, that input fires.
Slide 54
Slide 55
Moral It is trivial to design a circuit that allows one to
check whether a particular place on a proximal stimulus that has
grabbed an index, has a particular property. All it takes are some
threshold units and some and and or units. Although the simple
black box I showed you can only detect one static input place at a
time, it can inquiry about several properties. Extending this to
moving objects is easy using the same ideas you partially activate
regions near each selected input units, this increasing the
likelihood that it will be selected at the next cycle and so
on.
Slide 56
Some evidence for indexes and Object Files The correspondence
problem The binding problem Evaluating multi-place visual
predicates Recognizing shapes by their part-whole relations
Operating over several visual elements at once without having to
search for them first Subitizing Subset search Multiple-Object
Tracking Imagining space without requiring a spatial display in the
head {This is a large topic beyond the scope of this class, but see
Things and Places, Chapter 5}
Slide 57
A quick tour of some evidence for FINSTs The correspondence
problem (mentioned earlier) The binding problem Evaluating
multi-place visual predicates (recognizing multi-element patterns)
Operating over several visual elements at once without having to
search for them first Subitizing Subset selection Multiple-Object
Tracking Imagining space without requiring a spatial display in the
head
Slide 58
Pandemonium An early architecture for vision, called
Pandemonium, was proposed by Oliver Selfridge in 1959. This idea
continues to be at the heart of many psycho-logical models,
including ones implemented in contemporary connectionist or neural
net models. It is also the basic idea in what are called Blackboard
Architec- tures in AI (e.g., Hearsay speech recognition systems).
These architectures have no way to represent that some of the
features detected actually belonw with other features
detected.
Slide 59
Introduction to the Binding Problem: Encoding conjunctions of
properties Experiments show the special difficulty that vision has
in detecting conjunctions of several properties It seems that items
have to be attended (i.e., individuated and selected) in order for
their property-conjunction to be encoded When a display is not
attended, conjunction errors are frequent
Slide 60
Read the vertical line of digits in this display What were the
letters and their colors?
Slide 61
This is what you saw briefly Under these conditions Conjunction
Errors are very frequent
Slide 62
Encoding conjunctions requires attention One source of evidence
is from search experiments: Single feature search is fast and
appears to be independent of the number if items searched through
(suggesting it is automatic and pre-attentive) Conjunction search
is slower and the time increases with the number of items searched
through (suggesting it requires serial scanning of attention)
Slide 63
Rapid visual search (Treisman) Find the following simple figure
in the next slide:
Slide 64
This case is easy and the time is independent of how many
nontargets there are because there is only one red item. This is
called a popout search
Slide 65
This case is also easy and the time is independent of how many
nontargets there are because there is only one right-leaning item.
This is also a popout search.
Slide 66
Rapid visual search (conjunction) Find the following simple
figure in the next slide:
Slide 67
Slide 68
Constraints on nonconceptual representation of visual
information (and the binding problem) Because early (nonconceptual)
vision must not fuse the conjunctive grouping of properties, visual
properties cant just be represented as being present in the scene
because then the binding problem could not be solved! What else is
required? The most common answer is that each property must be
represented as being at a particular location According to Peter
Strawson and Austin Clark, the basic unit of sensory representation
is Feature-F-at-location-L This is the so-called feature placing
proposal. This proposal fails for interesting empirical reasons But
if feature placing is not the answer, what is?
Slide 69
The role of attention to location in Treismans Feature
Integration Theory
Slide 70
Individual objects and the binding problem We can distinguish
scenes that differ by conjunctions of properties, so early vision
must somehow keep track of how properties co-occur conjunction must
not be obscured. How to do this is called the binding problem. The
most common proposal is that vision keeps track of properties
according to their location and binds together colocated
properties.12
Slide 71
The proposal of binding conjunctions by the location of
conjuncts does not work when feature location is not punctate and
becomes even more problematic if they are colocated e.g., if their
relation is inside
Slide 72
Binding as object-based The proposal that properties are
conjoined by virtue of their common location has many problems In
order to assign a location to a property you need to know its
boundaries, which requires distinguishing the object that has those
properties from its background (figure-ground individuation)
Properties are properties of objects, not of locations which is why
properties move when objects move. Empty locations have no causal
properties. The alternative to conjoining-by-location is conjoining
by object. According to this view, solving the binding problem
requires first selecting individual objects and then keeping track
of each objects properties (in its object file or OF) If only
properties of selected objects are encoded and if those properties
are recorded in each objects OFs, then all conjoined properties
will be recorded in the same object file, thus solving the binding
problem
Slide 73
Attention spreads over perceived objects Using a priming method
(Egly, Driver & Rafal, 1994) showed that the effect of a prime
spreads to other parts of the same visual object compared to
equally distant parts of different objects. Spreads to B and not C
Spreads to B and not C Spreads to C and not B Spreads to C and not
B *
Slide 74
A quick tour of some evidence for FINSTs The correspondence
problem (mentioned earlier) The binding problem Evaluating
multi-place visual predicates (recognizing multi-element patterns)
Operating over several visual elements at once without having to
search for them first Subitizing Subset selection Multiple-Object
Tracking Cognizing space without requiring a spatial display in the
head
Slide 75
Being able to refer to individual objects or object-parts is
essential for recognizing patterns Encoding relational predicates;
e.g., Collinear (x,y,z,..) ; Inside (x, C) ; Above (x,y) ; Square
(w,x,y,z), requires simultaneously binding the arguments of n-place
predicates to n elements* in the visual scene Evaluating such
visual predicates requires individuating and referring to the
objects over which the predicate is evaluated: i.e., the arguments
in the predicate must be bound to individual elements in the scene.
*Note: elements is used to refer to objects that serve as parts of
other objects
Slide 76
Several objects must be picked out at once in making relational
judgments When we judge that certain objects are collinear, we must
first pick out the relevant objects while ignoring their
properties
Slide 77
Several objects must be picked out at once in making relational
judgments The same is true for other relational judgments like
inside or on- the-same-contour etc. We must pick out the relevant
individual objects first. Are dots Inside-same contour? On-same
contour? * Note: Ullman (1984) has shown that some patterns cannot
be recognized without doing so in a serial manner, where the serial
elements must be indexed first. doing so in a serial manner, where
the serial elements must be indexed first. And that is yet another
reason why Connectionist architectures cannot work! And that is yet
another reason why Connectionist architectures cannot work!
Slide 78
A quick tour of some evidence for FINSTs The correspondence
problem The binding problem Evaluating multi-place visual
predicates (recognizing multi-element patterns) Operating over
several visual elements at once without first having to search for
them Subitizing Subset selection Multiple-Object Tracking Cognizing
space without requiring a spatial display in the head
Slide 79
More functions of FINSTs Further experimental explorations
Recognizing the cardinality of small sets of things: Subitizing vs
counting (Trick, 1994) Searching through subsets selecting items to
search through (Burkell, 1997) Selecting subsets and maintaining
the selection during a saccade (Currie, 2002) Application of FINST
index theory to infant cardinality studies (Carey, Spelke, Leslie,
Uller, etc) Indexes may explain how children are able to acquire
words for objects by ostension without suffering Quines Gavagai
problem.
Slide 80
Signature subitizing phenomena only appear when objects are
automatically individuated and indexed Trick, L. M., &
Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated
differently? A limited capacity preattentive stage in vision.
Psychological Review, 101(1), 80-102.
Slide 81
Subitizing results There is evidence that a different mechanism
is involved in enumerating small (n 4) numbers of items (even
different brain mechanisms Dehaene & Cohen, 1994 ) Rapid
small-number enumeration (subitizing) only occurs when items are
first (automatically) individuated * Unlike counting, subitizing is
not enhanced by precuing location * Subitizing is insensitive to
distance among items * Our account for what is special about
subitizing is that once FINST indexes are assigned to n< 4
individual objects, the objects can be enumerated without first
searching for them. In fact they might be enumerated simply by
counting active indexes which is fast and accurate because it does
not require visual scanning * Trick, L. M., & Pylyshyn, Z. W.
(1994). Why are small and large numbers enumerated differently? A
limited capacity preattentive stage in vision. Psychological
Review, 101(1), 80-102.
Slide 82
Subset selection for search Burkell, J., & Pylyshyn, Z. W.
(1997). Searching through subsets: A test of the visual indexing
hypothesis. Spatial Vision, 11(2), 225-258.
Slide 83
Subset search results: Only properties of the subset matter but
note that properties of the entire subset must be taken into
account simultaneously (since that is what distin- guishes a
feature search from a conjunction search) If the subset is a
single-feature search it is fast and the slope (RT vs number of
items) is shallow If the subset is a conjunction search set, it
takes longer is more error prone and is more sensitive to the set
size As with subitizing, the distance between targets does not
matter, so observers dont seem to be scanning the display looking
for the target
Slide 84
The stability of the visual world entails the capacity to track
some individuals after a saccade There is no problem about how the
tactile sense can provide a stable world when you move around while
keeping your fingers on the same objects because in that case
retaining individual identity is automatic But with FINSTs the same
can be true in vision at least for a small number of visual objects
This is compatible with the fact that it appears that one retains
the relative location of only about 4 elements during saccadic eye
movements (Irwin, 1996) [ Irwin, D. E. (1996). Integrating
information across saccadic eye movements. Current Directions in
Psychological Science, 5(3), 94-100.]
Slide 85
The selective search experiment with a saccade induced between
the late onset cues and start of search Even with a saccade between
selection and access, items can be accessed efficiently Onset of
new objects grabs indexes
Slide 86
A quick tour of some evidence for FINSTs The correspondence
problem (mentioned earlier) The binding problem Evaluating
multi-place visual predicates (recognizing multi-element patterns)
Operating over several visual elements at once without having to
search for them first Subitizing Subset selection Multiple-Object
Tracking Imagining space without requiring a spatial display in the
head
Slide 87
Demonstrating the function of FINSTs with Multiple Object
Tracking (MOT) In a typical experiment, 8 simple identical objects
are presented on a screen and 4 of them are briefly distinguished
in some visual manner usually by flashing them on and off. After
these 4 targets are briefly identified, all objects resume their
identical appearance and move randomly. The observers task is to
keep track of the ones that had been designated as targets at the
start After a period of 5-10 seconds the motion stops and observers
must indicate, using a mouse, which objects are the targets
Slide 88
Another example of MOT: With self occlusion 5 x 5 1.75 x
1.75
Slide 89
Self occlusion dues not seriously impair tracking
Slide 90
Basic finding: Most people can track at least 4 targets that
move randomly among identical non-target objects (even some 5 year
old children can track 3 objects) Object properties do not appear
to be recorded during tracking and tracking is not improved if no
two objects have the same color, shape or size (asynch vs synch
changes) How is tracking done? We showed that it is unlikely that
the tracking is done by keeping a record of the targets locations
and updating them by serially visiting the objects (Pylyshyn &
Storm, 1998) Other strategies may be employed (e.g., tracking a
single deforming pattern), but they do not explain tracking
Hypothesis: FINST Indexes are grabbed by blinking targets. At the
end of the trial these indexes can be used to move attention to the
targets and hence to select them in making the response Some
findings with Multiple Object Tracking
Slide 91
What role do visual properties play in MOT? Certain properties
must be present in order for an index to be grabbed, and certain
properties (probably different properties) must be present in order
for the index to keep track of the object, but this does not mean
that such properties are encoded, stored, or used in tracking. Is
there something special about location? Do we record and track
properties-at-locations? Location in time & space may be
essential for individuating or clustering objects, but metrical
coordinates need not be encoded or made cognitively available The
fact that an object is actually at some location or other does not
mean that it is represented as such. Representing property P (where
P happens to be at location L) Representing property P-at-L.
Slide 92
A way of viewing what goes on in MOT An object file may contain
information about the object to which it is bound. But according to
FINST Theory, keeping track of the objects identity does not
require the use of this information. The evidence suggests that in
MOT, little or nothing is stored in the object file. Occasionally
some information may get encoded and entered in the Object File
(e.g., when an object appears or disappears) but this is not used
in the tracking process itself.* *We will see later that this has
to be stated with care since location may be stored in the object
file and used in a certain sense when the usual continuous tracking
does not work.
Slide 93
Another way of viewing MOT What makes something the same object
over time is that it remains connected to the same object-file by
the same Index. Thus, for something to be the same enduring object
no appeal to properties or concepts is needed. The only requirement
is that it be trackable. Another view of tracking is that it is the
basis of objecthood: An object is something that can be
perceptually tracked (Fodor). There seems to be growing evidence
that tracking is a reflex -- it proceeds without interference from
other attentive tasks.* Franconeri et al.** showed that the
apparent sensitivity of tracking performance to such properties as
speed is due to a confound of speed with object density. Distance
between objects is critical to MOT performance, which is predicted
by parallel tracking models. *Although tracking feels effortful,
many secondary tasks do not interfere with tracking (search) **
Franconeri, S., Lin, J., Pylyshyn, Z., Fisher, B., & Enns, J.
(2008). Evidence against a speed limit in multiple-object tracking.
Psychonomic Bulletin & Review, 15(4), 802-808.
Slide 94
Why is this relevant to foundational questions in the
philosophy of mind? According to Quine, Strawson, and most
philosophers, you cannot pick out or track individuals without
concepts (sortals) But you also cannot pick out individuals with
only concepts Sooner or later you have to pick out individuals
using non- conceptual causal connections between things and
thoughts. The present proposal is that FINSTs provide the needed
non- conceptual mechanism for individuating objects and for
tracking their (numerical) identity, which works most of the time
in our kind of world. It relies on some natural constraints (Marr).
FINST indexes provide the right sort of connection to allow the
arguments of predicates to be bound to objects prior to the
predicates being evaluated. They may also be the basis for learning
nouns by ostension.
Slide 95
But there must be some properties that cause indexes to be
grabbed! Of course there are properties that are causally
responsible for indexes being grabbed, and also properties
(probably different ones) that make it possible for objects to be
tracked; But these properties need not be represented (encoded) and
used in tracking The distinction between properties that cause
indexes to be grabbed and those that are represented (in Object
Files) is similar to Kripkes distinction between properties that
are needed to name an object (by baptismal) and those that
constitute its meaning
Slide 96
Effect of target properties on MOT Changes of object properties
are not noticed during MOT Keeping all targets at different color,
size, or shape does not improve tracking Observers do not use
target speed or direction in tracking (e.g., they do not track by
anticipating where the targets will be when they reappear after
occlusion) Targets can go behind an opaque screen and come out the
other side transformed in: color, shape, speed or direction of
motion (up to 60 from pre-occlusion direction), without affecting
tracking, but also without observers noticing the change! What
affects tracking is the distance travelled while behind the
occluding screen. The closer the reappearance to the point of
disappearance the better the tracking even if the closer location
happens to be in the middle of the occluding screen!
Slide 97
Some open questions We have arrived at the view that only
properties of selected (indexed) objects enter into subsequent
conceptualization and perception-based thought (i.e., only
information in object files is made available to cognition) So what
happens to the rest of the visual information? Visual information
seems rich and fine-grained while this theory says that properties
of only 4 or 5 objects are encoded! The present view also leaves no
room for representations whose content corresponds to the content
of conscious experience According to the present view, the only
content that modular nonconceptual representations have is the
demonstrative content of indexes that refer to perceptual objects
Question: Why do we need any more than that?
Slide 98
An intriguing possibility. Maybe the theoretically relevant
information we take in is less than (or at least different from)
what we experience This possibility has received attention recently
with the discovery of various blindnesses (e.g., change-blindness,
inattentional blindness, blindsight) as well as the discovery of
independent- vision systems (e.g., recognition and motor control)
The qualitative content of conscious experience may not play a role
in explanations of cognitive processes Even if detailed
quantitative information enters into causal process (e.g., motor
control) it may not be represented not even as nonconceptual
representation For something to be a representation its content
must figure in explanations it must capture generalizations. It
must have truth conditions and therefore allow for
misrepresentation. It is an empirical question whether current
proposals do (e.g., primal sketch, scenarios). cf Devitt: Pylyshyns
Razor
Slide 99
An alternative view of reference by Indexes This provisional
revised theory responds to Fodors argument that there is no seeing
without seeing-as According to Fodor, the visual module must do
more than the current theory assumes, because its output must
provide the basis for induction over what something is seen as.
This is not the traditional argument that percepts have a finer
grain than most theories provide for especially theories that
assume a symbolic output like this one. That argument relies too
much on our phenomenology which more often than not leads us
astray. So the vision module must contain more than object files.
It must be able to classify objects by their visual properties
alone, or to compute for each object a particular appearance- class
to which it belongs (see black swan example).
Slide 100
An alternative view of reference by Indexes Since the vision
module is encapsulated it must have a mechanism for assigning each
object x to an equivalence class based solely on what x looks like.
It must do this for a large number of such classes, based both on
its innate mechanisms and its visual experience [Look of x = L
(x)]. L (x) is thus an equivalence class induced by the sensorium
which includes the current token x. The L (x) associated with each
token x must be sufficiently distinctive to allow the cognitive
system to recognize x unambiguously as an token of something it
knows about (e.g., L ( x) => looks like a cow & this is a
farm => x is likely a cow). The sequence from x to recognition
must be correct most of the time in our kind of world (so it must
embody a natural constraint).
Slide 101
An alternative view of reference by Indexes This idea of an
appearance class L (x) has been explored in computational vision,
where a number of different functions have been proposed, many of
them based on mathematical compression or encoding functions. An
early idea which has implications for the present discussion, is a
proposal by David Marr called a Multiple-View proposal. He wrote:
The Multiple View representation is based on the insight that if
one chooses ones primitives correctly, the number of qualitatively
different views of an object may be quite small and Marr cites
Minsky as speculating that the representation of a 3D shape might
consist of a catalog of different appearances of that shape, and
that catalog may not need to be very large. (Marr & Nishihara,
1976) The search for the most general form of representation has
yielded many proposals, many of which have been tested in
Psychology Labs. E.g., generalized cylinders and
part-decomposition: Biederman, I. (1987).
Recognition-by-components: A theory of human image interpretation.
Psychological Review, 94, 115-148.
Slide 102
Seeing without Seeing As? Its true that instances of visual
encounters deliver an equivalence-class to which the object belongs
by virtue of its appearance as mapped by the function L (x). It is
an appearance class because it can only use information from the
sensorium and the natural constraints built into the modular vision
system. So in that respect one might say that seeing is always a
seeing as where the relevant category is L (x). But this is
unlikely to be the category under which the object enters into
thought. So the kind of seeing as category L (x), is not the same
category as the one under which the object is contemplated in
thought, where its category would depend on background knowledge
and personal history. The appearance L (x) is now replaced by
familiar categories of thought (e.g. card table, Ford car, Coca
Cola bottle, Warhol Brillo Box, and so on, categories rich in their
interconnections).
Slide 103
More on the structure of the Visual Module In order to compute
L (x), the vision module must possess enough machinery to map a
token object x onto an equivalence class designated by L (x) using
only sensory information and module-specific processes and
representations, without appealing to general knowledge. The module
must also have some 4-5 Object Files, because it needs those to
solve the binding problem as well as to bind predicate arguments to
objects (and also to use the proposed Recognition-By-Parts process
for recognizing complex objects).
Slide 104
Alternative view of whats in the module The alternative view of
what goes on inside the visual module would furnish it with more
processes to catalog and lookup of object shape- types L (x). Our
assumptions would seem to require that this augmented machinery
also be barred from accessing cognitive memories and general
inference capacity. Does this conflict with Fodors requirement that
the output be right for belief fixation? L ( x ) Modular vision
computer. Input is sensory information, output is standard form for
appearance of objects L ( x ). Minimal (Just indexes) Original
(indexes and files) Maximal (computing L ( x ) ) Which functions
are in the visual module?
Slide 105
Summary of the current FINST model Up to 5 indexes can be
grabbed based on local properties Active indexes bind objects to
object files (initially empty) Bound objects can then be queried*
and salient properties encoded in their Object File Does this
require voluntary attention ? Indexes stay bound to the objects
that grabbed them even as the objects change any of their
properties, including briefly disappearing behind an occluding
screen. When the objects change their location, the result is
tracking which is automatic / reflexive We also have evidence that
objects can be tracked through other continuously changing
properties (Blaser, Pylyshyn & Holcombe 2000) The only factor
that impairs tracking performance is spacing: too close yields
item-ambiguity and tracking errors
Slide 106
Tracking and spatial proximity Many experiments show that the
only factor that affects tracking performance is inter-item
spacing: when items are too close there is item-ambiguity resulting
in tracking errors Other factors that allegedly impair tracking
(e.g., speed) do so only because they affect average spacing. The
very process of tracking, which requires something like smooth
continuous movement, makes use of proximity. So does the process of
Gestalt individuation which must collect nearby pixels and features
(regardless of type). We have many results showing that when
objects disappear their only recalled property is where they were
at the time and the only thing that determines how well they
continue to be tracked when they reappear is how far away they have
moved. Franconeri, S., Pylyshyn, Z. W., & Scholl, B. J. (2012).
A simple proximity heuristic allows tracking of multiple objects
through occlusion Attention, Perception and Psychophysics,
72(4).
Slide 107
How is location stored and used? It is possible that location
is stored in object files since it is one of the more important
properties of moving objects. Object location is a property that
must be used in tracking since to track smoothly moving objects
just is to solve the correspondence problem by taking the nearest
object Many experiments show that the correspondence problem in
this case does not involve choosing the most similar object or the
one moving with the same speed or in the same direction but the
closest one to the locus of disappearance. Does this mean that
object location is stored and used in tracking, contrary to my
earlier claim? Maybe, but That depends on whether location is in
this case a conceptual property and tracking is a process involving
conceptual representations and there is evidence that it is
not.
Slide 108
Is location a conceptual property? Is location in this case a
conceptual property and is tracking a process involving conceptual
representations? Computing correspondence and tracking are
prototypical automatic and cognitively impenetrable processes,
likely computed by local parallel processes, which suggests that it
is subpersonal, modular and nonconceptual, since most automatic
processes are nonconceptual. Location plays a critical part in all
motor control and there is reason to believe that it plays this
role in a different way than the way conceptual information does.
It typically involves a different visual system, the dorsal
pathway. A great deal of evidence is now available showing that
only the central pathway contributes to object recognition while
the dorsal pathway is specialized for motor control (Milner &
Goodale, 1995; 2004) All in all it seems more likely that location
is used in MOT and other visual processes but that it is not a
conceptual process at all. If you accept that location is
conceptual, you pay a high price: you lose the goal of finding a
nononceptual link between cognition and the world!
Slide 109
A note on top-down vs bottom-up flow This dichotomy has been
the source of a great deal of misunderstanding of what goes on in a
module. Information does flow in both directions, but our claim is
that it does so only within the visual module, not across the
capsule boundary A more interesting distinction alluded to here
under the phrase the visual system queries o as opposed to
passively receiving information. This is a question of where
control resides. An appropriate event in the visual scene grabs an
index the responsibility here rests with an external event. In
computer talk this is called an interrupt because the ambient
process is interrupted by an external event. But to say the system
interrogates the scene through the index is to say that the
initiative belongs with the internal process. In computer talk this
corresponds to a test operation.
Slide 110
A note on top-down vs bottom-up flow What is interesting about
the distinction between interrupt and test is that only an
interrupt can be open ended. Things can be set up so that it is not
known in advance what sort of event will cause an interrupt. On the
other hand you cant have a test operation unless you specify what
you are looking for you have to test for, which, like see as or
select for, is an intentional act, where an interrupt can be a
causal event So in our case grabbing is a causal event whereas
querying is an intentional representation-governed event. Similarly
selecting is intentional, so scanning or switching visual attention
would be intentional whereas attention can also be elicited in
which case the event would be causal. There may also be
combinations of the two, as when you decide to track certain
targets. To do that, according to this theory, you combine the
intentional act of moving your focal attention to a particular
object, with the causal event, whereby some object in the scope of
the attention is enabled and can then grab an index.
Slide 111
Summary of augmented FINST model So far the only visual
information that is available to the mind is contained in the
Object Files in the visual module. The index mechanism discussed so
far also makes it possible to use additional currently perceived
information (see Things & Places, Chapt 5) Information in the
module is in a symbolic form very similar to the subsequent
conceptual representation, except: It is encoded in the vocabulary
of modular (subpersonal) categories (that many would call
nonconceptual), not in person-level conceptual vocabulary.
Construction of the intramodular representation cannot use general
knowledge, so all relevant representations must reside in the
module The intramodular representation uses information in Object
Files and preserves its bindings. The Object Files are the only
mechanisms for dealing with the General Binding Problem, as well as
the problem of binding predicate arguments to objects in the
world.
Slide 112
Open Questions about the augmented FINST model The modular
processes must somehow recover the relations between objects, and
these may or may not be encoded in OFs. Since information in the
module may serve a number of subsequent functions including
visual-motor coordination and multimodal perceptual integration it
will have to represent metrical information, very likely in a
nonconceptual form. The question of representing metrical
information is one we leave for the future since little is known
about how analogue representation might function in cognition We
now arrive at a central question of considerable importance to the
view we are promoting: What form is the visual representation in
when it is handed on to Cognition?
Slide 113
END
Slide 114
Vision science has always been deeply ambivalent about role of
conscious experience Isnt how things appear one of the things that
our theories must explain? Answer: There is no a priori must
explain! The content of subjective experience is a major type of
evidence. But it may turn out not to be the most reliable source
for inferring the relevant functional states. It competes with
other types of evidence. How things appear cannot be taken at face
value: it carries substantive theoretical assumptions. It also
draws on many levels of processing. It was a serious obstacle to
early theories of vision (Kepler) It has been a poor guide in the
case of theories of mental imagery (e.g., color mixing, image size,
image distances). Reading X off an image is an illusion. It seems
likely that vision science will use evidence of conscious
experience the way linguistics uses evidence of grammatical
intuitions only as it is filtered through developing theories. The
questions a science is expected to answer cannot be set in advance
they change as the science develops.
Slide 115
What next? This picture leaves many unanswered questions, but
it does provide a mechanism for solving the binding problem and
also explaining how mental representations could have a
nonconceptual connection with objects in the world (something
required if mental representations are to connect with
actions)
Slide 116
Schema for how FINSTs function in hockey
Slide 117
For a copy of these slides see:
http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionRefere nce.ppt
http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionRefere nce.ppt
Or MIT Press Paperback
Slide 118
Index capacity and training Daphne Baveliers lab (Rochester)
has shown that videogame players can track a larger number of
objects in MOT Jose Rivest (York) has shown that some athletes can
track more targets than non- athletes Within individuals the main
determiner of number of targets that can be tracked is the spacing
between them
Slide 119
X You are now here But you are also here
Slide 120
Slide 121
MOT with occlusion MOT with virtual occluders MOT with matched
nonoccluding disappearance Track endpoints of lines Track
rubber-band linked boxes Track and remember ID by location Track
and remember ID by name (number) Track while everything briefly
disappears ( sec) and goes on moving while invisible Track while
everything briefly disappears ( sec) and goes on moving while
invisible Track while everything briefy disappears and reappears
where they were when they disappeared Track while everything briefy
disappears and reappears where they were when they disappeared
Additional examples of MOT