4
MULTISENSORY INTEGRATION OF A SOUND WITH STEREO 3-D VISUAL EVENTS Kenzo Sakurai* and Philip M. Grove** *Tohoku Gakuin University, Japan **The University of Queensland, Australia ABSTRACT The stream/bounce effect is an example of audio/visual interaction in which two identical luminance-defined targets in a 2-D display move toward one another from opposite sides of a display, coincide, and continue past one another along collinear trajectories. The targets can be perceived to either stream past or bounce off of one another. Streaming is the dominant perception in visual only displays while bouncing predominates when an auditory transient tone is presented at the point of coincidence. We extended previous findings on audio/visual interactions, using 3-D displays, and found the following two points. First, the sound-induced bias towards bouncing persists in spite of the introduction of spatial offsets in depth between the trajectories, which reduce the probability of motion reversals. Second, audio/visual interactions are similar for luminance-defined and disparity-defined displays, indicating that audio/visual interaction occurs at or beyond the visual processing stage where disparity-defined form is recovered. Index Terms—Stream/bounce effect, multimodal perception, multisensory integration, stereo 3D vision, audio/visual interactions 1. INTRODUCTION Investigating how our brain integrates audio/visual information about moving objects in 3-D space is one way to guide the design of future multisensory 3-D TV with good reality, naturalness and high sense of presence. Our understanding about audio/visual interactions has grown significantly over the last decade. A classical view of audio/visual interactions is that vision dominates conflicting auditory information. A typical example is the ventriloquist effect in which observers perceive the ventriloquist’s voice as if it comes from the puppet’s mouth. A contrary modern view of audio/visual interactions is that there are instances where auditory information influences vision. A seminal finding was Sekuler, Sekuler and Lau’s (1997) [1] report of the so-called ‘stream/bounce’ effect, a transient-induced shift in perceptual bias when resolving an ambiguous motion display. In a typical display, two identical targets (usually dots or squares), move toward one another from opposite sides of a display at a constant speed, superimpose at the center (the point of coincidence), and continue past one another to the other object’s starting point. This visual sequence is equally consistent with the two objects “streaming” past one another with their individual motions unchanged for the duration of the sequence or “bouncing” off of one another where the targets reverse their motion after superimposing (Figure 1). Streaming is the dominant perception in visual only displays, though bouncing is occasionally reported (Bertenthal, Banton and Bradbury, 1993[2]; Sekuler and Sekuler, 1999[3]). Interestingly, however, a brief auditory tone presented at or near the point of coincidence alters this bias from predominantly streaming to predominantly bouncing (Sekuler et al., 1997) [1]. The stream/bounce effect has been successfully replicated and by several subsequent investigators (e.g. Sanabria, Correa, Lupiáñez and Spence, 2004) [4]. All previous studies on the stream/bounce effect have employed ambiguous 2-D motion displays. Therefore, little is known about the influence of transient auditory signals on the resolution of motion sequences depicting objects moving in 3-D space. Bertenthal et al. (1993) [2] employed dynamic random dot stereograms in which the motion targets were defined by binocular disparity. They varied the stereoscopic depth of the target trajectories, but did not present any additional transients in their motion sequences. Bertenthal et al. found that, when the motion targets occupied different depth planes, the bias towards streaming, present when the targets were at the same depth, was eliminated in favor of the veridical perception of bouncing. In another experiment, these authors demonstrated that the bias towards streaming was strong enough to override veridical bouncing even when the two motion targets were rendered visually distinguishable by varying their relative dot density. Two issues, not previously addressed are examined in the present report. First, how does transient auditory information affect the resolution of motion sequences in which the trajectories occupy different depth planes? Second, does the stream bounce effect generalize to Figure 1. Two possible percepts generated in a typical stream/bounce display. 978-1-4244-4318-5/09/$25.00 c 2009 IEEE 3DTV-CON 2009

[IEEE 2009 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON 2009) - Potsdam, Germany (2009.05.4-2009.05.6)] 2009 3DTV Conference: The True

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON 2009) - Potsdam, Germany (2009.05.4-2009.05.6)] 2009 3DTV Conference: The True

MULTISENSORY INTEGRATION OF A SOUND WITH STEREO 3-D VISUAL EVENTS

Kenzo Sakurai* and Philip M. Grove**

*Tohoku Gakuin University, Japan **The University of Queensland, Australia

ABSTRACT

The stream/bounce effect is an example of audio/visual interaction in which two identical luminance-defined targets in a 2-D display move toward one another from opposite sides of a display, coincide, and continue past one another along collinear trajectories. The targets can be perceived to either stream past or bounce off of one another. Streaming is the dominant perception in visual only displays while bouncing predominates when an auditory transient tone is presented at the point of coincidence.

We extended previous findings on audio/visual interactions, using 3-D displays, and found the following two points. First, the sound-induced bias towards bouncing persists in spite of the introduction of spatial offsets in depth between the trajectories, which reduce the probability of motion reversals. Second, audio/visual interactions are similar for luminance-defined and disparity-defined displays, indicating that audio/visual interaction occurs at or beyond the visual processing stage where disparity-defined form is recovered.

Index Terms—Stream/bounce effect, multimodal perception, multisensory integration, stereo 3D vision, audio/visual interactions

1. INTRODUCTION Investigating how our brain integrates audio/visual information about moving objects in 3-D space is one way to guide the design of future multisensory 3-D TV with good reality, naturalness and high sense of presence. Our understanding about audio/visual interactions has grown significantly over the last decade. A classical view of audio/visual interactions is that vision dominates conflicting auditory information. A typical example is the ventriloquist effect in which observers perceive the ventriloquist’s voice as if it comes from the puppet’s mouth. A contrary modern view of audio/visual interactions is that there are instances where auditory information influences vision. A seminal finding was Sekuler, Sekuler and Lau’s (1997) [1] report of the so-called ‘stream/bounce’ effect, a transient-induced shift in perceptual bias when resolving an ambiguous motion display. In a typical display, two identical targets (usually dots or squares), move toward one another from

opposite sides of a display at a constant speed, superimpose at the center (the point of coincidence), and continue past one another to the other object’s starting point. This visual sequence is equally consistent with the two objects “streaming” past one another with their individual motions unchanged for the duration of the sequence or “bouncing” off of one another where the targets reverse their motion after superimposing (Figure 1). Streaming is the dominant perception in visual only displays, though bouncing is occasionally reported (Bertenthal, Banton and Bradbury, 1993[2]; Sekuler and Sekuler, 1999[3]). Interestingly, however, a brief auditory tone presented at or near the point of coincidence alters this bias from predominantly streaming to predominantly bouncing (Sekuler et al., 1997) [1].

The stream/bounce effect has been successfully replicated and by several subsequent investigators (e.g. Sanabria, Correa, Lupiáñez and Spence, 2004) [4]. All previous studies on the stream/bounce effect have employed ambiguous 2-D motion displays. Therefore, little is known about the influence of transient auditory signals on the resolution of motion sequences depicting objects moving in 3-D space. Bertenthal et al. (1993) [2] employed dynamic random dot stereograms in which the motion targets were defined by binocular disparity. They varied the stereoscopic depth of the target trajectories, but did not present any additional transients in their motion sequences. Bertenthal et al. found that, when the motion targets occupied different depth planes, the bias towards streaming, present when the targets were at the same depth, was eliminated in favor of the veridical perception of bouncing. In another experiment, these authors demonstrated that the bias towards streaming was strong enough to override veridical bouncing even when the two motion targets were rendered visually distinguishable by varying their relative dot density.

Two issues, not previously addressed are examined in the present report. First, how does transient auditory information affect the resolution of motion sequences in which the trajectories occupy different depth planes? Second, does the stream bounce effect generalize to

Figure 1. Two possible percepts generated in a typical stream/bounce display.

978-1-4244-4318-5/09/$25.00 c©2009 IEEE 3DTV-CON 2009

Page 2: [IEEE 2009 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON 2009) - Potsdam, Germany (2009.05.4-2009.05.6)] 2009 3DTV Conference: The True

disparity-defined displays generating similar behavioral data to those observed with luminance-defined displays?

We conducted two experiments to examine these two issues in detail. In the first experiment, we objectively measured the organizing strength of auditory inputs on the resolution of visual motion sequences when the probability of a collision, and therefore a motion reversal, was reduced by introducing 3-D offsets between the targets (Grove and Sakurai, in press) [5]. In this situation, the trajectories of the two objects would be vertically aligned in the retinal images while the actual motion trajectories, occupying different depth planes, would be appreciated with binocular viewing via stereopsis. Progressively displacing the trajectories of the motion targets in depth, renders the motion sequences more and more consistent with streaming. The 3-D displacement at which an auditory tone fails to promote bouncing provides an objective measure of the organizing strength of an auditory tone at coincidence.

In the second experiment we identified the earliest functional stage of visual processing at which audio/visual interactions could take place. This is the point where disparity-defined form is extracted. To do this, we employed luminance-defined and disparity-defined motion sequences, and determined whether or not similar stream/bounce effects would be obtained in the two types of displays. If auditory influences on the resolution of visual motion sequences were due to interactions early in visual processing before binocular combination in the primary visual cortex (V1), a substantial difference in behavioral data between luminance-

defined and disparity-defined stream/bounce displays would be expected. On the other hand if luminance-defined and disparity-defined displays generate similar behavioral data, we can conclude that at least some of the audio/visual interactions underlying this phenomenon are taking place at or beyond the point of binocular combination.

2. EXPERIMENT 1 This experiment investigated the effect of the presence/absence of an auditory tone, at the point of coincidence, on stream/bounce perception as a function of offset in depth introduced between the targets and in the presence or absence of an occluder at the point of coincidence in a stereo 3-D display. 2.1. Methods Six people participated in this experiment. All demonstrated normal or corrected to normal visual acuity and stereopsis and had no known hearing anomalies.

Computer generated stereo-pair stimuli were presented via a mirror stereoscope at an optical distance of 100 cm. Observers sat with their head in a chin and forehead rest and viewed the stimuli through small apertures close to the eyes. Black disc targets (diameter: 25.6 min arc) moved across a grey background, and their trajectories were also displaced stereoscopically in depth. Random dot patches were presented in the upper and lower halves of the display to ensure binocular fusion. Figure 2 illustrates the conditions and possible perceptions for this experiment.

There were six disparity conditions (0, 2.6, 5.1, 10.2, 17.9 and 25.6 min arc) in the no occluder condition, and four disparity conditions (0, 2.6, 5.1, 10.2 min arc) in the occluder present condition, that is, 10 motion sequences in

Figure 2. Oblique view illustrating possible perceived trajectories of the targets in experiment 1. In a) no occluder is present. Observers could perceive the targets to either stream past (left panel) or to reverse their trajectory after coincidence (right panel). In b) an occluder is present. Observers could perceive streaming (left panel) and bouncing (right panel) when the targets coincide behind a central occluder. Dashed lines in the left panels represent the objective path of the targets while those in the right panels represent a possible perception.

Figure 3. Mean percentage of trials for which bounce was reported for six observers as a function of binocular disparity between the targets with no occluder present (circles) and with an occluder (squares). Open symbols: no tone; Filled symbols: tone presented at coincidence. Error bars represent ±1 SEM.

Page 3: [IEEE 2009 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON 2009) - Potsdam, Germany (2009.05.4-2009.05.6)] 2009 3DTV Conference: The True

total. These disparity values indicate the relative disparity between the targets. The black occluder was wide enough to occlude completely both targets at the point of coincidence, and had a crossed disparity 7.68 min arc relative to the fusion lock.

Observers viewed the motion sequences while fixating a central cross when no occluder was present or the center of the occluder when present and reported whether the targets appeared to stream past or bounce off of one another. Observers completed 20 observations in each condition. Disparity and tone conditions were randomized within each block. Occluder conditions were blocked and the block order was counterbalanced across observers. 2.2 Results and Discussion The group mean percentages of bounce trials as a function of relative disparity of the target trajectories when no occluder was present and when an occluder was present are shown in Figure 3. The percentage of bounce responses was higher in the tone present condition than in the no tone condition. The percentage of bounce reports decreased in the tone condition as disparity increased but remained virtually constant around zero in the no tone condition.

Auditory induced bounce responses persisted for trajectory displacements in depth of up to 10.7 min arc when the occluder was present and at least 25.6 min arc when the occluder was absent. Little difference is observed between the psychometric functions in Figure 3 for corresponding disparities between the occluder conditions. These data indicate that auditory induced perceived bouncing persists even though the probability of a motion reversal is reduced by introducing a spatial offset in depth between the trajectories of the motion targets by as much as 25.6 min arc.

3. EXPERIMENT 2 This experiment investigated the effect of an auditory tone at or near the point of coincidence on the perceptual resolution of ambiguous motion sequences containing luminance-defined or binocular disparity-defined targets. 3.1. Methods Eleven people, including the two authors, participated. All demonstrated normal or corrected to normal visual acuity and stereopsis and had no known hearing anomalies.

The same computer-controlled apparatus was used to generate and present stereoscopic motion stimuli and auditory sounds as in experiment 1. The motion targets, subtended 0.85 X 0.85o, and moved across a random dot background subtending 8.5o horizontally and 4.25o vertically. A black fixation dot subtending 0.3o was positioned 0.47o below the center of the random dot background.

In the luminance-defined motion sequences, targets consisting of 50% density black and white random dots

were superimposed upon a background of light (39.9 cd/m2) and dark (11.8 cd/m2) grey dots 50% each. Frame one of the luminance-defined motion sequence is illustrated in Figure 4A. In disparity-defined motion sequences, both targets and a background consisted of 50% density black and white random dots. A crossed horizontal disparity of 10.2 min arc relative to the background was introduced by shifting each eye’s target bodily to the nasal side by two dot widths. Targets and their motion were only visible after binocular fusion. A stereogram depicting one frame of this motion sequence is illustrated in Figure 4B.

Auditory transient (duration: 8 ms) stimuli consisted of an 800 Hz tone with a maximum sound pressure of 63 dB at the participant’s ear. Sounds were presented via loudspeakers on either side of the display, and their timing was 0, ±100, ±250ms relative to the point of coincidence.

Observers viewed the stimuli holding their gaze on the fixation dot and reported whether the targets appeared to stream past or bounce off of one another after each sequence. Observers completed 20 observations in each condition. Display conditions (luminance-defined/disparity-defined)

Figure 4. Reduced versions of the stimuli used in Experiment 2. A) Luminance-defined targets. B) Disparity-defined targets. Stereograms are for cross fusion.

Figure 5. Mean percentage of trials in which bounces were reported for luminance-defined and disparity-defined motion targets for six conditions: no transient; transient tone ±250ms, ±100ms from the time of coincidence. Error bars represent 95% confidence intervals.

Page 4: [IEEE 2009 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON 2009) - Potsdam, Germany (2009.05.4-2009.05.6)] 2009 3DTV Conference: The True

were blocked and counterbalanced across observers. Other conditions were randomized within each block. 3.2. Results and Discussion Group mean percentages of reported bounces as a function of the presence/absence and timing of the auditory tone are shown in Figure 5 for luminance-defined and disparity-defined stimuli.

Experiment 2 replicated and expanded the findings of Sekuler et al. (1997) [1] showing a similar effect of auditory stimulation on the perceptual outcome of ambiguous luminance-defined dynamic random dot motion displays when an auditory tone was presented at least 250ms before, at, or up to 100ms after the point of coincidence. Moreover, this experiment shows that the temporal aspects of auditory visual interactions are strikingly similar for luminance-defined and disparity-defined motion targets.

In addition, we found that an auditory tone presented 250ms after the point of coincidence failed to significantly promote bounce perception in luminance-defined displays but continued to do so for disparity-defined displays. This reveals a wider tolerance for temporal asynchrony between auditory and disparity-defined visual events.

4. DISCUSSION The results of experiment 1 show that a single tone at the point of coincidence will continue to promote bounce perception as increasing 3-D trajectory offsets reduce the probability of a motion reversal. This means that an auditory transient tone has powerful organizing strength that is not restricted to ambiguous motion displays with 2-D overlapping trajectories but extends to unambiguous motion displays with stereo 3-D trajectories located in different depth planes. The demonstrated persistence of perceived bouncing in spite of rendering the motion sequences more consistent with streaming provides an objective metric for the organizing strength of auditory stimulation on the resolution of visual motion sequences. The results of experiment 2 show that a transient sound at or near the point of coincidence promotes bounce perception similarly in motion sequences with luminance-defined targets and those with disparity-defined targets. This is the first psychophysical study to generalize the stream/bounce effect to disparity-defined stimuli in which an auditory transient tone was employed. Although the temporal window within which sounds presented after the point of coincidence still promoted bouncing was slightly wider in disparity-defined displays than that in luminance-defined displays, the strikingly similar pattern of results between the two types of displays suggests that the audio/visual interactions take place at a visual processing stage beyond the point at which disparity-defined form is extracted.

The inputs from the two eyes combine for the first time in layer 4 of primary visual cortex (V1). The audio/visual

interactions observed for disparity-defined targets could not occur prior to this stage because the disparity defined motion targets would be invisible. It is quite likely, however, that the site of interaction is even later in visual processing than this early stage. Recent reports (e.g. Cumming and Parker, 1997) [6] have shown that V1 does not directly support the perception of disparity-defined form though it may provide a substrate for later processing. Likely candidate sites for audio/visual interaction underlying perception in these motion sequences are V5 or MT, areas associated with motion processing and complex form perception.

We conclude that the influence of an auditory tone, presented at the point of coincidence, on the resolution of motion sequences can be given higher status than has been previously suggested, biasing the resolution of unambiguous as well as ambiguous motion sequences. Furthermore, the similar patterns of behavioral data gathered from the luminance-defined and disparity-defined motion sequences indicate that audio/visual interactions investigated here take place at a visual processing stage beyond the point at which disparity-defined form is extracted. Display engineers might exploit the marked perceptual biases we have reported here to enhance viewers’ sense of presence in dynamic 3-D virtual reality environments. [Supported by Grant-in-Aid of MEXT for Specially Promoted Research (no. 19001004).]

5. REFERENCES [1] R. Sekuler, A. B. Sekuler, and R. Lau, “Sound alters visual motion perception” Nature, 385, p.308, 1997. [2] B. I. Bertenthal, T. Banton, and A. Bradbury, “Directional bias in the perception of translating patterns” Perception, 22, pp. 193-207, 1993. [3] R. Sekuler, and A. B. Sekuler, “Collisions between moving visual targets: what controls alternative ways of seeing an ambiguous display?” Perception, 28, pp. 415-432, 1999. [4] D. Sanabria, Á. Correa, J. Lupiáñez, and C. Spence, “Bouncing or streaming? Exploring the influence of auditory cues on the interpretation of ambiguous visual motion” Experimental Brain Research, 157, pp. 537-541, 2004. [5] P. M. Grove, and K. Sakurai, “Auditory induced bounce perception persists as the probability of a motion reversal is reduced” Perception, in press. [6] B. G. Cumming, and A. J. Parker, “Responses of primary visual cortical neurons to binocular disparity without depth perception” Nature, 389, pp. 280-283, 1997.