Upload
independent
View
1
Download
0
Embed Size (px)
Citation preview
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
Effects of Spectro-Temporal Asynchrony in Auditory and Auditory-Visual Speech Processing
Ken W. Grant, Ph.D. Auditory-Visual Speech Recognition Laboratory, Walter Reed Army Medical Center, Army
Audiology and Speech Center, Washington, DC 20307-5001 (work) 202-782-8596 (fax) 202-782-9228 (email) [email protected]
Steven Greenberg, Ph.D.
International Computer Speech Institute, 1947 Center Street, Berkeley, CA 94704
David Poeppel, Ph.D. Cognitive Neuroscience of Language Laboratory, Neuroscience and Cognitive Science Program
(NACS), Department of Biology and Department of Linguistics, University of Maryland, College Park, MD 20742
Virginie van Wassenhove, Ph.D.
Cognitive Neuroscience of Language Laboratory, Neuroscience and Cognitive Science Program (NACS), Department of Biology and Department of Linguistics, University of Maryland,
College Park, MD 20742
Introduction
Auditory events can be described as a series of temporal patterns ranging from the simple
to the complex. Such patterns include steady-state sinusoids (simple tones), to amplitude- and
frequency-modulated tones, as well as sequences of sounds that occur either alone or in
combination with concurrently presented signals [1]. For example, as suggested by Lauter [2] the
speech sound [pa] can be viewed as a series of rapidly sequenced events: a brief broadband
noise-like burst, followed by brief period of silence, then a complex, quasi-harmonic signal
whose initial onset contains frequency glides, and which ends with a steady-state harmonic
signal characteristic of the vowel //. During portions of this relatively simple speech sequence,
multiple acoustic events occur concurrently across a fairly wide range of frequencies, some of
which are quasi steady-state in nature while other portions of the spectrum undergo rapid
Submitted to Seminars in Hearing 1
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
transitions in amplitude and frequency. Despite the complex, dynamic nature of the signal,
listeners are able to group its many frequency components together, treating them as belonging
to a single sound. A variety of mechanisms have been proposed as the glue binding disparate
parts of the acoustic signal into a coherent perceptual entity. Most of the proposals follow many
basic Gestalt principles, such as (a) figure and ground, (b) similarity, (c) proximity, (d) common
fate or continuity, (e) symmetry, and (f) closure [3]. Specifically, sounds forming a single,
coherent object possess one or more of the following characteristics [4]: (1) proximity in
frequency (components in the same frequency range are grouped together), (2) harmonicity
(components whose frequencies occur in integer multiples are grouped together), (3) common
time course (harmonics with common onset or common temporal fluctuations are grouped
together), and (4) common spatial location (components coming from the same location in space
are grouped together). Although each of these cues may serve to group components into auditory
objects, temporal properties are probably among the most important [5-8]. Components that begin
and end together are usually perceived as part of the same auditory object. Because of the
inherent temporal nature of sound (i.e, sounds evolve and change over time), questions related to
the auditory perception of simultaneity and synchrony are among the oldest subjects in auditory
psychophysics [9]. Rhythmic patterns, like those in speech and music, assume that a series of
auditory objects have been detected and ordered in time. And yet, many questions remain as to
how listeners organize the acoustic environment, separating one voice from another, and
isolating one target stream from interfering background.
Hirsh and colleagues recognized the importance of the dynamic qualities of speech and
music, embarking on a series of experiments, beginning in the late 1950's, to address some of the
basic questions pertaining to the perception of sequences and temporal patterns. Much of this
Submitted to Seminars in Hearing 2
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
literature is reviewed by Watson [10] in this volume and therefore will only be mentioned in
passing.
It should be noted that Hirsh was among the first to conclude that the auditory system
uses time to decode acoustic signals in many different ways and at many different levels of
abstraction. In Hirsh's view, time is an inherent property of all auditory events insofar as all
acoustic signals have some durational property. A signal's duration interacts with its detection, as
well as its loudness and pitch. In addition to duration's impact on various psychoacoustic
phenomena, Hirsh also was interested in how dynamic characteristics of acoustic events affect
auditory perception. Sound localization, the discrimination of one versus two sounds, rising
versus falling pitches, as well as temporal relations in forward and backward recognition
masking all depend on the temporal relationship between two or more auditory events. Building
further on this theme, Hirsh then went on to consider how timing plays a role in auditory
sequential patterns, and in particular, the perception of temporal order of two or more acoustic
events.
Summarizing this work[9], Hirsh identified three basic categories of temporal phenomena:
1. basic psychophysical information pertaining to detection of acoustic events and the
psychological dimensions of loudness and pitch;
2. temporal relations of non-continuous acoustic signals that result in different effects
for forward and backward masking, and judgments of succession versus simultaneity
(or fusion) of events;
3. auditory pattern perception where the focus is on properties of sequences rather than
on isolated acoustic events.
Submitted to Seminars in Hearing 3
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
The time window associated with temporal acuity and simultaneity for simple stimuli
(Hirsh's Category 1) is on the order of 1-2 ms [11]. Time windows associated masking intervals
and for monaural discrimination of two rapidly successive events (Hirsh's Category 2) are
somewhat longer, approximately 14 ms [12]. Identification, or labeling of temporal order (Hirsh's
Category 3) requires intervals of approximately 20-25 ms for relatively simple events (e.g.,
tones, clicks, light flash, etc.) [9] to 200 ms, when the events to be ordered are both complex and
qualitatively very different [13].
The experiments and ideas in this paper expand on this earlier work by Hirsh and
colleagues [14,15]. Our focus, however, is on the impact of temporal asynchrony across spectral
bands of speech and sensory modalities for speech intelligibility. Specifically, we sought to
determine the limits of temporal integration over which speech information is combined, both
within- and across-sensory channels. These limits are important because they govern how in time
information from different spectral regions and auditory and visual modalities are combined for
successful decoding of the speech signal.
The two principal questions are addressed in the studies that follow:
1. What are the effects of spectro-temporal asynchrony on speech recognition for
auditory-alone and auditory-visual speech inputs?
2. What is the maximum degree of temporal asynchrony for which non-simultaneous
events are perceived as simultaneous?
Methodologically, the integration processes were studied using three different paradigms:
(1) speech intelligibility, (2) detection of synchrony, and (3) synchrony discrimination. Stimulus
materials consisted of nonsense syllables and meaningful sentences. Both congruent (naturally
occurring) and incongruent (as in the McGurk paradigm where the audio segment of one
Submitted to Seminars in Hearing 4
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
utterance is dubbed onto the video segment of another utterance) audiovisual materials also were
used.
Spectro-Temporal Integration Derived Exclusively from Auditory Processes
In recent studies of the effects of spectro-temporal asynchrony on speech intelligibility,
Greenberg and colleagues [16-18] have shown that the auditory system is extremely sensitive to
changes made in the relative timing among different spectral bands of speech. The basic
paradigm is displayed in Figures 1 and 2. TIMIT (Texas Instruments – Massachusetts Institute of
Technology) sentence materials (e.g., "She had your dark suit in greasy wash water all year", [19]
spoken by an equal number of male and female talkers were filtered into four discrete non-
overlapping 1/3-octave bands and presented in various combinations, either synchronously or
asynchronously.
INSERT FIGURE 1 ABOUT HERE
Figure 1 shows the intelligibility scores for various band combinations presented
synchronously. For individual bands presented alone, scores for correctly recognized words were
between 2-9% correct. When the bands were combined however, recognition scores were
significantly higher, often exceeding what one might expect from the simple addition of two
independent bands. For example, combining bands 2 and 3 resulted in an average recognition
score of 60% (individual bands recognition score of 9% each). When all four bands were
combined, the score was 89%, showing that in quiet, as few as four non-contiguous (but widely
spaced) frequency channels are sufficient for near perfect speech recognition (despite the
omission of nearly 80% of the spectrum).
INSERT FIGURE 2 ABOUT HERE
Submitted to Seminars in Hearing 5
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
Figure 2 shows the basic paradigm for ascertaining the impact of spectral asynchrony on
intelligibility. The signals were similar to those shown in Figure 1 except that one pair of
channels were desynchronized relative to the other in the following fashion. Either bands 2 and 3
(as a unit) preceded bands 1 and 4 (Figure 2, left panel) or followed bands 1 and 4 (Figure 2,
right panel). Recall that when all four bands were presented synchronously, intelligibility was
nearly 90% correct. When the middle bands were presented asynchronously relative to the fringe
bands, intelligibility decreased dramatically (for leads and lags of 50 ms or greater),
demonstrating that auditory-based decoding of speech depends on relative synchrony across
spectral channels. Moreover, when asynchrony across spectral channels exceeded 50 ms,
intelligibility declined below baseline performance for the two central bands presented alone,
suggesting some form of interference between the fringe and central channels.
INSERT FIGURE 3 ABOUT HERE
The impact of spectral asynchrony is illustrated in Figure 3 for a broad range of channel
delays. Intelligibility is highest when all bands are presented synchronously and progressively
declines as the degree of asynchrony increases. The effect of asynchrony on intelligibility is
relatively symmetric in that performance is roughly similar for conditions in which the middle
bands lead the fringe bands and vice versa.
Intelligibility is not always so sensitive to cross-spectral asynchrony. In instances where
the spectral bandwidth of the speech signal is broad and continuous (e.g., no spectral gaps),
listeners are fairly tolerant to cross-spectral asynchrony up to about 140 ms [20,21] before
intelligibility appreciably declines.
One interpretation of these data is that there appear to be multiple time intervals over
which speech is decoded in the auditory system. These range from short analysis windows (1-40
Submitted to Seminars in Hearing 6
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
ms), possibly reflecting various aspects of phonetic detail at the articulatory feature level (e.g.,
voicing), mid-range analysis windows (40-120 ms) possibly reflecting segmental processing, to
long analysis windows (beyond 120 ms), possibly reflecting the importance of prosodic cues,
such as stress accent and syllable number, in the perception of running speech [20,22-24].
Spectro-Temporal Integration Derived from Auditory-Visual Interaction
In noisy and reverberant environments speech recognition is often severely compromised,
particularly at low signal-to-noise ratios. Under such conditions, visual speech cues can provide
sufficient information to restore intelligibility to a level characteristic of speech in quiet [25,26].
Thus, under normal listening conditions, listeners often integrate information not only across the
acoustic spectrum, but also across different sensory modalities [27].
Moreover, the relative timing of acoustic and visual input in auditory-visual speech
perception can have a pronounced effect on intelligibility, as it does on speech presented solely
to the auditory modality. The potential for auditory-visual asynchrony is not insignificant.
Because the bandwidth required for high-fidelity video transmission is much broader than that
required for audio transmission, there is considerable opportunity for the two sources of
information to become desynchronized. For example, in certain news broadcasts where foreign
correspondents are shown as well as heard, the audio component often precedes the video,
resulting in a combined transmission that is out of sync and difficult to understand.
INSERT FIGURE 4 ABOUT HERE
In a series of recent experiments [27-30], we have begun to explore the temporal parameters
governing auditory-visual integration for speech processing. The basic paradigm used in these
studies is displayed in Figure 4. Video-recorded speech materials (sentences, words, and
nonsense syllables) were digitized and edited to provide a means of desynchronizing separately
Submitted to Seminars in Hearing 7
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
the audio and video components of each utterance. The top trace in Figure 4 displays the natural
condition, where a neutral face is presented for about a third of a second before the talker begins
speaking. At the end of the utterance, a neutral face is again presented for roughly the same
period of time. Note that in the natural case, the onset of video motion always precedes the onset
of the audio signal. This is a consequence of the normal articulation (e.g., taking a breath before
the start of production). The second and third traces in Figure 4 show instances where the audio
waveform has been displaced either forwards or backwards in time by approximately 350 ms.
INSERT FIGURE 5 ABOUT HERE
In the earlier study by Grant and Seitz [27], conditions where the audio was delayed
relative to the video were tested. The speech signal [31] was mixed with a broadband speech-
shaped noise to lower the overall auditory-visual performance for each individual subject to
approximately 80% correct word recognition for synchronously presented speech. Consistent
with previous reports of speech intelligibility with asynchronous auditory-visual sentence
materials [32,33], most subjects were relatively unaffected by audio delay until about 200 ms,
beyond which the intelligibility fell precipitously. More recently, Grant and Greenberg [28]
extended these results to include conditions of both audio delay and lead. In addition, rather than
using broadband noise to reduce overall auditory-visual performance, the speech signal was
filtered into two narrow slits (Slit 1 + Slit 4; see Figure 1), consistent with studies by Greenberg
and colleagues. These data are illustrated in Figure 5. Similar to the earlier study of Grant and
Seitz [27], intelligibility was relatively unaffected by audio delay until about 200 ms (but declined
significantly for longer delays). However, recognition performance was adversely affected by
even the smallest amount of asynchrony (40 ms) when the audio signal preceded the video
signal.
Submitted to Seminars in Hearing 8
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
INSERT FIGURE 6 ABOUT HERE
More recently, van Wassenhove et al.[30] used similar methods to evaluate temporal
integration in the McGurk effect. The McGurk effect [34] refers to the illusion of perceiving a
third, unique speech token as a result of dubbing the video portion of one consonant onto the
audio portion of another. For example, when an acoustic /p/ is dubbed onto a video /k/, the
result is often /t/. This effect has been used in a number of studies of auditory-visual speech
recognition and stands as a fairly compelling example of the influence of visual speech cues on
speech perception [35-39]. Even when instructed to ignore what they see and report only what they
hear, subjects in these McGurk experiments find themselves unable to "turn off" the visual
channel. Thus, the McGurk effect is interpreted as a natural consequence of the integration
process whereby individuals cannot help but use all the available information at their disposal to
interpret speech. In the study by Wassenhove et al.[29], two discrepant McGurk stimulus tokens
were created: auditory /b/ combined with video // (leading to the illusory percepts /d/ or
//) and auditory /p/ combined with video /k/ (leading to the illusory percept /t/). Figure 6
shows an example of the dubbing procedure where an acoustic /pA/ is aligned to a video /k/.
Note that the alignment is based entirely on the timing relations between the two acoustic signals
and not to the video. In this example, the original acoustic /k/ is substituted by an acoustic /p/
positioned in time such that the consonant bursts are aligned.
INSERT FIGURES 7 ABOUT HERE
To study the effects of temporal asynchrony on the McGurk effect, normal-hearing
subjects were presented with auditory-visual consonant-vowel (CV) tokens consisting of either
video /k/ combined with acoustic /p/ (ApVk) or with video // combined with acoustic /b/
(AkVg) and asked to report what they heard while looking at the face. Informal tests of the audio-
Submitted to Seminars in Hearing 9
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
alone tokens (audio /p/ and /b/) demonstrated that the subjects were able to label these tokens
with near perfect accuracy. The alignment of the audio and video portions of each stimulus was
adjusted in steps of roughly 33 ms between the range from +467 ms (auditory lag) and -467 ms
(auditory lead). Subjects were given three response choices for each McGurk token. For ApVk,
subjects could respond either /k/ (visual response), /p/ (auditory response), or /t/ (fusion
response). For AbVg, subjects could respond either // (visual response), /b/ (auditory
response), or /d/ or // (fusion response). The probabilities for giving the auditory, visual, or
fusion response for the two McGurk tokens are presented in Figures 7 (ApVk) and 8 (AbVg),
respectively. Note that for each McGurk token the tendency to give the auditory response was
greatest for long asynchronies and least for a region roughly between -50 ms and 200 ms.
Conversely, the tendency to give the fusion response to the McGurk tokens was greatest when
the audio and video portions of the stimuli were nearly aligned. Subjects rarely gave the visual
response, indicating an inherent ambiguity in subject's ability to speechread nonsense syllables.
INSERT FIGURE 8 ABOUT HERE
Perception of Simultaneity Within- and Across-Modality
The data presented thus far show a clear difference between spectro-temporal integration
performed by the ears alone and integration across auditory and visual modalities. For acoustic-
only inputs presented monaurally or diotically, speech cues residing in different spectral bands
must be presented in fairly tight synchrony in order to obtain maximum recognition performance
(Figure 3). In contrast, spectro-temporal integration for audio-video speech recognition is very
robust and can proceed with maximal performance over a wide range of audio-video
asynchronies (Figures 5, 7, and 8). One question that arises from these studies, and others like
them [32,33,39], is whether subjects are even aware of the audio-video asynchrony inherent in the
Submitted to Seminars in Hearing 10
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
signals presented for audio delays corresponding to the plateau region where intelligibility
remains relatively high. In other words, does the temporal window of integration (TWI) derived
from studies of speech intelligibility correspond to the limits of synchrony detection or
discrimination? Or are subjects perceptually aware of small amounts of asynchrony that have
little or no effect on intelligibility? We addressed these questions using two different
experimental paradigms. The first was a simultaneity judgment task used with auditory-visual
CV materials. Subjects were asked to indicate whether the presented token was "simultaneous"
or "successive", irrespective of stimulus identity and temporal order. Both natural (audiovisual
/t/ and /d/) and McGurk (AbVg and ApVk) speech tokens were used. No feedback was
provided. The results are shown in Figure 9.
INSERT FIGURE 9 ABOUT HERE
The most striking property of the data in this figure is the difference between natural
speech audiovisual tokens and McGurk (incongruent) speech tokens. The congruent tokens were
more likely to be judged as simultaneous than the McGurk tokens, even for temporal
asynchronies in the range where fusion responses were most probable. At 0 ms (i.e., auditory-
visual synchrony), the two natural tokens (AdVd and AtVt) were perceived as simultaneous 95%
of the time, whereas the McGurk stimuli (AbVg and ApVk) were perceived as simultaneous only
75-80% of the time. Clearly, subjects noticed that something was not quite right with the
incongruent stimuli, even though the tokens were able to elicit strong fusion responses in the
labeling task. Second, the window over which simultaneity judgments remained high was
significantly wider for the congruent tokens relative to the McGurk stimuli.
The second paradigm used to address the issue of perceived simultaneity was a two-
interval, forced-choice discrimination task where the degree of spectral asynchrony or cross-
Submitted to Seminars in Hearing 11
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
modality asynchrony was controlled adaptively. Specifically, CV syllables and sentences were
presented either audio-alone or audiovisually. For audio-alone presentations, sentence stimuli
were filtered into four spectral slits (Figure 1). In one interval all four slits were presented
synchronously. In the other interval, the two mid-frequency slits (Slit 2 + Slit 3) were delayed or
advanced relative to the fringe slits (Slit 1 + Slit 4). The amount of asynchrony was controlled
adaptively according to a 2-down, 1-up rule which tracked the 71% point on the psychometric
function [40]. The subjects task was to indicate which interval contained speech with components
out of sync. For audiovisual presentations, the same basic procedure was used to control the
stimulus onset asynchrony of the audio signal relative to the video signal. Both CV syllables
(congruent and incongruent McGurk-like audio-visual pairings) and sentence materials were
tested. The results are shown in Table I (synchrony discrimination). Like the intelligibility data,
synchrony discrimination thresholds showed that the temporal window (last column in Table I) is
much narrower (roughly 20 ms) for auditory-alone events than for auditory-visual events
(roughly 200 ms). In addition, temporal integration for auditory-visual speech input derived from
the synchrony-discrimination task is highly asymmetric strongly favoring conditions where the
visual signal leads the auditory signal. Finally, a comparison between natural, congruent CV
tokens and McGurk CV tokens reveals a wider temporal window for the congruent tokens
(similar to that shown in Figure 9) relative to the McGurk stimuli. This difference is most likely
related to the fact that in natural speech the coherence between visible articulatory dynamics and
acoustic output is more highly correlated than for incongruent speech tokens. This, along with
the other differences noted for natural versus McGurk stimuli, raises the question as to whether
the McGurk paradigm can be reliably employed as a measure of auditory-visual speech
integration [27]. Whether the integration processes involved in fusing incongruent audio and video
Submitted to Seminars in Hearing 12
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
components is different in degree and/or kind from the integration processes involved in natural
speech processing is an open question that will likely require electrophysiological measures,
such as electroencephalography and magnetoencephalography to resolve[30].
INSERT TABLE I ABOUT HERE
Table I also shows the results of an analysis of the data obtained from speech recognition
and synchrony judgment tasks (synchrony identification). This analysis was performed on each
of the empirical functions shown in Figure 3 (audio alone sentence identification), Figure 5
(audiovisual sentence identification), Figures 7 and 8 (McGurk fusion identification), and Figure
9 (synchrony identification). The basic idea behind these analyses was to determine the temporal
window over which performance was not significantly different from that obtained with
synchronous input. Strictly speaking, the temporal window defined in this manner does not
indicate the range of temporal asynchronies over which integration may be said to occur. Rather,
it indicates the range of asynchronies that are statistically indistinguishable from the synchronous
case. This more restricted range was determined by fitting each of the functions with an
asymmetric double sigmoidal (ADS) curve. A confidence interval of 95% was chosen to
determine the asynchrony values at which performance was significantly different from
synchrony. Results are shown in Table I and displayed graphically in Figure 10.
INSERT FIGURE 10 ABOUT HERE
The data illustrated in Table I and Figure 10 show that TWIs for audio-alone input (18-53
ms) are much narrower than TWIs for audiovisual input (163-266 ms). In addition, TWIs for
audio-alone input are fairly symmetric, with very little difference in intelligibility regardless of
which spectral bands lead or lag. In contrast, TWIs for audiovisual input are highly asymmetric,
strongly favoring visual leads over audio leads. Furthermore, TWIs for natural, congruent
Submitted to Seminars in Hearing 13
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
audiovisual input are wider than TWIs for incongruent McGurk stimuli. Finally, audiovisual
TWIs derived from either synchrony identification or synchrony discrimination tasks are roughly
equivalent to TWIs derived from speech identification tasks, suggesting that intelligibility is
affected immediately after noticing asynchronous components within the stimulus ensemble.
This is not the case for audio TWIs where synchrony discrimination thresholds led to
significantly narrower TWIs than those derived from speech recognition tasks.
DISCUSSION
Hirsh noted over 40 years ago that the processing of auditory events, sequences, and
patterns operate in many different time realms (see Watson, this volume for a more detailed
discussion). The temporal integration studies described in this paper reveal a range of temporal-
processing windows close to what Hirsh anticipated. For speech recognition based solely on
auditory input, different frequency bands, presented monaurally or diotically, must be in fairly
tight temporal register for spectrally distinct speech cues to bind and form a coherent object. It is
within this context that we see the ubiquitous 20 ms range that Hirsh [14] and Hirsh and Sherrick
[15] described as the minimal interval required to perceive temporal order. However, if subjects
are asked to simply compare two different acoustic speech signals, one with all component parts
presented in synchrony, the other with select spectral bands displaced in time, then the temporal
window shrinks to about 10 ms. In this case, subjects are likely to be discriminating synchrony
from asynchrony based on what Hirsh called figural properties, such as phase relations or the
predominance of different frequencies in the beginning or end of the event complex [41].
In auditory-visual speech recognition, the temporal window is much wider (roughly 250
ms) and highly asymmetric. The asymmetry (auditory lags favored over auditory leads) is an
essential property that cannot easily be explained by resorting to differences in the speed of light
Submitted to Seminars in Hearing 14
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
versus the speed of sound, either in air or within the nervous system. While such transmission
differences across modality certainly exist, they do not lead readily to the prediction of an
asymmetric plateau region as seen in Figures 5, 7, 8, and 9. Rather, such transmission-time
considerations might suggest a bias for preferring one modality to lead the other (i.e., a peak in
the performance function rather than plateau), if, for example, the neural excitation associated
with both modalities (auditory and visual) need to reach a common central location at the same
time. To account for the asymmetry of the plateau region, however, other factors must be
invoked. One possibility is that the observed asymmetry could be related to the information
content carried by the two modalities. Auditory speech information is typically far more robust
than visual speech information and capable of signaling correct recognition without additional
support. In contrast, speechreading is limited to mostly place-of-articulation information [42,43]
and requires additional input from other sources (usually acoustic) to resolve ambiguities in the
data stream. Thus, an auditory-visual decision process that receives the acoustic signal first
might only be subject to cross-modality influences for the first 60 ms or so following signal onset
(i.e., the approximate time when voicing information can be accurately registered in the primary
auditory cortex [44,45]). However, in the case where the visual signal is received first, acoustic
information might influence the decision process over much longer time intervals because of the
visual channel's inherent ambiguity that persists throughout much of its duration [46].
Another possible explanation for the observed asymmetry in audiovisual TWIs is one that
appeals to the natural timing relations between audio and visual events, especially when it comes
to speech. In nature, visible byproducts of speech articulation, including posturing and breath,
almost always occur before acoustic output. This is also true for most non-speech events where
visible movement precedes sound (e.g., a hammer moving and then striking a nail). It is
Submitted to Seminars in Hearing 15
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
reasonable to assume that any learning network (such as our brains) exposed to repeated
occurrences of visually leading events would adapt its processing to anticipate and tolerate
multisensory events where visual input leads auditory input while maintaining the perception that
the two events are bound together. Conversely, because acoustic cues rarely precede visual cues
in the real world, the learning network might become fairly intolerant and unlikely to bind
acoustic and visual input where acoustic cues lead visual cues.
An important aspect of the data is the difference in overall window length between
auditory and auditory-visual TWIs. The temporal window for auditory-alone speech recognition,
in which speech recognition performance is relatively unaffected, is about 40-50 ms (~30 ms
mid-frequency audio lead to ~20 ms mid-frequency audio lag), whereas the temporal window for
auditory-visual speech recognition is about 250 ms (~50 ms audio lead to ~200 ms visual lead).
This corresponds roughly to the resolution needed for temporally fine-grained phonemic analysis
on the one hand and course-grain syllabic analysis on the other, which we interpret as reflecting
the different roles played by the auditory and auditory-visual speech processing.
When speech is processed by eye (i.e., speechreading) it is likely to be advantageous to
integrate over long time windows of roughly syllabic lengths (200-250 ms) because visual
speech cues are rather course [46]. At the segmental level, visual recognition of voicing and
manner-of-articulation is generally poor [43], and while some prosodic cues are decoded at better-
than-chance levels (e.g., syllabic stress, and phrase boundary location) accuracy is not very high
[47]. In contrast, acoustic processing of speech is much more robust and capable of much more
fine-grained analyses using temporal window intervals between 10-40 ms [18,48]. What is
interesting is that when acoustic and visual cues are combined asynchronously, the data suggest
that whichever modality is presented first seems to determine the operating characteristics of the
Submitted to Seminars in Hearing 16
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
speech processor. That is, when visual cues lead acoustic cues, a long temporal window seems to
dominate whereas when acoustic cues lead visual cues, a short temporal window dominates.
As Hirsh and colleagues suggested more than 40 years ago, auditory perception,
especially of speech and music, cannot be understood entirely without considering the role of
temporal processing as well as the perception of time in general. Hirsh's contributions to the
perception of acoustic events, sequences, and patterns have provided the inspiration and
motivation for decades of subsequent research as well as models upon which future work has
been based. Our current efforts to understand the specific mechanisms with which spectral cues
are integrated across different frequency channels, and how acoustic and visual speech cues are
integrated in time, are but one example of Ira Hirsh's influence on modern speech science.
ACKNOWLEDGMENTS
This research was supported by the Clinical Investigation Service, Walter Reed Army
Medical Center, under Work Unit #00-2501 and by grant numbers DC 000792-01A1 from the
National Institute on Deafness and Other Communication Disorders to Walter Reed Army
Medical Center, SBR 9720398 from the Learning and Intelligent Systems Initiative of the
National Science Foundation to the International Computer Science Institute, and DC 004638-
01 and DC 005660-01 from the National Institute on Deafness and Other Communication
Disorders to the University of Maryland. The opinions or assertions contained herein are the
private views of the authors and should not be construed as official or as reflecting the views of
the Department of the Army or the Department of Defense.
Submitted to Seminars in Hearing 17
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
REFERENCES
1. Hirsh IJ. Auditory perception and speech. In: Atkinson RC, Herrnstein RJ, Lindzey G, Luce
RD, eds. Stevens' Handbook of Experimental Psychology, Vol. 1. New York: Wiley;
1988:377-408.
2. Ellis WD. A Source Book of Gestalt Psychology. New York: Harcourt, Brace and World;
1938.
3. Lauter JL. Stimulus characteristics and relative ear advantages: A new look at old data. J
Acoust Soc Am 1983;74:1-17.
4. Bregman AS. Auditory Scene Analysis: the perceptual organization of sound. Cambridge,
Mass: Bradford Books, MIT Press; 1990.
5. Darwin CJ. Perceiving vowels in the presence of another sound: constraints on formant
perception. J Acoust Soc Am 1984;76:1636-1647.
6. Darwin CJ, Sutherland, NS. Grouping frequency components of vowels: when is a harmonic
not a harmonic? Quart J Exp Psychol 1984;36A:193-208.
7. Summerfield Q, Culling JF. Auditory segregation of competing voices: absence of effects of
FM or AM coherence. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 1992;336(1278):357-365.
8. Hukin RW, Darwin CJ. Comparison of the effect of onset asynchrony on auditory grouping
in pitch matching and vowel identification. Percept Psychophys 1995;57:191-196.
9. Hirsh IJ. Temporal aspects of hearing. In: Tower DB, ed. Human Communication and Its
Disorders. New York: Raven Press; 1975:157-162.
10. Watson CS. Temporal acuity and the judgment of temporal order: Related but distinct
auditory abilities. Seminars in Hearing 2004;(this volume).
11. Green DM. Temporal auditory acuity. Psychol Rev 1971;78:540-551.
Submitted to Seminars in Hearing 18
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
12. Penner MJ, Robinson CE, Green DM. The critical masking interval. J Acoust Soc Am
1972;48,894-905.
13. Warren RM, Obusek CJ, Farmer RM. Auditory sequences: Confusion of patterns other than
speech or music. Science 1969;164:586-587.
14. Hirsh IJ. Auditory perception of temporal order. J Acoust Soc Am 1959;31:759-767.
15. Hirsh IJ, Sherrick CE. Perceived order in different sense modalities. J Exp Psych
1961;62:423-432.
16. Greenberg S, Arai T, Silipo R. Speech intelligibility derived from exceedingly sparse spectral
information. In: Proceedings of the International Conference of Spoken Language
Processing. Sydney, Australia: ICSLP; 1998:74-77.
17. Silipo R, Greenberg S, Arai T. Temporal constraints on speech intelligibility as deduced from
exceedingly sparse spectral representations. In: Proceedings of Eurospech 1999. Budepest,
Hungary; 1999:2687-2690.
18. Greenberg S, Arai, T. The relation between speech intelligibility and the complex modulation
spectrum. In: Proceedings of the 7th European Conference on Speech Communication and
Technology (Eurospeech-2001). Aalborg, Denmark: 2001:473-476.
19. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL. The DARPA
TIMIT acoustic-phonetic continuous speech corpus. 1993: CDROM produced by the
National Institute of Standards and Technology (NIST).
20. Greenberg S. Understanding speech understanding: Towards a unified theory of speech
perception. In: Proceedings of the ESCA Workshop on the Auditory Basis of Speech
Perception. Keele University; 1996:1-8.
Submitted to Seminars in Hearing 19
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
21. Arai T, Greenberg S. Speech intelligibility in the presence of cross-channel spectral
asynchrony. IEEE International Conference on Acoustics, Speech and Signal Processing,
Seattle, WA: 1998:933-936.
22. Huggins AWF. On the perception of temporal phenomena in speech. J Acoust Soc Am
1972;51:1279-1290.
23. Greenberg S. On the origins of speech intelligibility in the real world. In: Proceedings of the
ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels.
Pont-a-Mousson, France: 1997:23-32.
24. Poeppel D. The analysis of speech in different temporal integration windows: Cerebral
lateralization as 'asymmetric sampling in time'. Speech Communication 2003;41:245-255.
25. Sumby WH, Pollack I. Visual contribution to speech intelligibility in noise. J Acoust Soc Am
1954;26:212-215.
26. Grant KW, Braida, LD. Evaluating the Articulation Index for audiovisual input. J Acoust Soc
Am 1991;89:2952-2960.
27. Grant KW, Seitz PF. Measures of auditory-visual integration in nonsense syllables and
sentences. J Acoust Soc Am 1998;104:2438-2450.
28. Grant KW, Greenberg S. Speech intelligibility derived from asynchronous processing of
auditory-visual information. In: Proceedings Auditory-Visual Speech Processing (AVSP
2001), Scheelsminde, Denmark: 2001:132-137.
29. van Wassenhove V, Grant KW, Poeppel D. Timing of Auditory-Visual Integration in the
McGurk Effect. Presented at the Society of Neuroscience Annual Meeting, San Diego, CA:
2001:488.
Submitted to Seminars in Hearing 20
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
30. van Wassenhove V, Grant KW, Poeppel D. Temporal Integration in the McGurk Effect.
Presented at the annual meeting of the Cognitive Neuroscience Society, San Francisco, CA:
2002:146.
31. Institute of Electrical and Electronic Engineers (IEEE). IEEE recommended practice for
speech quality measures. IEEE, New York:1969.
32. McGrath M, Summerfield Q. Intermodal timing relations and audio-visual speech
recognition by normal-hearing adults. J Acoust Soc Am 1985;77:678-685.
33. Pandey PC, Kunov H, Abel SM. Disruptive effects of auditory signal delay on speech
perception with lipreading. J Aud Res 1986;26:27-41.
34. McGurk, H, McDonald J. Hearing lips and seeing voices. Nature 1976;264:746-747.
35. Massaro DW. Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry.
Hillsdale, NJ: Lawrence Earlbaum Assoc;1987.
36. Walden BE, Montgomery AA, Prosek RA, Hawkins DB. Visual biasing of normal and
impaired auditory speech perception. J Speech Hear Res 1990;33:163-173.
37. Green KP, Kuhl PK, Meltzoff AN, Stevens EB. Integrating speech information across
talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect.
Percep Psychophys 1991;50:524-536.
38. Munhall K, Gribble P, Sacco L, Ward M. Temporal constraints on the McGurk effect. Percep
Psychophys 1996;58:351-362.
39. Massaro DW, Cohen MM, Smeele PM. Perception of asynchronous and conflicting visual
and auditory speech. J Acoust Soc Am 1996;100:1777-1786.
40. Levitt H. Transformed up-down methods in psychoacoustics. J Acoust Soc Am 1971;49:467-
477.
Submitted to Seminars in Hearing 21
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
41. Hirsh IJ. Temporal order and auditory perception. In: Moskowitz HR, Scharf B, Stevens JC,
eds. Sensation and Measurement. Dordrecht-Holland: D. Reidel Publishing Company;
1974:251-258.
42. Grant KW, Walden BE. Evaluating the articulation index for auditory-visual consonant
recognition. J Acoust Soc Am 1996a;100:2415-2424.
43. Grant KW, Walden BE, Seitz PF. Auditory-visual speech recognition by hearing-impaired
subjects: Consonant recognition, sentence recognition, and auditory-visual integration. J
Acoust Soc Am 1998;103:2677-2690.
44. Steinschneider M, Schroeder CE, Arezzo JC, Vaughan HGJr. Speech evoked activity in
primary auditory cortex: effects of voice onset time. Electroencephalography and Clinical
Neurophysiology 1994;92:30-43.
45. Steinschneider M, Volkov IO, Noh MD, Garell PC, Howard MA. Temporal encoding of the
voice onset time phonetic parameter by field potentials recorded directly from human
auditory cortex. J Neurophys 1999;82:2346-2357.
46. Seitz PF, Grant KW. Modality, perceptual encoding speed, and time course of phonetic
information. In: Massaro DW. ed. Proceedings of Auditory-Visual Speech Processing
(AVSP '99), Santa Cruz, CA: 1999. (CDROM).
47. Grant, KW, Walden BE. The spectral distribution of prosodic information. J Speech Hear
Res 1996b;39:228-238.
48. Stevens KN, Blumstein SE. Invariant cues for place of articulation in stop consonants. J
Acoust Soc Am 1978;64:1358-1368.
Submitted to Seminars in Hearing 22
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
FIGURE CAPTIONS
Figure 1. Average intelligibility of TIMIT (Texas Instruments – Massachusetts Institute of
Technology) sentences filtered into four discrete spectral regions and presented in various
combinations. Each spectral "slit" is 1/3-octave wide with center frequency (CF) as indicated.
Adapted from [16].
Figure 2. The effect of spectral slit asynchrony on the intelligibility of TIMIT (Texas
Instruments – Massachusetts Institute of Technology) sentences. Adapted from [16].
Figure 3. Same as Figure 2 but showing an expanded range for slits 1 + 4 leading slits 2 + 3.
Baseline performance for slits 1 + 4 and for slits 2 + 3 alone are indicated by solid and dashed
lines, respectively. Adapted from [17].
Figure 4. Audio-visual alignment procedure showing examples of audio lag (middle waveform)
and audio lead (bottom waveform). All alignments are made relative to the natural unedited
audiovisual speech token (top waveform). Dashed vertical lines show temporal positions (from
left to right) of 1) neutral face onset, 2) video motion onset, 3) video motion offset, and 4)
neutral face offset. Note that video motion precedes acoustic onset in the naturally, unedited
production (top waveform).
Figure 5. Average intelligibility of IEEE sentences as a function of audio-video asynchrony.
Note the substantial plateau region between -50 ms audio lead to 200 ms audio delay where
intelligibility scores are high relative to the audio-alone or video-alone conditions. Adapted from
[28].
Submitted to Seminars in Hearing 23
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
Figure 6. Example of waveform alignment for McGurk stimulus presentation. This example
shows the creation of visual /k/ and acoustic /p/ leading to the illusory perception /t/. Shown
is the original acoustic waveform for the naturally produced audiovisual token /k/ (top) and
dubbed waveform /p/. The two waveforms are aligned at the onset of the consonant burst.
Figure 7. Labeling functions for the McGurk stimulus visual /k/ and acoustic /p/ as a function
of audiovisual asynchrony (audio delay). Circles = probability of responding with the fusion
response /t/; squares = probability of responding with the acoustic stimulus /p/; triangles =
probability of responding with the visual response /k/. Note the relatively long temporal
window (-50 ms audio lead to 200 ms audio lag) where fusion responses are likely to occur.
Adapted from [29].
Figure 8. Same as Figure 7 but for the McGurk stimulus visual // and acoustic /b/. Fusion
response is /d/ or //. Adapted from [29].
Figure 9. Simultaneity judgments for congruent audiovisual consonant-vowel tokens /d/ and
/t/ (filled symbols) and incongruent audiovisual McGurk tokens visual /k/ - acoustic /p/ and
visual // - acoustic /b/ (open symbols) as a function of audiovisual asynchrony (audio delay).
Adapted from [29].
Figure 10. Summary of temporal windows obtained from the three different tasks (speech
intelligibility, synchrony identification, and synchrony discrimination) for audio and audiovisual
speech tokens.
Submitted to Seminars in Hearing 24
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
ABSTRACT
Throughout his career, Ira Hirsh studied and published articles and books pertaining to
many aspects of the auditory system. These included sound conduction in the ear, cochlear
mechanics, masking, auditory localization, psychoacoustic behavior in animals, speech
perception, medical and audiological applications, coupling between psychophysics and
physiology, and ecological acoustics. However, it is Hirsh’s work on auditory timing of simple
and complex rhythmic patterns, the backbone of speech and music that are at the heart of his
more recent work. In this paper, we report on several aspects of temporal processing of speech
signals, both within and across sensory systems. Data are presented on perceived simultaneity
and intelligibility of auditory and auditory-visual speech stimuli where stimulus components are
presented either synchronously or asynchronously. Differences in the symmetry and shape of
temporal windows derived from these data sets are highlighted. Results show two distinct ranges
for temporal integration for speech processing; one relatively short window, about 40 ms, and the
other much longer, around 250 ms. In the case of auditory-visual speech processing, the temporal
window is highly asymmetric, strongly favoring conditions where the visual stimulus precedes
the acoustic stimulus.
LEARNING OBJECTIVES
1) To show the connection between Hirsh’s work on the perception of temporal-order and
recent work on the processing of asynchronous speech both cross-spectrally and cross-
modally.
2) To compare and contrast the effects of temporal misalignment in auditory-alone and
auditory-visual speech processing.
Submitted to Seminars in Hearing 25
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
KEY WORDS
Ira J. Hirsh, Temporal Order, Spectro-Temporal Integration, Temporal Speech Processing
ABBREVIATIONS
ms - milliseconds
CV - Consonant-Vowel stimuli
AxVy - Incongruent auditory-visual speech tokens where the subjects hears the speech token x
and simultaneously sees the speech token y.
TWI - Temporal Window of Integration
ADS - an Asymmetric Double Sigmoidal curve fit through the data.
CEU QUESTIONS
1) What was the area of Hirsh’s research that occupied his primary interest in the latter part of
his career?
A) Measurement of hearing
B) Sound reproduction
C) Auditory physiology
D) Temporal processing of simple and complex sounds
E) Animal psychophysics
Submitted to Seminars in Hearing 26
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
2) The temporal window of integration for auditory-visual speech stimuli when the visual
stimulus leads the acoustic stimulus is roughly
A) 500 ms
B) 250 ms
C) 20 ms
D) 40 ms
E) 100 ms
3) For auditory-visual McGurk stimuli, as the acoustic and visual signals become more and
more “out of sync” the perceived stimulus is dominated by the
A) Visual stimulus
B) Acoustic stimulus
C) A noise-like stimulus that is part acoustic, part visual
D) There is no dominant response once the auditory and visual signals are “out of sync”
E) The subject’s response depends on their auditory and visual acuity
4) Spectro-temporal integration windows for auditory-only speech recognition are
A) Symmetrical and long (about 250 ms)
B) Symmetrical and short (about 40 ms)
C) Asymmetrical and long (about 250 ms)
D) Asymmetrical and short (about 40 ms)
E) Symmetrical, but the length varies between 100-200 ms depending on the subject.
Submitted to Seminars in Hearing 27
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
5) The magnitude and shape of the temporal window of integration for auditory-visual speech
A) Depends on whether the audio or visual signal is leading
B) Is asymmetrical with a broad plateau region for visual leading conditions
C) Is about 40 ms when the audio signal leads the visual signal and about 250 ms when the
visual signal leads the audio signal
D) Is similar for both speech identification and synchrony detection
E) All of the above
Submitted to Seminars in Hearing 28
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
AUTHOR BIOGRAPHIES
Ken Grant, Auditory-Visual Speech Recognition Laboratory, Army Audiology and Speech Center, Walter Reed Army Medical Center, Washington, DC, USA Dr. Grant has focused on audio-visual speech processing for nearly 20 years. He received his B.A. from Washington University (Philosophy of Science) in 1976, an M.S in Speech Science from the University of Washington (1980) and a Ph.D. in Communication Science from Washington University (1986). He was a post-doc in the Department of Electrical Engineering at the Massachusetts Institute of Technology between 1986 and 1990. He has been a research scientist at the Walter Reed Army Medical Center since 1990. His most recent work has focused on temporal properties of speech perception with an emphasis on interactions between the auditory and visual modalities. http://www.wramc.amedd.army.mil/departments/aasc/avlab Steven Greenberg, The Speech Institute, Oakland, CA, USA Dr. Greenberg is a scientist based in California, who has worked at the International Computer Science Institute, the University of California, Berkeley, and the University of Wisconsin. He received his A.B. in Linguistics from the University of Pennsylvania (1974), and a Ph.D. in Linguistics (1980) from the University of California, Los Angeles. His research has focused over the years on auditory mechanisms underlying the processing of speech, as well as on machine-learning-based statistical methods for automatic phonetic and prosodic annotation of spontaneous American English dialogue material. He has organized numerous conferences and conference special sessions. http://www.icsi.berkeley.edu/~steveng David Poeppel, Cognitive Neuroscience of Language Laboratory, University of Maryland, College Park, MD, USA David Poeppel is from Munich, Germany. He studied neurophysiology and cognitive science at Bowdoin College and then MIT (B.S. 1989). He stayed at MIT and received a Ph.D. in cognitive neuroscience (Ph.D. 1995), focusing on linguistics and the neural basis of speech perception and language comprehension. From 1995-1997 he was a post-doc at UCSF, learning the imaging techniques MEG and fMRI. Since 1998 he has been on the faculty at the University of Maryland College Park. He is an Associate Professor in the Department of Linguistics and the Department of Biology The work in his lab focuses on problems ranging from temporal sensitivity and coding in human auditory cortex to speech perception to lexical semantics. http://www.ling.umd.edu/poeppel Virginie, van Wassenhove, Cognitive Neuroscience of Language Laboratory, University of Maryland, College Park, MD, USA Virginie van Wassenhove received her B.S. in Neurophysiology from the University of Maryland, College Park in 1998, and her Ph.D. in the Neuroscience and Cognitive Science program from the University of Maryland, College Park in 2004. Her dissertation was conducted under the supervision of Dr. David Poeppel and Dr. Ken W. Grant at the Walter Reed Army Medical Center, Washington DC. Her thesis work focused on the neural mechanisms underlying the integration of auditory-visual speech and in defining the distal stimulus parameters that constrain auditory-visual speech integration and the perceptual binding of cross-modal information in general. Her work used a psychophysical approach to investigate the temporal constraints of auditory-visual speech integration whereas electrophysiology work focused on the
Submitted to Seminars in Hearing 29
Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)
Submitted to Seminars in Hearing 30
neural correlates of auditory-visual speech integration in the time domain by using electroencephalography and magnetoencephalography brain recording techniques. http://www.wam.umd.edu/%7Evvw/
Table I. Temporal window for efficient integration derived from speech identification, synchrony identification, and synchrony discrimination tasks. For audio conditions, window boundaries refer to the temporal onset of the mid-frequency channels (Slits 2 + 3) relative to the high- and low-frequency channels (Slits 1 + 4). For audiovisual conditions, window boundaries refer to the temporal onset of the audio signal relative to the video signal (negative values indicate audio leading conditions, positive values indicate video leading conditions).
Stimulus Materials Task Modality Left Boundary (ms) Right Boundary (ms) Temporal Window (ms) TIMIT Sentences Speech Identification Audio -30.9 22.1 53.0 IEEE Sentences Synchrony Discrimination Audio -7.7 10.4 18.1 IEEE Sentences Speech Identification Audiovisual -36.1 230.3 266.4 IEEE Sentences Synchrony Discrimination Audiovisual -33.5 165.3 198.8 CV Syllables Synchrony Identification Audiovisual -84.5 137.3 221.8 CV Syllables Synchrony Discrimination Audiovisual -91.3 122.8 214.1 McGurk CV Syllables Fusion Identification Audiovisual -27.3 152.4 179.7 McGurk CV Syllables Synchrony Identification Audiovisual -47.9 115.4 163.3 McGurk CV Syllables Synchrony Discrimination Audiovisual -67.6 115.0 182.6
FIG 1
89% 60% 13%
2% 9% 9% 4%
1
2
34
Slit
Num
ber
1
2
34
Slit
Num
ber
334
841
21205340
CF
(Hz)
334
841
21205340
CF
(Hz)
FIG 2
1
2
34
Slit
Num
ber
1
2
34
Slit
Num
ber
1
2
34
Slit
Num
ber
225 ms Delay
250 ms Delay
275 ms Delay
72%
62%
53%
80%
55%
41%
FIG 3
0
20
40
60
80
100
-100 0 100 200 300 400 500 600
Slit Asynchrony (ms)
Bands 2 + 3
Bands 1 + 4
TIM
IT W
ord
Rec
ogni
tion
(%)
FIG 5
0
20
40
60
80
100
-500 -400 -300 -200 -100 0 100 200 300 400 500
IEE
E W
ord
Rec
ogni
tion
(%)
Audio Alone
Visual Alone
Audio Delay (ms)
FIG 7
0
0.2
0.4
0.6
0.8
1
-500 -400 -300 -200 -100 0 100 200 300 400 500Audio Delay (ms)
/pa/ /ta/ /ka/R
espo
nse
Pro
babi
lity
FIG 8
Res
pons
e P
roba
bilit
y
0
0.2
0.4
0.6
0.8
1
-500 -400 -300 -200 -100 0 100 200 300 400 500
/ba/ /da/ or /a/ /ga/
Audio Delay (ms)
FIG 9
0
0.2
0.4
0.6
0.8
1
-500 -400 -300 -200 -100 0 100 200 300 400 500Audio Delay (ms)
AdV
dA
tV
tA
pV
kA
bV
g
Pro
babi
lity
of S
imul
tane
ity
Temporal Window of Integration
-100 0 100 200 300
Aud
ioA
udio
visu
al
TIMIT Word Identification
IEEE Synchrony Discrimination
IEEE Synchrony Discrimination
IEEE Word Identification
McGurk Synchrony Discrimination
McGurk Synchrony Identification
McGurk Fusion Identification
CV Synchrony Discrimination
CV Synchrony Identification
Stimulus Onset Asynchrony (ms)
FIG 10