Effects of Spectro-Temporal Asynchrony in Auditory and Auditory-Visual Speech Processing

Effects of Spectro-Temporal Asynchrony on Speech Processing (Grant et al.)

Effects of Spectro-Temporal Asynchrony in Auditory and Auditory-Visual Speech Processing

Ken W. Grant, Ph.D. Auditory-Visual Speech Recognition Laboratory, Walter Reed Army Medical Center, Army

Audiology and Speech Center, Washington, DC 20307-5001 (work) 202-782-8596 (fax) 202-782-9228 (email) [email protected]

Steven Greenberg, Ph.D.

International Computer Speech Institute, 1947 Center Street, Berkeley, CA 94704

David Poeppel, Ph.D. Cognitive Neuroscience of Language Laboratory, Neuroscience and Cognitive Science Program

(NACS), Department of Biology and Department of Linguistics, University of Maryland, College Park, MD 20742

Virginie van Wassenhove, Ph.D.

Cognitive Neuroscience of Language Laboratory, Neuroscience and Cognitive Science Program (NACS), Department of Biology and Department of Linguistics, University of Maryland,

College Park, MD 20742

Introduction

Auditory events can be described as a series of temporal patterns ranging from the simple

to the complex. Such patterns include steady-state sinusoids (simple tones), to amplitude- and

frequency-modulated tones, as well as sequences of sounds that occur either alone or in

combination with concurrently presented signals [1]. For example, as suggested by Lauter [2] the

speech sound [pa] can be viewed as a series of rapidly sequenced events: a brief broadband

noise-like burst, followed by brief period of silence, then a complex, quasi-harmonic signal

whose initial onset contains frequency glides, and which ends with a steady-state harmonic

signal characteristic of the vowel //. During portions of this relatively simple speech sequence,

multiple acoustic events occur concurrently across a fairly wide range of frequencies, some of

which are quasi steady-state in nature while other portions of the spectrum undergo rapid

Submitted to Seminars in Hearing 1


transitions in amplitude and frequency. Despite the complex, dynamic nature of the signal,

listeners are able to group its many frequency components together, treating them as belonging

to a single sound. A variety of mechanisms have been proposed as the glue binding disparate

parts of the acoustic signal into a coherent perceptual entity. Most of the proposals follow many

basic Gestalt principles, such as (a) figure and ground, (b) similarity, (c) proximity, (d) common

fate or continuity, (e) symmetry, and (f) closure [3]. Specifically, sounds forming a single,

coherent object possess one or more of the following characteristics [4]: (1) proximity in

frequency (components in the same frequency range are grouped together), (2) harmonicity

(components whose frequencies occur in integer multiples are grouped together), (3) common

time course (harmonics with common onset or common temporal fluctuations are grouped

together), and (4) common spatial location (components coming from the same location in space

are grouped together). Although each of these cues may serve to group components into auditory

objects, temporal properties are probably among the most important [5-8]. Components that begin

and end together are usually perceived as part of the same auditory object. Because of the

inherent temporal nature of sound (i.e, sounds evolve and change over time), questions related to

the auditory perception of simultaneity and synchrony are among the oldest subjects in auditory

psychophysics [9]. Rhythmic patterns, like those in speech and music, assume that a series of

auditory objects have been detected and ordered in time. And yet, many questions remain as to

how listeners organize the acoustic environment, separating one voice from another, and

isolating one target stream from interfering background.

Hirsh and colleagues recognized the importance of the dynamic qualities of speech and

music, embarking on a series of experiments, beginning in the late 1950's, to address some of the

basic questions pertaining to the perception of sequences and temporal patterns. Much of this



literature is reviewed by Watson [10] in this volume and therefore will only be mentioned in

passing.

It should be noted that Hirsh was among the first to conclude that the auditory system

uses time to decode acoustic signals in many different ways and at many different levels of

abstraction. In Hirsh's view, time is an inherent property of all auditory events insofar as all

acoustic signals have some durational property. A signal's duration interacts with its detection, as

well as its loudness and pitch. In addition to duration's impact on various psychoacoustic

phenomena, Hirsh also was interested in how dynamic characteristics of acoustic events affect

auditory perception. Sound localization, the discrimination of one versus two sounds, rising

versus falling pitches, as well as temporal relations in forward and backward recognition

masking all depend on the temporal relationship between two or more auditory events. Building

further on this theme, Hirsh then went on to consider how timing plays a role in auditory

sequential patterns, and in particular, the perception of temporal order of two or more acoustic

events.

Summarizing this work[9], Hirsh identified three basic categories of temporal phenomena:

1. basic psychophysical information pertaining to detection of acoustic events and the

psychological dimensions of loudness and pitch;

2. temporal relations of non-continuous acoustic signals that result in different effects

for forward and backward masking, and judgments of succession versus simultaneity

(or fusion) of events;

3. auditory pattern perception where the focus is on properties of sequences rather than

on isolated acoustic events.



The time window associated with temporal acuity and simultaneity for simple stimuli

(Hirsh's Category 1) is on the order of 1-2 ms [11]. Time windows associated masking intervals

and for monaural discrimination of two rapidly successive events (Hirsh's Category 2) are

somewhat longer, approximately 14 ms [12]. Identification, or labeling of temporal order (Hirsh's

Category 3) requires intervals of approximately 20-25 ms for relatively simple events (e.g.,

tones, clicks, light flash, etc.) [9] to 200 ms, when the events to be ordered are both complex and

qualitatively very different [13].

The experiments and ideas in this paper expand on this earlier work by Hirsh and

colleagues [14,15]. Our focus, however, is on the impact of temporal asynchrony across spectral

bands of speech and sensory modalities for speech intelligibility. Specifically, we sought to

determine the limits of temporal integration over which speech information is combined, both

within- and across-sensory channels. These limits are important because they govern how in time

information from different spectral regions and auditory and visual modalities are combined for

successful decoding of the speech signal.

The two principal questions are addressed in the studies that follow:

1. What are the effects of spectro-temporal asynchrony on speech recognition for

auditory-alone and auditory-visual speech inputs?

2. What is the maximum degree of temporal asynchrony for which non-simultaneous

events are perceived as simultaneous?

Methodologically, the integration processes were studied using three different paradigms:

(1) speech intelligibility, (2) detection of synchrony, and (3) synchrony discrimination. Stimulus

materials consisted of nonsense syllables and meaningful sentences. Both congruent (naturally

occurring) and incongruent (as in the McGurk paradigm where the audio segment of one



utterance is dubbed onto the video segment of another utterance) audiovisual materials also were

used.

Spectro-Temporal Integration Derived Exclusively from Auditory Processes

In recent studies of the effects of spectro-temporal asynchrony on speech intelligibility,

Greenberg and colleagues [16-18] have shown that the auditory system is extremely sensitive to

changes made in the relative timing among different spectral bands of speech. The basic

paradigm is displayed in Figures 1 and 2. TIMIT (Texas Instruments – Massachusetts Institute of

Technology) sentence materials (e.g., "She had your dark suit in greasy wash water all year", [19]

spoken by an equal number of male and female talkers were filtered into four discrete non-

overlapping 1/3-octave bands and presented in various combinations, either synchronously or

asynchronously.

INSERT FIGURE 1 ABOUT HERE

Figure 1 shows the intelligibility scores for various band combinations presented

synchronously. For individual bands presented alone, scores for correctly recognized words were

between 2-9% correct. When the bands were combined however, recognition scores were

significantly higher, often exceeding what one might expect from the simple addition of two

independent bands. For example, combining bands 2 and 3 resulted in an average recognition

score of 60% (individual bands recognition score of 9% each). When all four bands were

combined, the score was 89%, showing that in quiet, as few as four non-contiguous (but widely

spaced) frequency channels are sufficient for near perfect speech recognition (despite the

omission of nearly 80% of the spectrum).




Figure 2 shows the basic paradigm for ascertaining the impact of spectral asynchrony on

intelligibility. The signals were similar to those shown in Figure 1 except that one pair of

channels were desynchronized relative to the other in the following fashion. Either bands 2 and 3

(as a unit) preceded bands 1 and 4 (Figure 2, left panel) or followed bands 1 and 4 (Figure 2,

right panel). Recall that when all four bands were presented synchronously, intelligibility was

nearly 90% correct. When the middle bands were presented asynchronously relative to the fringe

bands, intelligibility decreased dramatically (for leads and lags of 50 ms or greater),

demonstrating that auditory-based decoding of speech depends on relative synchrony across

spectral channels. Moreover, when asynchrony across spectral channels exceeded 50 ms,

intelligibility declined below baseline performance for the two central bands presented alone,

suggesting some form of interference between the fringe and central channels.


The impact of spectral asynchrony is illustrated in Figure 3 for a broad range of channel

delays. Intelligibility is highest when all bands are presented synchronously and progressively

declines as the degree of asynchrony increases. The effect of asynchrony on intelligibility is

relatively symmetric in that performance is roughly similar for conditions in which the middle

bands lead the fringe bands and vice versa.

Intelligibility is not always so sensitive to cross-spectral asynchrony. In instances where

the spectral bandwidth of the speech signal is broad and continuous (e.g., no spectral gaps),

listeners are fairly tolerant to cross-spectral asynchrony up to about 140 ms [20,21] before

intelligibility appreciably declines.

One interpretation of these data is that there appear to be multiple time intervals over

which speech is decoded in the auditory system. These range from short analysis windows (1-40



ms), possibly reflecting various aspects of phonetic detail at the articulatory feature level (e.g.,

voicing), mid-range analysis windows (40-120 ms) possibly reflecting segmental processing, to

long analysis windows (beyond 120 ms), possibly reflecting the importance of prosodic cues,

such as stress accent and syllable number, in the perception of running speech [20,22-24].

Spectro-Temporal Integration Derived from Auditory-Visual Interaction

In noisy and reverberant environments speech recognition is often severely compromised,

particularly at low signal-to-noise ratios. Under such conditions, visual speech cues can provide

sufficient information to restore intelligibility to a level characteristic of speech in quiet [25,26].

Thus, under normal listening conditions, listeners often integrate information not only across the

acoustic spectrum, but also across different sensory modalities [27].

Moreover, the relative timing of acoustic and visual input in auditory-visual speech

perception can have a pronounced effect on intelligibility, as it does on speech presented solely

to the auditory modality. The potential for auditory-visual asynchrony is not insignificant.

Because the bandwidth required for high-fidelity video transmission is much broader than that

required for audio transmission, there is considerable opportunity for the two sources of

information to become desynchronized. For example, in certain news broadcasts where foreign

correspondents are shown as well as heard, the audio component often precedes the video,

resulting in a combined transmission that is out of sync and difficult to understand.


In a series of recent experiments [27-30], we have begun to explore the temporal parameters

governing auditory-visual integration for speech processing. The basic paradigm used in these

studies is displayed in Figure 4. Video-recorded speech materials (sentences, words, and

nonsense syllables) were digitized and edited to provide a means of desynchronizing separately



the audio and video components of each utterance. The top trace in Figure 4 displays the natural

condition, where a neutral face is presented for about a third of a second before the talker begins

speaking. At the end of the utterance, a neutral face is again presented for roughly the same

period of time. Note that in the natural case, the onset of video motion always precedes the onset

of the audio signal. This is a consequence of the normal articulation (e.g., taking a breath before

the start of production). The second and third traces in Figure 4 show instances where the audio

waveform has been displaced either forwards or backwards in time by approximately 350 ms.


In the earlier study by Grant and Seitz [27], conditions where the audio was delayed

relative to the video were tested. The speech signal [31] was mixed with a broadband speech-

shaped noise to lower the overall auditory-visual performance for each individual subject to

approximately 80% correct word recognition for synchronously presented speech. Consistent

with previous reports of speech intelligibility with asynchronous auditory-visual sentence

materials [32,33], most subjects were relatively unaffected by audio delay until about 200 ms,

beyond which the intelligibility fell precipitously. More recently, Grant and Greenberg [28]

extended these results to include conditions of both audio delay and lead. In addition, rather than

using broadband noise to reduce overall auditory-visual performance, the speech signal was

filtered into two narrow slits (Slit 1 + Slit 4; see Figure 1), consistent with studies by Greenberg

and colleagues. These data are illustrated in Figure 5. Similar to the earlier study of Grant and

Seitz [27], intelligibility was relatively unaffected by audio delay until about 200 ms (but declined

significantly for longer delays). However, recognition performance was adversely affected by

even the smallest amount of asynchrony (40 ms) when the audio signal preceded the video

signal.




More recently, van Wassenhove et al.[30] used similar methods to evaluate temporal

integration in the McGurk effect. The McGurk effect [34] refers to the illusion of perceiving a

third, unique speech token as a result of dubbing the video portion of one consonant onto the

audio portion of another. For example, when an acoustic /p/ is dubbed onto a video /k/, the

result is often /t/. This effect has been used in a number of studies of auditory-visual speech

recognition and stands as a fairly compelling example of the influence of visual speech cues on

speech perception [35-39]. Even when instructed to ignore what they see and report only what they

hear, subjects in these McGurk experiments find themselves unable to "turn off" the visual

channel. Thus, the McGurk effect is interpreted as a natural consequence of the integration

process whereby individuals cannot help but use all the available information at their disposal to

interpret speech. In the study by Wassenhove et al.[29], two discrepant McGurk stimulus tokens

were created: auditory /b/ combined with video // (leading to the illusory percepts /d/ or

//) and auditory /p/ combined with video /k/ (leading to the illusory percept /t/). Figure 6

shows an example of the dubbing procedure where an acoustic /pA/ is aligned to a video /k/.

Note that the alignment is based entirely on the timing relations between the two acoustic signals

and not to the video. In this example, the original acoustic /k/ is substituted by an acoustic /p/

positioned in time such that the consonant bursts are aligned.

INSERT FIGURES 7 ABOUT HERE

To study the effects of temporal asynchrony on the McGurk effect, normal-hearing

subjects were presented with auditory-visual consonant-vowel (CV) tokens consisting of either

video /k/ combined with acoustic /p/ (ApVk) or with video // combined with acoustic /b/

(AkVg) and asked to report what they heard while looking at the face. Informal tests of the audio-



alone tokens (audio /p/ and /b/) demonstrated that the subjects were able to label these tokens

with near perfect accuracy. The alignment of the audio and video portions of each stimulus was

adjusted in steps of roughly 33 ms between the range from +467 ms (auditory lag) and -467 ms

(auditory lead). Subjects were given three response choices for each McGurk token. For ApVk,

subjects could respond either /k/ (visual response), /p/ (auditory response), or /t/ (fusion

response). For AbVg, subjects could respond either // (visual response), /b/ (auditory

response), or /d/ or // (fusion response). The probabilities for giving the auditory, visual, or

fusion response for the two McGurk tokens are presented in Figures 7 (ApVk) and 8 (AbVg),

respectively. Note that for each McGurk token the tendency to give the auditory response was

greatest for long asynchronies and least for a region roughly between -50 ms and 200 ms.

Conversely, the tendency to give the fusion response to the McGurk tokens was greatest when

the audio and video portions of the stimuli were nearly aligned. Subjects rarely gave the visual

response, indicating an inherent ambiguity in subject's ability to speechread nonsense syllables.


Perception of Simultaneity Within- and Across-Modality

The data presented thus far show a clear difference between spectro-temporal integration

performed by the ears alone and integration across auditory and visual modalities. For acoustic-

only inputs presented monaurally or diotically, speech cues residing in different spectral bands

must be presented in fairly tight synchrony in order to obtain maximum recognition performance

(Figure 3). In contrast, spectro-temporal integration for audio-video speech recognition is very

robust and can proceed with maximal performance over a wide range of audio-video

asynchronies (Figures 5, 7, and 8). One question that arises from these studies, and others like

them [32,33,39], is whether subjects are even aware of the audio-video asynchrony inherent in the



signals presented for audio delays corresponding to the plateau region where intelligibility

remains relatively high. In other words, does the temporal window of integration (TWI) derived

from studies of speech intelligibility correspond to the limits of synchrony detection or

discrimination? Or are subjects perceptually aware of small amounts of asynchrony that have

little or no effect on intelligibility? We addressed these questions using two different

experimental paradigms. The first was a simultaneity judgment task used with auditory-visual

CV materials. Subjects were asked to indicate whether the presented token was "simultaneous"

or "successive", irrespective of stimulus identity and temporal order. Both natural (audiovisual

/t/ and /d/) and McGurk (AbVg and ApVk) speech tokens were used. No feedback was

provided. The results are shown in Figure 9.


The most striking property of the data in this figure is the difference between natural

speech audiovisual tokens and McGurk (incongruent) speech tokens. The congruent tokens were

more likely to be judged as simultaneous than the McGurk tokens, even for temporal

asynchronies in the range where fusion responses were most probable. At 0 ms (i.e., auditory-

visual synchrony), the two natural tokens (AdVd and AtVt) were perceived as simultaneous 95%

of the time, whereas the McGurk stimuli (AbVg and ApVk) were perceived as simultaneous only

75-80% of the time. Clearly, subjects noticed that something was not quite right with the

incongruent stimuli, even though the tokens were able to elicit strong fusion responses in the

labeling task. Second, the window over which simultaneity judgments remained high was

significantly wider for the congruent tokens relative to the McGurk stimuli.

The second paradigm used to address the issue of perceived simultaneity was a two-

interval, forced-choice discrimination task where the degree of spectral asynchrony or cross-



modality asynchrony was controlled adaptively. Specifically, CV syllables and sentences were

presented either audio-alone or audiovisually. For audio-alone presentations, sentence stimuli

were filtered into four spectral slits (Figure 1). In one interval all four slits were presented

synchronously. In the other interval, the two mid-frequency slits (Slit 2 + Slit 3) were delayed or

advanced relative to the fringe slits (Slit 1 + Slit 4). The amount of asynchrony was controlled

adaptively according to a 2-down, 1-up rule which tracked the 71% point on the psychometric

function [40]. The subjects task was to indicate which interval contained speech with components

out of sync. For audiovisual presentations, the same basic procedure was used to control the

stimulus onset asynchrony of the audio signal relative to the video signal. Both CV syllables

(congruent and incongruent McGurk-like audio-visual pairings) and sentence materials were

tested. The results are shown in Table I (synchrony discrimination). Like the intelligibility data,

synchrony discrimination thresholds showed that the temporal window (last column in Table I) is

much narrower (roughly 20 ms) for auditory-alone events than for auditory-visual events

(roughly 200 ms). In addition, temporal integration for auditory-visual speech input derived from

the synchrony-discrimination task is highly asymmetric strongly favoring conditions where the

visual signal leads the auditory signal. Finally, a comparison between natural, congruent CV

tokens and McGurk CV tokens reveals a wider temporal window for the congruent tokens

(similar to that shown in Figure 9) relative to the McGurk stimuli. This difference is most likely

related to the fact that in natural speech the coherence between visible articulatory dynamics and

acoustic output is more highly correlated than for incongruent speech tokens. This, along with

the other differences noted for natural versus McGurk stimuli, raises the question as to whether

the McGurk paradigm can be reliably employed as a measure of auditory-visual speech

integration [27]. Whether the integration processes involved in fusing incongruent audio and video



components is different in degree and/or kind from the integration processes involved in natural

speech processing is an open question that will likely require electrophysiological measures,

such as electroencephalography and magnetoencephalography to resolve[30].

INSERT TABLE I ABOUT HERE

Table I also shows the results of an analysis of the data obtained from speech recognition

and synchrony judgment tasks (synchrony identification). This analysis was performed on each

of the empirical functions shown in Figure 3 (audio alone sentence identification), Figure 5

(audiovisual sentence identification), Figures 7 and 8 (McGurk fusion identification), and Figure

9 (synchrony identification). The basic idea behind these analyses was to determine the temporal

window over which performance was not significantly different from that obtained with

synchronous input. Strictly speaking, the temporal window defined in this manner does not

indicate the range of temporal asynchronies over which integration may be said to occur. Rather,

it indicates the range of asynchronies that are statistically indistinguishable from the synchronous

case. This more restricted range was determined by fitting each of the functions with an

asymmetric double sigmoidal (ADS) curve. A confidence interval of 95% was chosen to

determine the asynchrony values at which performance was significantly different from

synchrony. Results are shown in Table I and displayed graphically in Figure 10.


The data illustrated in Table I and Figure 10 show that TWIs for audio-alone input (18-53

ms) are much narrower than TWIs for audiovisual input (163-266 ms). In addition, TWIs for

audio-alone input are fairly symmetric, with very little difference in intelligibility regardless of

which spectral bands lead or lag. In contrast, TWIs for audiovisual input are highly asymmetric,

strongly favoring visual leads over audio leads. Furthermore, TWIs for natural, congruent



audiovisual input are wider than TWIs for incongruent McGurk stimuli. Finally, audiovisual

TWIs derived from either synchrony identification or synchrony discrimination tasks are roughly

equivalent to TWIs derived from speech identification tasks, suggesting that intelligibility is

affected immediately after noticing asynchronous components within the stimulus ensemble.

This is not the case for audio TWIs where synchrony discrimination thresholds led to

significantly narrower TWIs than those derived from speech recognition tasks.

DISCUSSION

Hirsh noted over 40 years ago that the processing of auditory events, sequences, and

patterns operate in many different time realms (see Watson, this volume for a more detailed

discussion). The temporal integration studies described in this paper reveal a range of temporal-

processing windows close to what Hirsh anticipated. For speech recognition based solely on

auditory input, different frequency bands, presented monaurally or diotically, must be in fairly

tight temporal register for spectrally distinct speech cues to bind and form a coherent object. It is

within this context that we see the ubiquitous 20 ms range that Hirsh [14] and Hirsh and Sherrick

[15] described as the minimal interval required to perceive temporal order. However, if subjects

are asked to simply compare two different acoustic speech signals, one with all component parts

presented in synchrony, the other with select spectral bands displaced in time, then the temporal

window shrinks to about 10 ms. In this case, subjects are likely to be discriminating synchrony

from asynchrony based on what Hirsh called figural properties, such as phase relations or the

predominance of different frequencies in the beginning or end of the event complex [41].

In auditory-visual speech recognition, the temporal window is much wider (roughly 250

ms) and highly asymmetric. The asymmetry (auditory lags favored over auditory leads) is an

essential property that cannot easily be explained by resorting to differences in the speed of light



versus the speed of sound, either in air or within the nervous system. While such transmission

differences across modality certainly exist, they do not lead readily to the prediction of an

asymmetric plateau region as seen in Figures 5, 7, 8, and 9. Rather, such transmission-time

considerations might suggest a bias for preferring one modality to lead the other (i.e., a peak in

the performance function rather than plateau), if, for example, the neural excitation associated

with both modalities (auditory and visual) need to reach a common central location at the same

time. To account for the asymmetry of the plateau region, however, other factors must be

invoked. One possibility is that the observed asymmetry could be related to the information

content carried by the two modalities. Auditory speech information is typically far more robust

than visual speech information and capable of signaling correct recognition without additional

support. In contrast, speechreading is limited to mostly place-of-articulation information [42,43]

and requires additional input from other sources (usually acoustic) to resolve ambiguities in the

data stream. Thus, an auditory-visual decision process that receives the acoustic signal first

might only be subject to cross-modality influences for the first 60 ms or so following signal onset

(i.e., the approximate time when voicing information can be accurately registered in the primary

auditory cortex [44,45]). However, in the case where the visual signal is received first, acoustic

information might influence the decision process over much longer time intervals because of the

visual channel's inherent ambiguity that persists throughout much of its duration [46].

Another possible explanation for the observed asymmetry in audiovisual TWIs is one that

appeals to the natural timing relations between audio and visual events, especially when it comes

to speech. In nature, visible byproducts of speech articulation, including posturing and breath,

almost always occur before acoustic output. This is also true for most non-speech events where

visible movement precedes sound (e.g., a hammer moving and then striking a nail). It is



reasonable to assume that any learning network (such as our brains) exposed to repeated

occurrences of visually leading events would adapt its processing to anticipate and tolerate

multisensory events where visual input leads auditory input while maintaining the perception that

the two events are bound together. Conversely, because acoustic cues rarely precede visual cues

in the real world, the learning network might become fairly intolerant and unlikely to bind

acoustic and visual input where acoustic cues lead visual cues.

An important aspect of the data is the difference in overall window length between

auditory and auditory-visual TWIs. The temporal window for auditory-alone speech recognition,

in which speech recognition performance is relatively unaffected, is about 40-50 ms (~30 ms

mid-frequency audio lead to ~20 ms mid-frequency audio lag), whereas the temporal window for

auditory-visual speech recognition is about 250 ms (~50 ms audio lead to ~200 ms visual lead).

This corresponds roughly to the resolution needed for temporally fine-grained phonemic analysis

on the one hand and course-grain syllabic analysis on the other, which we interpret as reflecting

the different roles played by the auditory and auditory-visual speech processing.

When speech is processed by eye (i.e., speechreading) it is likely to be advantageous to

integrate over long time windows of roughly syllabic lengths (200-250 ms) because visual

speech cues are rather course [46]. At the segmental level, visual recognition of voicing and

manner-of-articulation is generally poor [43], and while some prosodic cues are decoded at better-

than-chance levels (e.g., syllabic stress, and phrase boundary location) accuracy is not very high

[47]. In contrast, acoustic processing of speech is much more robust and capable of much more

fine-grained analyses using temporal window intervals between 10-40 ms [18,48]. What is

interesting is that when acoustic and visual cues are combined asynchronously, the data suggest

that whichever modality is presented first seems to determine the operating characteristics of the



speech processor. That is, when visual cues lead acoustic cues, a long temporal window seems to

dominate whereas when acoustic cues lead visual cues, a short temporal window dominates.

As Hirsh and colleagues suggested more than 40 years ago, auditory perception,

especially of speech and music, cannot be understood entirely without considering the role of

temporal processing as well as the perception of time in general. Hirsh's contributions to the

perception of acoustic events, sequences, and patterns have provided the inspiration and

motivation for decades of subsequent research as well as models upon which future work has

been based. Our current efforts to understand the specific mechanisms with which spectral cues

are integrated across different frequency channels, and how acoustic and visual speech cues are

integrated in time, are but one example of Ira Hirsh's influence on modern speech science.

ACKNOWLEDGMENTS

This research was supported by the Clinical Investigation Service, Walter Reed Army

Medical Center, under Work Unit #00-2501 and by grant numbers DC 000792-01A1 from the

National Institute on Deafness and Other Communication Disorders to Walter Reed Army

Medical Center, SBR 9720398 from the Learning and Intelligent Systems Initiative of the

National Science Foundation to the International Computer Science Institute, and DC 004638-

01 and DC 005660-01 from the National Institute on Deafness and Other Communication

Disorders to the University of Maryland. The opinions or assertions contained herein are the

private views of the authors and should not be construed as official or as reflecting the views of

the Department of the Army or the Department of Defense.



REFERENCES

1. Hirsh IJ. Auditory perception and speech. In: Atkinson RC, Herrnstein RJ, Lindzey G, Luce

RD, eds. Stevens' Handbook of Experimental Psychology, Vol. 1. New York: Wiley;

1988:377-408.

2. Ellis WD. A Source Book of Gestalt Psychology. New York: Harcourt, Brace and World;

1938.

3. Lauter JL. Stimulus characteristics and relative ear advantages: A new look at old data. J

Acoust Soc Am 1983;74:1-17.

4. Bregman AS. Auditory Scene Analysis: the perceptual organization of sound. Cambridge,

Mass: Bradford Books, MIT Press; 1990.

5. Darwin CJ. Perceiving vowels in the presence of another sound: constraints on formant

perception. J Acoust Soc Am 1984;76:1636-1647.

6. Darwin CJ, Sutherland, NS. Grouping frequency components of vowels: when is a harmonic

not a harmonic? Quart J Exp Psychol 1984;36A:193-208.

7. Summerfield Q, Culling JF. Auditory segregation of competing voices: absence of effects of

FM or AM coherence. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 1992;336(1278):357-365.

8. Hukin RW, Darwin CJ. Comparison of the effect of onset asynchrony on auditory grouping

in pitch matching and vowel identification. Percept Psychophys 1995;57:191-196.

9. Hirsh IJ. Temporal aspects of hearing. In: Tower DB, ed. Human Communication and Its

Disorders. New York: Raven Press; 1975:157-162.

10. Watson CS. Temporal acuity and the judgment of temporal order: Related but distinct

auditory abilities. Seminars in Hearing 2004;(this volume).

11. Green DM. Temporal auditory acuity. Psychol Rev 1971;78:540-551.



12. Penner MJ, Robinson CE, Green DM. The critical masking interval. J Acoust Soc Am

1972;48,894-905.

13. Warren RM, Obusek CJ, Farmer RM. Auditory sequences: Confusion of patterns other than

speech or music. Science 1969;164:586-587.

14. Hirsh IJ. Auditory perception of temporal order. J Acoust Soc Am 1959;31:759-767.

15. Hirsh IJ, Sherrick CE. Perceived order in different sense modalities. J Exp Psych

1961;62:423-432.

16. Greenberg S, Arai T, Silipo R. Speech intelligibility derived from exceedingly sparse spectral

information. In: Proceedings of the International Conference of Spoken Language

Processing. Sydney, Australia: ICSLP; 1998:74-77.

17. Silipo R, Greenberg S, Arai T. Temporal constraints on speech intelligibility as deduced from

exceedingly sparse spectral representations. In: Proceedings of Eurospech 1999. Budepest,

Hungary; 1999:2687-2690.

18. Greenberg S, Arai, T. The relation between speech intelligibility and the complex modulation

spectrum. In: Proceedings of the 7th European Conference on Speech Communication and

Technology (Eurospeech-2001). Aalborg, Denmark: 2001:473-476.

19. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL. The DARPA

TIMIT acoustic-phonetic continuous speech corpus. 1993: CDROM produced by the

National Institute of Standards and Technology (NIST).

20. Greenberg S. Understanding speech understanding: Towards a unified theory of speech

perception. In: Proceedings of the ESCA Workshop on the Auditory Basis of Speech

Perception. Keele University; 1996:1-8.



21. Arai T, Greenberg S. Speech intelligibility in the presence of cross-channel spectral

asynchrony. IEEE International Conference on Acoustics, Speech and Signal Processing,

Seattle, WA: 1998:933-936.

22. Huggins AWF. On the perception of temporal phenomena in speech. J Acoust Soc Am

1972;51:1279-1290.

23. Greenberg S. On the origins of speech intelligibility in the real world. In: Proceedings of the

ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels.

Pont-a-Mousson, France: 1997:23-32.

24. Poeppel D. The analysis of speech in different temporal integration windows: Cerebral

lateralization as 'asymmetric sampling in time'. Speech Communication 2003;41:245-255.

25. Sumby WH, Pollack I. Visual contribution to speech intelligibility in noise. J Acoust Soc Am

1954;26:212-215.

26. Grant KW, Braida, LD. Evaluating the Articulation Index for audiovisual input. J Acoust Soc

Am 1991;89:2952-2960.

27. Grant KW, Seitz PF. Measures of auditory-visual integration in nonsense syllables and

sentences. J Acoust Soc Am 1998;104:2438-2450.

28. Grant KW, Greenberg S. Speech intelligibility derived from asynchronous processing of

auditory-visual information. In: Proceedings Auditory-Visual Speech Processing (AVSP

2001), Scheelsminde, Denmark: 2001:132-137.

29. van Wassenhove V, Grant KW, Poeppel D. Timing of Auditory-Visual Integration in the

McGurk Effect. Presented at the Society of Neuroscience Annual Meeting, San Diego, CA:

2001:488.



30. van Wassenhove V, Grant KW, Poeppel D. Temporal Integration in the McGurk Effect.

Presented at the annual meeting of the Cognitive Neuroscience Society, San Francisco, CA:

2002:146.

31. Institute of Electrical and Electronic Engineers (IEEE). IEEE recommended practice for

speech quality measures. IEEE, New York:1969.

32. McGrath M, Summerfield Q. Intermodal timing relations and audio-visual speech

recognition by normal-hearing adults. J Acoust Soc Am 1985;77:678-685.

33. Pandey PC, Kunov H, Abel SM. Disruptive effects of auditory signal delay on speech

perception with lipreading. J Aud Res 1986;26:27-41.

34. McGurk, H, McDonald J. Hearing lips and seeing voices. Nature 1976;264:746-747.

35. Massaro DW. Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry.

Hillsdale, NJ: Lawrence Earlbaum Assoc;1987.

36. Walden BE, Montgomery AA, Prosek RA, Hawkins DB. Visual biasing of normal and

impaired auditory speech perception. J Speech Hear Res 1990;33:163-173.

37. Green KP, Kuhl PK, Meltzoff AN, Stevens EB. Integrating speech information across

talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect.

Percep Psychophys 1991;50:524-536.

38. Munhall K, Gribble P, Sacco L, Ward M. Temporal constraints on the McGurk effect. Percep

Psychophys 1996;58:351-362.

39. Massaro DW, Cohen MM, Smeele PM. Perception of asynchronous and conflicting visual

and auditory speech. J Acoust Soc Am 1996;100:1777-1786.

40. Levitt H. Transformed up-down methods in psychoacoustics. J Acoust Soc Am 1971;49:467-

477.



41. Hirsh IJ. Temporal order and auditory perception. In: Moskowitz HR, Scharf B, Stevens JC,

eds. Sensation and Measurement. Dordrecht-Holland: D. Reidel Publishing Company;

1974:251-258.

42. Grant KW, Walden BE. Evaluating the articulation index for auditory-visual consonant

recognition. J Acoust Soc Am 1996a;100:2415-2424.

43. Grant KW, Walden BE, Seitz PF. Auditory-visual speech recognition by hearing-impaired

subjects: Consonant recognition, sentence recognition, and auditory-visual integration. J

Acoust Soc Am 1998;103:2677-2690.

44. Steinschneider M, Schroeder CE, Arezzo JC, Vaughan HGJr. Speech evoked activity in

primary auditory cortex: effects of voice onset time. Electroencephalography and Clinical

Neurophysiology 1994;92:30-43.

45. Steinschneider M, Volkov IO, Noh MD, Garell PC, Howard MA. Temporal encoding of the

voice onset time phonetic parameter by field potentials recorded directly from human

auditory cortex. J Neurophys 1999;82:2346-2357.

46. Seitz PF, Grant KW. Modality, perceptual encoding speed, and time course of phonetic

information. In: Massaro DW. ed. Proceedings of Auditory-Visual Speech Processing

(AVSP '99), Santa Cruz, CA: 1999. (CDROM).

47. Grant, KW, Walden BE. The spectral distribution of prosodic information. J Speech Hear

Res 1996b;39:228-238.

48. Stevens KN, Blumstein SE. Invariant cues for place of articulation in stop consonants. J

Acoust Soc Am 1978;64:1358-1368.



FIGURE CAPTIONS

Figure 1. Average intelligibility of TIMIT (Texas Instruments – Massachusetts Institute of

Technology) sentences filtered into four discrete spectral regions and presented in various

combinations. Each spectral "slit" is 1/3-octave wide with center frequency (CF) as indicated.

Adapted from [16].

Figure 2. The effect of spectral slit asynchrony on the intelligibility of TIMIT (Texas

Instruments – Massachusetts Institute of Technology) sentences. Adapted from [16].

Figure 3. Same as Figure 2 but showing an expanded range for slits 1 + 4 leading slits 2 + 3.

Baseline performance for slits 1 + 4 and for slits 2 + 3 alone are indicated by solid and dashed

lines, respectively. Adapted from [17].

Figure 4. Audio-visual alignment procedure showing examples of audio lag (middle waveform)

and audio lead (bottom waveform). All alignments are made relative to the natural unedited

audiovisual speech token (top waveform). Dashed vertical lines show temporal positions (from

left to right) of 1) neutral face onset, 2) video motion onset, 3) video motion offset, and 4)

neutral face offset. Note that video motion precedes acoustic onset in the naturally, unedited

production (top waveform).

Figure 5. Average intelligibility of IEEE sentences as a function of audio-video asynchrony.

Note the substantial plateau region between -50 ms audio lead to 200 ms audio delay where

intelligibility scores are high relative to the audio-alone or video-alone conditions. Adapted from

[28].



Figure 6. Example of waveform alignment for McGurk stimulus presentation. This example

shows the creation of visual /k/ and acoustic /p/ leading to the illusory perception /t/. Shown

is the original acoustic waveform for the naturally produced audiovisual token /k/ (top) and

dubbed waveform /p/. The two waveforms are aligned at the onset of the consonant burst.

Figure 7. Labeling functions for the McGurk stimulus visual /k/ and acoustic /p/ as a function

of audiovisual asynchrony (audio delay). Circles = probability of responding with the fusion

response /t/; squares = probability of responding with the acoustic stimulus /p/; triangles =

probability of responding with the visual response /k/. Note the relatively long temporal

window (-50 ms audio lead to 200 ms audio lag) where fusion responses are likely to occur.

Adapted from [29].

Figure 8. Same as Figure 7 but for the McGurk stimulus visual // and acoustic /b/. Fusion

response is /d/ or //. Adapted from [29].

Figure 9. Simultaneity judgments for congruent audiovisual consonant-vowel tokens /d/ and

/t/ (filled symbols) and incongruent audiovisual McGurk tokens visual /k/ - acoustic /p/ and

visual // - acoustic /b/ (open symbols) as a function of audiovisual asynchrony (audio delay).

Adapted from [29].

Figure 10. Summary of temporal windows obtained from the three different tasks (speech

intelligibility, synchrony identification, and synchrony discrimination) for audio and audiovisual

speech tokens.



ABSTRACT

Throughout his career, Ira Hirsh studied and published articles and books pertaining to

many aspects of the auditory system. These included sound conduction in the ear, cochlear

mechanics, masking, auditory localization, psychoacoustic behavior in animals, speech

perception, medical and audiological applications, coupling between psychophysics and

physiology, and ecological acoustics. However, it is Hirsh’s work on auditory timing of simple

and complex rhythmic patterns, the backbone of speech and music that are at the heart of his

more recent work. In this paper, we report on several aspects of temporal processing of speech

signals, both within and across sensory systems. Data are presented on perceived simultaneity

and intelligibility of auditory and auditory-visual speech stimuli where stimulus components are

presented either synchronously or asynchronously. Differences in the symmetry and shape of

temporal windows derived from these data sets are highlighted. Results show two distinct ranges

for temporal integration for speech processing; one relatively short window, about 40 ms, and the

other much longer, around 250 ms. In the case of auditory-visual speech processing, the temporal

window is highly asymmetric, strongly favoring conditions where the visual stimulus precedes

the acoustic stimulus.

LEARNING OBJECTIVES

1) To show the connection between Hirsh’s work on the perception of temporal-order and

recent work on the processing of asynchronous speech both cross-spectrally and cross-

modally.

2) To compare and contrast the effects of temporal misalignment in auditory-alone and

auditory-visual speech processing.



KEY WORDS

Ira J. Hirsh, Temporal Order, Spectro-Temporal Integration, Temporal Speech Processing

ABBREVIATIONS

ms - milliseconds

CV - Consonant-Vowel stimuli

AxVy - Incongruent auditory-visual speech tokens where the subjects hears the speech token x

and simultaneously sees the speech token y.

TWI - Temporal Window of Integration

ADS - an Asymmetric Double Sigmoidal curve fit through the data.

CEU QUESTIONS

1) What was the area of Hirsh’s research that occupied his primary interest in the latter part of

his career?

A) Measurement of hearing

B) Sound reproduction

C) Auditory physiology

D) Temporal processing of simple and complex sounds

E) Animal psychophysics



2) The temporal window of integration for auditory-visual speech stimuli when the visual

stimulus leads the acoustic stimulus is roughly

A) 500 ms

B) 250 ms

C) 20 ms

D) 40 ms

E) 100 ms

3) For auditory-visual McGurk stimuli, as the acoustic and visual signals become more and

more “out of sync” the perceived stimulus is dominated by the

A) Visual stimulus

B) Acoustic stimulus

C) A noise-like stimulus that is part acoustic, part visual

D) There is no dominant response once the auditory and visual signals are “out of sync”

E) The subject’s response depends on their auditory and visual acuity

4) Spectro-temporal integration windows for auditory-only speech recognition are

A) Symmetrical and long (about 250 ms)

B) Symmetrical and short (about 40 ms)

C) Asymmetrical and long (about 250 ms)

D) Asymmetrical and short (about 40 ms)

E) Symmetrical, but the length varies between 100-200 ms depending on the subject.



5) The magnitude and shape of the temporal window of integration for auditory-visual speech

A) Depends on whether the audio or visual signal is leading

B) Is asymmetrical with a broad plateau region for visual leading conditions

C) Is about 40 ms when the audio signal leads the visual signal and about 250 ms when the

visual signal leads the audio signal

D) Is similar for both speech identification and synchrony detection

E) All of the above



AUTHOR BIOGRAPHIES

Ken Grant, Auditory-Visual Speech Recognition Laboratory, Army Audiology and Speech Center, Walter Reed Army Medical Center, Washington, DC, USA Dr. Grant has focused on audio-visual speech processing for nearly 20 years. He received his B.A. from Washington University (Philosophy of Science) in 1976, an M.S in Speech Science from the University of Washington (1980) and a Ph.D. in Communication Science from Washington University (1986). He was a post-doc in the Department of Electrical Engineering at the Massachusetts Institute of Technology between 1986 and 1990. He has been a research scientist at the Walter Reed Army Medical Center since 1990. His most recent work has focused on temporal properties of speech perception with an emphasis on interactions between the auditory and visual modalities. http://www.wramc.amedd.army.mil/departments/aasc/avlab Steven Greenberg, The Speech Institute, Oakland, CA, USA Dr. Greenberg is a scientist based in California, who has worked at the International Computer Science Institute, the University of California, Berkeley, and the University of Wisconsin. He received his A.B. in Linguistics from the University of Pennsylvania (1974), and a Ph.D. in Linguistics (1980) from the University of California, Los Angeles. His research has focused over the years on auditory mechanisms underlying the processing of speech, as well as on machine-learning-based statistical methods for automatic phonetic and prosodic annotation of spontaneous American English dialogue material. He has organized numerous conferences and conference special sessions. http://www.icsi.berkeley.edu/~steveng David Poeppel, Cognitive Neuroscience of Language Laboratory, University of Maryland, College Park, MD, USA David Poeppel is from Munich, Germany. He studied neurophysiology and cognitive science at Bowdoin College and then MIT (B.S. 1989). He stayed at MIT and received a Ph.D. in cognitive neuroscience (Ph.D. 1995), focusing on linguistics and the neural basis of speech perception and language comprehension. From 1995-1997 he was a post-doc at UCSF, learning the imaging techniques MEG and fMRI. Since 1998 he has been on the faculty at the University of Maryland College Park. He is an Associate Professor in the Department of Linguistics and the Department of Biology The work in his lab focuses on problems ranging from temporal sensitivity and coding in human auditory cortex to speech perception to lexical semantics. http://www.ling.umd.edu/poeppel Virginie, van Wassenhove, Cognitive Neuroscience of Language Laboratory, University of Maryland, College Park, MD, USA Virginie van Wassenhove received her B.S. in Neurophysiology from the University of Maryland, College Park in 1998, and her Ph.D. in the Neuroscience and Cognitive Science program from the University of Maryland, College Park in 2004. Her dissertation was conducted under the supervision of Dr. David Poeppel and Dr. Ken W. Grant at the Walter Reed Army Medical Center, Washington DC. Her thesis work focused on the neural mechanisms underlying the integration of auditory-visual speech and in defining the distal stimulus parameters that constrain auditory-visual speech integration and the perceptual binding of cross-modal information in general. Her work used a psychophysical approach to investigate the temporal constraints of auditory-visual speech integration whereas electrophysiology work focused on the


http://www.wramc.amedd.army.mil/departments/aasc/avlab

http://www.icsi.berkeley.edu/~steveng

http://www.ling.umd.edu/poeppel



neural correlates of auditory-visual speech integration in the time domain by using electroencephalography and magnetoencephalography brain recording techniques. http://www.wam.umd.edu/%7Evvw/

http://www.wam.umd.edu/%7Evvw/

Table I. Temporal window for efficient integration derived from speech identification, synchrony identification, and synchrony discrimination tasks. For audio conditions, window boundaries refer to the temporal onset of the mid-frequency channels (Slits 2 + 3) relative to the high- and low-frequency channels (Slits 1 + 4). For audiovisual conditions, window boundaries refer to the temporal onset of the audio signal relative to the video signal (negative values indicate audio leading conditions, positive values indicate video leading conditions).

Stimulus Materials Task Modality Left Boundary (ms) Right Boundary (ms) Temporal Window (ms) TIMIT Sentences Speech Identification Audio -30.9 22.1 53.0 IEEE Sentences Synchrony Discrimination Audio -7.7 10.4 18.1 IEEE Sentences Speech Identification Audiovisual -36.1 230.3 266.4 IEEE Sentences Synchrony Discrimination Audiovisual -33.5 165.3 198.8 CV Syllables Synchrony Identification Audiovisual -84.5 137.3 221.8 CV Syllables Synchrony Discrimination Audiovisual -91.3 122.8 214.1 McGurk CV Syllables Fusion Identification Audiovisual -27.3 152.4 179.7 McGurk CV Syllables Synchrony Identification Audiovisual -47.9 115.4 163.3 McGurk CV Syllables Synchrony Discrimination Audiovisual -67.6 115.0 182.6

FIG 1

89% 60% 13%

2% 9% 9% 4%

1

2

34

Slit

Num

ber

1

2

34

Slit

Num

ber

334

841

21205340

CF

(Hz)

334

841

21205340

CF

(Hz)

FIG 2

1

2

34

Slit

Num

ber

1

2

34

Slit

Num

ber

1

2

34

Slit

Num

ber

225 ms Delay

250 ms Delay

275 ms Delay

72%

62%

53%

80%

55%

41%

FIG 3

0

20

40

60

80

100

-100 0 100 200 300 400 500 600

Slit Asynchrony (ms)

Bands 2 + 3

Bands 1 + 4

TIM

IT W

ord

Rec

ogni

tion

(%)

FIG 4

FIG 5

0

20

40

60

80

100

-500 -400 -300 -200 -100 0 100 200 300 400 500

IEE

E W

ord

Rec

ogni

tion

(%)

Audio Alone

Visual Alone

Audio Delay (ms)

FIG 6

FIG 7

0

0.2

0.4

0.6

0.8

1

-500 -400 -300 -200 -100 0 100 200 300 400 500Audio Delay (ms)

/pa/ /ta/ /ka/R

espo

nse

Pro

babi

lity

FIG 8

Res

pons

e P

roba

bilit

y

0

0.2

0.4

0.6

0.8

1

-500 -400 -300 -200 -100 0 100 200 300 400 500

/ba/ /da/ or /a/ /ga/

Audio Delay (ms)

FIG 9

0

0.2

0.4

0.6

0.8

1

-500 -400 -300 -200 -100 0 100 200 300 400 500Audio Delay (ms)

AdV

dA

tV

tA

pV

kA

bV

g

Pro

babi

lity

of S

imul

tane

ity

Temporal Window of Integration

-100 0 100 200 300

Aud

ioA

udio

visu

al

TIMIT Word Identification

IEEE Synchrony Discrimination

IEEE Synchrony Discrimination

IEEE Word Identification

McGurk Synchrony Discrimination

McGurk Synchrony Identification

McGurk Fusion Identification

CV Synchrony Discrimination

CV Synchrony Identification

Stimulus Onset Asynchrony (ms)

FIG 10

Documents

Effects of Spectro-Temporal Asynchrony in Auditory and Auditory-Visual Speech Processing