23
Journal of Memory and Language 44, 274–296 (2001) doi:10.1006/jmla.2000.2753, available online at http://www.academicpress.com on 0749-596X/01 $35.00 Copyright © 2001 by Academic Press All rights of reproduction in any form reserved. 274 How Listeners Compensate for Disfluencies in Spontaneous Speech Susan E. Brennan State University of New York at Stony Brook and Michael F. Schober New School for Social Research Listeners often encounter disfluencies (like uhs and repairs) in spontaneous speech. How is comprehension affected? In four experiments, listeners followed fluent and disfluent instructions to select an object on a graphical display. Disfluent instructions included mid-word interruptions (Move to the yel- purple square), mid-word interruptions with fillers (Move to the yel- uh, purple square), and between-word interruptions (Move to the yellow- purple square). Relative to the target color word, listeners selected the target object more quickly, and no less accurately, after hearing mid-word interruptions with fillers than after hearing comparable fluent utterances as well as utterances that replaced disfluencies with pauses of equal length. Hearing less misleading information before the interruption site led listeners to make fewer errors, and fillers allowed for more time after the interrup- tion for listeners to cancel misleading information. The information available in disfluencies can help listeners compensate for disruptions and delays in spontaneous utterances. © 2001 Academic Press Key Words: repairs; disfluencies; pauses; fillers; spontaneous speech; comprehension; parsing; paralinguistic cues. Spontaneous human speech is notoriously disfluent. Speakers hesitate, interrupt themselves mid-phrase or mid-word, repeat or replace words, abandon phrases to start afresh, and season their talk with expressions like um, uh, or, I mean, and oh. A conservative estimate (excluding silent hesitations) for the rate of disfluencies in spontaneous speech is 6 words per 100 (Fox Tree, 1995). A similar rate of 5.97 per 100 words was found in a corpus of speech by young, middle-aged, and older people, married and strangers, about familiar and unfamiliar top- ics (Bortfeld, Leon, Bloom, Schober, & Bren- nan, in press). Disfluency rates may be even higher in certain content domains (Schachter, Christenfeld, Ravina, & Bilous, 1991), although they appear to be lower in speech directed at machines (Oviatt, 1995). Disfluent speech poses a continuation prob- lem for listeners (Levelt, 1989), who must edit out disfluencies in order to make sense of speak- ers’ utterances. Consider a hypothetical listener who hears Call Don Harw- I mean, Harrison. Recovering a fluent version requires determin- ing that there is a problem with the utterance, what the problem is, and how to repair it (in- cluding how far to back up to replace corrected information); to do this, the parser must identify three adjoining intervals, the reparandum, the edit interval, and the repair interval (see Levelt, 1983; Nakatani & Hirschberg, 1994). The reparandum contains fluent speech up until the interruption site, where the speaker leaves off speaking fluently. In the case of an overt error This material is based on work supported by the National Science Foundation under Grants IRI9402167, IRI9711974, and SBR9730140. We are grateful to Elizabeth Shriberg, Arthur Samuel, Fred Conrad, Richard Gerrig, Charles Metz- ing, and three anonymous reviewers for their comments on an earlier draft as well as to Jonathan Bloom, Ricardo Car- rion, Julia Kung, Angela Lawrence, Maria Malzone, Kimiko Ryokai, Leora Schefres, Darron Vanaria, and Maurice Williams for their assistance in preparing the stimuli and running the experiments. Address correspondence and reprint requests to Susan E. Brennan, Department of Psychology, State University of New York, Stony Brook, NY, 11794-2500 or Michael F. Schober, Department of Psychology, AL-303, New School for Social Research, 65 FifthAvenue, New York, NY, 10003. E-mail: [email protected] or schober @newschool.edu.

How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

Journal of Memory and Language 44, 274–296 (2001)doi:10.1006/jmla.2000.2753, available online at http://www.academicpress.com on

How Listeners Compensate for Disfluencies in Spontaneous Speech

Susan E. Brennan

State University of New York at Stony Brook

and

Michael F. Schober

New School for Social Research

Listeners often encounter disfluencies (likeuhsand repairs) in spontaneous speech. How is comprehensionaffected? In four experiments, listeners followed fluent and disfluent instructions to select an object on a graphicaldisplay. Disfluent instructions included mid-word interruptions (Move to the yel- purple square), mid-wordinterruptions with fillers (Move to the yel- uh, purple square), and between-word interruptions (Move to theyellow- purple square). Relative to the target color word, listeners selected the target object more quickly, and noless accurately, after hearing mid-word interruptions with fillers than after hearing comparable fluent utterancesas well as utterances that replaced disfluencies with pauses of equal length. Hearing less misleading informationbefore the interruption site led listeners to make fewer errors, and fillers allowed for more time after the interrup-tion for listeners to cancel misleading information. The information available in disfluencies can help listenerscompensate for disruptions and delays in spontaneous utterances.© 2001 Academic Press

Key Words:repairs; disfluencies; pauses; fillers; spontaneous speech; comprehension; parsing; paralinguistic cues.

Spontaneous human speech is notoriouslyl

F0

young, middle-aged, and older people, marriedp--ener,h at

itak-

ner

disfluent. Speakers hesitate, interrupt themsemid-phrase or mid-word, repeat or replace words,abandon phrases to start afresh, and seasontalk with expressions like um, uh, or, I mean,and oh. A conservative estimate (excludinsilent hesitations) for the rate of disfluenciesspontaneous speech is 6 words per 100 (Tree, 1995). A similar rate of 5.97 per 1words was found in a corpus of speech

This material is based on work supported by the Natio

0749-596X/01 $35.00Copyright © 2001 by Academic PressAll rights of reproduction in any form reserved.

27

in-e,

n-edfy

etheffror

Science Foundation under Grants IRI9402167, IRI97119and SBR9730140. We are grateful to Elizabeth ShribeArthur Samuel, Fred Conrad, Richard Gerrig, Charles Meing, and three anonymous reviewers for their commentsan earlier draft as well as to Jonathan Bloom, Ricardo Crion, Julia Kung, Angela Lawrence, Maria Malzone, KimikRyokai, Leora Schefres, Darron Vanaria, and MauriWilliams for their assistance in preparing the stimuli arunning the experiments.

Address correspondence and reprint requests to SusaBrennan, Department of Psychology, State University of NeYork, Stony Brook, NY, 11794-2500 or Michael F. SchobeDepartment of Psychology, AL-303, New School for SociResearch, 65 Fifth Avenue, New York, NY, 10003. [email protected] or schober @newschool.ed

4

ves

their

ginox0by

and strangers, about familiar and unfamiliar toics (Bortfeld, Leon, Bloom, Schober, & Brennan, in press). Disfluency rates may be evhigher in certain content domains (SchachtChristenfeld, Ravina, & Bilous, 1991), althougthey appear to be lower in speech directedmachines (Oviatt, 1995).

Disfluent speech poses a continuation prob-lem for listeners (Levelt, 1989), who must edout disfluencies in order to make sense of speers’ utterances. Consider a hypothetical listewho hears Call Don Harw- I mean, Harrison.Recovering a fluent version requires determing that there is a problem with the utterancwhat the problem is, and how to repair it (icluding how far to back up to replace correctinformation); to do this, the parser must identithree adjoining intervals, the reparandum, theedit interval, and the repair interval(see Levelt,1983; Nakatani & Hirschberg, 1994). Threparandum contains fluent speech up until interruption site, where the speaker leaves ospeaking fluently. In the case of an overt er

nal74,rg,tz- onar-ocend

n E.wr,alil:u.

Page 2: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

S

h

p

a

t

)

tde

fi

a

ldapi

n

t

n d

&,r,,te

b-utl-s

as they are in spontaneous speech, affect and in-

ver-), byn-

ar-sesllyrceires

89;&s

leup-

akerntp-ltd-dd

-s-

fed-

as

SPEECH DISFLUENCIE

repair, this interval contains material that tspeaker finds problematic (in our exampHarw- ). The edit interval begins at the interrution site and ends with the onset of the rep(here, I meanserves as an editing expressionexplicit comment that the speaker has madeerror; Hockett, 1967; Levelt, 1989). The repinterval may or may not retrace material frothe reparandum (in our simple example, the pair Harrison does not include anyretracing). Determining these intervals is notrivial task, as developers of machine speerecognition systems have discovered (CoreSchubert, 1999; Nakatani & Hirschberg, 199Shriberg, Bear, & Dowding, 1992).

Evidence about how people process sponneous speech (with all its natural disfluenciesscant, as most studies of human speech comhension have focused on fluent, idealized utances. The tacit assumptions of these stuseem to be that (1) disfluencies uniformly prent obstacles to comprehension and (2) disflucies need to be excluded in order to study coprehension in its “purest” form. We questioboth of these assumptions. Regarding the assumption, some disfluencies may actuapresent information that listeners can use to hcompensate for what might otherwise hindprocessing, as we discuss shortly. As H. H. Cl(1994, 1996) has argued, people have a numof resources for managing the process as wethe content of conversation. That is, not only they produce and interpret utterances that dress the main purposes at hand, but they duce informative secondary (or paralinguistsignals about the utterances themselves.

As for the second assumption, studyicomprehension under normal “noisy” condtions should be just as important as studying i“pure” conditions. As Fox Tree (1995) has notemany current approaches to parsing the strture and interpreting the meaning of incomiutterances are built on data about sanitizedterances and cannot handle disfluencies atSuccessful comprehension of spontaneous, fluent speech is particularly intriguing given thfact that much of the time, listeners don’t exp

rience disfluencies as disruptive, and when thdo detect disfluencies, they have trouble cate

AND COMPREHENSION 275

ele,-

airor anir

mre-

rizing or locating them precisely (see Bond Small, 1984; Cooper, Tye-Murray, & Nelson1987; Ferber, 1991; Fromkin, 1973; Lave1973; Martin & Strange, 1968; Tent & Clark1980). Instead, listeners make the appropriaparsing decisions, solve the continuation prolem, and interpret speakers’ intentions withomuch apparent difficulty. In the studies that folow, we examine how disfluencies, ubiquitou

g

ach&

4;

ta- ispre-er-iess-en-m-nrstllyelperrkberl asod-ro-

c)

gi- ind,uc-gut-all.is-ee-ey

form comprehension processes.

HOW MIGHT SPEECH DISFLUENCIESINFORM COMPREHENSION?

As many speech production studies hashown (e.g., Dell, 1986; Fromkin, 1973; Garett, 1975; Levelt, 1989; Smith & Clark, 1993spontaneous speech is systematically shapedthe problems speakers encounter while planing messages, retrieving lexical items, and ticulating a speech plan. For instance, analyof corpora show that interruptions are usualocated very close to the word that is the souof the trouble, since speakers monitor thespeech closely and tend to interrupt themselvjust as soon as they detect trouble (Levelt, 19Nooteboom, 1980; although see Blackmer Mitton, 1991, who argue that when interruptionare followed immediately by repairs, the troubmust have been detected before the interrtion). If the problem is detected after a trouble-some word has already been uttered, the speis somewhat more likely to finish the curreword (but not the current phrase) before stoping to repair it (Levelt, 1983, 1989). Levefound that when an interruption occurs miword, the problem is with the interrupted woritself (Levelt, 1983, 1989); so an interrupteword may signal what a speaker does notmean.

Another systematicity within speech disfluencies is in the occasional use of editing expresions likeI mean, sorry, or that is. Sixty-two per-cent of the repairs in Levelt’s (1983) corpus ospontaneous task-oriented utterances includsome type of editing expression. The most common of these waser (corresponding to whatmany American English speakers pronounceuh), occurring in 30% of repairs.Er seemed to

o-mark those repairs made very early; the most
Page 3: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

N

l

t

noie

d

v

3db

n

i

4

rso

di

t

9

is-orlyre -e-rd

is-hatngndn-es-se

us

hens. to on-

p-re

to-r-

therers,sreionmouss to aeo-

fulchre

u-ichnt.b-ner

276 BRENNAN A

frequent use ofer in Levelt’s corpus was incovert repairs (that is,er filled a hesitation inwhich no problematic word was actualuttered). Often,erwas used immediately after ainterruption, as inLeft- er- right in front of me(Levelt, 1989, p. 484), and its use declined asinterruption occurred further away from thsource of the problem. In other words,er wasmore likely to be used the earlier a problem wdetected.

While linguists and psycholinguists have cosidered disfluencies largely from a productistandpoint, computational linguists have consered them from a recognition standpoint, oftwith the goal of improving machine recognitioof spontaneous speech. This work has focusecharacterizing disfluencies in speech corpoidentifying potential cues that disfluencies haoccurred, identifying repairs and reparanda, acomparing the performance of algorithms bason various combinations of cues. Hindle (198for instance, originally suggested that an “esignal” serves as a cue that fluent speech hasinterrupted. Although no evidence for a singsuch cue has been found (Bear, Dowding,Shriberg, 1992; Lickley & Bard, 1998; Nakata& Hirschberg, 1994), several corpus studies hafoundcombinationsof cues that could be used balgorithms to identify disfluencies and repawith reasonable success (Bear, Dowding,Shriberg, 1992; Nakatani & Hirschberg, 199Shriberg et al., 1992; Shriberg, 1999). Thepotential cues include word or syllable lengtheing, interrupted words, glottalization, laryngeaization, silent pauses in the edit interval, filleand other editing expressions, and increastress on a repair word versus a reparandum w

Our question is how human listeners copewith disfluent input. A handful of linguists anpsycholinguists have looked at disfluencfrom the listener’s point of view, proposinthat cues such as interruptions, hesitations, prosody in spontaneous speech may help lisers solve the continuation problem and that phaps listeners can infer speakers’ intentionsexploiting regularities in the distributions otypes of speech errors. Lickley and Bard (19

used a word gating paradigm (increasing tlength of utterances one word at a time wh

and

D SCHOBER

yn

hee

as

-nd-n

non

ra,ended),iteenle&iveyrs&;

sen-l-sedrd.

esganden-er-byf8)he

playing them over and over to listeners) to dcover how much information is necessary fdetecting disfluency. They found that near80% of the disfluencies in their corpus wedetectable (rated “disfluent” or “maybe disfluent”) at the first word gate in a repair, somtimes even before lexical access of that wohad occurred. Fox Tree’s (1995) studies of lteners in Dutch and English demonstrated twhile some fresh starts slowed word-monitoriperformance, repetitions did not. Brennan aWilliams’ (1993) studies, using as stimuli spotaneous answers to general knowledge qutions, showed that listeners were able to uthe information available in pauses of variolengths and in fillers (um and uh) vs silentpauses to correctly judge how confident tspeakers were in their answers to the questioAnd in a study of synthesized speech playedlisteners, pauses before repairs and stressrepair words led to judgments of higher comprehensibility and helped listeners initiate reetitions of the disfluent utterances moquickly (Howell & Young, 1991).

While these studies represent first steps ward discovering how listeners use the infomation in spontaneous disfluent speech, tasks (gating and judging if disfluencies wepresent, monitoring for words, rating answechoosing which of two disfluent utterancewith and without a particular feature was mocomprehensible, and repeating a fluent versof a just-heard disfluent utterance) differ frowhat listeners do when they hear spontanespeech. One goal of the present studies waexamine the processing of disfluencies incomprehension task that is closer to what pple do in everyday language use.

If the forms of some disfluencies bear useinformation, there should be situations in whia target word in a disfluent utterance is moquickly comprehended (without loss of accracy) than in a comparable utterance in whthe disfluency, or one of its features, is abseConsider a situation in which there are two ojects mutually known to a speaker and a liste(say, the purple mugand the yellow mug). If thespeaker begins to name one and then stops

ilenames the other (hand me the pur- uh, yellow

Page 4: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

-

u

h

-

e

r

an

a

o w

thethel. s-

hev-t-seed, asr-

p-

d-

re-pe-s

in to bendrdsr-theryetsnt s 1i-en-If

byornot.rror tog-

rss. of

SPEECH DISFLUENCIE

mug), the interruption (pur- uh), which displayswhat the speaker did not intend (purple), mighttip the listener off early as to what the speadoes intend (yellow). In this situation the listener may be faster to recognize the speakintentions to refer to a target object in a disfluutterance than in a comparable, more fluentterance (for instance, hand me the <pause> yelow mug).

For the studies that follow, we set up just sua situation: Listeners tried to pick out a uniqreferent on a display in response to fluent adisfluent versions of the same utterances. Theard disfluent utterances spontaneously pduced by one speaker, likeMove to the orangepurple circle or Move to the pur- uh, yellowsquare, and we measured their comprehensby seeing how quickly and accurately thcould press a key corresponding to the tarsquare. We used spontaneously produced disencies because of the possibility that artificiacreated or performed ones would sound unnaral (Fox Tree, 1995). Listeners also heard thkinds of relatively fluent comparison utteranceThe most obvious comparison is with (1) spotaneous fluent utterances by the same speaas inMove to the yellow square. Because suchfluent utterance may differ from a disflueutterance in prosody or other uncontrolled chacteristics, we also created (2) disfluencexcised controls in which the disfluency (in thcase,pur- uh) was electronically excised fromcopy of the disfluent utterance and the remaing parts “zipped up” to createMove to theyellow square. Because both the fluent and difluency-excised utterances are shorter in dution than the disfluent utterances, we also cated (3) controls in which the disfluency wremoved from a copy of the utterance andplaced with a pause of equal length, as inMove tothe <pause> yellow square. Any comprehen-sion difference for a disfluency may be duethe form of the disfluency or it could be due simply to timing. The disfluency-replaced-by-paucontrols thus provide the most stringent compison because they hold time constant. Thesethe disfluency-excised controls have the ad

tional advantage that they vary disfluenwithin the same token of an utterance.

n-

S AND COMPREHENSION 277

ker

er’sent ut-l-

chendeyro-

iony

getflu-llytu-ees.n-ker,

tar-y-isain-

s-ra-re-s

re-

to-

sear-anddi-cy

In Experiments 1 and 2 we examine twhypotheses (not mutually exclusive) about hocertain features of disfluencies may support repair process in comprehension such that net effects of disfluencies are not harmfuHypothesis 1 is that mid-word interruption(like yell- orange) are better signals than between-word interruptions (like yellow- orange)that a word was produced in error and that tspeaker intends to replace it. This follows Leelt’s (1989, p. 481) proposal that “by interruping a word, a speaker signals to the addresthat that word is an error. If a word is completethe speaker intends the listener to interpret itcorrectly delivered.” Hypothesis 2 is that interuptions marked by the filler uh (like yell- uhorange) are better error signals than interrutions without uh (like yell- orange). This fol-lows Levelt’s (1989, p. 481) proposal that an eiting expression like er or uh may “warn theaddressee that the current message is to beplaced.” These two hypotheses are not in comtition; they simply specify two kinds of featurethat may help in monitoring and repair.

The logic of our studies, then, is that certacues in disfluent speech may be informativelisteners (as opposed to being simply noise tofiltered out). If so, then there should be faster ano less accurate comprehension of target wofollowing disfluencies than target words in utteances in which these cues are absent. On other hand, if disfluencies lack compensatofeatures, then listeners should choose targmore slowly and less accurately after disflueutterances than after the controls. If Hypothesiis correct, mid-word interruptions should faciltate comprehension more (or harm comprehsion less) than between-word interruptions. Hypothesis 2 is correct, interruptions marked uh should facilitate comprehension more (harm comprehension less) than those that are Experiment 1 compares response times and erates for three types of disfluent utterancesthree types of edited and natural controls in a loically constrained context, in which listeneselect objects from two-object referent arrayExperiment 2 decreases the informativenessthe disfluency by increasing the number of pote

tial referents. Experiment 3 examines whether
Page 5: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

s r-gt-

278 BRENNAN AN

the disfluency advantage found in Experimentand 2 is due to the phonological or temporal chacteristics of fillers, and Experiment 4 examin

the effect when an interruption does not alwa

e

ot

n

n

ee

ai

gn

e

toees-aseys

htedofin o ysim--

eededd,skral

a-ed.a-d,rd-

a

r

idn- of

predict a replacement.

COLLECTION OF SPONTANEOUS FLUENTAND DISFLUENT UTTERANCES FOR

EXPERIMENTS 1–4

We began by collecting a database of sponneous utterances to use as stimuli. A voluntwho was a male native speaker of English anaive to our purposes produced the utterancHe viewed a series of simple computer displaof geometric objects (orange, yellow, purplred, green, and blue squares and circles) shthree at a time and arranged horizontally on screen against a white background. After a dplay of three objects appeared, there was an enting tone and one object was highlighted. Tspeaker’s task was to say out loud: Move to the<highlighted object>. Between trials, the screewas blank. After 10 practice trials, the speakdescribed 150 displays in each of three sessio

Meanwhile, the experimenter who had istructed the speaker sat at a similar displaythe same room and carried out his instructioby moving her cursor to the target object (nther could see the other’s display). She intacted freely with the speaker (e.g., by sayi“okay” when she had moved). This minimfeedback proved necessary in order to elspontaneous and natural-sounding utteranfrom the speaker. (In our pilot attempts to colect instructions with no addressee presespeakers appeared to fall into a routine and pduced utterances with the mechanical-soundintonation characteristic of reading a list.)

We elicited disfluent utterances using a prcedure modified from van Wijk and Kempe(1987). Our speaker was aware that the hilight might jump to another object and was istructed to update his instructions in those caas quickly as he could. In about two-fifths of thtrials, the highlight suddenly jumped to anothobject, and the rest of the time it remained the same object. When the highlight remainthe same, about one-third of the time it flicker

slightly (as if it might jump). Meanwhile, a sec

D SCHOBER

1ar-esys

ta-erndes.yse,wnheis-ori-he

erns.-

innsi-r-

ngl

citcesl-nt,ro-ing

o-nh--

sese

eronedd

ond experimenter monitored the speaker’s utteances in another room and controlled the timinof the highlights using a remote keyboard atached to the speaker’s computer in order elicit disfluencies and self-corrections from thspeaker. The speaker was unaware of the prence of the second experimenter until he wdebriefed. The second experimenter hit a kduring each trial, usually while the speaker waspeaking, which sometimes made the highligjump to another object. The speaker producdisfluencies or self-corrections in about 34% the trials. These were produced not only response to the changing highlight, but alsapparently because of interference; the displaappeared in rapid succession and were quite silar. About three-quarters of the displays involved objects with two-syllable colors (yellow,purple, or orange) to increase the likelihood ofthe speaker interrupting himself mid-word.

Since our goal in this project was to study theffects of disfluencies upon listeners, we usonly one speaker so that listeners would not neto calibrate to different voices (see NygaarSommers, & Pisoni, 1994) and so that their tawould be less unnatural than if they had seve(virtual) partners.

Stimuli

The audiotapes of fluent and disfluent spontneous utterances were transcribed and checkThe utterances were then digitized using Signlyze for the Macintosh. Those that mentioneobjects with two-syllable colors (yellow, purpleor orange) were categorized as (1) 27 mid-wointerruptions followed immediately by replacement of a color word, such as Move to the yel-purple square; (2) 12 mid-word interruptionsfollowed by uh and then the replacement of color word, such as Move to the yel- uh, purplecircle; (3) 30 between-word interruptions fol-lowed immediately by replacement of colowords, such as Move to the yellow- purplesquare; (4) 195 fluent utterances; and (5) otherutterances, which included all directions that dnot fall into the first four categories. These cosisted of restarts that involved repeating partsphrases, such as Move to the yel- to the purple

-square, as well as utterances with more than one
Page 6: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

al-elya-rd

allyy-ereloredce. 68.n-es,

er-s,

otaln-s

t-is

perndp-).rere

6d3,

p-

iesch;in)-m-es

eirani

SPEECH DISFLUENCIES

restart or with a restart of the shape word, asMove to the yel- orange cir- square. Fluent anddisfluent utterances involving one-syllable colwords (red, green, and blue) were not caterized or used. Both authors and a research astant categorized the utterances. Disfluent uttances from the three disfluency categories thad been difficult to categorize or that were asociated with any background noise were elimnated from further consideration. This left 14 btween-word interruptions, 14 mid-word interuptions, and 6 mid-word interruptions witfillers, for a total of 34 Naturally Disfluentutter-ances. These served both as the critical stimand as the basis for the digitally edited controTwo involved the speaker’s replacing yellowwith orange, 12 orange with yellow, 8 yellowwith purple, 5 purplewith yellow, 5 orangewithpurple, and 3 purple with orange. We chose atotal of 68 utterances from the 195 fluent uttances to serve as fillers and as natural contrThese Fluent utterances were chosen randombut such that their color words were in roughthe same relative proportions as the second (get) color words in the Naturally Disfluent utteances (there were 12 orange, 24 purple, andyellow objects mentioned in the Fluent control

In order to vary fluency within utterances, wmade digital copies of each of the 34 sponneous disfluencies and used SoundEdit Prothe Macintosh to edit them. For each disfluenwe created a Disfluency-replaced-by-pausever-sion by replacing the first color word (and anfiller associated with it) with a silent pause equal length. The silent pause contained same ambient room sound as the material it wreplacing. Recall that the purpose of these itewas to enable us to examine whether an inrupted word is more informative about speaker’s intention than a silent pause. In adtion, we created a Disfluency-excisedversion byremoving the interrupted color word and anassociated filler, shortening the utterance byequal amount (exactly equal to the length of tpause inserted into the Disfluency-replaced-bpause version and with the same edit points)

Finally, for each utterance, a reference po

for the target color word was determined usiSoundEdit Pro so that response latencies r

rds

AND COMPREHENSION 279

in

orgo-sis-er-hats-i-

e-r- h

ulils.

er-ols.ly,lytar-r- 32

s).eta- forcy,

yoftheas

mster-adi-

y anhey-

.intng

tive to the onsets of the target word could be cculated. These points were located as precisas possible at the earliest point where informtion about the onset of the target color wo(yellow, purple, or orange) was available. Foreach trio based on the same utterance (NaturDisfluent, Disfluency-excised, and Disfluencreplaced-by-pause), the reference points wexactly the same with respect to the target coword, since the members of a trio were bason the same spontaneously disfluent utteranReference points for the onsets of each of theFluentcontrols were measured independently

In all, the main set of stimulus utterances cosisted of 34 spontaneously disfluent utteranc34 edited Disfluency-replaced-by-pause vsions, 34 edited Disfluency-excised versionand 68 spontaneous Fluent utterances, for a tof 170 utterances. So one-fifth of the items cotained an overt lexical disfluency and four-fifthdid not. Since each utterance (e.g., move to theorange square) consisted of 5–6 words (couning the word fragments in the disfluencies), thamounts to a disfluency rate of less than 4 100 words (less than Fox Tree’s, 1995, aBortfeld et al.’s, in press, estimated rate of aproximately 6 disfluencies per 100 wordsThree-fifths of the utterances in the set wespontaneously produced and two-fifths wedigitally edited. In Experiments 1 and 2, onlyutterances (< 4% of all utterances) containefillers; in Experiments 3 and 4, only 12 and 1respectively, contained fillers (< 7%).

All the disfluent utterances had their interrution sites within or immediately following theword to be repaired. These sorts of disfluencare relatively frequent in spontaneous speefor instance, 69% of the interruption sites Levelt’s (1983) corpus occurred within (18%or immediately after (51%) problem words. Interrupted words appear to be even more comon in repairs of speech directed at machin(Bear et al., 1992, found 60% of repairs in thcorpus contained word fragments, and Nakat& Hirschberg, 1994, found 73%).

Prosodic Characteristics of the Stimuli

Pitch accents increase the prominence of wo

ela-in speech; they can convey that something is new
Page 7: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

se

e

r

nn

akunh

e

at-vefor

entabce, toleft

seoflaras

dke

ed-er-e

dedd as

tedane of

nothesehan

t ap-hetheedr-tis-

be-p-

nt

Mean (naturally disfluent) 102.8 122.0

rather than given or that it contrasts with othinformation mentioned in an utterance (Pierrhumbert & Hirschberg, 1990). Repair wordthat supply new semantic content are stres(with a higher F0) relative to the words they arcorrecting (Bear et al., 1992; Howell & Young1991; Levelt & Cutler, 1983; O’Shaughness1992; Shriberg et al., 1992), while repeatwords are not (Bear et al., 1992; O’Shaughnes1992). To characterize our stimuli, we measurF0 peaks on each color word (see Table 1). Tget color words in Naturally Disfluent utterances had higher F0s than those in Fluent utteances, t(100) = 8.04, p < .001. And in theNaturally Disfluent utterances, the F0 of (target)color words in repairs averaged 19.2 Hz highthan the F0 of color words in reparanda, t(34) =9.47, p < .001. This difference was consistefor all three types of disfluencies. There was difference in F0 between the first (reparandumcolor words in Naturally Disfluent utteranceand those in Fluent utterances.

Ratings of Edited and Nonedited Stimuli

Before presenting the stimuli to listenersinstructions to follow in a comprehension taswe needed to determine whether listeners cohear the electronic edit points in the edited cotrols. We also wanted to determine whether tDisfluency-excised and Fluent utterances weperceptually distinguishable (since lexically ansyntactically they were exactly the same). Sixte

undergraduate Stony Brook students (11 womand 7 men) listened to the set of 170 utteranc

rd

D SCHOBER

ere-sed

,y,dsy,edar---

er

to)s

s,ld-eredn

en

and judged the degree to which they sounded nural or electronically edited. The students, natispeakers of English, volunteered in exchangeresearch credit in a psychology course.

Each utterance was played once in a differrandom order for each listener, using SuperLon a Macintosh computer. After each utteranlisteners took as much time as necessarymake a rating on a 7-point scale where the endpoint was labeled (1) Edited, the midpointwas labeled (4) ????, and the right endpoint walabeled (7) Natural. Listeners participated alonand made their ratings using the top row number keys (1–7) on the keyboard. Particucare was taken to instruct listeners about wmeant by Natural and Edited, with the instruc-tions: “Some of the utterances will be playenaturally, that is, exactly as the speaker spothem, and others have been electronically ited, that is, changed using the computer aftward.” Since we wanted to know whether thelectronic edit points were detectable, we neelisteners to judge whether utterances soundeif they had been edited electronically as op-posed to whether the speaker himself correc(or “edited”) them as he was speaking. Menaturalness ratings were calculated on a scal1 (edited) to 7 (natural).

Ratings are shown in Table 2. There was evidence that listeners were able to detect electronic edits. Disfluency-replaced-by-pauutterances were rated as no more edited tNaturally Disfluent (nonedited) utterances, t1(15)= .67, ns; t2(33) = 1.21, ns. This suggests thaDisfluency-replaced-by-pause utterances arepropriate controls for testing the effects of tpresence and absence of a disfluency within same utterance. Although they were wordidentically, Fluent and Disfluency-excised utteances were not rated alike by listeners; Fluenutterances were rated as more natural than Dfluency-excised ones, t1(15) = 5.04, p < .001;t2(100) = 14.90, p < .001. Disfluency-excisedutterances may have sounded less natural cause of interrupted intonation contours (as oposed to any audible edits).

Even though none of the Naturally Disflueutterances contained electronic edits, mid-wo

280 BRENNAN AN

TABLE 1

Stimuli: F0 of Color Words (in Hertz)

Reparandum Target Stimuli color word color word

Fluent instructions 104.0

Naturally disfluent instructions

Between-word 102.8 125.1

Mid-word 101.3 117.6

Mid-word w/filler 102.8 124.7

esinterruptions with fillers were rated as less

Page 8: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

asn

gee-

di-g

-lu-lso

ed4rep-ner-p-

p-alf

ent

Mean 3.97 3.80 3.75 5.63

edited/more natural than the other two typconsidered together, planned comparison, F1(1,15) = 9.53, p < .007; t2(31) = 4.69, p < .001.Fluent utterances were rated as only sligh(.77) more natural than mid-word interruptionwith fillers, significant by items but not by subjects, t1(15) = 1.28, p = .22; t2(72) = 4.28, p <.001.

Disfluency-replaced-by-pause utterances pear to be the most appropriate edited controlNaturally Disfluent utterances for three reasothey vary disfluency within items, there are idetical F0s and delays before the target wordswith Naturally Disfluent utterances, and bothese types of utterances were judged as equelectronically nonedited. However, we includethe Disfluency-excised and Fluent utterancas additional controls in case the pauses witDisfluency-replaced-by-pause controls had adetrimental effects on comprehension. NeithDisfluency-excised nor Fluent controls had alexical disfluency or hesitation; target words Disfluency-excised utterances had higher F0s

SPEECH DISFLUENCIES AND COMPREHENSION 281

TABLE 2

Stimuli: Ratings of Edited and Nonedited Instructions (1 = Edited,7 = Natural)

Disfluency type Naturally disfluent Disfluency-replaced-by-pause Disfluency-excised Flu

Between-word (n = 14) 3.88 3.80 3.75

Mid-word (n = 14) 3.67 3.62 3.50

Mid-word w/filler (n = 6) 4.86 4.22 4.23

than those in Fluent utterances, while Flue

ro

tlsh

-ro-ey- ats

or-ey athe asaseyre

told that they should press the correct key as

utterances were judged as less edited.

EXPERIMENT 1

Methods

Participants. The participants were 50 undegraduate students (37 women and 13 men) frthe State University of New York at Stony Broowho volunteered in exchange for research crein a psychology class. None had participatedgenerating or rating the stimuli, and all idenfied themselves as native speakers of Eng(those who were bilingual had learned Engli

before the age of 8). Data from 5 students wh

es

tlys-

p- ofs:

n-as

thallydeshinnyernyin

nt

-m

kditini-ish

produced a large total number of errors (placinthem distinctly outside the distribution of thother participants) were replaced. Those rplaced made errors approximately 2SDs (ormore) above the mean of the others. One adtional student was replaced for not followininstructions.

Design and stimuli. Each listener heard thesame set of Naturally Disfluent, Disfluencyreplaced-by-pause, Disfluency-excised, and Fent utterances (two of the 68 Fluent controwere omitted due to a programming error). Sapproximately 80% of the utterances containno lexical disfluency. Of the 20% that did, 1contained between-word interruptions befothe repair, 14 contained a mid-word interrution, and 6 contained a mid-word interruptioand a filler. There were two versions of thexperiment (A and B) for counterbalancing puposes; if the target object for an utterance apeared on the left-hand side in List A, it apeared on the right in List B and vice versa. Hthe listeners received each list.

Procedure. Stimuli were presented on a Macintosh Quadra computer by the SuperLab pgram and a Sony speaker. Two keys on the kboard were used for input. Participants viewedseries of displays of pairs of geometric objec(either two squares or two circles) arranged hizontally on the screen; with each display, thheard an utterance instructing them abouttarget object. They were instructed to press key on the same side as the object intendedthe target by the speaker. When a key wpressed, the object corresponding to that kwas highlighted on the screen. Listeners we

oquickly as possible, and if they pressed the

Page 9: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

i

ch

e

W

y

o

fer-VA

en-dfe

ionm-snsnalreeAallylu-and

d-usiss:

p-

rd-r,e-get

nt

Mean 666 (.13) 675 (.02) 708 (.02) 719 (.02)

llycy-rgets

aht.

wrong one, they should then press the corrone as quickly as possible (and that the next twould not begin until they pressed the correone). One second after the correct key wpressed, the next trial began. Half of the timthe target object appeared on the right and hon the left. After 15 practice trials, the 170 eperimental and control trials began, appearinga different random order for each listener. Lteners were told to respond as quickly andaccurately as possible.

Responses and response times were collefor each trial. Time was measured from tstart of each utterance until the correct key wpressed. Response times were then calcularelative to the correct target color words bsubtracting the time of onset of the target coword (established earlier in order to providereference point) from the response time for thtrial. Response times from trials in which rsponses were incorrect or times were greathan approximately 3 SDs from the mean (1400ms) were discarded.

Results and Discussion

For this experiment and the ones that followe first computed a difference score for eaitem within which disfluency was varied (via aedited and intact version of the same token). did this by subtracting the item’s NaturallDisfluent response time from its Disfluencreplaced-by-pause response time whenever bresponses were correct. These difference screpresent a disfluency advantagewhenever theywere positive-that is, when the presence o

disfluency led to faster response than its asence (in the form of a pause of equal length)

ectrialctase,alf,x- ins-as

tedeastedy

lor aat-ter

w,chn

ey-othres

f ab-

the same utterance. We examined these difence scores using a repeated-measures ANOwhose factors were disfluency type (betweword interruption, mid-word interruption, anmid-word interruption with filler) and side othe target object.1 Planned comparisons werused to examine the impact of the interruptpoint (between vs mid-word) as well as the ipact of the filler (mid-word interruptions with vwithout fillers). The same planned comparisiowere used to examine error rates in an additiorepeated-measures ANOVA comparing the thkinds of disfluency types. An additional ANOVwas used to compare responses for NaturDisfluent utterances to those for nonedited Fent controls. Table 3 shows response times error rates for each type of utterance.

Hypothesis 1 predicted a comprehension avantage for target words after mid-word versbetween-word interruptions. This hypotheswas supported by a difference in error ratemid-word interruptions without fillers led tolower error rates than between-word interrutions,F1(1, 49)= 17.67,p < .001;t2(31)= 3.35,p = .002. Interrupting the reparandum mid-woprevented listeners from committing themselves to the wrong interpretation. HoweveHypothesis 1 was not supported by the rsponse-time difference scores: those for tar

1 Target side did not affect response times for NaturaDisfluent, Disfluency-replaced-by-pause, and Disfluenexcised utterances. For Fluent utterances, right-side tawere 30 ms slower than left-side targets, F1(1, 49) = 20.00,p < .001, F2(1, 65) = 15.69, p < .001. This may be due to tendency for listeners to scan the display from left to rig

282 BRENNAN AND SCHOBER

TABLE 3

Experiment 1 (Two-Object Displays): Response Times in Milliseconds from Onset of Target Color Word (Error Rates in Parentheses)

Disfluency type Naturally disfluent Disfluency-replaced-by-pause Disfluency-excised Flue

Between-word 682 (.19) 667 (.02) 702 (.02)

Mid-word 673 (.11) 675 (.03) 707 (.01)

Mid-word w/filler 617 (.05) 695 (.03) 726 (.02)

inSince side was counterbalanced across the two versions ofthe experiment, we do not report it further.

Page 10: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

S

regels

a

floe

i

s.er

ye

a

a

te

i

dh it

hc

s

geu-e

.sser-),

y-er- oford oransesthy-re-fornt

(we). of-

in ane-erp-

eect doarkrdr is

do

SPEECH DISFLUENCIE

words after between-word interruptions weno different (15 ms slower than Disfluency-rplaced-by-pause controls) than those for tarwords after mid-word interruptions (2 ms fastthan Disfluency-replaced-by-pause controplanned comparison,F1(1, 48)= .03,ns, t2(31)= .20,ns.

As for Hypothesis 2, which predicted thmarking interrupted words with fillers would bhelpful, there was a clear response time disency advantage. Difference scores for mid-winterruptions with fillers (78 ms) were highthan those without (2 ms), supporting the idthat fillers may signal the replacement of an terrupted word, planned comparison, F1(1, 48) =15.76, p < .001; t2(31) = 3.75, p = .001. This re-sponse-time advantage is particularly intereing because it occurred with relatively little coin accuracy: Listeners were inaccurate only of the time after mid-word interruptions markwith fillers, much less than with mid-word interuptions not so marked, F1(1, 49) = 13.53, p =.001; t2(31) = 2.05, p < .05, and at a rate onlmarginally higher than after Fluent utterancF1(1, 49) = 3.47, p < .07; t2(96) = 1.62, ns. Theimplication is that while disfluent utterances cbe misleading, those with fillers are less so.

Listeners chose correct targets after NaturDisfluent utterances 53 ms faster than after Fent ones, F1(1, 49) = 20.83, p < .001; F2(1, 98) =13.63, p < .001. Responses to all three typesdisfluent utterances were significantly fasthan responses to Fluent utterances (betwword interruptions vs Fluent utterances, F1(1,48) = 12.80, p = .001; t2(96) = 1.95, p < .07;mid-word vs Fluent, F1(1, 48) = 12.67, p =.001; t2(96) = 2.43, p < .02; mid-word with fillervs Fluent, F1(1, 48) = 38.12, p < .001; t2(96) =3.58, p = .001. Although listeners must walonger to hear the target color word in a disflent utterance than in a fluent utterance, thspeeded response to the target word in a fluent utterance partially compensates for tdelay. As for accuracy, listeners respondedless accurately to mid-word interruptions wfillers than to Fluent utterances. However, thdid respond less accurately to both of the ottypes of disfluencies than to Fluent utteran

[between-word interruptions vs Fluent utte

AND COMPREHENSION 283

e-etr),

teu-rdrean-

st-t

05d-

s,

n

llylu-

oferen-

tu-eiris-isnoheyeres

ances, F1(1, 49) = 36.27, p < .001; t2(96) =15.05, p < .001, and mid-word interruptions vFluent utterances, F1(1, 49) = 24.66, p < .001;t2(96) = 8.28, p < .001].

That responding toorangeafter a between-word interruption likeyellow- orangecould befaster than responding toorangein a Fluent utter-ance seems surprising at first. This advantacould be due to the target color word in a Natrally Disfluent utterance receiving contrastivstress relative to thefirst colorword (as inMove tothe yellow- ORANGE square) (see Cutler, 1983;Levelt & Cutler, 1983; Shriberg et al., 1992)Converging support for this possibility comefrom the finding of marginally speeded respontimes for Disfluency-excised over Fluent utteances (different by subjects but not by itemsF1(1, 49)= 4.07,p< .05;F2(1, 98)= .80,ns.

Overall, responses to Fluent, Disfluencexcised, and Disfluency-replaced-by-pause uttances were equally (and highly) accurate; nonethese utterances contained unintended color winformation that could have caused interferencefalse alarms in the task. The response time mein Table 3 show the same pattern of differencamong the three types of disfluencies for boDisfluency-excised and Disfluency-replaced-bpause controls. That the times for Disfluency-placed-by-pause controls were no longer than Disfluency-excised controls suggests that silepauses may not be harmful to comprehension consider pauses further in Experiments 3 and 4

These results show that that some featuresdisfluencies enable listeners to partially compensate for any potential disruption or delay comprehension. When a speaker interruptsunintended word rather than completing it bfore repairing it, this disadvantages the listenless. And when the speaker marks the interrution with a filler, the listener is not only moraccurate but also faster to recognize the corrtarget word than when the speaker does notso. These results support speculations by Cl(1994, 1996) and Levelt (1989) that a wofragment lets listeners know that the speakehaving difficulty with that word.

Experiment 1 used a logically constrainecontext in which there were only two objects t

r-choose from. Interruptions may have signaled to

Page 11: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

N

as

rge

ulind

284 BRENNAN A

listeners that the speaker was having trouble wa word or, more specifically, that he or she meto cancel that word. With only two possibilitieif one is cancelled, then the other is certain. Newe examined the disfluency effect in a conte

where knowing that a word is troublesome

u

t

a

m

a

itob-

4-r-

ose

hen

d,-towle,tes--r

-ecttoed-of.ddly

slow response times (>1600 ms,∼ 3 SDs from the

somewhat less informative.

EXPERIMENT 2

In Experiment 2, we expanded the set of posible referents to reduce the information valof the disfluencies: Listeners chose from throbjects rather than two. With three objecwhen the speaker interrupts himself withyell-uh, orange, the information in the reparandum(yell- uh) may signal an intent to cancelyellow,but if it does, this no longer uniquely detemines which of the other two objects thspeaker must mean. So if the disfluency advtage in Experiment 1 depended entirely oninference enabled by the unusual logical costraints of the context, then the response tiadvantage should be eliminated with three ojects. On the other hand, if an advantage soccurs (perhaps in attenuated form,from havingto choose among three objects rather than twthis supports the idea that a disfluency cancomprehension by signaling what a speakerhaving trouble with.

Methods

Participants. Fifty native speakers of Englis(31 women and 19 men) volunteered to partpate in exchange for research credit in an inductory undergraduate psychology course. Nhad participated previously. One additional su

Mean 765 (.05)

D SCHOBER

ithnt,xt,xtis

s-e

ees,

r-en-

ann-e

b-till

o),idis

hci-ro-ne

ties and another was replaced for making a lanumber of errors (> 2 SDs above the mean).

Design, stimuli, and procedure. Experiment2 used the same software, hardware, and stimas described earlier (68 Fluent utterances a34 Naturally Disfluent utterances matched to 3Disfluency-replaced-by-pause and 34 Disfluency-excised versions). The procedure, utteances, and practice trials were the same as thin Experiment 1, with the following minor differ-ences: Listeners were told that if they pressed twrong key, they would hear a sound (rather thagetting a visual highlight as in Experiment 1) anthat the next trial would begin automaticallyshortly after they made their selection. The important difference from Experiment 1 was thaeach visual display showed three rather than twobjects (either an orange, a purple, and a yellosquare or an orange, a purple, and a yellow circarranged horizontally in varying order). To creathe three-object displays, we modified the diplays from Experiment 1 by adding a third (irrelevant) object to the same position (left, middle, oright) on both the List A and List B displays for aparticular utterance so that for all Naturally Disfluent items, when the object mentioned in threparandum appeared to the left of the objementioned in the repair in one list, it appearedthe right in the other. Irrelevant objects appearequally often in the left, middle, and right positions. Each listener experienced the trialseither List A or List B in a different random orderFor input, listeners used both index fingers anthe middle finger of their dominant hand, poiseover three adjacent keys. Incorrect or extreme

nt

ject was replaced because of technical difficul-mean) were discarded.

TABLE 4

Experiment 2 (Three-Object Displays): Response Times in Milliseconds from Onset of Target Color Word (Error Rates in Parentheses)

Disfluency type Naturally disfluent Disfluency-replaced-by-pause Disfluency-excised Flue

Between-word 781 (.08) 743 (.01) 762 (.02)

Mid-word 761 (.04) 749 (.01) 751 (.01)

Mid-word w/filler 741 (.01) 769 (.00) 789 (.02)

750 (.01) 762 (.02) 786 (.01)

Page 12: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

S

a

rl

m

r

ia

uo

in

ges

t-s

er

n-e--

he

esen-r-lyi-y

SPEECH DISFLUENCIE

Results and Discussion

Response times along with error ratesshown in Table 4. Listeners were noticeabslower to choose a referent out of a set of ththan out of a set of two; response times for Fent utterances were 67 ms slower than in Expiment 1. Although listeners received the sainstructions to try to be both fast and accurateboth experiments, three-object displays psented a harder task than did two-object dplays, and this seems to have made listentrade off speed for accuracy. Experiment 2’s lteners took longer to respond but got more tricorrect, making on average 3.5 errors to Expement 1’s 7.3.

As before, we compared within-item response-time difference scores (Disfluency-replaced-bpause minus Naturally Disfluent) to test forrelative advantage for mid-word as opposedbetween-word interruptions (Hypothesis 1). Jas before, there was no difference, planned ctrast,F1(1, 49)= 2.27,ns, t2(31)= .91,ns, but asbefore, listeners were more accurate after mword than between-word interruptions,F1(1,49) = 12.43,p < .001, t2(31) = 2.33, p < .05.Once again, difference scores for mid-word

terruptions with fillers were greater than thos

FIG. 1. Disfluency advantage (Disfluency-replacthe three kinds of disfluencies in Experiments 1 and

AND COMPREHENSION 285

relyeeu-er-eine-is-erss-lsri-

y-atostn-

id-

-

without (Hypothesis 2), planned contrast,F1(1,49) = 15.32,p < .001,t2(31)= 2.39,p < .05. Asbefore, this response-time disfluency advantaincurred little cost in accuracy; errors were lescommon with mid-word interruptions withfillers than without (by subjects but not byitems),F1(1, 49)= 5.62,p = .02; t2(31) = .97,ns. In fact, errors occurred no more often to uterances with fillers than to Fluent utterance(1% of which incurred errors),F1(1, 49)= 0.05,ns; t2(98) = .14, ns. So listeners responded totarget words faster and more accurately aftmid-word interruptions marked with fillers thanthose not so marked, and this disfluency advatage for response times did not occur for btween-word interruptions or mid-word interruptions not marked withuh. This pattern of resultsacross the three types of disfluencies is tsame as that for two-object displays.

The important difference between responsto two- and three-object displays is that thwhole pattern of difference scores shifted dowward with three objects (see Fig. 1). The diffeence score for utterances with fillers was on28 ms (down from 78 ms in Experiment 1), relably different from zero by subjects but not b

eitems, F1(1, 49) = 8.86, p = .005; F2(1, 5) =

ed-by-pause minus Naturally Disfluent response times) for 2.

Page 13: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

nh

c

na

u

atr-sesges?he

n,dd-y-e

the

nd-

han

eur-

n-allychs-i-

2e-ein-n-re-te

286 BRENNAN AN

2.71, p = .161, ns. That the filler’s value as a cuis attenuated but not eliminated when listenchoose between three objects as opposed tosuggests that knowing what word the speakehaving trouble with helps listeners cancel teffect of hearing the unintended word.

Although listeners balanced speed and acracy somewhat differently with three- than witwo-object displays, making fewer errors withree-object displays, the pattern of erroremained the same. Mid-word interruptiowith fillers were the only kind of disfluency witan error rate as low as that of Fluent utterancListeners were less accurate after between-winterruptions than after Fluent utterances, F1(1,49) = 29.62, p < .001; t2(98) = 7.74, p < .001.They were also less accurate after mid-wordterruptions than after Fluent utterances, F1(1,49) = 10.74, p < .005; t2(98) = 2.84, p < .01.

As before, stress (in the form of elevatedF0)appeared to contribute to response times. Lteners responded to target words in Disfluenexcised utterances 24 ms faster than in Fluones, planned contrast,F1(1, 49) = 14.11,p < .001;F2(1, 98)= 2.82,p < .10. Target sidehad no effect.

Apart from the finding that increasing the sof possible referents attenuated the disflueadvantage for mid-word interruptions, the pterns of response time and error data parathose in Experiment 1. When listeners mchoose from a set of three rather than two jects, the advantage of an interruption markwith uh is attenuated, but not eradicated. Thesponse time advantage for the disfluency mdiminish because the information value of tcue is lower, because it takes longer to chofrom three alternatives than two, or for both resons. Whether the disfluency is marked withfiller still contributes strongly to the error rateresponses to utterances containing mid-word

terruptions marked with uhwere just as accurat

en-

-ar- are-ve

speakers of English. Data from 3 students who

as responses to Fluent utterances.

EXPERIMENT 3

So far, listeners’ responses to target cowords after mid-word interruptions marked bfillers have been faster than when these disflu

cies were absent (in both Disfluency-replace

D SCHOBER

eers twor ishe

cu-ththrs s

es.ord

in-

is-y-

ent

etcyt-llelst

ob-edre-ay

heosea- as; in-e

lory

by-pause and naturally Fluent utterances). Whis still unclear is why. What features of the utteance are responsible for the speeded responand lower error rates, and what is the underlyinprocess that takes advantage of these featurThe disfluency advantage could be due to tphonological form of the filler alone; certainfillers could heighten listeners’ vigilance to aupcoming target word (Fox Tree, 1993). If sothen a version minus the interrupted word ancontaining only the filler should be processejust as quickly as the nonedited Naturally Disfluent versions and faster than the Disfluencreplaced-by-pause versions. Alternatively, thresponse time advantage could be due tophonological form of the fillerin combinationwith the interrupted word. If the filler acts as aediting expression signaling that an interrupteword is being cancelled, then the Naturally Disfluent utterances should be processed faster tversions in which either the interrupted wordorthe filler are edited out. Finally, the advantagcould be due to the extra time that elapses ding the filler, together with the informationabout the cancelled target word in the reparadum. If this last hypothesis is true, then a signthat a word is to be replaced may help onwhen there is sufficient time to process suinformation; whether the interrupted word ifollowed by a filler or a silent pause of equivalent length should make no difference. Experment 3 teased apart these three possibilities.

Although listeners in Experiments 1 and were instructed to respond quickly, they somtimes waited until well after the end of thutterance before responding. Since we were terested in the incremental effects of disfluecies, Experiment 3 encouraged listeners to spond more quickly by placing a moderadeadline on their responses.

Methods

Participants. Forty-eight undergraduate psychology students (25 women and 23 men) pticipated in exchange for research credit inpsychology course. None had participated pviously, and all identified themselves as nati

d-produced a large total number of errors and time-

Page 14: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

e

e

uale

-

sedari-ed

ri-esed-res.to

rer)

eptans,

tres-escy-not

ngaurr

e

Note.Utterances are aligned to symbolize relative time elapsed. Vertical bars show location of electronic edits and hori-

outs (placing them distinctly outside the distribtion of the other participants) were replacThose replaced made errors and timeouts proximately 2 SDs (or more) above the mean.

Design and stimuli. Experiment 3 used thsame stimuli as the previous experiments p12 additional items (see Table 5) that weedited from duplicates of the 6 Naturally Disflent mid-word interruptions with fillers for a totof 182 stimuli. Six of the new items had the filreplaced with a pause of equal length (Fillremoved items), and 6 had the interrupted wreplaced with a pause of equal length (Woremoved items). All displays consisted of twobjects as in Experiment 1, and these were coterbalanced for target side using two versionsthe displays as before.

Procedure. The same software, hardware, ainput method were used as in the previoexperiments. When listeners did not respowithin 1000 ms after the onset of the targword, the trial timed out and a “too slow” mesage appeared before the next trial proceeautomatically.

Results and Discussion

The critical comparisons for Experimentwere those that contrasted features of the disencies within utterances, using four versionseach mid-word interruption with filler: Natu

zontal lines show relative extent of silent pauses.

rally Disfluent, Filler-removed, Word-removedand Disfluency-replaced-by-pause. Correct

u-d.ap-

lusre -l

err-

ordrd-o un- of

ndusndet

s-ded

3flu-of

sponse times to these four versions (as oppoto difference scores, which represent a compson between only two versions) were comparin an ANOVA that included display side, withplanned comparisons to test for relative contbution of fillers and mid-word interruptions. Thresults were clear: With respect to both respontimes (Fig. 2) and error rates, Filler-removeitems were similar to Naturally Disfluent (nonedited) items and Word-removed items wesimilar to Disfluency-replaced-by-pause itemListeners responded about 80 ms more slowlyutterances in which mid-word interruptions wereplaced by silent pauses (leaving only the fillethan to the nonedited versions,F1(1, 47)= 25.11,p < .001;F2(1, 5)= 64.92,p < .001. And listenersresponded just as quickly to utterances that kthe interrupted word but replaced the filler withsilent pause as they did to the nonedited versioF1(1, 47)= .004,ns, F2(1, 5)= .036,ns. Whileremoving the filler from Naturally Disfluenutterances did not harm response times, the pence of the filler in the Word-removed utterancled to faster response times than the Disfluenreplaced-by-pause versions by subjects butby items,F1(1, 47)= 4.16,p < .05; F2(1, 5) =2.86,p = .15. So it appears that some remodeliis in order for Levelt’s (1989) original idea thatfiller may be an informative cue, at least in otask situation. The facilitating aspect of a fille

SPEECH DISFLUENCIES AND COMPREHENSION 287

TABLE 5

Intervals of Interest in the Stimuli for Experiments 3 and 4

Stimuli Reparandum Edit interval Repair (target)

1. Disfluent Move to the ye- uh, orange square

2. Filler-removed Move to the ye- |---| orange square

3. Word-removed Move to the |---| uh, orange squar

4. Disfluency-replaced-by-pause Move to the |--- ---| orange square

Target

5. Disfluency-excised Move to the | orange square

6. Fluent Move to the orange square

,re-after a mid-word interruption appears to benotthe phonological form of the filler, but the extra

Page 15: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

288

h

edg

s,

FIG. 2. Effects of interrupted words and fillers in Experiments 3 and 4.

time that elapses after the interruption, while tfiller is being produced.

Error rates, which were extremely low for afour versions of the mid-word interruptions witfillers, were lowest when misleading information in the form of interrupted color words wa

-on,

y-s)1

--

-

h

u-n-n

es

replaced-by-pause)

e

llh-sed

and Naturally Disfluent versions, F1(1, 47) =5.96, p < .02; F2(1, 5) = 18.95, p < .01. Dis-fluency-replaced-by-pause and Word-removversions, the two versions with no misleadincolor information, had equally low error rateF1(1, 47) = .02, ns; F2(1, 5) = .00, ns. And re-moving the filler from Naturally Disfluent versions made no difference, planned comparisF1(1, 47) = .20, ns; F2(1, 5) = .43, ns. Error ratesand response times are displayed in Table 6.

Difference scores (Disfluency-replaced-bpause minus Naturally Disfluent response timeshowed the same pattern as in Experimentsand 2. Difference scores for mid-word interruptions were no higher than for between-word interruptions,F1(1, 47)= 1.74, ns; t2(31) = .64,ns. As before, mid-word interruptions led tofewer errors than did between-word interruptions (by subjects but not by items),F1(1, 47)=8.21,p = .006; t2(31) = 1.62,p < .116. Differ-ence scores for mid-word interruptions witfillers were higher than for those without,F1(1,47) = 20.82,p < .001; t2(31) = 3.36,p = .002.As before, there was no penalty to the disflency advantage in response times: Mid-word iterruptions with fillers led to fewer errors tha

BRENNAN AND SCHOBER

removed, planned comparison of Word-remov

TABLE 6

Experiments 3 and 4 (Two-Object Displays): ResponsTimes in Milliseconds from Onset of Target Color Word

(Error Rates in Parentheses)

Stimuli Experiment 3 Experiment 4

Fluent 628 (.01) 612 (.01)Naturally Disfluent

Replaced color wordsBetween-word 525 (.14) 532 (.19)Mid-word 541 (.09) 539 (.09)Mid-word w/filler 490 (.03) 498 (.02)

Repeated Color Words 548 (.03)Disfluent edited

Filler-removed 489 (.03) 473 (.03)Word-removed 571 (.00) 574 (.02)Both word and filler 595 (.00) 589 (.02)

removed(Disfluency-

did those without fillers (by subjects but not by

Page 16: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

S

l vh

kr

t

err

m

n

ortheener-rese 2).d-orrs

ad

w-rdgty,-

et-r-tor-g-

SPEECH DISFLUENCIE

items), F1(1, 47) = 10.14, p = .003; t2(31) =1.48,p = .148.

However, under moderate time pressure, teners made more use of the informationdisfluencies: Difference scores were positinot only for the mid-word interruptions witfillers, but for the other two types of disfluencies as well (unlike in Experiments 1 and 2So this time, listeners responded more quicto Naturally Disfluent between- and mid-wointerruptions (without fillers) such as Move tothe purple- orange squarethan they did to theDisfluency-replaced-by-pause versions of same utterances such as Move to the <pause>orange square. Figure 3 shows that while threlative pattern of difference scores for the thtypes of disfluencies remained the same acExperiments 1, 2, and 3, the difference scoshifted upward with the earlier responses Experiment 3. If we assume that recognitionthe (highly predictable) color word in the repatook well under 200 ms (see Marslen-WilsonTyler, 1981) and the key press another 200 then this places the listener’s decision poafter the color word in the repair (mean respotimes ranged from nearly 500 ms in Experime

3 to the high 700s in Experiment 2). So by ttime listeners made their decisions, they had h

a

FIG. 3. Disfluency advanta

AND COMPREHENSION 289

is-ine

-).lyd

he

eeossresin ofir&s,

intsent

he

ample evidence that the reparandum word word fragment was being cancelled from bothe semantic and the prosodic contrasts betwthe repair and reparandum. Those under modate time pressure (Experiment 3) benefited mofrom hearing cues in the disfluencies than thounder less time pressure (Experiments 1 andIn the Disfluency-replaced-by-pause and Worremoved versions, without a good account fthe hesitation before the target word, listenecould tell only that the speaker may have hsome (covert) trouble.

So far, we have found that error rates are loest for utterances with no misleading color woinformation, next lowest for those containinmid-word interruptions marked by a filler, nexlowest for those not so marked, and finallhighest for between-word interruptions. However, we have not yet distinguished two comping explanations for this effect: Are the loweerror rates for mid-word interruptions (compared to between-word interruptions) due having a discrete cue in the form of the interupted word, or are they simply due to hearinless information that is misleading-a continuously varying cue?

To address this question, we conducted

adpost hoc analysis of the stimuli. We measured

ge for Experiments 1, 2, and 3.

Page 17: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

N

hlfl

gi

ht

3r

t

u-t orswth

thin

ortk ite

thee- of

e-ees.id-

ng ede

e-

-eor

word in our task, but the processing time that

eere-

esa-h-rsth-y

e-s-x-ch

dhen-s,ns

oon3at 2. en mu dirss

290 BRENNAN A

the lengths of two intervals of interest: treparandum and the editing interval (or debefore the repair). The reparandum in a disent utterance began with the onset of the unin-tended color word and ended with the interrution site. Mean reparandum lengths avera328, 276, and 175 ms for between-word, mword, and mid-word interruptions with fillersrespectively. The editing interval began with tinterruption site and ended with the onset of intended color word (corresponding to the begning of the repair); this averaged 71, 94, and ms for between-word, mid-word, and mid-wointerruptions with fillers, respectively.2 For eachdisfluent utterance, we computed correlatiobetween each of these two intervals and the dependent measures of interest: mean errorand mean response time difference score (Disency-replaced-by-pause minus Naturally Disflent) for each disfluent utterance in ExperimenTo determine whether any reliable associatiwere carried solely by the items with fille(which, after all, yielded the largest effects), computed each correlation both with and wiout the six filler items.

Error rates were correlated with the lengof the reparanda (or amount of misleading formation), r(33) = .462, p < .001, but not withthe lengths of the edit intervals (or delay befrepair), r(33) = −.305, ns. This is consistenwith the interpretation that listeners mafewer errors when they hear less misleadingformation rather than depend on an interrup

2 The lengths of these intervals fall within the rangethose reported by Blackmer and Mitton (1991) for their cpus of natural examples from talk-show conversatioTheir overt repairs had editing intervals that averaged ms but that were sometimes as short as 0 ms. In Nakand Hirschberg’s (1994) corpus, edit intervals averagedms after word fragments and 481 after nonfragmentsLickley’s (1996) sample of 30 disfluent utterances, silpauses at the interruption site ranged from 34 to 1134That the edit intervals after nonfragments in our stimwere so short suggests that the speaker in our limitedmain did not need to take much time for replanning repaFor the purpose of algorithmically identifying pauses asciated with edit intervals, O’Shaughnessy (1992) propo

that these pauses should range from 80 to 400 ms; thconsistent with our edit intervals.

D SCHOBER

eayu-

p-edd-,e

hein-90d

nsworateflu-

word as a discrete cue. On the other hand, magnitude of disfluency advantages in rsponse times was correlated with the lengthsthe edit intervals, r(33) = .560, p < .001, butnot with the lengths of the reparanda, r(33) =−.228, ns. So the longer the delay before the rpair, the more hearing the information in thdisfluency may have speeded response timThese results are not due simply to the mword interruptions with fillers (which haveespecially short reparanda and long editiintervals); when those utterances are removfrom the correlations, the pattern is the sam(although slightly attenuated). This evidencconverges with our conclusion from the comparison of Filler-removed vs Naturally Disfluent items: It is not the phonological form of thfiller that speeds recognition of the target col

-is

3.ns

e-

s-

e

en-d

elapses during the filler.

EXPERIMENT 4

The disfluencies in Experiments 1–3 wer100% informative that the color word before thinterruption would be replaced by another coloword. But interrupted utterances can also bcontinued with repeated material; an interruption does not always signal that a word will bcancelled. In a study of a spontaneous convertions during which pairs of speakers did a matcing task (Bortfeld et al., in press), the speakeinterrupted themselves and then repeated (wiout correcting) portions of their utterances fairlfrequently (1.47 times every 100 words) althoughless often than interrupting themselves and rplacing reparanda with new material (1.94 timeevery 100 words). When we elicited the disfluencies for the current experiments, we had epected to collect many tokens of repetitions suasMove to the yell- yellow square(in fact, that iswhy we set up about one-fifth of our elicitationtrials so that the highlight of the object flickerebut then remained on the same object). But tspeaker in our corpus produced only seven istances of immediately repeated color wordnot nearly enough to provide comparison tokefor our three types of interruptions (between

fr-s.32ani89Ints.

lio-s.o-ed

word, mid-word, and mid-word with filler). is

Page 18: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

S

ehfLth

sl

-a

r

h

tetp%fr

o

rf

ti

o

ds

pt-

at

urre-h-gers

tedrdr

tedls

tsete eree-omes,re,

tes.ort

e,ri-r-

llyfor tor-er-se

alct, in-dth,

in-t-

urally Disfluent mid-word interruptions with

SPEECH DISFLUENCIE

That the stimuli in Experiments 1–3 includno repetitions raises the possibility that tspeeded response times may have resulted some special strategy on the part of listeners. teners may have learned in our experiments interruptions always signaled replacement. If twere the case, then reducing the informativenof interruptions should make the disfluency avantage go away. Experiment 4 ruled this pobility out by including some interruptions folowed by repetitions.

Methods

Participants. Fifty Stony Brook undergraduate native speakers of English (29 women 21 men) participated in exchange for reseaparticipation credit in their psychology coursNone had participated in any of the previoexperiments. Data from 7 students who pduced a large total number of errors and timouts (2 SDs or more above the mean of tother participants) were replaced.

Design, stimuli, and procedure. For the experiments so far, the informativeness of inruptions in the task context has remained same. In Experiments 1 and 2, of the 170 eximental trials and 15 practice trials, 37 (or 20included reparanda containing all or part ocolor word, and 100% of the time these corsponded to a change of color word in the repIn Experiment 3, the percentage of replaccolor words increased slightly, to 22%, duethe addition of six items with interrupted colwords for which the fillers were removed. It possible that listeners adjusted to this pdictability and responded via an unnatural stegy such as deciding what key to press bethey heard the repair. Of the seven tokensutterances with repeated color words in odatabase, one was a between-word interrupfour were mid-word interruptions, one wasmid-word interruption with a filler, and one waa mid-word interruption that recycled a shstretch of the previous utterance (Move to theyel- to the yellow square). These were too varieand too rare to use for controlled comparisonExperiment 4; however, we included them order to reduce the informativeness of interru

ing a color word. So the informativeness of i

AND COMPREHENSION 291

de

romis-hatis

essd-si--

ndrche.uso-e-e

r-heer-)

ae-air.edtor

isre-at-ore of uron,asrt

inin

terruptions was reduced from 100 to 82%; this, about one time in seven, the speaker did notcancel what was in the reparandum. Given ocorpus, this was the most by which we could duce the informativeness of interruptions witout resorting to electronically manufacturindisfluent utterances. Note that, unlike the othdisfluent utterances in which the color wordwere repaired, the utterances with repeacolor words did not have the target color wocontrastively stressed; F0s were the same fothe first and second color words, t(6) = 1.04, ns.

One of the seven utterances with a repeacolor word was presented in the practice triaso that listeners would be exposed at the outo the possibility that color words could brepeated as well as replaced; the other six wincluded among the experimental trials, prsented as before to listeners in different randorders. Apart from the seven new utterancExperiment 4 used the same stimuli, proceduand analyses as Experiment 3.

Results and Discussion

Table 6 shows response times and error raEven though interrupting an utterance after within the color word no longer predicted thathis word would be cancelled 100% of the timresults were consistent with those in Expement 3: Response times for filler-removed utteances were most similar to those for NaturaDisfluent utterances, and response times Word-removed utterances were most similarthose for Disfluency-replaced-by-pause utteances (see Fig. 2). As before, replacing the intrupted word with a silent pause slowed respontimes, planned comparison, F1(1, 49) = 62.89, p < .001; F2(1, 5) = 107.99, p < .001. Onceagain, it seems that it is not the phonologicform of fillers that speeds processing; in faresponses were slightly faster (but only margally) after interrupted words when fillers habeen replaced with pauses of equal lengplanned comparison, F1(1, 49) = 3.07, p = .09;F2(1, 5) = 6.87, p < .05.

As before, error rates with utterances that cluded misleading interrupted color words (Na

n-fillers and Filler-removed utterances) were no

Page 19: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

ud

l

ni

n

p-

r

u

3

uaa

i

ngc-lynales

st- therlyfores p-asld top-ens

to’t mchstsly

treating an interrupted color word as a cue by

-at

ou-s

r,s-is-e

melso-

292 BRENNAN AN

different, planned comparison, F1(1, 49) = .23,ns; F2(1, 5) = .31, ns. And utterances without misleading information (Word-removed and Disflency-replaced-by-pause versions) were no ferent in either response times, F1(1, 49) = 2.74,ns; F2(1, 5), ns, or error rates, F1(1, 49) = .53,ns; F2(1, 5) = .60, ns. Unlike in Experiment 3,however, there was no decrease in the (alrequite low) error rate when the interrupted coword was replaced with a pause, planned coparison, F1(1, 49) = .04, ns; F2(1, 5) = .90, ns.Finally, error rates for the nonedited versio(mid-word interruptions with fillers) were no different than for Fluent utterances, F1(1, 49) = .20,ns; t(2) = .25, ns.

For the different types of nonedited disflueutterances (between-word, mid-word, and mword interruptions with fillers), response-timdifference scores (Disfluency-replaced-by-pauminus Naturally Disfluent response times) aerror rates followed the same pattern as in Expiments 1–3. Difference scores for between-wointerruptions were the same as for mid-wointerruptions without fillers,F1(1, 49)= .51,ns;t2(31)= .59,ns. Mid-word interruptions incurredfewer errors than did between-word interrutions,F1(1, 49)= 35.25,p < .001;t2(31)= 2.97,p < .01. Difference scores for mid-word interuptions with fillers were higher than those without, F1(1, 49)= 17.56,p < .001; t2(31)= 3.26,p = .003, and incurred fewer errors,F1(1, 49)=29.98,p < .001;t2(31)= 1.68,p = .10. As in Ex-periment 3, difference scores for all three typof disfluencies were positive. That is, listeneresponded to target words in all Naturally Disflent utterances more quickly than to target worin Disfluency-replaced-by-pause utterances.

As for the six trials with repeated color word(or word fragments) that were included to brethe perfect correlation between interruptions areplacing words, listeners made errors only of the time. If listeners were treating an interupted color word as a cue that the word wobe replaced, then listeners should have mmore errors with repeated color words. The fthat this error rate was so low is consistent wthe proposal that the less misleading informat

in the reparandum, the less likely listeners arechoose the wrong object (utterances with repe

D SCHOBER

--if-

adyorm-

ns-

td-esed

er-rdrd

tions contained disfluencies but no misleadicolor information). Correct responses to instrutions with repeated color words were relativefast, averaging 453 ms from the onset of the ficolor word (see Table 6). However, in 28 cas(9.6%) the correct response came before theonset of the final (repeated) color word, suggeing that in these cases listeners responded tocolor word in the reparandum. In contrast, eacorrect responses were virtually nonexistent disfluent instructions in which color words werreplaced (because the first color word waalways incorrect). If we adjust the means for reetitions by counting the 28 early responses incorrect, then correct response times wouaverage 548 ms and error rates 13% (similarthe data for between- and mid-word interrutions with replacements in Table 6). So evthough instructions with repeated color wordwere relatively rare in our stimuli compared those with replaced color words, they didnappear to slow listeners down or lead theastray any more than did disfluencies in whicolor words were replaced. This result suggethat listeners did not adopt a strategy of simp

es

--

esrs-ds

saknd%r-lddect

ithon to

itself that the word would be replaced.

GENERAL DISCUSSION

Across all four experiments, listeners responded to target words after disfluencies thhad long edit intervals (e.g., Move to the pur- uhyellow square) faster than when disfluencieswere absent (e.g., Move to the yellow square,with or without a silent pause before yellow).This was true whether the display included twobjects or three (with three objects, the disflency advantage for utterances with fillers waattenuated but still reliably positive). Togethethe results show that there is information in difluencies that partially compensates for any druption in processing. This disfluency advantagwas even clearer when listeners were under tipressure (Experiments 3 and 4); they were aable to use the information in disfluencies without fillers (with shorter edit intervals, e.g., Moveto the purple- yellow square) to respond fasterthan when disfluencies were replaced by paus

ti-of equal length.
Page 20: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

S

e

ir

esre

tar

iso a uvuatcle

gu

toedr

teeu osh

reinthe in-nt,rh aole),and

beir

tob-cet aanhets,tesaird,lor

yrrss,

ortedt aa

in-l

ithn

fol-inr isis

no-rednd

SPEECH DISFLUENCIE

In Experiments 1–4, regardless of how listers traded off speed and accuracy, the relapattern of error rates was the same (betweword interruptions led to more errors than mword interruptions, which led to more errothan mid-word interruptions with fillers). Ocourse, the reason measurable errors occurrall was because our task required speakermake a commitment; in more natural comphension settings, reaching the wrong interprtion seldom leads to an irrevocable commment. For future work we are eager to useonline technique, such as eye tracking, thasensitive to the incremental decisions that made (as well as unmade or postponed) duthe interpretation of spontaneous speech.

From these data, we can construct an incipaccount of how listeners are able to compenfor the disfluencies they hear in spontanespeech. We began with predictions inspiredsome ideas of Levelt’s (1989); however, our dsupport an account that is not as simpleexpected. Hypothesis 1, that mid-word interrtions should be easier for listeners to recofrom than between-word interruptions, was sported by the consistent pattern of error dacross all the experiments. However, the inrupted word does not appear to act as a discue; rather, the less misleading information teners hear, the less likely they are to makcommitment to the wrong interpretation. Listeers responded equally quickly to the tar(repair) word whether the preceding reparandwas interrupted mid- or between-word.

As for the idea that fillers may help listeneprocess disfluent speech (Hypothesis 2), lisers responded consistently fastest to target wpreceded by a mid-word interruption with a fillStrikingly, these speeded responses incurreerror penalty; mid-word interruptions with filleconsistently led to fewer errors than did disflencies without fillers, and in fact error rawere equal to or nearly as low as for Fluitems. But Experiments 3 and 4 showed qclearly that it was not the phonological formthe filler that was driving the faster responand lower error rates, but the extra time t

elapsed during the filler before the repair. Whthe time between the interrupted word a

AND COMPREHENSION 293

n-tiveen-d-sfd at toe-ta-it-an isre

ing

entateusbytaas p-erp-ta

er-reteis- a

n-et m

rsen-rdsr. nosu-sntitef

esaten

repair was held constant, whether or not thewas a filler in this interval made no difference response times or error rates. And when interrupted word was removed and the time terval before the repair was again held constait again made little difference whether a fillepreceded the repair. It appears, then, that witlonger editing interval (allowing more time tprocess the evidence that there is some troublisteners are better able to process the repair select the correct target word more quickly.

As we mentioned before, disfluencies candetected quite early-by the first word in a repa(Lickley & Bard, 1992, 1998). But what is it thaserves as a cue to a listener that there is a prlem with an utterance? Two pieces of evidensuggest that our listeners did not conclude thaword was necessarily being cancelled whenutterance was interrupted: First, even when treferent array was constrained to two objecthe time course of listeners’ responses indicathat they waited to hear at least part of the repafter the interruption and edit interval. Seconresponses to the utterances with repeated cowords (e.g.,Move to the yel- yellow square; Ex-periment 4) indicate that an interruption bitself did not signal the cancellation of the coloword in the reparandum. If it had, then listeneshould have made many errors after repetitionand they did not. The fact that replaced colwords received contrastive stress and repeacolor words did not suggests that the cue thaword was being cancelled may consist ofthree-part combination: the presence of theterruption,3 the semantic content of the materiaafter the interruption (differing from that in thereparandum), and the stress characteristics (wF0 either elevated or not) of the color word ithe repair.

So we suggest an account that goes as lows: the parser monitors for an interruption fluent speech, which signals that the speakehaving some difficulty. When the edit interval

3 While we have not manipulated the presence of phological cues to interruptions here, others have considethem; for instance, Nakatani and Hirschberg (1994) fou

nd that 30% of interruption sites in their corpus were character-ized by what they call interruption glottalization (p. 1608).

Page 21: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

N

tr

s

t

a

f

iull

arctr

e

r

fi

i

i

ds an-

orr isirh anv-s-83,

is-ite-is,oc-

-ssn-in-dohe lewe

sis- tow,er

tsored

sn-df

nsntat-r

n-esr-

294 BRENNAN A

long enough, this difficulty is confirmed. Whethe following repair word is pronounced withe same stress as the reparandum word, a tition is signaled; on the other hand, when it hgreater stress, this helps the listener suppresmaterial the speaker means to cancel. Wheninterruption comes early enough, such as witthe troublesome word, this helps prevent listener from committing him- or herself to thwrong interpretation (as shown by the lowerror rates after mid-word interruptions in four experiments). This could happen for twreasons: either the interrupted word is a retively poor cue for activating the (unintendecolor word or the fact that this word is interupted may help suppress any potential interence that it may cause (leading to fewer increct interpretations after interrupted words thafter completed words). The first possibilseems somewhat unlikely; in our task, it shotake very little information to activate (highpredictable) color words. The second possibiseems more likely, particularly with the help contrastive stress on the repair word.

Our experiments raise as many questionsthey answer; here are three. First, precisely wis it that announces an interruption in fluespeech? The current studies say more abwhat the cues are not than about what theyWhat we know is that neither mid-word interuptions nor the phonological forms of fillers aas cues by themselves. The length of the ediinterval (which is longer when there is a filleplays a significant role, as does the amountmisleading information that precedes the intruption site; contrastive vs parallel stress ispromising candidate as well. Future work shoufocus on the role of stress in repetitions andplacements (perhaps including edited exampwith appropriate and inappropriateF0s).

Second, by what mechanism does the inmation available in disfluencies aid the parsprocess? As Fox Tree suggested, it may be “listeners do not automatically enter a repmode when they hear an incongruity. If the congruity consists of words identical to somthing just heard, the repair process is inact

The process may only begin when the incogruity consists of different words” (Fox Tree, 1995,

e

D SCHOBER

nhepe-as the

thehinheeerllola-d)r-er-or-antyld

yityof

ashatntoutre.-t

ing)ofr-a

lde-les

or-ngthatairn-e-ve.n-

p. 730). This idea is consistent with Lickley anBard’s (1996) finding that word recognition ihindered for words in the reparandum beforefresh start, whereas word recognition is not hidered by repetition.

We propose that listeners continuously monitspontaneous speech for cues that the speakehaving trouble. After all, speakers monitor theown internal speech prior to articulation, whicenables them to make covert repairs; they calso monitor what they say after they say it (Leelt, 1983, 1989) at the same point at which liteners have access to the utterance. Levelt (19p. 50) proposed for speakers that “if some mmatch is detected which surpasses certain crria, the monitor makes the speaker aware of thor in other words: An alarm signal is sent tworking memory. The speaker can then take ation on the information received.” Similarly, listeners may use a continuous monitoring proceto detect problems in the speech they hear. Ulike speakers, they don’t have access to the tention behind the utterance; however, they have access to the surface form and tsemantic-pragmatic context. They may be abto use paralinguistic cues such as the ones have considered as well as detectable incontencies with the structure of the ongoing parsemake repairs incrementally. As our data shothe cues in a disfluency can help the listencompensate for its potentially harmful effec(such as any interference, misinterpretation, delay caused by having heard the unintendinformation in a reparandum).

A third question is this: What are the effectof other types of disfluencies on comprehesion? For this set of studies, we have limiteourselves to looking at the informativeness omid-word interruptions with and without fillersbecause we began with Levelt’s suggestioand because we were able to collect sufficietokens of these types (we wanted to avoid creing disfluent utterances by editing). We furthelimited ourselves to tokens in which the reparadum ended during or immediately after thcolor word in order to gain as much control apossible. However, speakers often back up futher for their repairs, retracing material befor

the problem word, as inMove to the purp- to the
Page 22: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

d-

sye

r

t

g

e

e8t

r

o

e

soaese:

t-m

enessr-

n-ss-

pensates for flaws in the delivery of spontaneous

gc-

r

i-ch.

g

&ta-le,

f to

ers.

g.

e-

h

u-

ns

e-

Onof

SPEECH DISFLUENCIES

orange square. The current findings would leadus to predict that retracing part of a grammacal phrase before a problem word would estalish at least as much of a disfluency advantaas the extra time that elapses during a filler.

Our finding of a disfluency advantage donot suggest that it is better for speakers to disfluent than fluent. In absolute terms, comphending Naturally Disfluent utterances in theentirety would not be faster than comprehening Fluent utterances-recall that the respontime advantages were relative only to the onof the target color word. Since this word alwaoccurred later in Naturally Disfluent utterancthan in Fluent utterances, and because Fluutterances had lower error rates overall, fluenis still desirable from a listener’s perspectivBut certain features of disfluencies do appeacompensate for mishaps in speaking. That is,earlier the speaker interrupts a reparandum,better for the listener. And not only is pausina bit before a repair (in the editing intervanot harmful, but it buys time for the listener cancel the unintended part of the message.course, these findings are not meant to be desider-ata for speakers, for we have no evidence tspeakers make such choices deliberately.

We began with the observation that, althouspontaneous speech contains many disfluencmost studies of the comprehension of spoken lguage ignore this fact and use only idealizconstructed, or read utterances. We view the crent experiments, along with those of Fox Tr(1993, 1995) and Lickley and Bard (1992, 199as early steps in what is largely an uncharted ritory-the comprehension of spontaneous speeOur experiments offer a methodological contbution which, we believe, has the dual benefitsincreasing both control and validity for studies spoken language processing. In addition to bing stimuli on spontaneous (not read or pformed) utterances, the approach we have tacombines two additional elements: (1) digital eiting to create two kinds of controls that vary difluency within utterances (in addition tnonedited controls) and (2) a comprehension tthat, while logically constrained, approximatsomething people do in ordinary language u

understanding references to objects.

AND COMPREHENSION 295

ti-b-ge

esbere-ir

In conclusion, listeners make fewer commiments to an unintended word in a reparanduwhen the utterance is interrupted earlier, whthere is more time to cancel material in threparandum, and when the repair word’s strecontrasts with the reparandum’s. That the infomation available in disfluencies is useful is cosistent with an incremental interpretation procethat continually monitors, recognizes, and com

seetssent cye. tothethegl)oOf

hat

hies,an-d,ur-e),er-ch.i- off

as-r-

kend--

sks

spoken utterances.

REFERENCES

Bear, J., Dowding, J., & Shriberg, E. (1992). Integratinmultiple knowledge sources for detection and corretion of repairs in human-computer dialog. In Proceed-ings, 30th Annual Meeting of the Association foComputational Linguistics, Newark, DE (pp. 56–63).

Blackmer, E. R., & Mitton, J. L. (1991). Theories of montoring and the timing of repairs in spontaneous speeCognition, 39,173–194.

Bond, Z. S., & Small, L. H. (1984). Detecting and correctinmispronunciations: A note on methodology. Journal ofPhonetics, 12,279–283.

Bortfeld, H., Leon, S. D., Bloom, J. E., Schober, M. F., Brennan, S. E. (in press). Disfluency rates in sponneous speech: Effects of age, relationship, topic, roand gender. Language and Speech.

Brennan, S. E., & Williams, M. (1995). The feeling oanother’s knowing: Prosody and filled pauses as cuelisteners about the metacognitive states of speakJournal of Memory and Language, 34,383–398.

Clark, H. H. (1994). Managing problems in speakinSpeech Communication, 15,243–250.

Clark, H. H. (1996). Using language. Cambridge, UK:Cambridge Univ. Press.

Cooper, W. E., Tye-Murray, N., & Nelson, L. J. (1987). Dtection of missing words in spoken text. Journal ofPsycholinguistic Research, 16,233–240.

Core, M. G., & Schubert, L. K. (1999). A model of speecrepairs and other disruptions. In Proceedings, AAAIFall Symposium on Psychological Models of Commnication in Collaborative Systems, American Associa-tion for Artificial Intelligence, North Falmouth, MA(pp. 48–53).

Cutler, A. (1983). Speakers’ conceptions of the functioof prosody. In A. Cutler & D. R. Ladd (Eds.), Prosody:Models and measurements(pp. 79–91). Berlin:Springer-Verlag.

Dell, G. S. (1986). A spreading activation theory of rtrieval in sentence production. Psychological Review,93,283–321.

Ferber, R. (1991). Slip of the tongue or slip of the ear? the perception and transcription of naturalistic slips

the tongue. Journal of Psycholinguistic Research, 20,105–122.
Page 23: How Listeners Compensate for Disfluencies in Spontaneous Speech · 2003. 9. 23. · of spontaneous speech. This work has focused on characterizing disfluencies in speech corpora,

u

p

i

i

l

i

-

)n

si-

ed

tec-s

).

ng

n-

ofe.

F.l-

echf

e-er

-

a-

296 BRENNAN AN

Fox Tree, J. E. (1993). Comprehension after speech disflencies. Unpublished doctoral dissertation, StanfoUniversity, Stanford, CA.

Fox Tree, J. E. (1995). The effects of false starts and retions on the processing of subsequent words in sponeous speech. Journal of Memory and Language, 34,709–738.

Fromkin, V. A. (Ed.) (1973). Speech errors as linguistic evdence. The Hague: Mouton.

Garrett, M. F. (1975). The analysis of sentence productionG. Bower (Ed.),The psychology of learning and motvation (Vol. 9, pp. 133–177). New York: AcademicPress.

Hindle, D. (1983). Deterministic parsing of syntactic nofluencies. In Proceedings, 21st Annual Meeting of thAssociation for Computational Linguistics, Cambridge,MA (pp. 123–128).

Hockett, C. F. (1967). Where the tongue slips there slip ITo honor Roman Jakobson(Vol. 2, pp. 910–936). TheHague: Mouton.

Howell, P., & Young, K. (1991). The use of prosody in higlighting alterations in repairs from unrestricted speeQuarterly Journal of Experimental Psychology, 43A,733–758.

Laver, J. D. M. (1973). The detection and correction of sof the tongue. In V. A. Fromkin (Ed.), Speech errors aslinguistic evidence(pp. 132–143). The Hague: Mouton

Levelt, W. J. M. (1983). Monitoring and self-repair speech. Cognition, 14,41–104.

Levelt, W. J. M. (1989). Speaking: From intention to articulation. Cambridge, MA: The MIT Press.

Levelt, W. J. M., & Cutler, A. (1983). Prosodic marking speech repair. Journal of Semantics, 2, 205–217.

Lickley, R. (1996). Juncture cues to disfluency. In Proceed-ings, International Conference on Spoken LanguaProcessing (ICSLIP ‘96), Philadelphia (pp. 2478–2481

Lickley, R., & Bard, E. (1992). Processing disfluespeech: Recognising disfluency before lexical acceIn Proceedings, International Conference on SpokLanguage Processing (ICSLIP ‘92), Banff (pp.935–938).

Lickley, R., & Bard, E. (1996). On not recognizing disflue

cies in dialog. In Proceedings, International Conference on Spoken Language Processing (ICSLIP ‘9,Philadelphia (pp. 1876–1879).

Lickley, R., & Bard, E. (1998). When can listeners detect dfluency in spontaneous speech? Language and Speech,41,203–226.

Marslen-Wilson, W., & Tyler, L. (1981). Central processin speech understanding. Philosophical Transactions ofthe Royal Society London, P259,317–332.

D SCHOBER

-rd

eti-nta-

-

. In-

n-e

. In

h-ch.

ips

.n

in

ge.tss.en

n-

Martin, J. G., & Strange, W. (1968). The perception of hetation in spontaneous speech. Perception and Psycho-physics, 3, 427–438.

Nakatani, C. H., & Hirschberg, J. (1994). A corpus-basstudy of repair cues in spontaneous speech. Journal ofthe Acoustical Society of America, 95,1603–1616.

Nooteboom, S. G. (1980). Speaking and unspeaking: Detion and correction of phonologicial and lexical errorin spontaneous speech. In V. Fromkin (Ed.), Errors inlinguistic performance(pp. 87–95). New York: Acade-mic Press.

Nygaard, L. C., Sommers, M. S., & Pisoni, D. B. (1994Speech perception as a talker-contingent process. Psy-chological Science, 5, 42–46.

Oviatt, S. (1995). Predicting spoken disfluencies durihuman-computer interaction. Computer Speech andLanguage, 9, 19–35.

O’Shaughnessy, D. (1992). Analysis of false starts in spotaneous speech. In Proceedings, International Confer-ence on Spoken Language Processing (ICSLIP), Banff(pp. 931–934).

Pierrehumbert, J., & Hirschberg, J. (1990). The meaningintonational contours in the interpretation of discoursIn P. R. Cohen, J. Morgan, & M. E. Pollack (Ed), Inten-tions in communication(pp. 271–311). Cambridge,MA: MIT Press.

Schachter, S., Christenfeld, N., Ravina, B., & Bilous, (1991). Speech disfluency and the structure of knowedge. Journal of Personality and Social Psychology,60,362–367.

Shriberg, E. E. (1999). Phonetic consequences of spedisfluency. Proceedings, International Congress oPhonetic Sciences(San Francisco), 619–622.

Shriberg, E., Bear, J., & Dowding, J. (1992). Automatic dtection and correction of repairs in human-computdialog. In M. Marcus (Ed.), Fifth DARPA Speech andNatural Language Workshop(pp. 419–424). SanMateo, CA: Morgan Kaufmann.

Smith, V. L., & Clark, H. H. (1993). On the course of answering questions. Journal of Memory and Language,32,25–38.

Tent, J., & Clark, J. E. (1980). An experimental investig

-6)

is-

es

tion into the perception of slips of the tongue. Journalof Phonetics, 8, 317–325.

van Wijk, C., & Kempen, G. (1987). A dual system for pro-ducing self-repairs in spontaneous speech: Evidencefrom experimentally elicited corrections. CognitivePsychology, 19,403–440.

(Received March 24, 2000)(Revision received July 5, 2000)