26
Chapter 31 Speechreading: What’s MISS-ing? Ruth Campbell Introduction In this chapter I hope to demonstrate that watching speech as well as listening to it is very much part of what we do when we look at people, and that observing someone talking can relate to other aspects of face processing. The perception of speaking faces can offer insights into how human communication skills develop, and the role of the face and its actions in relation to those skills. In developing these themes, I propose that cross-modal processing occurs at a relatively early cognitive processing stage in relation to speech processing and identifying aspects of the talker. To date, the cognitive processes involved in face processing and voice/speech processing have been considered as essentially separate unimodal systems; faces being processed wholly within a visual stream, speech within an acoustic stream. In considering speech processing, the role of vision— i.e. of seeing speech being spoken—has hardly been considered at all in most models, yet seeing speech impacts on auditory speech processing in a variety of ways in hearing people, as well as being the prime mode of access to speech for deaf people (Campbell, 2008). In relation to face processing, previous models suggest that cross-modal information comes into play only at a relatively late stage of processing, when we form representations of known individuals (e.g. rec- ognizing someone we know from their voice: a processing stage that reflects semantic associa- tions). Here, I propose that multimodal (audio-visual) processing is apparent early in cognitive processing for faces and for speech, A perceptual information processing stage (multimodal indexical speech structure: MISS) operates prior to obtaining information about who someone is or what they are saying. In turn, this suggests that some aspects of face processing can be consid- ered as inherently multimodal: research on seeing speech highlights this assertion. Preamble: seeing speech is special Purely in terms of its signal qualities, speech is unlike other perceived face actions. Speech inheres in a continuously changing auditory or audio-visual signal that delivers recognizable language in structured and meaningful segments. Natural speech can be processed efficiently despite varia- tions between or within talkers, and carries semantic import. In viewing a speaking face we attend to what it is saying, we “listen” to the talker. When we watch someone speak we notice small, fast movements of the face. While these include mouth movements they are not limited to those movements—hence “speechreading” rather than “lip-reading” (Campbell, 2005). What do we see? We might catch “ah” being spoken here, “p” or “mm” there. It’s these segments, often called “visemes” since they can be readily mapped onto the phonemes of heard speech, that carry the message. The perception of non-speech facial actions, such as detecting and identifying emotional expression, does not require that we identify each small, fast-changing component of the face act over a time-frame of tenths of a second. What is more, the meaning of a spoken message— whether it is heard or speechread—depends crucially on the words that have been perceived to be C31.S2 C31.S1 31-Calder-31.indd 605 31-Calder-31.indd 605 12/23/2010 12:51:21 PM 12/23/2010 12:51:21 PM OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

Speechreading: what’s MISS-ing?

  • Upload
    ucl

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Chapter 31

Speechreading: What’s MISS-ing?

Ruth Campbell

Introduction In this chapter I hope to demonstrate that watching speech as well as listening to it is very much part of what we do when we look at people, and that observing someone talking can relate to other aspects of face processing. The perception of speaking faces can offer insights into how human communication skills develop, and the role of the face and its actions in relation to those skills .

In developing these themes, I propose that cross-modal processing occurs at a relatively early cognitive processing stage in relation to speech processing and identifying aspects of the talker. To date, the cognitive processes involved in face processing and voice/speech processing have been considered as essentially separate unimodal systems; faces being processed wholly within a visual stream, speech within an acoustic stream. In considering speech processing, the role of vision — i.e. of seeing speech being spoken — has hardly been considered at all in most models, yet seeing speech impacts on auditory speech processing in a variety of ways in hearing people, as well as being the prime mode of access to speech for deaf people (Campbell, 2008 ). In relation to face processing, previous models suggest that cross-modal information comes into play only at a relatively late stage of processing, when we form representations of known individuals (e.g. rec-ognizing someone we know from their voice: a processing stage that reflects semantic associa-tions). Here, I propose that multimodal (audio-visual) processing is apparent early in cognitive processing for faces and for speech, A perceptual information processing stage (multimodal indexical speech structure: MISS) operates prior to obtaining information about who someone is or what they are saying. In turn, this suggests that some aspects of face processing can be consid-ered as inherently multimodal: research on seeing speech highlights this assertion.

Preamble: seeing speech is special Purely in terms of its signal qualities, speech is unlike other perceived face actions. Speech inheres in a continuously changing auditory or audio-visual signal that delivers recognizable language in structured and meaningful segments. Natural speech can be processed efficiently despite varia-tions between or within talkers, and carries semantic import. In viewing a speaking face we attend to what it is saying, we “listen” to the talker. When we watch someone speak we notice small, fast movements of the face. While these include mouth movements they are not limited to those movements — hence “speechreading” rather than “lip-reading” (Campbell, 2005). What do we see? We might catch “ah” being spoken here, “p” or “mm” there. It’s these segments, often called “visemes” since they can be readily mapped onto the phonemes of heard speech, that carry the message. The perception of non-speech facial actions, such as detecting and identifying emotional expression, does not require that we identify each small, fast-changing component of the face act over a time-frame of tenths of a second. What is more, the meaning of a spoken message — whether it is heard or speechread — depends crucially on the words that have been perceived to be

C31.S2

C31.S1

31-Calder-31.indd 60531-Calder-31.indd 605 12/23/2010 12:51:21 PM12/23/2010 12:51:21 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

vidyarani.p
Text Box
Author query: shouldCampbell, 2005 be 2006 or 2008? If not please supply (also elsewhere in chapter)
vidyarani.p
Rectangle
RC - this and other references to Campbell 2005 should be to Campbell 2006
Ruth
Cross-Out
Ruth
Inserted Text
6

SPEECHREADING: WHAT’S MISS-ING?606

spoken. By contrast, meaning gleaned from an expression on a face seems to be holistic and largely determined by the context in which the expression is seen. (de Gelder and Van den Stock, Chapter 27; Ambady and Weisbuch, Chapter 24, this volume).

Where researchers have been interested in speechreading, therefore, they have followed a speech processing perspective, and the critical questions are: which components of speech can be read from facial actions? What are the cognitive and cortical processes involved in audio-visual speech perception? Briefly addressing these questions will open up the main part of this chapter, which is to explore speechreading in the context of face processing and, in particular, how seeing speech might relate to other aspects of face processing.

Clearly, fewer speech segments can be distinguished by eye than by ear. Some estimates suggest that only a quarter to a third of the critical speech phonemes are visemically distinctive (Campbell, 2005). However, this is not always a bar to identifying the content of silent seen speech. Not only can context often provide a strong cue (the footballer swearing on the football pitch), but also many spoken words can be identified on the basis of a subset of their identifiable phonemes: That is, many words are over-determined when their constituent phonemes are considered. In this respect, speechreading silent speech is similar to listening to speech that is degraded, e.g. speech heard in noise. We may have had the experience of hearing words emerge from otherwise hard-to-hear speech in noise. Sometimes these are words which one has been primed to recognize (your own name, or a predictable end to a sentence), but another class of words also tends to emerge from the noise. These are words which have a unique segmental structure in relation to all other words: those whose phoneme combinations are unique, so that they could only lead to a single identification. The word “umbrella,” for instance, is hard to mistake for any other word, even when some parts of it may be lost in noise when heard, or invisible when seen to be spoken. Words such as “airella” or “ubrenner” do not exist, so the target must be “umbrella.” So, as well as sensitivity to the segmental structure of seen speech, successful speechreading requires sensitiv-ity to the statistical structure of speech, insofar as that leads to word identification. Good speechreaders, especially deaf and deafened people who rely on speechreading, make use of this (Auer and Bernstein, 1997 ), while other verbal abilities, especially verbal working memory skills, can boost speechreading abilities further (Rönnberg, 2003 ).

Of course, for hearing people, silent speech is not the usual way in which we encounter seen speech. We generally experience speech audio-visually. Audio-visual integration of seen and heard speech occurs readily, and can be demonstrated at all levels of speech structure, from the subphonemic to the phrase (Campbell, 2008 , for review). As far as the cortical substrates of seen speech are concerned, while many regions are implicated (and we will encounter these later in the chapter), two findings have been repeatedly demonstrated. Firstly, seen silent speech can activate parts of auditory cortex that had been thought to be dedicated to hearing: these are within Heschl’s gyrus in the temporal plane (planum temporale) of the upper part of the lateral temporal lobe (Calvert et al., 1997 ; Pekkola et al., 2005 ). This suggests that at least some parts of this region may be specialized for the amodal processing of speech-like signals, rather than complex acoustic sig-nals only. “Hearing by eye” is not just a figurative description of speechreading: “auditory” parts of the brain are distinctively and selectively responsive to seen speech. Secondly, posterior parts of the superior temporal gyrus in the lateral temporal lobes are especially sensitive to audio-visual speech congruence and synchronization, and are thought to play a key role in cross-modal integration of speech (see Figure 31.1 ) .

These well-established findings set the scene for more recent research which focuses less on what the talker is saying and more on the characteristics of the talker herself. These are considered the indexical aspects of speech processing; those idiosyncratic characteristics of the talker which tell us about who they are, rather than what they are saying. One reason for this shift within speech processing research is to try to understand how speech content can be processed irrespective

31-Calder-31.indd 60631-Calder-31.indd 606 12/23/2010 12:51:22 PM12/23/2010 12:51:22 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

PREAMBLE: SEEING SPEECH IS SPECIAL 607

(a)(a)

(b) Left HG

Subject 1

Subject 4 Subject 5 Subject 5Subject 6 Subject 6Axial

Subject 9Subject 8Subject 7 Subject 7Left sagittal

Subject 2 Subject 3 Subject 3 Subject 2Coronal

Right HG

(b)

Fig. 31.1 Two cortical regions critically involved in processing seen speech: fMRI activation. (a) Posterior superior temporal regions are especially sensitive to the audio-visual binding of seen and heard speech. This fMRI group analysis shows the region which was activated more strongly when observing synchronized audio-visual speech than non-synchronized audio-visual speech or auditory or visual speech alone The cluster of voxels from the group averaged data in (a) axial and (b) coronal sections localized to the ventral bank of the left superior temporal sulcus (x = − 49, y = − 50, z = 9),. The images are displayed in radiological convention so that the left of the image corresponds to the right hemisphere. Reprinted from Current Biology , Calvert, Campbell, and Brammer, Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex © 2000 with permission from Elsevier. (b) Watching speech activates primary auditory cortex (a study at 3T). Individual activation patterns within the planum temporale (as determined by structural scanning) when watching silent monosyllables being spoken. Significant (Z > 2.3, p<0.01, corrected), activations within Heschl’s gyri (HG) are shown, overlaid on axial magnetic resonance images. The yellow line outlines Heschl’s gyri, the medial parts of which accommodate primary auditory cortex. The statistical maps and the Heschl’s gyrus outline are collapsed into a 2D image. The middle column displays coronal, axial, and left sagittal high-resolution MR images of subject 6 with overlaid activations. Reprinted from Pekkola, Ojanen, Autti et al. Primary auditory cortex activation by visual speech: an fMRI study at 3T, NeuroReport , 16 :2 with permission from Wolters Kluwer Health.

of who speaks it — and, conversely, to identify talker characteristics such as region of origin, age or gender from a sample of speech. For visual speech, this last point might sometimes seem trivial — after all if we can see the talker, we can see if it is a child or an old man. But other ques-tions persist: can regional and non-native accents be discriminated from facial speech actions? Can we reliably match a seen, silent talker (as on a video-clip) to a separate, audio record of their

C31.F1

31-Calder-31.indd 60731-Calder-31.indd 607 12/23/2010 12:51:22 PM12/23/2010 12:51:22 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

Ruth
Cross-Out
Ruth
Replacement Text
Reprinted from Calvert et al (2000), Evidence from functional magnetic resonance imaging of crissmodal binding in the human heromodal cortex, Current Biology ,10, 16-22, with permission from Elsevier

SPEECHREADING: WHAT’S MISS-ING?608

speech? Does knowing what the talker’s speech patterns look like affect our ability to process heard speech? These questions link readily with the concerns of the researcher in face processing. I will try to show some of those links in the sections that follow. In order to do this, I need to convince you that while few people can speechread efficiently, just about everyone shows some sensitivity to speech that they see, as well as speech that they hear.

Visual speech sensitivity: developmental and comparative evidence The first place to look for evidence that we take account of what we see when we perceive a talker is in prelingual infants. Here, some very early acquired multimodal processing capacities are apparent. By the age of 8 weeks, hearing babies prefer to watch synchronized audio-visual speech and are disturbed when face and voice seem to slip out of synch (Dodd, 1979 ). Behavioral sensi-tivity to the visible shape and heard sound of a (native) vowel can be detected in 18-week-olds (Kuhl and Meltzoff, 1982 ; Patterson and Werker, 1999 ). Electrophysiological studies confirm that discriminating (mismatch) responses to a heard vowel followed by a seen facial action can be recorded from the scalps of 10-week-old infants (Bristow et al., 2009 ).

What function could be served by such multimodal sensitivity to speech in the prelingual infant? One possibility is that seeing what a speech sound looks like can help the infant learn to develop appropriate phoneme categories for their native spoken language. A recent study (Teinonen et al., 2008 ) presented 6-month-olds with stimuli from a synthesized auditory continuum spanning the heard speech sounds “ba” to “da.” Stimulus items in the mid-range of the continuum were ambigu-ous between the two forms. Adults and older children typically show a categorical response to such ambiguous items and readily classify these ambiguous speech sounds as either “ba” or “da.” They show categorical perception for these speech sounds. However, prelingual infants tend to respond randomly to the ambiguous mid-range stimuli, failing to show sharp categorical responses. In the Teinonen et al. study, when the 6-month-old infants were provided with an appropriate visual “anchor” of a seen “ba” and a seen “da” accompanying the relevant tokens within the ambiguous auditory mid-range, they quickly learned the appropriate auditory categorization.

In Teinonen et al.’s study, infants were required to make discriminative responses to auditorilly ambiguous tokens, and were helped when they saw “ba” or “da” on the talker’s face. What happens when auditory and visual syllables are quite different, and individually clearly perceived, but then combined in a synchronized audio-visual presentation? Depending on the material presented, peo-ple may report that they “hear” something that was different than the acoustic stimulus. The most dramatic of these are syllabic illusions involving consonants. For instance, hearing “ba” when “ga” is what appears on the talker’s face, leads to the perception that “da” was uttered: the experience is of hearing “da.” not the actual “ba” (McGurk effects; McGurk and MacDonald, 1976 ; see Campbell, 2008 for review). McGurk effects can be elicited in prelingual infants from around 20 weeks of age (Rosenblum et al., 1997 ; Burnham and Dodd, 2004 ). This suggests that some (multimodal) compo-nents of the appropriate phonological categories are already in place in very young infants.

There is much more evidence that the human infant with normal sensory capacities uses both vision and audition to develop an appropriate perceptual repertoire of speech-related information to guide their own developing language competencies in the first year of life (Lewkowicz, in press). One feature of development in the second half of the first year of life is “perceptual narrowing.” While a 6-month-old infant can distinguish between two similar heard speech sounds (e.g. an aspirated “pa,” where the “p” requires a plosive, cheek-puffing action, and the aftercoming vowel is voiced late in the utterance, and “pa” spoken with an unvoiced “p” — no cheek puff and early voice onset), by the age of 8 to 10 months she will have “tuned in” to her native speech sound

C31.S3

31-Calder-31.indd 60831-Calder-31.indd 608 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

Ruth
Cross-Out
Ruth
Inserted Text
2010

SPEECHREADING FACIAL ACTIONS: MOVEMENT OF THE FACE AND POINT-LIGHT SEEN SPEECH 609

categories. In English, there is no distinction between the two variants of “pa,” and by that age the child raised among English speakers loses the ability to distinguish those sounds. An infant reared in a language such as Hindi, where the aspirated and non-aspirated versions of “pa” can distinguish words, does not lose the ability to distinguish between them. So, from an early state in the child’s development when she is sensitive to all or any possible distinctions in the speech that she hears, she soon becomes especially sensitive to the important contrasts within the language that sur-rounds her, losing some of her previous, more general, discriminatory powers (Werker and Tees, 2005 . See Pascalis and Wirth, Chapter 37, this volume, for similar “perceptual narrowing” as a characteristic of the infant’s ability to discriminate own- and other race/species faces).

Pons et al. ( 2009 ) reported perceptual narrowing for audio-visual speech processing in the sec-ond half of the first year of life. “ba” and “va” are easy to distinguish visually for English speakers — “ba” is produced with both lips touching (bi-labial), while “va” is made with the top teeth touching the bottom lip (labio-dental). In English, these are contrastive gestures: “ban” and “van” are different words. In Spanish, however, the distinction is not contrastive: “baño” may be pronounced with a bi-labial or a labio-dental gesture on the first consonant. The question then is, do Spanish infants show a different pattern of sensitivity to the visual forms of these utterances than English infants, and if so at what age? Pons et al. report that, when infants were trained to match a heard syllable to one which then appeared as a silent, seen speech gesture, the looking preference patterns of both English and Spanish 6-month-olds showed appropriate cross-modal matching for “ba” and for “va.” However, while English infants aged eleven months retained this skill, Spanish infants of the same age showed reduced audio-visual discrimination between “ba” and “va.”

While these studies clearly show a role for seen speech in early human speech perception, when considered in evolutionary terms these competencies may be based on broader multimodal abilities related to vocalization. The ability to detect correspondences between seen facial actions and heard vocalizations occurs in other primate species. “Coo” and “threat” calls of rhesus mon-keys are produced with distinct facial actions and mouth shapes. Ghazanfar and Logothetis ( 2003 ) showed that monkeys were sensitive to the congruence between the acoustic signal quali-ties and those of the face image corresponding to these different calls. Monkeys are also able to match the visual size/age of a monkey “caller” to an appropriate heard vocalization. This is assumed to reflect the mapping of acoustic formant characteristics delivering characteristic reso-nance and pitch qualities of the vocal tract of the caller (Ghazanfar et al, 2007 ). To date, no stud-ies have explored the perceptual bases for face-call matching in non-human primates. For example, the effects of structural changes in the facial image (Parr and Hecht, Chapter 35, this volume), and of “other-species effects” when monkeys view other primate species faces and calls (Pascalis and Wirth, Chapter 37, this volume) remain to be investigated.

The studies with rhesus monkeys suggest that identifying individuals in the group is one reason for this multimodal sensitivity — in addition to the utility of matching vision to audition in iden-tifying the meaning of a call. It remains to be established whether domesticated animals, which show surprising abilities in processing human faces (sheep: Tate et al, 2006 ; Kendrick and Feng, Chapter 34, this volume; horses: Stone, 2010) and intentions (dogs: Hare and Tomasello, 2005 ) are able to match faces to voices, and whether they can distinguish different language users on the basis of their visible speech patterns.

Speechreading facial actions: movement of the face and point-light seen speech What of normal adult capabilities? First of all, is a full facial image required to process who some-one is and what they are saying? Point-light stimuli, generated by illuminated points distributed

C31.S4

31-Calder-31.indd 60931-Calder-31.indd 609 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

SPEECHREADING: WHAT’S MISS-ING?610

on the skin of the face, and which retain facial motion characteristics without facial forms or features, can be sufficient to identify familiar faces (Lander et al. 1999 ). Characteristic facial motion patterns are quickly learned for new faces (Lander and Davies, 2007 ). Familiar people can also be identified solely from their point-light seen speech patterns (Rosenblum et al., 2007b ). Point-light faces can generate McGurk effects and enhance the comprehension of speech in noise (Rosenblum and Saldaña, 1996 ; Rosenblum et al. 1996 ). That is, the visible dynamic speech signature, alone, carries sufficient information to enable the identification of speech content and speech carrier. These demonstrations have been made with people who have been exposed to natural speaking faces, and (until it has been shown otherwise) we should assume that it is expo-sure to natural faces throughout the lifespan that has developed processing systems and represen-tations that support these skills, and which allow strong correlations to emerge between the patterns of movement of face parts and phonetic classification of heard speech segments (Jiang et al., 2002 ; 2007 ).

Speechreading and facial identity processing While point-light seen speech can be “read” for identity and content, a still image of a face, even if it is just a photographic snapshot, also provides speech cues. In Figure 31.2 , we can easily see that Madonna might have been snapped saying “ah” — and not “mm” or “ff.” Which cognitive systems are involved in identifying the image in that figure as Madonna, and which are involved in identifying the speech gesture that she seems to be making? Speech processing and face identi-fication were characterized by Fodor ( 1983 ) as distinct cognitive modules (Fodor, 1983 ), with no (low-level) interaction between the two processes. In apparent support of this discrete modular view came clear neuropsychological evidence. Two patients were described with doubly dissoci-ated abilities. One could recognize faces, but could not identify any images of seen speech. For instance, in one task the patients were each given a pile of photographs, each of which showed one of six different individuals. The models had been photographed from different viewpoints and posed different speech gestures, such as “oo,” “mm,” “ff,” “th,” “sh,” “ee.” One patient was able to sort the cards into separate piles on the basis of their identity, but not on the basis of the speech gesture. That patient had a left (posterior) temporal lesion. The other patient could sort and identify speech patterns from the cards, but not faces. She had a right (posterior) temporal lesion (Campbell et al , 1986 ). This finding was incorporated into the now-classical model of face processing (Bruce and Young, 1986 ), which proposed independence of processing streams for the two tasks. Figure 31.2 illustrates the cognitive stages in processing (familiar) faces, as described in the Bruce and Young model. It shows that while face recognition requires access to face recogni-tion units (FRUs), seen speech does not. Similarly, facial expression processing is, on this model, independent of face identification. 1

Not all people with face recognition problems show spared speechreading abilities. Some peo-ple have been identified with impaired face recognition abilities and lessened sensitivity to visual influence in speech perception — that is, with associated deficits in each domain (e.g. de Haan and Campbell, 1991 ; de Gelder et al., 1998 ). The impairments shown by these patients, who showed no other visual deficits, could be attributed to impaired structural encoding problems specific to faces — whatever the task. This is consistent with the Bruce and Young model and suggests, more-over, that speechreading requires some special face-related processing — that is, it appears to be a skill which falls within the face-processing domain, rather than something extrinsic to it.

1 But see Calder, Chapter 22, this volume, for another perspective.

C31.S5

31-Calder-31.indd 61031-Calder-31.indd 610 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

SPEECHREADING AND FACIAL IDENTITY PROCESSING 611

Visualprocessing

Expressionanalysis

Expression-independentdescriptions

View centreddescriptionsStructural

encoding

Auditory signal Auditoryprocessing

Face recognitionunits (FRU)

Directed visualprocessing

Facial speechanalysis

Person identity nodes(PIN)

Cognitive system

2

Name generation

Voicerecognitionunits (VRU)

Structuralencoding

for speechprocessing

Fig. 31.2 Modeling the cognitive stages in face recognition and in face processing: independent modality-specific routes. This figure replicates Bruce and Young’s ( 1986 ) model of face processing, showing independent visual processing streams for watching speech, expression processing and face recognition. A similar modality-specific model has been proposed for voice recognition (Belin et al., 2004 ); parts of it are sketched here. The two (unimodal) recognition systems are assumed to interact at the PIN level.

More problematic for the model are findings suggesting that speechreading may be sensitive to identity-related processing. Walker et al. ( 1995 ) showed that sensitivity to audio-visual fusions was affected by personal familiarity with the talker. They used McGurk material (see “ga” and hear “ba”: think-you-hear “da”), and found that the influence of vision on audition was greatest when the face of the talker was unfamiliar. For familiar faces, dubbed to an unfamiliar voice, McGurk effects were much reduced. In the Bruce and Young model, the processing of a facial identity (FRU), and the identification of the person (multimodal person identification node: PIN) reflect discrete processing stages, and the model assumes that speechreading reflects processes at the structural encoding stage, independent of familiarity with the face or the person. If this were the case then every face, whether familiar or unfamiliar, should generate similar patterns of McGurk sensibility when dubbed to an (unfamiliar) voice. Since the familiarity of the face affected McGurk sensitivity, we must — assume at least some “backflow” of information about a familiar person (PIN-level) to structural processing stages to account for the Walker et al. finding. This is not an isolated example of effects of talker-familiarity in processing speech from faces. Using still photographs and a repetition priming paradigm, Campbell et al. ( 1996 ) found face- identity judgments were speeded when that individual had been seen earlier in a speech-matching phase of

C31.F2

31-Calder-31.indd 61131-Calder-31.indd 611 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

SPEECHREADING: WHAT’S MISS-ING?612

the study: that is, when respondents had become familiarized with the faces in the course of the experiment. Schweinberger and Soukup ( 1998 ) used a Garner ( 1974 ) interference paradigm, sys-tematically changing possible sources of variation in the face image when sequences of images were presented for speeded speech gesture classification. Each image was of a face producing a different vowel — such as “ee,” “oo,” or “ah.” Thus, in one phase of the study a single face was seen through-out, while in another phase different faces appeared. Varying the identity/familiarity of the face affected speed of identifying the seen vowel shape. Also, matching face images on the basis of the vowel shape was faster for personally familiar than unfamiliar faces.

From Schweinberger et al.’s study, two effects related to familiarity appear to be important: familiarity with a person helps identify their speech pattern, but also it is easier to identify the speech pattern when the face is the same from trial to trial, even when the face is unfamiliar. It is most efficient when one talker is seen throughout a series of presentations (priming), least effi-cient when many different facial instances of a talker are seen (interference). Similar effects have also been demonstrated in studies using more naturalistic material than photographs of speech. Several studies have shown a processing cost to speechreading words or connected speech when multiple talkers rather than a single talker are seen (Yakel et al., 2000 ; Kaiser et al., 2003 ; Kaufmann and Schweinberger, 2005 ). These findings confirm clinical reports that people with hearing loss may find it easier to speechread just one person, and have more difficulty in speechreading when different talkers speak. Observers tune in to the speech patterns of a model, so that switching between talkers carries a processing cost. These data echo findings for heard speech. It is harder to understand heard words (for example in a lexical decision task where the task is to respond to each spoken item) when spoken by different (unfamiliar) talkers compared with a single talker (Nygaard et al. , 1994 ). Indexical properties (who is talking) affect seen speech processing just as they affect heard speech processing.

All these studies suggest that whether we consider faces, voices or speech content, processing is more sensitive to influences related to discriminating or identifying the talker than is acknowl-edged in many cognitive models. One way to accommodate these influences is to introduce or emphasize top-down effects. Thus, activation at the supramodal PIN level can in turn influence FRU or VRU activation which in turn influences processes “downstream” within structural encoding itself (Figure 31.2 ). However, as we will see, this may not be sufficient to account for all phenomena related to how we match talkers from their face or voice.

Indexical properties of the voice: identifying the talker across different modalities and matching face to voice We can often identify a familiar person from their voice — though not usually as accurately or automatically as from the face (Belin et al., 2004 ). As Figure 31.2 shows, in the Bruce and Young ( 1986 ) model, the person identification system (PIN) describes the level at which all information relevant to identifying a familiar person may be represented and accessed. It is at this level that semantic knowledge about a person (their name, their profession and so forth) is associated with their facial appearance, and also their voice, gait and other identifying aspects of the individual person. It is also at the PIN level that multimodal information is accessible — for example, match-ing face and voice for a known person. However, empirical research has only recently explored the implications of this part of the model. For example, can we classify a voice as familiar or unfa-miliar more efficiently when it is accompanied by the “right” face, and what if the “wrong” face accompanies the voice?

Schweinberger et al. ( 2007 ) presented familiar and unfamiliar voices to be discriminated. The voice, speaking a standard sentence, was synchronized to the face of the corresponding talker, or

C31.S6

31-Calder-31.indd 61231-Calder-31.indd 612 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

613INDEXICAL PROPERTIES OF THE VOICE

to another talking face, which could be familiar or unfamiliar. For familiar voices, the presence of the corresponding talking face improved, while a non-corresponding talking face impaired familiarity decisions, when compared with voice-alone decisions. Thus, making a decision about identity (indexical processing) from the voice is affected by seeing the face. This outcome occurred only when the faces were shown as dynamic, natural talking sequences: photographs of talkers had no effect on voice familiarity judgment (Schweinberger et al., 2007 ). Further evi-dence that the dynamic signature of the seen talker is crucial comes from a study by Rosenblum and colleagues. Rosenblum et al. ( 2007b ) showed that familiar people can be identified from point-light displays of the face speaking. In a separate study, this team showed that one hour of (full) video familiarization with a previously unfamiliar talker enabled accurate matching of (natural) silent speech to voice (Rosenblum, 2007a ; and see von Kriegstein et al., 2008 ). What all these studies suggest is that there appears to be a reasonably robust dynamic signature for speech that can discriminate talkers and which is available by eye as well as by ear. These studies add to those which demonstrate that the movement characteristics of a known person’s face are stored as part of the representation of that individual, and that the dynamic signature of facial expres-siveness and movement can be sufficient to identify familiar people (Lander et al., 1999, 2004; O’Toole et al., 2002 ).

But what of unfamiliar talkers? In Schweinberger et al.’s ( 2007 ) study, only weak improvements in unfamiliar voice identification occurred when watching the “correct” (unfamiliar) talker’s face, and the interference effects of splicing such voices to non-corresponding faces were slight and non-significant. However, using different methodologies, cross-modal identity matches can be demonstrated for unfamiliar people. Typically, the perceiver first experiences the talker in just one modality and then tries to match that target talker when she is presented among distractors in the other modality. For example, if one hears a voice uttering a sentence, then one sees silent videotapes of two unfamiliar talkers, of similar gender, age and language background, identifica-tion of the talker is better than chance — even when the utterance is different for presentation and match conditions (Kamachi et al., 2003 ). The important clues that support this matching skill reside in the dynamic signature of the talker’s speech, since matching a voice to a photograph, or a backward-running videotape of the talker, was no better than chance (Kamachi et al. 2003 ; Lachs and Pisoni, 2004a ). Further studies established that such matching was most effective when the acoustic signal preserved all the spectral and temporal formant frequency information (although some was possible with reduced signal quality; see Lachs and Pisoni, 2004b ), and con-firmed the importance of the dynamic visual properties of the talker (i.e. by point-light displays — Lachs and Pisoni, 2004c ; Rosenblum et al., 2006 ).

The role of dynamic information in learning to identify an unfamiliar face is nugatory (Christie and Bruce, 1998 ), so how can people match voice and face for individuals who they do not know? Neither simple talking speed nor extent of movement of the articulators is a reliable cue (Lander et al., 2007 ; Rosenblum et al., 2006 ). While structural aspects of natural voice quality (pitch and timbre) can provide cues to the size of the talker’s vocal tract, these are usually insufficient for forensic identification purposes (Rose, 2002 ). Indeed, natural voice quality may not always be needed to identify a talker, as sine-wave speech experiments have shown (see Remez et al. , 1997 ; Sheffert et al., 2002 ). Talker-specific cues are to be found in a combination of cues. Some of these are linguistic. For example, phonetic variation in the way that vowels are produced characterizes many regional accents and idiosyncratic speech patterns available to the hearer as variation in the acoustic formant frequency patterns (Mullenix et al., 1989 ). Variations in how a word is spoken can also be seen as changes in mouth shape or other visible aspects of speech (McGrath et al., 1984 ; Green, 1998 ). Because of this, it is possible to distinguish regional accents simply from watching the talker (Irwin et al., 2007 ).

31-Calder-31.indd 61331-Calder-31.indd 613 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

vidyarani.p
Rectangle
vidyarani.p
Text Box
AQ : please supply Lander et al 2004 ref
Ruth
Sticky Note
reference added - see citation list

SPEECHREADING: WHAT’S MISS-ING?614

Language-specific aspects of speechreading Cross-modal matching for talker identity appears to be most effective when (auditory) words can be recognized (Lachs and Pisoni, 2004b ). This is probably because representations of known spo-ken words are sufficiently phonetically detailed to allow idiosyncratic deviations from the norm to be readily detected and identified. In turn, this suggests that familiarity with the spoken lan-guage is important in processing talker variations. How are processes involving seen speech engaged when the language is unfamiliar and linguistic processing cannot proceed automatically? The language of the talker and the expectations of the viewer about the language being spoken affect susceptibility to visual influences and speechreading. McGurk effects which are robust for North American users of spoken English need not be so for users of another language, even when tested in their own language. For example, in Japanese or Chinese there is lowered susceptibility to the effects of vision on audition (Sekiyama and Tohkura, 1991 ; Sekiyama, 1994 , 1997 ). It has been suggested that both phonological and cultural factors are important here (Sekiyama, 1994 , 1997 ). Intriguing effects of language emerge when auditory speech is masked by noise and the participant is required to detect when speech has occurred. Under these conditions, seeing the talker improves detection when speech is in a familiar language, but can impair it when the language is unfamiliar (Kim and Davis, 2003 ).

However, we are not “blind” to seen speech in an unfamiliar language. Indeed, being able to see the talker can help the novice language learner learn distinctions that may be hard to make by ear alone (Hazan et al., 2006 ; Navarra and Soto, 2007 ). Distinguishing between languages can also be done by eye, though language familiarity affects its accuracy. Spanish-Catalan bilinguals were able to distinguish these two familiar languages by eye alone, but the task was impossible for English or Italian people who knew neither language, while Spanish monolinguals could perform the discrimination, but less accurately than the bilinguals (Soto-Faraco et al., 2007 ).

Moreover, the skill of distinguishing two languages by eye is evident prelingually in infants raised in bilingual homes. At 8 months old, bilingual (French–English) infants can distinguish talkers of the different languages when these are shown speaking silently (Weikum et al. , 2007 ). We have already noted that monolingual infants “tune in” to their native language “by eye” in the second half of the first year of life, losing some of their earlier sensitivity in matching seen and heard utterances that may not be phonologically salient to the language to which they are exposed (Pons et al. , 2009 ). It seems that bilingual children maintain sensitivity to auditory-visual speech utterances, which they then apply discriminatively to the talkers and languages that surround them. The representations (phonological, lexical) of the forms of familiar language and of languages that the child is learning find their way into processes that require or reflect visual speech processing.

All these recent studies are moving us towards more specific characterizations of the indexical signature for (seen) speech. Language-specific and talker-specific processes are implicated, and arise early in life. Recognizing individual voice characteristics (i.e. for matching to a face) are best when the language/s is/are familiar. However, the actual cues used by observers to match seen speech to the talker, and when processing audio-visual combinations, are still under-researched, although the dynamic signature of the talking face appears to be important. Unfamiliar, as well as familiar, faces can be matched to their voices — not perfectly but better than chance. What do these recent studies suggest about the cognitive structures involved in identifying a talker from their visible and acoustic speech patterns?

Multimodal indexical speech structure (MISS) Taken together, these findings suggest a model such as that proposed in Figure 31.3 . In distinction to other models (see Figure 31.2 , and, e.g. Belin et al., 2004 ; Campanella and Belin, 2007 ;

C31.S7

C31.S8

31-Calder-31.indd 61431-Calder-31.indd 614 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

MULTIMODAL INDEXICAL SPEECH STRUCTURE (MISS) 615

Vision

Structuralencoding

Multimodal

FRU

PIN

3Semantic systems

Hearing

Structuralencoding

Indexicalspeech

structureVRU

Content-basedspeech structure

Fig. 31.3 MISS: A multimodal indexical speech structure system. This figure shows the relationship of MISS (see text) with content-based normative speech processing and with sensory-specific and multimodal familiarity detection systems: Face recognition units (FRU), Voice recognition Units (VRU), and person identification nodes (PIN) (Bruce and Young, 1986 , and see Figure 31.1 ). Connector lines are black where they are salient to indexical processing, blue in relation to content processing and to interactions between content and indexical processing.

von Kriegstein et al., 2008 ), which posit separate visual and auditory structural analysis systems leading to modality-specific voice (VRU, voice recognition units) and face (FRU) activation, a multimodal indexical speech structure encoding component (MISS) takes inputs from both visual and auditory systems, in order to derive a multimodal speech-based representation. The contribution of each sensory modality to the structural analysis of seen-and-heard speech will differ depending on speechreading sensitivity and hearing status (see below). One function of MISS is to deliver a coherent signature of the potential identifying characteristics of any talker. MISS is conceived to work interactively with structural analysis for speech content, allowing content and carrier information to influence each other. 2

Although not indicated in the figure, MISS is also likely to accommodate inputs from action and somatosensory systems. Some theorists suggest that it is the articulatory plan for making a speech gesture, not its auditory or visual outcome, that constitutes the primary speech represen-tation (motor theory of speech perception: Liberman et al., 1967 ). Articulatory motor plans are “mirrored” in one’s own speech as imitative speaking styles (Kerzel and Bekkering, 2000 ; Sams et al. , 2005 ).

2 The intention of Figure 31.3 is to highlight how the indexical part of a multimodal speech processing system may operate in relation to the components of a face processing system. As indicated in the previous sections, language (speech content) can affect indexical processing. That is, a single speech encoding system encom-passes both indexical and normative content cue information. Both indexical and content analysis can proceed from a single cue or set of cues, with processing being distributed depending on task requirements.

C31.F3

31-Calder-31.indd 61531-Calder-31.indd 615 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

SPEECHREADING: WHAT’S MISS-ING?616

From all these inputs, features characterizing idiosyncratic speech patterns can be derived. Activation within MISS allows unfamiliar voices (auditory input) and face actions (visual input) to be matched via their distinctive (unimodal) components. However, MISS also receives “top-down” activation from familiarity detection systems within the unimodal visual (FRU) and auditory (VRU) systems, both directly and via the amodal PIN. This allows knowledge of a known person’s face or voice to constrain/inform processing of new input (i.e. detecting mismatched (known) face and voice), and accounts for the findings that familiarity with the individual makes face-voice matching more efficient. When, for example, a child hears the voice of someone who is out of view — and two people then come into view, MISS processing will have been involved in identifying the voice and matching the face to it. She will be alerted to the “face that owns the voice,” whether that is a familiar or an unfamiliar one. Matching unknown faces to voices has forensic import, too: audiotapes of individuals may need to be matched to (silent) CCTV video footage.

One reason for proposing MISS rather than distinct modality-specific structural processing systems is that acoustic, visible and articulatory characteristics are tightly correlated (Munhall and Vatikiotis-Bateson, 1998, 2004; Yehia et al. , 1998 , 2002 ; Jiang et al., 2002 ; Davis and Kim, 2004 ). The affordances of the dynamic signature of speech across the different signal channels appears to be a primary aspect of speech processing, and operates efficiently at the segmental (consonant-vowel) level (Jiang et al., 2007 ). Given these correspondences, researchers are increas-ingly suggesting that conceptualizing speech processing as intrinsically multimodal may give more powerful insights than models that focus exclusively on the auditory pathway for speech (see Ghazanfar and Schroeder, 2006 ). Some other structural features are also apparent when speech is conceptualized as multimodal. For example, the relatively constant characteristics of voice such as timbre and pitch, which vary with the size and shape of the vocal tract, can be mapped onto visible indicators of size and shape of the talker such as their gender or age. Gender matching of face and voice is apparent in prelingual infants (Patterson and Werker, 2002 ). These are unlikely to depend on learned knowledge of faces and speech styles and may, rather, reflect some more general tendencies reflecting sensitivity to the acoustic properties of visible objects (small things tend to make high, low energy sounds; large things low, high energy sounds). All these features are more readily accommodated in a model that proposes a single multimodal structural system in relation to speech processing, rather than separate systems interacting only at a relatively late processing stage.

MISS cannot capture all aspects of the structural processing of speech/talker. Some remain stubbornly unimodal. Auditory adaptation effects demonstrate this. These occur when the repeated auditory presentation of a heard syllable shifts the categorization of an ambiguous syllable away from the direction of the “adapted” sound. Such effects are unaffected by the sight of the talker, whether for speech content (i.e. syllable category shift for McGurk stimuli; Roberts and Summerfield, 1981 ; Saldaña and Rosenblum, 1994 ) or voice identification (Schweinberger et al. , 2008 ).

While audio-visual speech processing may rest on multimodal processing skills this does not mean that audio-visual speech processing is completely “automatic.” Attentional processes are implicated in audio-visual speech processing, even for such apparently effortless perceptions as McGurk effects (Alsius et al., 2005 ). One aspect of how attention is allocated across modalities is that, normally, temporal synchronization of what is seen and what is heard leads to the perception of a unitary event. A recent neuropsychological case study suggests what happens when this “binding function” is missing (Hamilton et al., 2006 ). This patient, in distinction to most people, was worse at reporting audio-visual than auditory-alone speech and showed reduced susceptibility to McGurk illusions. The patient had bilateral parietal lesions, suggesting that parietal function

31-Calder-31.indd 61631-Calder-31.indd 616 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

vidyarani.p
Rectangle
vidyarani.p
Text Box
AQ Please supply Munhall and Vatikiotis-Bateson reference
Ruth
Sticky Note
delete the Munhall and VB 2004 reference

617NEUROPSYCHOLOGICAL CORRELATES

needs to be intact to effect the seamless and apparently automatic perception that characterizes natural audio-visual speech processing (de Gelder and Bertelson, 2003 ). Input from different cortical systems needs to be effectively coordinated in order for MISS to function effectively.

Logically, the accurate identification of a spoken utterance needs to take account of indexical features such as accent or style of speech, embedded in language knowledge. Empirically, the auditory processing of speech (e.g. lexical decision) is more efficient when a single talker, rather than several talkers, is heard (Nygaard et al. , 1994 ). For seen and heard speech, the advantages of a single talker over multiple talkers occur when training is in one modality, and testing in the other (e.g. Rosenblum et al., 2007a , Lander and Davies 2008 ). Here MISS is hypothesized to pro-vide a means for “setting” speech content analysis to the perceived speech parameters of the talker. Memorial functions (VRU, FRU and PIN) then come into play when observers are exposed to specific talkers. Training participants to recognize voices paired with talkers’ facial actions can enhance the later recognition of those voices when they are heard without facial actions (Sheffert and Olson, 2004; von Kriegstein et al., 2006 , 2008 ).

To summarize: a multimodal structural processing stage is proposed which would allow visual and auditory speech (and possibly inputs from somesthetic and action systems too) to start to be analyzed prior to the identification of heard words or talkers faces (visual). Various research results point to the possibility of this structural multimodal stage. That is, structural aspects of face processing and structural aspects of speech processing may, independently, input into MISS. However, it is possible that MISS also enjoys more direct multimodal activation. The strongly correlated auditory (acoustic formants) and visual (face movement) consequences of speaking suggest more powerful multimodal mechanisms that detect correspondences between visual and auditory signal streams, whether or not face processing or voice analysis has been undertaken. This is an empirical question that awaits resolution.

Neuropsychological correlates: carrier and content in speech processing, a MISSing piece? In addition to its cross-modal properties, MISS, as proposed here, makes a distinction between identifying a talker across modalities (matching a talker’s voice to their face) and identifying the content of speech (e.g. from noisy audio-visual speech). That is, carrier and content may be dis-tinguishable at this level. If a double dissociation were to be reported between content and carrier processing, whether for familiar or unfamiliar speech, that would constitute a telling line of evi-dence in support of MISS. Generally, of course, it is easier to identify a particular talker by face than by voice, and to identify a particular word by ear rather than by eye: modality itself offers asymmetric cues to carrier and content of speech. Nevertheless, we have seen that neuropsycho-logical dissociations between carrier and content can be identified in the visual modality as dis-sociated skills for speechreading and for identifying faces or facial expressions (see Campbell et al., 1990 ). Is there neuropsychological evidence for a dissociation between content and carrier for speech? This has not yet been demonstrated in the multimodal domain, but several studies suggest a neuropsychological dissociation in auditory speech processing between content and voice identification, with right hemisphere mechanisms implicated in voice recognition, and left hemisphere ones for content (Assal et al., 1976 ; Landis et al. , 1982 ; Kreiman and Van Lancker 1988 ; see Belin, 2006 , for further discussion). Neuroimaging research with intact respondents supports the distinction: Voice identification usually activates right-hemisphere regions to a greater extent, while speech-content identification shows greater left-sided activation (Belin and Zatorre, 2003 ; Stevens, 2004; von Kriegstein et al., 2003 ). Future research could usefully address these questions multimodally. “Temporary lesion” techniques such as repetitive transcranial

C31.S9

31-Calder-31.indd 61731-Calder-31.indd 617 12/23/2010 12:51:23 PM12/23/2010 12:51:23 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

vidyarani.p
Rectangle
vidyarani.p
Rectangle
reference added - see citation list
vidyarani.p
Text Box
AQ Please supply Sheffert and Olson, 2004 reference
vidyarani.p
Rectangle
Delete the Stevens reference
vidyarani.p
Text Box
AQ Please supply Stevens, 2004 reference

SPEECHREADING: WHAT’S MISS-ING?618

magnetic stimulation (Pitcher et al., Chapter 19, this volume) could be used to explore the contri-bution of localized cortical systems to natural speech processing, and would rule out explanations of anomalous perception in patients, where compensatory and long-term plasticity mechanisms may obscure interpretation.

Neuroimaging studies of speechreading I have already mentioned two important regions involved in observing speaking faces. These are both in the superior parts of the temporal lobe and can include “auditory cortex” within Heschl’s gyrus in the planum temporale as well as multimodal regions lateral and posterior to it (see Figure 31.1 ). But, of course, these are not the only regions involved. Regions sensitive to face processing including occipital, occipitotemporal and inferior temporal regions are also activated when watching a speaking face, but when watching speech, other regions are implicated too. These include inferior frontal, insular and inferior parietal regions, and even the auditory brain-stem. These may show differently lateralized sensitivity depending on task. It should also be noted that processing of moving speaking faces and the processing of stilled facial speech forms can dissociate, following a “dorsal” and a “ventral” route, respectively (Calvert and Campbell, 2003 ; Campbell, 2008 ).

A superior temporal “hub” for multimodal processing of carrier and content? The region comprising mid to posterior parts of the superior temporal gyrus and the posterior superior temporal sulcus inferior to it, bordering the middle temporal gyrus (pSTG/STS) is reli-ably activated in most studies of speechreading and audio-visual speech processing (e.g. Calvert et al., 1997 ; Möttönen et al., 2002 ; Wright et al. , 2003 ; Capek et al., 2004 , 2008 ; Hall et al., 2005 ; Murase et al., 2008 ; Stevenson and James, 2009 ). Posteriorly, this extends to the temporoparietal junction around the supramarginal gyrus (see Figure 31.1 a). This highly multimodal region receives projections not only from auditory, visual and somatosensory cortical regions, but also from motor and action planning regions and more medial structures, too. What is its role in silent speechreading? To answer this question we need a model of information flow through the cortical areas involved.

In a simple feedforward model, seeing speech first activates visual regions of the occipital and inferior temporal lobes, then projects to multimodal p-STS — which can then generate activation within more “purely” auditory speech areas such as primary and secondary auditory cortex via back-projection. Depending on the task, activation may also flow into frontal regions (e.g. when a speech gesture has to be repeated). A MEG study (Nishitani and Hari, 2002 ) provided a close fit to this simple model. That is, on this model, the role of p-STS is to serve as a multimodal hub for processing speech, allowing projections both to higher order centers related to speech compre-hension and production and also in the reverse direction to “sensory-specific” regions corre-sponding to the modality other than that in which the signal was experienced, as well as to the regions in which the signal was experienced.

It is important to note that p-STS specialization is not restricted to audio-visual and visual speech: it appears to be much more broadly based. Functionally, p-STS has been implicated in a broad range of processes relating to effective communication and intention processing (Brothers, 1990; Hein and Knight, 2008 ). So, as far as audio-visual and visual speech are concerned, the role of p-STS may reflect the convergence of at least three distinct functional specializations: Firstly, p-STS is especially sensitive to the perception of moving faces and their parts (e.g. Puce et al.,

C31.S10

C31.S11

31-Calder-31.indd 61831-Calder-31.indd 618 12/23/2010 12:51:24 PM12/23/2010 12:51:24 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

vidyarani.p
Rectangle
vidyarani.p
Rectangle
reference added - see citation list
vidyarani.p
Text Box
AQ Please supply Brothers, 1990 reference
vidyarani.p
Rectangle

A SUPERIOR TEMPORAL “HUB” FOR MULTIMODAL PROCESSING OF CARRIER AND CONTENT? 619

1998; Allison et al., 2000; Haxby et al., 2002 ; Santi et al., 2003 , and see Calder, Chapter 22, this volume). Secondly, p-STS/STG is implicated in auditory speech processing, where its preference for speech-like over non-speech-like, but similarly complex, acoustic patterns has been amply demonstrated (e.g. Rauschecker, 1997 ; Scott and Johnsrude, 2003 , Binder et al., 2004 ). Finally, these regions are highly sensitive to the processing of perceived intentional actions and disposi-tions of others, however delivered (e.g. Saxe and Kanwisher, 2003 ; Pelphrey et al. 2005 ; Ghazanfar and Schroeder, 2006 ; Redcay, 2008 ).

Although p-STS is implicated in speech-content processing “by eye,” its involvement in the identification of individuals from their facial movement patterns (i.e. indexical speech processing capabilities from visual displays such as point-light displays where face features are absent) has not yet been explored. STG/STS must be a “prime suspect” for a key role in indexical processing of seen speech given its involvement in identifying speech from point-light displays (Santi et al., 2003 ) and in auditory voice processing (Belin, 2006 ).

The extensive connectivity of p-STS with many other cortical regions underlines its role as a key cortical site for the cross-modal integration of auditory and visual speech, including projections to ostensibly sensory-specific regions (back-projection). Thus just listening to a familiar person’s voice can generate activation in the fusiform gyrus, a region specialized for processing faces within the inferior temporal lobe. The fusiform gyrus is usually activated when looking at a face, but not when listening to sounds that are not like human speech. When listening to someone talk, activation in the fusiform gyrus is correlated with p-STS activation (von Kriegstein et al., 2003 , 2008 ). Similarly, just watching someone speak (silent speech observation) leads to activation in (primary) auditory cortex, as we have seen (figure 1 , Calvert et al., 1997 ; Pekkola et al., 2005 ). Activation in auditory cortex by silent speech is also dependent on p-STS activation (e.g. Calvert et al, 2000 ). Watching silent speech can even lead to neural potentials being recorded in the audi-tory brainstem, a subcortical, early “auditory” processing region (Musacchia et al., 2006 , 2007 ). Again, it is likely that this reflects back-projection and is associated with p-STS activation.

Because p-STS has been implicated in most of these patterns, and because it is in p-STS that effects of seeing the talker on hearing speech can be robustly and reliably observed (Reale et al. , 2007 ), it may follow that its role in silent speechreading is secondary to audio-visual speech processing and the perception of a unified audio-visual speech event (the “binding function”: see Calvert et al., 2000 ; Miller and D’Esposito, 2005 ).However, it should be noted that people who are deaf from birth show greater activation in p-STS than hearing people, when they are speechreading. At the very least, this suggests that acoustic specification of voice does not have to be experienced in order for p-STS to show distinctive activation for silent speechreading. 3

While p-STS may play a key role in speechreading and in audio-visual speech processing, some studies show no activation in p-STS, with key sites distributed in other regions, for both audio-visual (Bernstein et al., 2008 ) and visual (Campbell, 2008 ) speech observation. However, these demonstrations typically use special experimental material or require special responses which may not be relevant to natural speechreading. That is, they suggest that regions other than p-STS may be implicated in some distinctive aspects of seen speech analysis. More critical in relation to the simple feed-forward model for silent and audio-visual speech processing, the timing of the activations in sensory and supramodal regions does not always fit the proposal that p-STS is the

3 Elsewhere in this handbook, it is proposed that p-STS supports another binding function: that of the (wholly visual) dynamic and non-dynamic aspects of facial expression perception (Calder, Chapter 22, this volume). A similar proposal has been made with respect to the integration of dynamic and non-dynamic visual information for processing silent speech (Calvert and Campbell, 2003 ; Campbell, 2006 ). This may underlie its special role in seen speech processing, especially in deaf people.

31-Calder-31.indd 61931-Calder-31.indd 619 12/23/2010 12:51:24 PM12/23/2010 12:51:24 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

vidyarani.p
Rectangle
both references added - see citation list
vidyarani.p
Text Box
AQ Please supply Puce et al., 1998; Allison et al., 2000; references

SPEECHREADING: WHAT’S MISS-ING?620

critical processing hub. For example, we would not expect silent speechreading to activate secondary auditory regions of the superior temporal lobe prior to activation in p-STS: yet this has been reported in independent studies (Möttönen et al., 2004 ; Besle et al., 2008 ), and is discussed further, below.

What of the distinction between carrier and content which motivated much of the discussion in this chapter? While posterior superior temporal regions can show distinctively lateralized sensitivity to carrier (right hemisphere) and content (bilateral/left hemisphere) for auditory speech (e.g. Belin and Zatorre, 2003 ), a similar pattern has not yet been shown for seen speech where, to date, studies have used a single talker and so are mute with respect to this issue. An anterior/posterior gradient along the superior temporal gyrus may reflect further functional specialization in relation to content/carrier. More anterior parts of STS/STG can be associated with perception of intentions (Hein and Knight, 2008 ), and with the extraction of meaning from heard speech (left anterior temporal regions: Scott et al., 2006 ). A further role for anterior parts of the (right) superior temporal gyrus is in identifying known conspecifics by voice (Belin, 2006 ; Petkov et al. , 2008).

The role of the right anterior superior temporal gyrus in distinguishing different talkers was confirmed in a further neuroimaging study with humans. Formisano et al. ( 2008 ) used a multivariate statistical learning algorithm to “train” the discrimination of voice (three auditory identities) and content (three auditory vowels) from multi-voxel fMRI activation patterns within primary auditory cortex, and then examined which further temporal regions were associated with these “labeled-and-learned” discriminations. In the first place, the algorithm distinguished voice and content from the multi-voxel activation patterns within primary auditory cortex — and generalized accurately to new samples of talkers and content. The extended temporal regions associated with each of these patterns were distinctive. Right anterior temporal regions were implicated in voice recognition, while content discrimination was associated with bilateral activation along the length of the superior temporal gyrus.

The novelty of this approach, and its relevance to MISS, is that the algorithm worked efficiently in the absence of complex symbolic processes, such as phonetic feature recognition or specific spectrotemporal acoustic features of a particular voice: the only information given to the “pattern recognizer” was either voice or content identity. Because the auditory stimulus patterns came from unfamiliar talkers and were limited to monosyllables, one possible inference from this study is that (right) anterior superior temporal regions need not be conceptualized exclusively as “PIN” regions, intimately linked to conceptual and semantic knowledge about individuals and associa-tions between known faces and their voices. Rather, it suggests that right anterior superior tem-poral regions may have a special role in processing stylistic, idiosyncratic voice features which may be particularly important for discriminating the speech carrier, whether they are familiar or not (and see Belin and Zatorre, 2003 ; von Kriegstein and Giraud, 2004 ).

The superior temporal gyrus and sulcus appear to be good candidate regions underpinning crucial aspects of processing seen speech, both in terms of mapping heard to seen speech, and identifying the talker from speech patterns. While specialized sub-regions may be implicated dif-ferentially (right anterior–carrier; left posterior–speech content), it is likely that processing is distributed along the gyrus and sulcus so that content and indexical information can be mutually informative. However, no study to date has explicitly explored this.

Other cortical regions implicated in speechreading: the role of frontal systems Functional activation patterns indicate a network of interconnected regions supporting speech-reading. Activation in insular cortex and in subcortical sub-Sylvian regions is reliably reported

C31.S12

31-Calder-31.indd 62031-Calder-31.indd 620 12/23/2010 12:51:24 PM12/23/2010 12:51:24 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

vidyarani.p
Rectangle
reference added - see citation list
vidyarani.p
Text Box
AQ Please supply Petkov et al., 2008 reference

TIMING AND MODELS OF FLOW OF ACTIVATION 621

(e.g. Calvert et al., 1997 ; Hertrich et al., 2009 ). Seen speech can make special demands on regions associated with speech production. In particular, activation in inferior frontal regions, including partes opercularis/triangularis of the inferior frontal gyrus (BA 44/45) accompanies visual or audio-visual speech processing. Frontal activation often extends dorsally to the supplementary motor area (e.g. Campbell et al., 2006 ; Calvert and Campbell, 2003 ; Paulesu et al., 2003 ; Santi et al. 2003 ; Watkins et al., 2003 ; Miller and d’Esposito, 2005 ; Saito et al., 2005 ; Skipper et al., 2005 ; Hasson et al., 2007 ; Fridriksson et al., 2009 ). This suggests that access to articulatory representa-tions of speech is important for speechreading and could mean that these regions are more acces-sible for speech that is seen (or seen-and-heard) than for speech that is simply heard. Seeing speech could then be considered to have a special role not only in relation to the binding of auditory and visual modalities in perception, but in the binding of speech perception and its production, whereby action-based representations refine the perceptual processing of speech (Studdert-Kennedy, 1983 ). The activation of representations based on articulatory properties of speech has further conse-quences — for example, it can lead to subvocal articulation of seen speech (Watkins et al., 2003 ) with concomitant activation in the observer’s somatosensory cortex (Möttönen et al., 2005 ).

Timing and models of flow of activation It is not yet clear to what extent regions implicated in the perception of seen speech exert their effects early or late in processing. Simple feed-forward models for speechreading (Nishitani and Hari, 2002 ; Campbell, 2008 ) will probably require modification In particular, there are indica-tions from electrophysiological and MEG studies that there may be “super-fast” connections from occipital regions sensitive to dynamic visual events directly to auditory cortex, since responses within auditory cortex appear to be modulated by visual speech events at an earlier stage of processing than would be predicted if their source were polymodal p-STS (Möttönen et al., 2004 ; van Wassenhove et al., 2005 ; Besle et al., 2008 ; Hertrich et al., 2009 ).

Now, visible mouth movements often precede actual vocalization, and observers judge as syn-chronous a seen speech action that can be up to half a second in advance of the heard sound (Campbell and Dodd, 1980 ; Dixon and Spitz, 1980 ). Therefore it seems likely that seeing a par-ticular speech gesture (temporarily) “sets” auditory processing strategy, and this is reflected in cortical “preparedness.” One way to conceptualize this is as a temporary prioritization of forward connections from visual face processing regions (inferior temporal), to secondary auditory areas in more anterior parts of the temporal lobe. That is, local plastic changes in connection strengths come into play as seeing a face action prepares the observer for a (corresponding) auditory event (van Wassenhove et al., 2005 ; Hertrich et al., 2009 ). Such preparatory re-setting does not seem to involve p-STS — at least not in relation to early processing stages. The specificity and duration of such re-setting of the forward connections for audio-visual speech will be a fruitful area for further research.

The many regions implicated in speechreading, and the elaboration of the time course of events allowing a visual or audio-visual stimulus to be perceived as speech suggests not only that process-ing of carrier and of content is distributed across multiple brain sites, but that coherent, sustained activation of multiple sites is required for audio-visual percepts such as McGurk effects to be experienced as unitary events. Gamma-band oscillatory activity across multiple brain regions is a likely candidate for such temporal cohesion (Fingelkurts et al., 2003 ). This type of activity is associated with the detection of asynchrony in audio-visual speech (Doesburg et al., 2008 ) and with synchronized McGurk syllable perception (Kaiser et al. 2006 ). It is likely that future research will find more evidence that multiple perceptual and action systems contribute to the unitary experience of a speech event.

C31.S13

31-Calder-31.indd 62131-Calder-31.indd 621 12/23/2010 12:51:24 PM12/23/2010 12:51:24 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

SPEECHREADING: WHAT’S MISS-ING?622

Taken together, neuroimaging findings are opening up interesting avenues for identifying the processes implicated in speechreading and audio-visual speech perception, and their time course. While many “classical” speech perception regions are involved, especially within superior tempo-ral regions, and these include auditory processing regions, speechreading may implicate frontal and somesthetic circuits too — possibly to a greater extent than for hearing speech. p-STS clearly performs crucial integration functions with respect to seeing speech, however, some fast feed-forward connections may sometimes bypass this route. Investigations of voice as well as speech processing further suggest that content and carrier can be distinguished by different cortical sub-systems within superior temporal regions. The cognitive stages models ( Figures 31.2 and 31.3 ) proposed to account for findings in visual and audio-visual processing do not always fit readily to cortical activation findings, but cortical data can inform these models.

Summary and conclusion In this chapter I have shown that when we listen to speech we are highly sensitive to the sight of the talker, and this is evident early in infancy, before the child produces spoken language. In this sense we all speechread, even when we may not be able to understand much of a silent speechread utterance. Moreover we can match a speaker’s face to their voice and cannot ignore who someone is when we hear them talk. Watching someone talk involves both speech processing and face processing — and processing speech involves both processing speech content and managing aspects of the identity of the talker (indexical properties — the carrier). While we do this readily for familiar people, we can also discriminate between unfamiliar individuals by matching their (silent) talking face with their voice. Since a spoken utterance has correlated dynamic properties across the face and the voice (the timing and extent of the actions are closely synchronized), I have suggested that an early multimodal processing system (MISS) operates to integrate the talking face and the heard voice at a structural (pre-identification) level.

Fodor ( 1983 ) proposed that cognitive input systems are domain specific and input driven (bottom-up). To take the last of these first: one aspect of the chapter is to illustrate the extent to which the input-driven aspect may need to be moderated. Examples abound of “top-down” influences on “automatic” audio-visual speech processing, and have been provided in this chapter. But what about domain-specificity? What domain is occupied by speechreading? In contrast to a simple interpretation of Fodorian modularity, which might locate face processing in a purely visual domain and speech processing in a purely acoustic one (and so speechreading would be brushed out of sight), our sensitivity to the sight of a talker who is speaking tells us that there are crucial multimodal processes involved — and some of these operate at a relatively low (input) level.

One message of this chapter is that two domains — each of which is inherently multimodal — underpin much of our ability to process communicative information. These are a content domain (primarily speech) and a carrier domain (who is doing the communicating, and how?) The cortical correlates for each of these domains can be identified and each of their distinctive cogni-tive characteristics can be described. The details of how these domains interact await discovery.

Acknowledgments Much work for this chapter was done during a visit to the MARCS laboratory, University of Western Sydney, Australia. I am happy to acknowledge the hospitality of Denis Burnham, who invited me to present at Summerfest2008 and was gracious enough to let me stay on for a while. MARCS colleagues, especially Jeesun Kim and Chris Davis, and also Harold Hill, provided the necessary inspiration.

C31.S14

C31.S15

31-Calder-31.indd 62231-Calder-31.indd 622 12/23/2010 12:51:24 PM12/23/2010 12:51:24 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

REFERENCES 623

References Alsius , A. , Navarra , J. , Campbell , R. and Soto-Faraco , S.S. ( 2005 ). Audiovisual integration of speech falters

under high attention demands . Current Biology , 15 , 839 – 843 .

Assal , G. , Zander , E. , Kremin , H. and Buttet , J . ( 1976 ). Discrimination des voix lors des lésions du cortex cérébral . Arch Suisses de Neurologie, Neurochirugerie et Psychiatrie , 119 , 307 – 315

Auer , E.T. , J r . and Bernstein , L.E. ( 1997 ). Speechreading and the structure of the lexicon: computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness . Journal of the Acoustical Society of America , 102 , 3704 – 3710 .

Belin , P . ( 2006 ). Voice processing in human and non-human primates . Philosophical Transactions of the Royal Society of London B , 361 , 2091 – 2107 .

Belin , P. , Fecteau , S. and Bédard , C. ( 2004 ). Thinking the voice: neural correlates of voice perception . Trends in Cognitive Science , 8 , 129 – 136.

Belin , P. and Zatorre , R.J . ( 2003 ). Adaptation to speaker’s voice in right anterior temporal lobe . Neuroreport , 14 , 2105 – 2109 .

Besle , J. , Fischer , C. , Bidet-Caulet , A. , Lecaignard , F. , Bertrand , O. and Giard , M.H. ( 2008 ). Visual activation and audiovisual interactions in the auditory cortex during speech perception: intracranial recordings in humans . Journal of Neuroscience , 28 , 14301 – 14 3 10.

Bernstein , L.E. , Auer , E.T. Jr. , Wagner , M. and Ponton , C.W. ( 2008 ). Spatiotemporal dynamics of audiovisual speech processing . Neuroimage , 39 , 423 – 435 .

Binder , J.R. , Rao , S.M. , Hammeke , T.A. , et al . ( 2004 ). Functional magnetic resonance imaging of human auditory cortex Annals of Neurology , 35 , 662 – 672 .

Bristow , D. , Dehaene-Lambertz , G. , Mattout , J. , et al . ( 2009 ). Hearing faces: how the infant brain matches the face it sees with the speech it hears . Journal of Cognitive Neuroscience , 21 , 905 – 921 .

Bruce , V. and Young , A.W. ( 1986 ). Understanding face recognition . British Journal of Psychology , 77 , 305 – 327 .

Burnham , D. and Dodd , B. ( 2004 ). Auditory-visual speech integration by prelinguistic infants: perception of an emergent consonant in the McGurk effect . Developmental Psychobiology , 45 , 204 – 220 .

Calvert , G. , Bullmore , E. , Brammer , M.J. , et al . ( 1997 ). Activation of auditory cortex during silent speechreading . Science , 276 , 593 – 596 .

Calvert , G. and Campbell , R. ( 2003 ). Reading speech from still and moving faces: the neural substrates of seen speech . Journal of Cognitive Neuroscience , 15 , 57 – 70.

Calvert , G.A. , Campbell , R. , and Brammer , M. ( 2000 ). Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex . Current Biology , 10 , 649 – 665.

Campanella , S. and Belin , P. ( 2007 ). Integrating face and voice in person perception . Trends in Cognitive Sciences , 11 , 535 – 543.

Campbell , R. ( 2006 ). Audio-visual speech processing . In K. Brown , A. Anderson , L. Bauer , M. Berns , G. Hirst , and J. Miller (eds.), The encyclopedia of language and linguistics , pp. 562 – 569 . Amsterdam : Elsevier.

Campbell , R. ( 2008 ). The processing of audio-visual speech: empirical and neural bases . Philosophical Transactions of the Royal Society of London B , 363 , 1001 – 1010.

Campbell , R. and Dodd , B. ( 1980 ). Hearing by eye. Quarterly Journal of Experimental Psychology , 32 , 85 – 99.

Campbell , R. , Landis , T. and Regard , M. ( 1986 ). Face recognition and lipreading . A neurological dissociation. Brain , 109 , 509 – 521.

Campbell , R. , Garwood , J. , Franklin , S. , Howard , D. , Landis , T. and Regard , M. ( 1990 ). Neuropsychological studies of auditory-visual fusion illusions . Four case studies and their implications. Neuropsychologia , 128 , 787 – 802 .

Campbell , R. , Brooks , B. , de Haan , E. and Roberts , T. ( 1996 ). Dissociating face processing skills: decision about lip-read speech, expression, and identity . Quarterly Journal of Experimental Psychology A , 49 , 295 – 314 .

31-Calder-31.indd 62331-Calder-31.indd 623 12/23/2010 12:51:24 PM12/23/2010 12:51:24 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

Ruth
Sticky Note
Brothers, L. (1990). The social brain: a project for integrating primate behaviour and neurophysiology in a new domain. Concepts in Neuroscience, 1, 27-51.
Ruth
Sticky Note
Allison,T., Puce,A. & McCarthy,G. (2000)Social perception from visual cues: role of the STS region Trends in Cognitive Sciences 4, 267-78

SPEECHREADING: WHAT’S MISS-ING?624

Capek , C.M. , Bavelier , D. , Corina , D. , Newman , A.J. , Jezzard , P. and Neville , H.J. ( 2004 ). The cortical organization of audio-visual sentence comprehension: an fMRI study at 4 Tesla . Brain Research, Cognitive Brain Research , 20 , 111 – 119 .

Capek , C.M. , MacSweeney , M. , Woll , B. , et al . ( 2008 ). Cortical circuits for silent speechreading in deaf and hearing people . Neuropsychologia , 46 , 1233 – 1241 .

Christie , F. and Bruce , V. ( 1998 ). The role of dynamic information in the recognition of unfamiliar faces . Memory and Cognition , 26 , 780 – 790 .

Davis , C. and Kim , J. ( 2004 ). Audio-visual interactions with intact clearly audible speech . Quarterly Journal of Experimental Psychology A , 57 , 1103 – 1121 .

De Gelder , B. and Bertelson , P. ( 2003 ). Multisensory integration, perception and ecological validity . Trends in Cognitive Science , 10 , 460 – 467 .

de Gelder , B. , Vroomen , J. , and Bachoud Levi , A.C. ( 1998 ). Impaired speechreading and audio-visual speech integration in prosopagnosia . In R. Campbell and B. Dodd (eds.) Hearing by eye II pp. 195 – 207 . Hove : Psychology Press/Erlbaum .

de Haan , E.H. and Campbell , R. ( 1991 ). A fifteen year follow-up of a case of developmental prosopagnosia . Cortex , 27 , 489 – 509.

Dixon , N.F. and Spitz , L. ( 1980 ). The detection of audio-visual desynchrony . Perception , 9 , 719 – 21.

Dodd , B. ( 1979 ). Lip reading in infants: attention to speech presented in- and out-of-synchrony . Cognitive Psychology , 11 , 478 – 484.

Doesburg , S.M. , Emberson , L.L. , Rahi , A. , Cameron , D. and Ward , L.M. ( 2008 ). Asynchrony from synchrony: long-range gamma-band neural synchrony accompanies perception of audiovisual speech asynchrony . Experimental Brain Research , 185 , 11 – 20.

Fingelkurts , A.A. , Fingelkurts , A.A. , Krause , C.M. , Möttönen , R. and Sams , M. ( 2003 ). Cortical operational synchrony during audio-visual speech integration . Brain and Language , 85 , 297 – 312 .

Fodor , J.A. ( 1983 ). Modularity of Mind: An Essay on Faculty Psychology . Cambridge, MA : MIT Press .

Formisano , E. , De Martino , F. , Bonte , M. and Goebel , R. ( 2008 ). “ Who” is saying “what”? Brain-based decoding of human voice and speech . Science , 322 , 970 – 973 .

Fridriksson , J. , Moser , D. , Ryalls , J. , Bonilha , L. , Rorden , C. and Baylis , G. ( 2009 ). Modulation of frontal lobe speech areas associated with the production and perception of speech movements . Journal of Speech and Language and Hearing Research , 52 , 812 – 819 .

Garner , W.R. ( 1974 ). The Processing of Information and Structure . New York : Erlbaum, Potomac , Wiley .

Ghazanfar , A.A. and Logothetis , N.K. ( 2003 ). Neuroperception: facial expressions linked to monkey calls . Nature , 423 , 937 – 938 .

Ghazanfar , A.A. and Schroeder , C.E. ( 2006 ). Is neocortex essentially multisensory? Trends in Cognitive Science , 10 , 278 – 285 .

Ghazanfar , A.A. , Turesson , H.K. , Maier , J.X. , van Dinther , R. , Patterson , R.D. and Logothetis , N.K. ( 2007 ). Vocal-tract resonances as indexical cues in rhesus monkeys . Current Biology , 17 , 425 – 430

Green , K.P. ( 1998 ). The use of auditory and visual information during phonetic processing; implications for theories of speech perception . In Campbell R , Dodd BM and Burnham D (eds.) Hearing By Eye II , pp. 3 – 26. Hove : Psychology Press .

Hall , D.A. , Fussell , C. and Summerfield , A.Q. ( 2005 ). Reading fluent speech from talking faces: typical brain networks and individual differences . Journal of Cognitive Neuroscience , 17 , 939 – 953 .

Hamilton , R.H. , Shenton , J.T. and Branch Coslett , H. ( 2006 ). An acquired deficit of audio-visual speech processing . Brain and Language , 98 , 66 – 73.

Hare , B. and Tomasello , M. ( 2005 ). Human-like social skills in dogs? Trends in Cognitive Science , 9 , 439 – 444 .

Hasson , U. , Skipper , J.I. , Nusbaum , H.C. and Small , S.L. ( 2007 ). Abstract coding of audiovisual speech: beyond sensory representation . Neuron , 56 , 1116 – 1126 .

31-Calder-31.indd 62431-Calder-31.indd 624 12/23/2010 12:51:24 PM12/23/2010 12:51:24 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

REFERENCES 625

Haxby , J.V. , Hoffman , E.A. and Gobbini , M.I. ( 2002 ). Human neural systems for face recognition and social communication . Biological Psychiatry , 51 , 59 – 67 .

Hazan , V. , Sennema , A. , Faulkner , A. , Ortega-Llebaria , M. , Iba , M. and Chunge , H. ( 2006 ). The use of visual cues in the perception of non-native consonant contrasts . Journal of the Acoustical Society of America , 119 , 1740 – 1751 .

Hein , G. and Knight , R.T. ( 2008 ). Superior temporal sulcus - It’s my area: Or is it? Journal of Cognitive Neuroscience , 20 , 2125 – 2136.

Hertrich , I. , Mathiak , K. , Lutzenberger , W. and Ackerman , H. ( 2009 ). Timecourse of early audiovisual interactions during speech and nonspeech central auditory processing: a magnetoencephalography study Journal of Cognitive Neuroscience , 21 , 259 – 274 .

Irwin , A. , Thomas , S. and Pilling , M. ( 2007 ). Regional accent familiarity and speechreading performance . In J. Vroomen , M. Swarts and E. Kramer (eds. ) Proceedings of the international conference AVSP2007 ISCA available at : http://spitswww.uvt.nl/Fsw/Psychologie/AVSP2007/ .

Jiang , J. , Alwan , A. , Keating , P.A. , Auer , E.T. and Bernstein , L.E. ( 2002 ). On the relationship between face movements, tongue movements and speech acoustics . Journal of Applied Signal Processing , 11 , 1174 – 1188.

Jiang , J. , Auer , E.T. J r . , Alwan , A. , Keating , P.A. and Bernstein , L.E. ( 2007 ). Similarity structure in visual speech perception and optical phonetic signals . Perception and Psychophysics , 69 , 1070 – 1083 .

Kaiser , A.R. , Kirk , K.I. , Lachs , L. and Pisoni , D.B. ( 2003 ) Talker and lexical effects on audiovisual word recognition by adults with cochlear implants . Journal of Speech and Language and Hearing Research , 46 , 390 – 404.

Kaiser , J. , Hertrich , I. , Ackermann , H. and Lutzenberger , W. ( 2006 ) Gamma-band activity over early sensory areas predicts detection of changes in audiovisual speech stimuli . Neuroimage , 30 , 1376 – 1382.

Kamachi , M. , Hill , H. , Lander , K. and Vatikiotis-Bateson , E. ( 2003 ). Putting the face to the voice: matching identity across modality . Current Biology , 13 , 1709 – 1714.

Kaufmann , J.M. and Schweinberger , S.R. ( 2005 ). Speaker variations influence speechreading speed for dynamic faces . Perception , 34 , 595 – 610 .

Kerzel , D. and Bekkering , H. ( 2000 ). Motor activation from visible speech: evidence from stimulus response compatibility . Journal of Experimental Psychology: Human Perception and Performance , 26 , 634 – 647.

Kim , J. and Davis , C. ( 2003 ). Hearing foreign voices: does knowing what is said affect visual-masked-speech detection? Perception , 32 , 111 – 120 .

Kreiman , J. and Van Lancker , D. ( 1988 ). Hemispheric specialization for voice recognition: evidence from dichotic listening . Brain and Language , 34 , 246 – 252 .

Kuhl , P.K. and Meltzoff , A.N. ( 1982 ). The bimodal perception of speech in infancy . Science , 218 , 1138 – 1141.

Lachs , L. and Pisoni , D.B. ( 2004a ). Cross-modal source identification in speech perception . Ecological Psychology , 16 , 159 – 187.

Lachs , L. and Pisoni , D.B. ( 2004b ). Cross-modal source information and spoken word recognition . Journal of Experimental Psychology: Human Perception and Performance , 30 , 378 – 396.

Lachs , L. and Pisoni , D.B. ( 2004c ). Specification of cross-modal source information in isolated kinematic displays of speech . Journal of the Acoustical Society of America , 116 , 507 – 518 .

Lander , K. and Davies , R. ( 2007 ). Exploring the role of characteristic motion when learnig new faces . Quarterly Journal of Experimental Psychology , 60 , 519 – 526 .

Lander , K. and Davies , R. ( 2008 ). Does face familiarity influence speechreadability? Quarterly Journal of Experimental Psychology , 61 , 961 – 967 .

Landis , T. , Buttet , J. , Assal , G., and Graves , R. ( 1982 ). Dissociation of ear preference in monaural word and voice recognition . Neuropsychologia , 20 , 501 – 504 .

31-Calder-31.indd 62531-Calder-31.indd 625 12/23/2010 12:51:24 PM12/23/2010 12:51:24 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

Ruth
Sticky Note
Lander K. (2004). Repetition priming from moving faces. MEMORY & COGNITION, 32(4), 640-647.

SPEECHREADING: WHAT’S MISS-ING?626

Lander , K. , Christie , F. and Bruce , V. ( 1999 ). The role of movement in the recognition of famous faces . Memory and Cognition , 27 , 974 – 985 .

Lander , K. , Hill , H. , Kamachi , M. and Vatikiotis-Bateson , E. ( 2007 ). It’s not what you say but the way you say it: matching faces and voices . Journal of Experimental Psychology: Human Perception and Performance , 33 , 905 – 914 .

Lewkowicz , D.J. ( in press ). Infant perception of audio-visual speech synchrony . Developmental Psychology .

Liberman , A.M. , Cooper , F.S. , Shankweiler , D.P. and Studdert-Kennedy , M. ( 1967 ). Perception of the speech code . Psychological Review , 74 , 431 – 461 .

McGrath , M. , Summerfield , A.Q. and Brooke , N.M. ( 1984 ). Roles of lips and teeth in lipreading vowels , Proceedings of the Institute of Acoustics (Autumn Meeting, Windermere) , 6 , 401 – 408 .

McGurk , H. and MacDonald , J. ( 1976 ). Hearing lips and seeing voices . Nature , 264 , 746 – 748 .

Miller , L.M. and D’Esposito , M. ( 2005 ). Perceptual fusion and stimulus coincidence in the cross-modal integration of speech . Journal of Neuroscience , 25 , 5884 – 5893 .

Möttönen , R. , Krause , C.M , Tiippana , K. and Sams , M. ( 2002a ). Processing of changes in visual speech in the human auditory cortex, Brain Research, Cognitive . Brain Research , 13 , 417 – 425.

Möttönen , R. , Schürmann , M. and Sams , M. ( 2004 ). Time course of multisensory interactions during audiovisual speech perception in humans: a magnetoencephalographic study . Neuroscience Letters, 363 , 112 – 115 .

Möttönen , R. , Järveläinen , J. , Sams , M. and Hari , R. ( 2005 ). Viewing speech modulates activity in the left SI mouth cortex . Neuroimage , 24 , 731 – 737 .

Mullenix , J.W. , Pisoni , D.B. and Martin , C.S. ( 1989 ). Some effects of talker variability on spoken word recognition, Journal of the Acoustical Society of America , 85 , 365 – 378.

Munhall , K.G. and Vatikiotis-Bateson , E. ( 1998) . The moving face during speech communication . In Campbell , R. , Dodd , B. and Burnham , D. (eds.) Hearing by eye II , pp. 123 – 139. Hove : Psychology Press Ltd .

Murase , M. , Saito , D.N. , Kochiyama , T. , et al . ( 2008 ). Cross-modal integration during vowel identification in audiovisual speech: a functional magnetic resonance imaging study . Neuroscience Letters , 434 , 71 – 76 .

Musacchia , G. , Sams , M. , Nicol , T. and Kraus , N. ( 2006 ). Seeing speech affects acoustic information processing in the human brainstem . Experimental Brain Research , 168, 1 – 10.

Musacchia , G. , Sams , M. , Skoe , E. and Kraus , N. ( 2007 ). Musicians have enhanced subcortical auditory and audiovisual processing of speech and music . Proceedings of the National Academy of Sciences, U S A , 104 , 15894 – 15 89 8.

Navarra , J. and Soto-Faraco , S. ( 2007 ). Hearing lips in a second language: visual articulatory information enables the perception of second language sounds . Psychological Research , 71 , 4 – 12.

Nishitani , N. and Hari , R. ( 2002 ). Viewing lip forms: cortical dynamics . Neuron , 36 , 1211 – 1220.

Nygaard , L.C. , Sommers , M.S. , and Pisoni , D.B. ( 1994 ). Speech perception as a talker-contingent process . Psychological Science , 5 , 42 – 46 .

O’Toole , A.J. , Roark , D.A. and Abdi , H. ( 2002 ). Recognizing moving faces: a psychological and neural synthesis . Trends in Cognitive Science , 6 , 261 – 266 .

Patterson , M.L. and Werker , J.F. ( 1999 ). Matching phonetic information in lips and voice is robust in 4.5-month-old infants . Infant Behaviour and Development , 22 , 237 – 247.

Patterson , M.L. and Werker , J.F. ( 2002 ). Infants ability to match phonetic and gender information in the face and voice Journal of Experimental Child Psychology , 81 , 93 – 115.

Paulesu , E. , Perani , D. , Blasi , V. , Silani , G. , Borghese , N.A. et al . ( 2003 ). A functional anatomical model for lipreading . Journal of Neurophysiology , 90 , 2005 – 2013.

Pekkola , J. , Ojanen , V. , Autti , T. , et al . ( 2005 ). Primary auditory cortex activation by visual speech: an fMRI study at 3 T . NeuroReport , 16 , 125 – 128.

31-Calder-31.indd 62631-Calder-31.indd 626 12/23/2010 12:51:24 PM12/23/2010 12:51:24 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

vidyarani.p
Rectangle
Please update - the reference is now 2010, (vol) 46:66-77.
vidyarani.p
Text Box
AQ Is there any update available for Lewkowicz in press?

REFERENCES 627

Pelphrey , K.A. , Morris , J.P. , Michelich , C.R. , Allison , T. , and McCarthy , G. ( 2005 ). Functional anatomy of biological motion perception in posterior temporal cortex: An FMRI study of eye, mouth and hand movements . Cerebral Cortex , 15 , 1866 – 1876 .

Pons , F. , Lewkowicz , D.J. , Soto-Faraco , S. , and Sebastián-Gallés , N. ( 2009 ). Narrowing of intersensory speech perception in infancy . Proceedings of the National Academy of Sciences, USA , 106 , 10598 – 10602.

Rauschecker , J.P. ( 1997 ). Processing of complex sounds in the auditory cortex of cat, monkey and man . Acta Otolaryngologica (Stockholm) Suppl . 532 , 34 – 38 .

Reale , R.A. , Calvert , G.A. , Thesen , T. , et al . ( 2007 ). Auditory-visual processing represented in the human superior temporal gyrus . Neuroscience , 145 , 162 – 184.

Redcay , E. ( 2008 ). The superior temporal sulcus performs a common function for social and speech perception: implications for the emergence of autism . Neuroscience Biobehavioral Reviews , 32 , 123 – 142.

Remez , R.E. , Fellowes , J.M. and Rubin , P.E. ( 1997 ). Talker identification based on phonetic information . Journal of Experimental Psychology: Human Perception and Performance , 23 , 651 – 666 .

Roberts , M. and Summerfield , Q. ( 1981 ). Audiovisual presentation demonstrates that selective adaptation in speech perception is purely auditory . Perception and Psychophysics , 30 , 309 – 314 .

Rönnberg , J. ( 2003 ). Cognition in the hearing impaired and deaf as a bridge between signal and dialogue: A framework and a model . International Journal of Audiology , 42 , S68 – S76 .

Rose , P. ( 2002 ). Forensic Speaker Identification . London: Taylor and Francis/CRC Press .

Rosenblum , L.D. and Saldaña H.M. ( 1996 ). An audiovisual test of kinematic primitives for visual speech perception . Journal of Experimental Psychology: Human Perception and Performance , 22 , 318 – 331 .

Rosenblum , L.D. , Johnson , J.A. and Saldaña , H.M. ( 1996 ). Point-light facial displays enhance comprehension of speech in noise . Journal of Speech and Hearing Research , 39 , 1159 – 1170 .

Rosenblum , L.D. , Schmuckler , M.A. and Johnson , J.A. ( 1997 ). The McGurk effect in infants . Perception and Psychophysics , 59 , 347 – 357.

Rosenblum , L.D. , Smith , N.M. , Nichols , S.M. , Hale , S. and Lee , J. ( 2006 ). Hearing a face: cross-modal speaker matching using isolated visible speech . Perception and Psychophysics , 68 , 84 – 93 .

Rosenblum , L.D. , Miller , R.M. and Sanchez , K. ( 2007a ). Lip-read me now, hear me better later: cross-modal transfer of talker-familiarity effects . Psychological Science , 18 , 392 – 396.

Rosenblum , L.D. , Niehus , R.P. and Smith , N.M. ( 2007b ). Look who’s talking: recognizing friends from visible articulation . Perception , 36 , 157 – 159 .

Saldaña , H.M. and Rosenblum , L.D. ( 1994 ). Selective adaptation in speech perception using a compelling audiovisual adaptor . Journal of the Acoustic Society of America , 95 , 3658 – 3961 .

Saito , D.N. , Yoshimura , K. , Kochiyama , T. , Okada , T. , Honda , M. and Sadato , N. ( 2005 ). Cross-modal binding and activated attentional networks during audio-visual speech integration: a functional MRI study . Cerebral Cortex , 15 , 1750 – 1760 .

Sams , M. , Möttönen , R. and Sihvonen , T. ( 2005 ). Seeing and hearing others and oneself talk . Brain Research, Cognitive Brain Research , 23 , 429 – 435 .

Santi , A. , Servos , P. , Vatikiotis-Bateson , E. , Kuratate , T. and Munhall , K. ( 2003 ). Perceiving biological motion: dissociating visible speech from walking . Journal of Cognitive Neuroscience , 15 , 800 – 809 .

Saxe , R. and Kanwisher , N. ( 2003 ). People thinking about thinking people . The role of the temporo-parietal junction in “theory of mind”. Neuroimage , 19 , 1835 – 1842 .

Schweinberger , S.R. , Casper , C. , Hauthal , N. , et al . ( 2008 ). Auditory adaptation in voice perception . Current Biology , 18 , 684 – 688 .

Schweinberger , S.R. , Robertson , D. , and Kaufmann , J.M. ( 2007 ). Hearing facial identities . Quarterly Journal of Experimental Psychology , 60 , 1446 – 1456 .

Schweinberger , S.R. and Soukup , G.R. ( 1998 ). Asymmetric relationships among perceptions of facial identity, emotion, and facial speech . Journal of Experimental Psychology: Human Perception and Performance , 24 , 1748 – 1765 .

31-Calder-31.indd 62731-Calder-31.indd 627 12/23/2010 12:51:25 PM12/23/2010 12:51:25 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

Ruth
Sticky Note
Puce,A., Allison,T., Bentin,S.,Gore,J.C. & McCarthy,G. (1998) temporal cortex activation in humans viewing eye and mouth movements Journal of Neuroscience, 18, 2188-99
Ruth
Sticky Note
Petkov, C.I., Kayser, C., Steudel, T., Whittingstall, K., Augath, M., Logothetis, N.K. (2008) A voice region in the monkey brain. Nature Neuroscience 11,367-74.

SPEECHREADING: WHAT’S MISS-ING?628

Scott , S.K. and Johnsrude , I.S. ( 2003 ). The neuroanatomical and functional organization of speech perception . Trends in Neuroscience , 26 , 100 – 107.

Scott , S.K. , Rosen , S. , Lang , H., and Wise , R.J. ( 2006 ). Neural correlates of intelligibility in speech investigated with noise vocoded speech — a positron emission tomography study . Journal of the Acoustical Society of America , 120 , 1075 – 1083.

Sekiyama , K. ( 1994 ). Differences in auditory-visual speech perception between Japanese and Americans: McGurk effect as a function of incompatibility . Journal of the. Acoustical Society of Japan , 15 , 143 – 158 .

Sekiyama , K. ( 1997 ). Cultural and linguistic factors in audiovisual speech processing: the McGurk effect in Chinese subjects . Perception and Psychophysics , 59 , 73 – 80 .

Sekiyama , K. and Tohkura , Y. ( 1991 ). McGurk effect in non-English listeners: few visual effects for Japanese subjects hearing Japanese syllables of high auditory intelligibility . Journal of the Acoustical Society of America , 90 , 1797 – 1805 .

Sheffert , S.M. , Pisoni , D.B. , Fellowes , J.M., and Remez , R.E. ( 2002 ). Learning to recognize talkers from natural, sinewave, and reversed speech samples . Journal of Experimental Psychology: Human Perception and Performance, 28 , 1447 – 1469.

Skipper , J.I. , Nusbaum , H.C. , and Small , S.L. ( 2005 ). Listening to talking faces: motor cortical activation during speech perception . Neuroimage 25 , 76 – 89 .

Soto-Faraco , S. , Navarra , J. , Weikum , W.M. , Vouloumanos , A. , Sebastián-Gallés , N. and Werker , J.F. ( 2007 ). Discriminating languages by speech-reading . Perception and Psychophysics. 69 , 218 – 231 .

Stevenson , R.A. and James , T.W. ( 2009 ). Audiovisual integration in human superior temporal sulcus: Inverse effectiveness and the neural processing of speech and object recognition . Neuroimage . 44 , 1210 – 1223 .

Stone , S.M. ( 2010 ). Human facial discrimination in horses: can they tell us apart? Animal Cognition , 13 , 51 – 61 .

Studdert-Kennedy , M. ( 1983 ). Perceptual processing links to the motor system . In M. Studdert-Kennedy (ed.) The Psychobiology of Language , pp. 29 – 39. Cambridge MA : MIT Press .

Tate , A.J. , Fischer , H. , Leigh , A.E. and Kendrick , K.M. ( 2006 ). Behavioural and neurophysiological evidence for face identity and face emotion processing in animals . Philosophical Transactions of the Royal Society of London B , 361 , 2155 – 2172 .

Teinonen , T. , Aslin , R.N. , Alku , P., and Csibra , G. ( 2008 ). Visual speech contributes to phonetic learning in 6-month-old infants Cognitio , 108 , 850 – 855.

van Wassenhove , V. , Grant , K.W., and Poeppel , D. ( 2005 ). Visual speech speeds up the neural processing of auditory speech . Proceedings of the National Academy of Sciences USA , 102 , 1181 – 1186 .

von Kriegstein , K.V. and Giraud , A.L. ( 2004 ). Distinct functional substrates along the right superior temporal sulcus for the processing of voices . Neuroimage , 22 , 948 – 955.

von Kriegstein , K. , Eger , E. , Kleinschmidt , A., and Giraud , A.L. ( 2003 ). Modulation of neural responses to speech by directing attention to voices or verbal content . Brain Research, Cognitive Brain Research , 17 , 48 – 55.

von Kriegstein , K. , Kleinschmidt , A., and Giraud , A.L. ( 2006 ) Voice recognition and cross-modal responses to familiar speakers’ voices in prosopagnosia . Cerebral Cortex 16 , 1314 – 11 3 13.

von Kriegstein , K. , Dogan , O. , Grüter , M. , et al . ( 2008 ) Simulation of talking faces in the human brain improves auditory speech recognition . Proceedings of the National Academy of Sciences USA , 105 , 6747 – 6752.

Walker , S. , Bruce , V. and O’Malley , C. ( 1995 ). Facial identity and facial speech processing: familiar faces and voices in the McGurk effect . Perception and Psychophysics , 57 , 1124 – 1133.

Watkins , K.E. , Strafella , A.P. and Paus , T. ( 2003 ). Seeing and hearing speech excites the motor system involved in speech production Neuropsychologia , 41 , 989 – 994.

Weikum , W.M. , Vouloumanos , A. , Navarra , J. , Soto-Faraco , S. , Sebastián-Gallés , N., and Werker , J.F. ( 2007 ). Visual language discrimination in infancy . Science , 316 , 1159 .

31-Calder-31.indd 62831-Calder-31.indd 628 12/23/2010 12:51:25 PM12/23/2010 12:51:25 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

Ruth
Sticky Note
Sheffert,S.M. & Olson,E.(2004) Audiovisual speech facilitates voice learning. Perception and Psychophysics, 66, 352-62

REFERENCES 629

Werker , J.F. and Tees , R.C. ( 2005 ). Speech perception as a window for understanding plasticity and commitment in language systems of the brain . Developmental Psychobiology 46 , 233 – 234 .

Wright , T.M. , Pelphrey , K.A. , Allison , T. , McKeown , M.J., and McCarthy , G. ( 2003 ). Polysensory Interactions along lateral temporal regions evoked by audiovisual speech . Cerebral Cortex , 13 , 1034 – 1043.

Yakel , D.A. , Rosenblum , L.D., and Fortier , M.A. ( 2000 ). Effects of talker variability on speechreading . Perception and Psychophysics. 62 , 1405 – 1412.

Yehia , H.C. , Rubin , P.E., and Vatikiotis-Bateson , E. ( 1998 ). Quantitative association of vocal-tract and facial behavior . Speech Communication 26 , 23 – 43.

Yehia , H.C. , Kuratate , T. , and Vatikiotis-Bateson , E. ( 2002 ). Linking facial animation, head motion and speech acoustics . Journal of Phonetics 30 , 555 – 568 .

31-Calder-31.indd 62931-Calder-31.indd 629 12/23/2010 12:51:25 PM12/23/2010 12:51:25 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH

31-Calder-31.indd 63031-Calder-31.indd 630 12/23/2010 12:51:25 PM12/23/2010 12:51:25 PM

OUP UNCORRECTED PROOF-FPP, 12/23/2010, GLYPH