Chapter 14: Models and Theories of Speech Production and Perception

Chapter 14: Models and Theories of Speech Production

and Perception

Perry C. Hanavan, Au.D.

http://www.eucognition.org/wiki/index.php?title=Image:NeuralModelKroeger.jpg

TheoryA proposed description, explanation, or model of the manner of interaction of a set of natural phenomena, capable of predicting future occurrences or observations of the same kind, and capable of being tested through experiment or otherwise falsified through empirical observation.

Model

A model is a conceptual representation of some phenomenon

Speech Perception

• Puzzles and Speech Perception by Hawkins

http://www.ling.cam.ac.uk/sarah/docs/Hawkins_Puzzles-and-Patterns-kns50.ppt

Speech Production

• Serial-Order Issue– Order of phonemes determines how the word

will be perceived or recognized

k a t t a k

• Degrees of Freedom– Muscle movement for each sound can vary,

yet understanding of sound occurs

• Context-Sensitivity Problem– The context in which a sound is made can

have significant implication on meaning

Theories of Speech Production

• Target

• Feedback and

• Feedforward

• Dynamic Systems

• Connectionist

Target Models

• “process in which a speaker attempts to attain a sequence of targets corresponding to the speech sounds he is attempting to produce.”– (Borden et al., 1994)

Target Models

• Spatial?– Internal map of the vocal tract in the CNS– Coarticulation

• varient –movements of the articulator for a specific phoneme must change depending on starting point

– ki– ku

– Feedback information to brain to regulate fine movements and correct errors

Target Models

• Acoustic-auditory?– Goal is acoustic output (targets)

• Articulatory movements used to achieve goal– Variations in articulatory movements used when:

» Adjacent phonemes vary» Speaker rate varies» Stress variations in utterance

Target Models

• Perkell, Matthies, & Svirsky (1995) framework– Ultimate goal:

• Articulatory movements result in understandable acoustic events (speech)

Feedback

• Feedback was introduced The Basics of Cybernetics", and is especially important to speech production theory.

• Gracco and Abbs (1987) are among many to point out that continuous speech involves continuous feedback, that is to say, that the continuous execution of a motor program requires an equally continuous stream of sensory information from muscle and cutaneous senses throughout the respiratory, laryngeal, and orofacial regions.

http://www.smithsrisca.demon.co.uk/cybernetics.html

Feedback

• Levelt (1989) types of feedback .....• Am I saying what I meant to say?• Is this the way I meant to say it?• Is what I am saying socially appropriate?• Am I selecting the right words?• Am I using the right syntax and morphology?• Am I making any phonological errors?• Is my articulation at the right speed and pitch?

Feedback

•Successful speech production is a constant battle against error, and those errors can pop up anywhere. –The phrases we then use to interrupt and correct

ourselves (phrases such as "sorry", "I mean", "let me put that another way", etc.) are known generically as "editing expressions" (Hockett, 1967). Levelt (1989) summarised the issue thus .....

Feedback

• "The major feature of editor theories [of monitoring] is that production results are fed back through a device that is external to the production system.

–Such a device is called an editor or a monitor. –This device can be distributed in the sense that it can check in-between results at different levels of processing.

–The editor may, for instance, monitor the construction of the preverbal message, the appropriateness of lexical access, the well-formedness of syntax, or the flawlessness of phonological-form access.

–There is, so to speak, a watchful little homonculus connected to each processor." (Levelt, 1989, pp467-468; italics original; bold emphasis added)

Feedback• Lee interpreted these findings as evidence of a

multiple loop control hierarchy, with four levels of feedback, as follows:– The "Thought Loop": The top control level releases individual

thoughts for action, and then monitors that action for successful progress and completion. The highest level feedback loop then monitors the output for what would nowadays be termed its pragmatic appropriacy

– The "Word Loop": The second highest loop monitors speech production for word selection accuracy.

– The "Voice Loop": The third highest loop monitors speech production at whole-syllable level for morphological accuracy.

– The Articulating Loop": Finally, the lowest loop monitors speech production checking that the right phonemes have been used within each syllable.

Feedforward

• Feedback is used to detect and correct errors in speech output

• Feedforward signals are used to make articulatory adjustment online

Dynamic Systems Models

• Speech as a dynamic pattern of trajectories through articulatory– Groups of muscles link up together to perform a

particular task• The lip and jaw muscles function as a coordinative unit in

bilabial closure

Connectionist Models

• Parallel-distributed processing models

• Spreading activation models– Non-hierachial models

Speech Perception Issues

• Linearity

• Segmentation

• Speaker Normalization

• Basic Unit of Perception

• Specialization of speech perception

Categories of Speech Perception Theories

• Active vs. Passive

• Bottom-up vs. Top-down

• Autonomous vs. Interactive

Theories of Speech Perception

• Motor• Acoustic Invariance• Direct Realism• TRACE• Cohort• Fuzzy Logical• Logogen• Native Language

Magnet

Speech Perception Issues

• Linearity

• Segmentation

• Speaker normalization

• Basic unit of perception

• Specialization of speech perception

Linearity & Segmentation• Linearity Principle:

– A specific sound in a word corresponds to specific phoneme

• Segmentation– the ability to break the spoken language

signal into the parts that make up words • Thus, these two principles suggest

speech perception is based on a linear correspondence between the acoustic signal and the phoneme units

• Although we perceive speech as a series of separate and distinct phonemes and words, the acoustic boundaries between phonemes is blurred– eg. /ki/ vs. /ku/ (speech is not invariant)

http://www.psych.neu.edu/SpeechLab/spectr.JPG

Speaker Normalization

• How are listeners able to recognize speech sounds and words despite wide variations in speaker production?– Speaker variations– Gender differences– Age differences

• Normalization is the ability to perceive words spoken by different speakers, at different rates, and in different phonetic contexts as the same.

Basic Unit of Perception

• What is the basic unit of speech perceptions?– Acoustic-phoneme features– Allophones– Phonemes– Syllables– Words

• Listening in noise (focus on smaller units)• Young children focus on syllables and formant

transitions

Specialization of Speech Perception

• Is speech perception a specialized function/process in humans– However, animals have been able to

demonstrate categorical perception

Specialization of Speech Perception

• Perceptual magnet effect not demonstrated in animals (e.g., whereby `good' variants in F1/F2 coordinate space are poorly discriminated from typical vowel prototypes)

(A) Formant frequencies of vowels surrounding an American/i/prototype (red) and a Swedish/y/prototype (blue). (B) Results of tests on American and Swedish infants indicating an effect of linguistic experience. Infants showed greater generalization when tested with the native-language prototype. PME, Perceptual magnet effect. [American Association for the Advancement of Science]

Categories of Speech Perception Theories

• Active vs. Passive

• Bottom up Top Down

• Autonomous vs. Interactive

Active vs. Passive• Active theories suggests that speech

perception and production are closely related– Listener knowledge of how sounds are produced

facilitates recognition of sounds

• Passive theories emphasizes the sensory aspects of speech perception– Listeners utilize internal filtering mechanisms– Knowledge of vocal tract characteristics plays a minor

role, for example when listening in noise conditions

Bottom up Top Down• Top-down processing works with

knowledge a listener has about a language, context, experience, etc.– Listeners use stored information

about language and the world to make sense of the speech

• Bottom-up processing works in the absence of a knowledge base providing top-down information– listeners receive auditory information,

convert it into a neural signal and process the phonetic feature information

Autonomous vs. Interactive• Autonomous theories posit

feed-forward processing with lexical influence restricted to post-perceptual decision processes (uni-directional)

• Interactive theories posit information and knowledge from many sources available to the listener a re involved at any or all stages of the processing of the signal (bi-directional)

Speech Perception Theories

• Motor Theory• Acoustic Invariance Theory• Direct Realism• Trace Model• Logogen Theory• Cohort Theory• Fuzzy Logic Model of Perception• Native Language Magnet Theory

Question

This theory postulates speech is perceived by reference to how it is produced:

A.Motor TheoryB.Acoustic Invariance TheoryC.Direct RealismD.Trace ModelE.Logogen TheoryF.Cohort TheoryG.Fuzzy Logic Model of PerceptionH.Native Language Magnet Theory

Senteo QuestionTo set the properties right click and select Senteo Question Object->Properties...

Motor Theory• Postulates speech is perceived by

reference to how it is produced– when perceiving speech, listeners access

their own knowledge of how phonemes are articulated

– Articulatory gestures (such as rounding or pressing the lips together) are units of perception that directly provide the listener with phonetic information

Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967

Motor Theory

Three main claims:(1) speech processing is special,

(2) perceiving speech is perceiving gestures, and

(3) the motor system is recruited for perceiving speech.

Today, the theory is more closely connected with research and theorizing in the broad context of cognitive science than with research and theorizing in the field of speech.

QuestionThe acoustic properties of the landmarks

constitute the basis for establishing the distinctive features:

A.Motor TheoryB.Acoustic Invariance TheoryC.Direct RealismD.Trace ModelE.Logogen TheoryF.Cohort TheoryG.Fuzzy Logic Model of PerceptionH.Native Language Magnet Theory

Acoustic Invariance Theory• Listeners inspect the incoming signal for the so-

called acoustic landmarks which are particular events in the spectrum carrying information about gestures which produced them.

• Gestures are limited by the capacities of humans’ articulators and listeners are sensitive to their auditory correlates, the lack of invariance simply does not exist in this model.

• The acoustic properties of the landmarks constitute the basis for establishing the distinctive features.

• Bundles of the distinctive features uniquely specify phonetic segments (phonemes, syllables, words).

Stevens, K.N. (2002). "Toward a model of lexical access based on acoustic landmarks and distinctive features" (PDF). Journal of the Acoustical Society of America 111 (4): 1872–1891.

http://linguistics.berkeley.edu/~kjohnson/ling210/stevens2002.pdf

http://linguistics.berkeley.edu/~kjohnson/ling210/stevens2002.pdf

Acoustic Invariance Theory

Two principal claims:

1.There are invariant acoustic patterns in the speech signal which correspond to phonetic features and which remain invariant across speakers phonetic contexts, and languages.

2.That human perceivers use these properties (invariant acoustic patterns) to provide the phonetic framework for natural language and to process the sounds of speech in ongoing perception

Question

Hypothesizes that perception allows listeners to have direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived.

A. Motor TheoryB. Acoustic Invariance TheoryC.Direct RealismD.Trace ModelE. Logogen TheoryF. Cohort TheoryG.Fuzzy Logic Model of PerceptionH.Native Language Magnet Theory

Senteo QuestionTo set the properties right click and select Senteo Question Object->Properties...

Direct Realism• Hypothesizes that perception allows listeners to have

direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived.

• Asserts that the objects of perception are actual vocal tract movements, or gestures, and not abstract phonemes or (as in the Motor Theory) events that are causally antecedent to these movements, i.e. intended gestures.

• Listeners perceive gestures not by means of a specialized decoder (as in the motor theory) but because information in the acoustic signal specifies the gestures that form it.

• Suggests that the actual articulatory gestures that produce different speech sounds are themselves the units of speech perception. Fowler, C. A. (1986). "An event approach to the study of speech perception from a direct-realist perspective". Journal of Phonetics14: 3–28.

Direct Realism

• In essence, the object of our perceptions (ie., a sound, words, etc.), known as a ‘percept’, is the distal (causal) stimulus, not the proximal (sensory) stimulus

• If listening to a question, you don’t necessarily explicitly perceive the rise in the Fo of the auditory signal; instead, you perceive that a question has been asked

• Claims there is a strong, top-down effect on perception

• Supports a connectionist view of cognition

TRACE Model• Assumes there is a cognitive unit for each feature (for

example, nasality) at the feature level, for each phoneme at the phoneme level, and for each word at the word level.

• At any given time, all of these units are activated to a greater or lesser extent, as opposed to all or none.

• When units are activated above a certain threshold, they may influence other units at the same or different levels. – These effects may be either excitatory or inhibitory;

that is, they may increase or decrease the activation of other units.

• The entire network of units is referred to as the trace, because “the pattern of activation left by a spoken input is a trace of the analysis of the input at each of the three processing levels”

• The network is active and changes with subsequent input.McClelland, J.L., & Elman, J.L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1-86

TRACE Model• For example, a listener hears the

beginning of bald, and the words bald, ball, bad, bill become active in memory. Then, soon after, only bald and ball remain in competition (bad, bill have been eliminated because the vowel sound doesn't match the input).

– Soon after, bald is recognized.

• TRACE simulates this process by representing the temporal dimension of speech, allowing words in the lexicon to vary in activation strength, and by having words compete during processing.

– Figure 1 shows a line graph of word activation in a simple TRACE simulation.

TRACE Model

Neural net model– Aims to identify single words– Account for categorical perception, Ganong

effect and other traditional phonetic findings that were considered important in 1970s- 1980s

– Connectionist model of speech perception

(McClelland and Elman, 1986)

Logogen Theory• Model designed to explain word recognition using a new type of unit known as a “logogen"

• A critical element, lexicons, or specialized aspects of memory that include semantic and phonemic information about each item that is contained in memory.

• A given lexicon consists of many smaller, abstract items known as logogens.

• Logogens contain a variety of properties about given word such as their appearance, sound, and meaning.

• Logogens do not store words within themselves, they store information that is specifically necessary for retrieval of whatever word is being searched for.Morton, J. (1969). Interaction of information in word recognition. Psychological Review, 76, 165-178

Logogen Theory• A given logogen will become activated by stimuli or contextual information (words) consistent with the properties of that specific logogen and when the logogen's activation level rises to or above its threshold level, the pronunciation of the given word is sent to output system.

• Certain stimuli can affect the activation levels of more than one word at a time, usually involving words similar to one another.

• When this occurs, whichever words' activation levels reaches the threshold level, it is that word sent to the output system with the listener remaining unaware of any partially excited logogens.

Cohort Theory

• Designed specifically to account for auditory word recognition.

• Breaks word down. • Model posits that when a word is heard, all

words beginning with first sound of target word are activated.

• This set of words is considered the Cohort. • Once first cohort has been activated, other

information, or sounds in word narrow down choices.

• Listener recognizes word when left with a single choice; considered "recognition point." Marslen-Wilson, W. (1987). "Functional parallelism in spoken word recognition." Cognition, 25, 71-102.

Cohort Model• Designed specifically to account for auditory

word recognition• Works by breaking a word down when a word is

heard--all words that begin with the first sound of the target word are activated– This set of words is considered the “first” cohort

• Once the first cohort activated, other sounds in the word narrow down the choices.

• Listener recognizes the word when left with a single choice– this is considered the "recognition point".

Cohort Theory

Fuzzy Logic Model of Perception• Proposes that people remember speech sounds

in a probabilistic, or graded, way.• Suggests people remember descriptions of the

perceptual units of language, called prototypes. • Within each prototype various features may

combine.• Features are not binary (true or false) -- there is

a fuzzy value corresponding to how likely it is that a sound belongs to a particular speech category.

Massaro D. The logic of the fuzzy logical model of perception Behavioral and Brain Sciences (1989), 12: 778-794 Cambridge University Press

Fuzzy Logic Model of Perception• When perceiving a speech signal, decision about

what is actually heard is based on the relative goodness of the match between the stimulus information and values of particular prototypes.

• The final decision is based on multiple features or sources of information, even visual information (this explains the McGurk effect).

• Computer models of the fuzzy logical theory demonstrate that the theory's predictions of how speech sounds are categorized correspond to the behavior of human listeners

• iBaldiMassaro D. The logic of the fuzzy logical model of perception Behavioral and Brain Sciences (1989), 12: 778-794 Cambridge University Press

Native Language Magnet Theory• According to Kuhl’s (1994) Native Language Magnet

(NLM) theory the phonetic perceptual space is organized in terms of prototypes.

• Prototypes are defined as particularly good category exemplars that function as perceptual references for linguistic phonetic units (mental representations or perceptual maps of the speech).

• Prototypes function as “perceptual-magnets” and exert an attracting force on neighboring auditory representations which tend to be assimilated by (attracted towards) the prototype that conforms to phonetic categories in the language that is heard

• Thus, the perceptual space appears to be warped in the neighborhood of a prototype because a prototype attracts exemplars that fall within its zone of influence.Patricia K Kuhl, Barbara T Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola, and Tobey Nelson Phonetic learning as a pathway to language: new data and native language magnet theory expanded (NLM-e) Phil Trans R Soc B 2008 363: 979-1000.

Native Language Magnet TheoryData show infants perceptually “map” critical aspects

of ambient language in the first year of life before they can speak.

Statistical properties of speech are picked up through exposure to ambient language.

Linguistic experience alters infants' perception of speech, warping perception in the service of language.

Infants are neither the tabula rasas Skinner described nor innate grammarians Chomsky envisioned.

Infants have inherent perceptual biases that segment phonetic units without providing innate descriptions of them.

Patricia K Kuhl, Barbara T Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola, and Tobey Nelson Phonetic learning as a pathway to language: new data and native language magnet theory expanded (NLM-e) Phil Trans R Soc B 2008 363: 979-1000.

Native Language Magnet TheoryInfants use inherent learning strategies that were not

expected, ones thought to be too complex and difficult for infants to use.

Adults addressing infants unconsciously modify speech in ways that assist the brain mapping of language.

In combination, these factors provide a powerful discovery procedure for language. Patricia K Kuhl, Barbara T Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola, and Tobey Nelson Phonetic learning as a pathway to language: new data and native language magnet theory expanded (NLM-e) Phil Trans R Soc B 2008 363: 979-1000.

Native Language Magnet TheorySix tenets of a new view of language acquisition are offered: (i)infants' initially parse the basic units of speech allowing them to

acquire higher-order units created by their combinations; (ii)the developmental process is not a selectionist one in which

innately specified options are selected on the basis of experience;

(iii)Perceptual learning process, unrelated to Skinnerian learning, commences with exposure to language, during which infants detect patterns, exploit statistical properties, and are perceptually altered by that experience;

(iv)Vocal imitation links speech perception and production early, and auditory, visual, and motor information are coregistered for speech categories

(v)adults addressing infants unconsciously alter speech to match infants' learning strategies, and this is instrumental in supporting infants' initial mapping of speech; and

(vi)Critical period for language is influenced not only by time, but by the neural commitment that results from experience.Patricia K Kuhl, Barbara T Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola, and Tobey Nelson Phonetic learning as a pathway to language: new data and native language magnet theory expanded (NLM-e) Phil Trans R Soc B 2008 363: 979-1000.

Documents

Chapter 14: Models and Theories of Speech Production and Perception