Modeling and Perceiving of (Un)Certainty in Articulatory Speech Synthesis

Charlotte Wollermann*, Eva Lasarcyk**

*Institute of Communication Scienes, University of Bonn

** Institute of Phonetics, Saarland [email protected], [email protected]

Modeling and Perceiving of (Un)Certainty in Articulatory Speech

Synthesis

22

Overview1. Introduction

1.1 Emotion/attitude and speech synthesis1.2 Previous studies on (un)certainty1.3 Goal of the current study

2. Modeling of (un)certainty in articulatory speech synthesis2.1 Acoustical criteria2.2 The articulatory speech synthesis system

3. Perception studies3.1 Experiment 13.2 Experiment 2

4. Conclusions

33

1. Introduction1.1 Emotion and speech synthesis• The modeling of emotion and attitude has gained

extensive importance in the last few years Generating synthetic speech which is natural

andhuman-like as possible

Multimodal speech synthesis systems as possible applications: Talking Heads (Beskow 2003), Embodied Conversational Agents (Cassell et al. 1999)

• Most emotional speech synthesis systems are based on prototypical emotions according to Ekman (1972):

happiness, sadness, anger, fear, surprise, disgust

44

1. Introduction

Emotion and speech synthesis:• Different approach from emotion psychology: Using

evaluation, activation and power as basic dimensions for representing emotional states (Wundt 1896)

Ex.: EmoSpeak as part of the TTS-System MARY

(Schröder 2004; Trouvain, Schröder 2003)

Why investigating emotion and attitude in articulatory

speech synthesis?• The modeling of attitude has been barely investigated• 3D-articulatory synthesizer (Birkholz 2005): great

degree of freedom and precise adjustments of single parameters at the same time

• (Un)Certainty as non-prototypical emotion

55

1. Introduction

1.2 Previous studies: Production and perception of (un)certainty in natural speech – Acoustic domain

• Smith, Clark (1993): Studying memory processes in question-answering Feeling of Knowing Paradigm FOK (Hart 1965) Uncertainty prosodically marked by rising intonation,

delay, linguistic hedges like ”I guess“

• Brennan, Williams (1995): Perception of uncertainty of another speaker (Feeling of Another‘s Knowing FOAK) Intonation, form and latency of answer; fillers like

“hm”, “uh” as relevant cues

66

1. Introduction

Previous studies: Production and perception of (un)certainty in natural speech – Audiovisual

domain• Swerts et al. (2003, 2005): Production and

perception of (un)certainty in audiovisual speech

Delay, pause and fillers Smiles, funny faces etc.

1.3 Goal of the current study• Investigation of perceiving of uncertainty in

human-machine interaction by using articulatory synthesis

77

2. Modeling of uncertainty in articulatory speech synthesis

• General Setting: Stimuli are embedded into a context– Telephone dialog between caller and weather expert system

Wie wird das Wetter nächste Woche in ... ?

Eher kalt.

• Different levels of uncertainty indicated by the presence of– High intonation– Delay– Fillers

How is the weather going to be next week in … ? Rather

cold.

88

ID

1* C - - -

2 U1 + - -

3* U2 + + -

4 U3 + + +

5* C - - -

6* U(2) + + -7* C - - -

8 U1 + - -

9* U2 + + -

10 U3 + + +

Caller's question

System's answer

Level of certainty

Intonati-on high

Delay Filler

“Wie wird das Wetter nächste

Woche in X?“

“Ziemlich kühl“

“Relativ heiss“

“Eher kalt“

2. Modeling of uncertainty in articulatory speech synthesis

Wie wird das Wetter nächste Woche in ... ? 1000 ms Eher kalt

2200 ms Eher kaltWie wird das Wetter nächste Woche in ... ?

Intonation– Variation takes places

on the last word– Either rising or

falling contour Experiment 1 Experiment 2

Delay and filler structure

Wie wird das Wetter nächste Woche in ... ? 1500 ms Eher kalt1000 msHmm

Wie wird das Wetter nächste Woche in ... ? 1000 ms Eher kalt

99

Speech signal

One-dimensional tube modelVocal tract

Gestural score

Aerodynamic-acoustic simulation

Birkholz (2005)

2. Modeling of uncertainty in articulatory speech synthesis - Overview Articulatory Synthesizer

1010

Gestural Score: “ziemlich kühl “

1111

Gestural Score: „… hm …“

1212

3. Perception studies3.1 Experiment 1GoalAre subjects able to recognize intended certain/uncertain utterances in articulatory speech synthesis? Does certainty influences intelligibility?

Method• 38 students from the Univ. of Bonn and Saarland Univ.• Audio-presentation in a group experiment and

individually testing (two different random orders of the stimuli)

• Judging the certainty and intelligibility of each answer of the expert-system on a 5-point Likert-Scale(1=uncertain/unintelligible, 5=certain/intelligible)

• Wilcoxon Signed Rank Test

1313

3. Perception studies

3.1 Experiment 1

Results

0

1

2

3

4

5

CertainUncertain

Audio Stimulus

Med

ian

** ** **

0

1

2

3

4

5

CertainUncertain

Audio Stimulus

Med

ian

*

Perception of intelligibility

Perception of certainty

n.s. n.s.

*

* p < 0.05** p < 0.001ns not significant

Discussion- Technical problems: Reason for relatively low intelligibility of “relativ heiss”

- Which role do fillers playin perceiving uncertainty?

uncertain

certain

unintelligible

intelligible

1414


3.2 Experiment 2GoalTo what extent do different combinations of acoustic cues

affect the perception of uncertainty?

Method• Subjects: 34 students from the University of Bonn• Audio-presentation in three group experiments

(three different random orders of the stimuli)• Same procedure as in Experiment 1, but this time

*only* judging the certainty of each answer of system on a 5-point Likert-Scale (1=uncertain, 5=certain)

• Wilcoxon Signed Rank Test

1515


3.2 Experiment 2Results

"ziemlich kühl" (pretty chilly)

"eher kalt" (rather cold)

0

1

2

3

4

5

CertainUncertain 1 (High In-tonation)Uncertain 2 (Intona-tion, Delay)Uncertain 3 (Intona-tion, Delay, Filler)

Audio Stimulus

Me

dia

n

n.s.

n.s.

n.s. not significantall other pairs differ significantly with p < 0.001

Perception of intended levels of uncertainty

uncertain

certain

1616


3.2 Experiment 2Discussion• Results from Experiment 1 are generally confirmed: Intended certain

utterances can be clearly distinguished from uncertain ones (variation of intonation and delay)

Levels of uncertainty: What do our data suggest?• Signaling uncertainty by high intonation exclusively is sufficient for

perceiving uncertainty• Delay as additional acoustic cue does not yield a higher degree of

uncertainty• Combination of fillers, delay and high intonation has the strongest

effect BUT: Role of delay and fillers „per se“ can not be inferred

from our data

1717

4. ConclusionsStudy presents a first step towards modeling of certainty and

different degrees of uncertainty with means of articulatory speech

synthesis

Perception:• Intonation by itself contributes to a higher degree of

perceived uncertainty in our data; Combination of all three acoustic cues yields to the highest degree of perceived uncertainty

Open questions:• Influence of fillers and delay respectively ”per se“• Problems with judging a machine‘s meta-cognitive state• Influence of the choice of wordings

Future work:• Testing audiovisual stimuli for different degress of

uncertainty and making finally use of the 3-D vocal tract provided by the

articulatory synthesizer

1818

Literature• Beskow, J. (2003). Talking Heads – Models and Applications for Multimodal Speech Synthesis.

Doctoral Dissertation, KTH, Stockholm, Sweden.• Birkholz, P. (2005). 3-D Artikulatorische Sprachsynthese. Berlin: Logos Verlag.• Brennan, S. E. and Williams, M. (1995). “The feeling of another's knowing: Prosody and filled

pauses as cues to listeners about the metacognitive states of speakers”. In Journal of Memory and Language , 34, 383-398.

• Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjlmsson, H., Yan, H. (1999). “Embodiment in conversational interfaces: REA”. In Proceedings of ACM CHI 99, 520-527.

• Ekman, P. (1972). “Universals and cultural differences in facial expressions of emotion”. In Cole, J. (ed.), Nebraska Symposium on Motivation 1971, vol. 19, 207-283. Lincoln, NE: University of Nebraska Press.

• Hart, J.T. (1965). “Memory and the feeling-of-knowing experience”. In Journal of Educational Psychology, 56, 208–216.

• Lasarcyk, E. (2007). “Investigating Larynx Height With An Articulatory Speech Synthesizer”. In Proceedings of the 16th ICPhS, Saarbrücken, August 2007.

• Lasarcyk, E. and Trouvain, J. (2007). “Imitating conversational laughter with an articulatory speech synthesizer.” To appear in Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, Saarbrücken, August 2007.

• Schröder, M. (2004). Speech and Emotion Research: An overview of research frameworks and dimensional approach to emotional speech synthesis. PhD thesis, PHONUS 7, Research Report of the Institute of Phonetics, Saarland University.

• Smith, V. and Clark, H. (1993). “On the course of answering questions”. In: Journal of Memory and Language, 32, 25-38.

• Swerts, M., Krahmer, E., Barkhuysen, P. & van de Laar, L. (2003). “Audiovisual cues to uncertainty”. In: Proceedings of ISCA workshop on error handling in spoken dialog systems, Chateau-d'Oex, Switzerland, August/September 2003.

• Swerts, M. and Krahmer, E. (2005). ”Audiovisual prosody and feeling of knowing”. In Journal of Memory and Language, 53:1, 81-94.

• Schröder, M. and Trouvain, J. (2003). “The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching”. In International Journal of Speech Technology, 6, 365-377.

• Wundt, W. (1896). Grundriss der Psychologie. Leipzig: Verlag von Wilhelm Engelmann.

Documents

Modeling and Perceiving of (Un)Certainty in Articulatory Speech Synthesis