Upload
quinto
View
23
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Charlotte Wollermann * , Eva Lasarcyk** *Institute of Communication Scienes, University of Bonn ** Institute of Phonetics, Saarland University [email protected], [email protected]. Modeling and Perceiving of (Un)Certainty in Articulatory Speech Synthesis. Overview. Introduction - PowerPoint PPT Presentation
Citation preview
Charlotte Wollermann*, Eva Lasarcyk**
*Institute of Communication Scienes, University of Bonn
** Institute of Phonetics, Saarland [email protected], [email protected]
Modeling and Perceiving of (Un)Certainty in Articulatory Speech
Synthesis
22
Overview1. Introduction
1.1 Emotion/attitude and speech synthesis1.2 Previous studies on (un)certainty1.3 Goal of the current study
2. Modeling of (un)certainty in articulatory speech synthesis2.1 Acoustical criteria2.2 The articulatory speech synthesis system
3. Perception studies3.1 Experiment 13.2 Experiment 2
4. Conclusions
33
1. Introduction1.1 Emotion and speech synthesis• The modeling of emotion and attitude has gained
extensive importance in the last few years Generating synthetic speech which is natural
andhuman-like as possible
Multimodal speech synthesis systems as possible applications: Talking Heads (Beskow 2003), Embodied Conversational Agents (Cassell et al. 1999)
• Most emotional speech synthesis systems are based on prototypical emotions according to Ekman (1972):
happiness, sadness, anger, fear, surprise, disgust
44
1. Introduction
Emotion and speech synthesis:• Different approach from emotion psychology: Using
evaluation, activation and power as basic dimensions for representing emotional states (Wundt 1896)
Ex.: EmoSpeak as part of the TTS-System MARY
(Schröder 2004; Trouvain, Schröder 2003)
Why investigating emotion and attitude in articulatory
speech synthesis?• The modeling of attitude has been barely investigated• 3D-articulatory synthesizer (Birkholz 2005): great
degree of freedom and precise adjustments of single parameters at the same time
• (Un)Certainty as non-prototypical emotion
55
1. Introduction
1.2 Previous studies: Production and perception of (un)certainty in natural speech – Acoustic domain
• Smith, Clark (1993): Studying memory processes in question-answering Feeling of Knowing Paradigm FOK (Hart 1965) Uncertainty prosodically marked by rising intonation,
delay, linguistic hedges like ”I guess“
• Brennan, Williams (1995): Perception of uncertainty of another speaker (Feeling of Another‘s Knowing FOAK) Intonation, form and latency of answer; fillers like
“hm”, “uh” as relevant cues
66
1. Introduction
Previous studies: Production and perception of (un)certainty in natural speech – Audiovisual
domain• Swerts et al. (2003, 2005): Production and
perception of (un)certainty in audiovisual speech
Delay, pause and fillers Smiles, funny faces etc.
1.3 Goal of the current study• Investigation of perceiving of uncertainty in
human-machine interaction by using articulatory synthesis
77
2. Modeling of uncertainty in articulatory speech synthesis
• General Setting: Stimuli are embedded into a context– Telephone dialog between caller and weather expert system
Wie wird das Wetter nächste Woche in ... ?
Eher kalt.
• Different levels of uncertainty indicated by the presence of– High intonation– Delay– Fillers
How is the weather going to be next week in … ? Rather
cold.
88
ID
1* C - - -
2 U1 + - -
3* U2 + + -
4 U3 + + +
5* C - - -
6* U(2) + + -7* C - - -
8 U1 + - -
9* U2 + + -
10 U3 + + +
Caller's question
System's answer
Level of certainty
Intonati-on high
Delay Filler
“Wie wird das Wetter nächste
Woche in X?“
“Ziemlich kühl“
“Relativ heiss“
“Eher kalt“
2. Modeling of uncertainty in articulatory speech synthesis
Wie wird das Wetter nächste Woche in ... ? 1000 ms Eher kalt
2200 ms Eher kaltWie wird das Wetter nächste Woche in ... ?
Intonation– Variation takes places
on the last word– Either rising or
falling contour Experiment 1 Experiment 2
Delay and filler structure
Wie wird das Wetter nächste Woche in ... ? 1500 ms Eher kalt1000 msHmm
Wie wird das Wetter nächste Woche in ... ? 1000 ms Eher kalt
99
Speech signal
One-dimensional tube modelVocal tract
Gestural score
Aerodynamic-acoustic simulation
Birkholz (2005)
2. Modeling of uncertainty in articulatory speech synthesis - Overview Articulatory Synthesizer
1010
Gestural Score: “ziemlich kühl “
1111
Gestural Score: „… hm …“
1212
3. Perception studies3.1 Experiment 1GoalAre subjects able to recognize intended certain/uncertain utterances in articulatory speech synthesis? Does certainty influences intelligibility?
Method• 38 students from the Univ. of Bonn and Saarland Univ.• Audio-presentation in a group experiment and
individually testing (two different random orders of the stimuli)
• Judging the certainty and intelligibility of each answer of the expert-system on a 5-point Likert-Scale(1=uncertain/unintelligible, 5=certain/intelligible)
• Wilcoxon Signed Rank Test
1313
3. Perception studies
3.1 Experiment 1
Results
0
1
2
3
4
5
CertainUncertain
Audio Stimulus
Med
ian
** ** **
0
1
2
3
4
5
CertainUncertain
Audio Stimulus
Med
ian
*
Perception of intelligibility
Perception of certainty
n.s. n.s.
*
* p < 0.05** p < 0.001ns not significant
Discussion- Technical problems: Reason for relatively low intelligibility of “relativ heiss”
- Which role do fillers playin perceiving uncertainty?
uncertain
certain
unintelligible
intelligible
1414
3. Perception studies
3.2 Experiment 2GoalTo what extent do different combinations of acoustic cues
affect the perception of uncertainty?
Method• Subjects: 34 students from the University of Bonn• Audio-presentation in three group experiments
(three different random orders of the stimuli)• Same procedure as in Experiment 1, but this time
*only* judging the certainty of each answer of system on a 5-point Likert-Scale (1=uncertain, 5=certain)
• Wilcoxon Signed Rank Test
1515
3. Perception studies
3.2 Experiment 2Results
"ziemlich kühl" (pretty chilly)
"eher kalt" (rather cold)
0
1
2
3
4
5
CertainUncertain 1 (High In-tonation)Uncertain 2 (Intona-tion, Delay)Uncertain 3 (Intona-tion, Delay, Filler)
Audio Stimulus
Me
dia
n
n.s.
n.s.
n.s. not significantall other pairs differ significantly with p < 0.001
Perception of intended levels of uncertainty
uncertain
certain
1616
3. Perception studies
3.2 Experiment 2Discussion• Results from Experiment 1 are generally confirmed: Intended certain
utterances can be clearly distinguished from uncertain ones (variation of intonation and delay)
Levels of uncertainty: What do our data suggest?• Signaling uncertainty by high intonation exclusively is sufficient for
perceiving uncertainty• Delay as additional acoustic cue does not yield a higher degree of
uncertainty• Combination of fillers, delay and high intonation has the strongest
effect BUT: Role of delay and fillers „per se“ can not be inferred
from our data
1717
4. ConclusionsStudy presents a first step towards modeling of certainty and
different degrees of uncertainty with means of articulatory speech
synthesis
Perception:• Intonation by itself contributes to a higher degree of
perceived uncertainty in our data; Combination of all three acoustic cues yields to the highest degree of perceived uncertainty
Open questions:• Influence of fillers and delay respectively ”per se“• Problems with judging a machine‘s meta-cognitive state• Influence of the choice of wordings
Future work:• Testing audiovisual stimuli for different degress of
uncertainty and making finally use of the 3-D vocal tract provided by the
articulatory synthesizer
1818
Literature• Beskow, J. (2003). Talking Heads – Models and Applications for Multimodal Speech Synthesis.
Doctoral Dissertation, KTH, Stockholm, Sweden.• Birkholz, P. (2005). 3-D Artikulatorische Sprachsynthese. Berlin: Logos Verlag.• Brennan, S. E. and Williams, M. (1995). “The feeling of another's knowing: Prosody and filled
pauses as cues to listeners about the metacognitive states of speakers”. In Journal of Memory and Language , 34, 383-398.
• Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjlmsson, H., Yan, H. (1999). “Embodiment in conversational interfaces: REA”. In Proceedings of ACM CHI 99, 520-527.
• Ekman, P. (1972). “Universals and cultural differences in facial expressions of emotion”. In Cole, J. (ed.), Nebraska Symposium on Motivation 1971, vol. 19, 207-283. Lincoln, NE: University of Nebraska Press.
• Hart, J.T. (1965). “Memory and the feeling-of-knowing experience”. In Journal of Educational Psychology, 56, 208–216.
• Lasarcyk, E. (2007). “Investigating Larynx Height With An Articulatory Speech Synthesizer”. In Proceedings of the 16th ICPhS, Saarbrücken, August 2007.
• Lasarcyk, E. and Trouvain, J. (2007). “Imitating conversational laughter with an articulatory speech synthesizer.” To appear in Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, Saarbrücken, August 2007.
• Schröder, M. (2004). Speech and Emotion Research: An overview of research frameworks and dimensional approach to emotional speech synthesis. PhD thesis, PHONUS 7, Research Report of the Institute of Phonetics, Saarland University.
• Smith, V. and Clark, H. (1993). “On the course of answering questions”. In: Journal of Memory and Language, 32, 25-38.
• Swerts, M., Krahmer, E., Barkhuysen, P. & van de Laar, L. (2003). “Audiovisual cues to uncertainty”. In: Proceedings of ISCA workshop on error handling in spoken dialog systems, Chateau-d'Oex, Switzerland, August/September 2003.
• Swerts, M. and Krahmer, E. (2005). ”Audiovisual prosody and feeling of knowing”. In Journal of Memory and Language, 53:1, 81-94.
• Schröder, M. and Trouvain, J. (2003). “The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching”. In International Journal of Speech Technology, 6, 365-377.
• Wundt, W. (1896). Grundriss der Psychologie. Leipzig: Verlag von Wilhelm Engelmann.