36
Multi-modal expression of Swedish prominence Björn Granström Centre for Speech Technology, Department of Speech, Music and Hearing, KTH, Stockholm, Sweden TT Centrum för talteknologi

Multi-modal expression of Swedish prominence Björn Granström Centre for Speech Technology, Department of Speech, Music and Hearing, KTH, Stockholm, Sweden

Embed Size (px)

Citation preview

Multi-modal expression of Swedish prominence

Björn Granström

Centre for Speech Technology, Department of Speech, Music and Hearing, KTH, Stockholm,

Sweden

TTCentrum för talteknologi

Historical background

• Prosody for speech synthesis at KTH, together with Rolf Carlson

• The Lund intonation model – Gösta Bruce et al.

Several joint projectsProfs – Prosodic phrasing in Swedish ~1989-1992Gösta Bruce, Björn Granström and moreFirst reference: G. Bruce and B. Granström. Modelling

Swedish intonation in a text-to-speech system. STL-QPSR, 30(1):17-21, 1989. (on the KTH web)

Potentially ambiguous sentences, varying in phrase boundary location

Entering greve Piper´s humble residence

Windows Explorer (2).lnk

Several joint projects, cont.

Prosodiag - Prosodic Segmentation and Structuring of Dialogue (HSFR + NUTEK) 1993 –1996

Gösta Bruce, Björn Granström, Kjell Gustafson, David House, Paul Touati

Project DescriptionThe object of study is the prosody of dialogue in a language technology

framework. The primary goal of the project is to increase our understanding of how prosodic aspects of speech are exploited interactively in dialogue and on the basis of this increased knowledge to be able to create a more powerful prosody model.

Late reference: Gösta Bruce, Johan Frid, Björn Granström, Kjell Gustafson, Merle Home, and David House. Prosodic segmentation and structuring of dialogue. TMH-QPSR, 37(3):1-6, 1996.

More than 20 joint publications – and then?

Much in the context of the annual phonetics meetings – next:

Project meetings in inspirering surroundings

..probing many different cultures

Is prosody more than sound?

• Our bias: communication is multi-modal• Traditionally prosodic functions are signaled

by “gestures”, perceived by “eye and ear”• This concerns both body and face gestures• Preliminary hypothesis: F0~eyebrow height

- e.g. Cavé et al. (1996)• Easy to put to a test with multimodal

speech synthesis

Eyebrow vs intonation

“Jag heter Axel, inte Axell” (translation: “My name is Axel, not Axell”). In Sweden Axel is a first name as opposed to Axell, which is a family name.

1 No eyebrow motion

2 Eyebrow motion controlled by the fundamental frequency of the voice

3 Eyebrow motion at focal accents +

4 Eyebrow motion at the first focal accent +

Goals and research context

• How are visual expressions used to convey and strengthen prosodic functions?

• Understand interactions between visual expressions, dialog functions and speech acoustics

• Context: animated talking agent– Realistic communicative behavior using

multimodal speech synthesis

Visual prosodic functions

• Prominence– stress– focus

• Phrasing• Utterance type

– question– statement

• Dialogue functions– back channeling– turntaking

• Attitudes• Emotions

Visual prosody cont.• What is underlying? • How tight is the AV connection?• What are the important visual

gestures?• More optional than acoustic prosodic

parameters?• Individual and cultural variation• Reinforcing or qualifying acoustics?

Formal experimentProminence due to eyebrow

rise5 content words: ”När pappa fiskar stör piper

Putte”When dad is fishing sturgeon, Putte is whimpering

Example of stimuliTask: “which word is most prominent”

(identical acoustics – varied location of eyebrow movement)

No eyebrow movement (neutral)

Eyebrow movement

Prominence increase due to eyebrow movement

Influence on judged prominence by eyebrow movement

0

10

20

30

40

50

Swedish Foreign All

% p

rom

inen

ce d

ue

to e

yeb

row

mo

vem

ent

Feedback experiment

• Mini dialogues (two turns)• Travel agent application• Both visual and acoustic feedback cues• Affirmative cues – agent

understands/accepts the request • Negative cues – agent is unsure about

the request (seeks confirmation)• Six cues hypothesised

Granström, House & Swerts (2002)

Pos/Neg feedback

experiment

Affirmative setting Negative settingSmile Head smiles Head has neutral expressionHead movement Head nods Head leans backEyebrows Eyebrows rise Eyebrows frownEye closure Eyes close a bit Eyes open widelyF0 contour Declarative intonation Interrogative intonationDelay Immediate reply Slow reply

(Granström, House & Swerts 2002)

Cue strength

0

0,5

1

1,5

2

2,5

3

Ave

rage

res

pons

e va

lue

Recording of communicative

interactions

Automatic tracking of reflective spots in 3D (Qualisys)

Interactions: emotion and articulation (resynthesis)

(from AV speech database – EU/PF_STAR project)

Measurement points for lip coarticulation

analysis

Lateral distance

Vertical distance

left mouth corner

The expressive mouth

• All vowels(sentences)

– Encouraging– Happy– Angry– Sad– Neutral

”left mouth corner”

(Svanfeldt et al. 2003)

Prompted read speech database

• Expressive modes: – Confirming, questioning, certain, uncertain, happy,

(angry)

• 39 short, content neutral sentences with three possible focal accent positions each, e.g.

• Båten seglade förbi (The boat sailed by) • Dom flyttade möblerna (They moved the furniture)

• Nonsense words (VCV, VCCV, CVC)• Digits

Mean eyebrow positions for one speaker

Nose marker traces with automatic (blue) and two human (red)annotated head nods (adapted from Cerrato & Svanfeldt 2006)

Examples from the databaseC

on

firm

ing

H

ap

py

Focal accent on: Båten seglade förbi

Exploitation of visual parameters

• Visual cues exploited at focal accent• Mouth cues

– Happy, encouraging

• Eyebrow cues– Happy, questioning

• Vertical head nods– Confirming

Analysis in terms of FAP and FMQ

MPEG-4 Facial Animation Parameter (FAP) A subset of 31 FAPs out of the 68 FAPs defined in the MPEG-4 standard, including only the ones that we were able to calculate directly from our measured point data

Focal Motion Quotient, FMQ, defined as the standard deviation of a FAP parameter taken over a word in focal position, divided by the average standard deviation of the same FAP in the same word in non-focal position.

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

3: open jaw

14: thrust jaw

15: shift jaw

18: depress chin

39: puff left cheek

40: puff right cheek

41: lift left cheek

42: lift right cheek

16: push bottom lip

52: raise bottom m

idlip

57: raise bottom lip lm

58: raise bottom lip rm

17: push top lip

51: lower top m

idlip

55: lower top lip left m

id

56: lower top lip rm

53: strech left cornerlip

54: strech right cornerlip

59: raise left cornerlip

60: raise right cornerlip

31: raise left inner eyebrow

32: raise right inner eyebrow

33: raise left mid eyebrow

34: raise right mid eyebrow

35: raise left outer eyebrow

36: raise right outer eyebrow

37: squeeze left eyebrow

38: squeeze right eyebrow

48: head pitch

49: head yaw

50: head roll

FAP

Angry

Happy

Confirming

Questioning

Certain

Uncertain

Neutral

The focal motion quotient, FMQ, averaged across all sentences, for all measured MPEG-4 FAPs for several

expressive modes

articulation I smile I brows I head

The effect of focus on the variation of several groups of MPG-4 /FAP parameters,

for different expressive modes

0

0,5

1

1,5

2

2,5

3

An

gry

Ha

pp

y

Co

nfirm

ing

Qu

estio

nin

g

Ce

rtain

Un

certa

in

Ne

utra

l

articulationsmilebrowshead

FM

Q (

Fo

cal

Mo

tio

n Q

uo

tien

t)

The effect of focal accent on selected parameter variations in Certain and Uncertain

readings

0

0,5

1

1,5

2

2,5

3

3,5

4

Certain Uncertain

31: raise left innereyebrow

32: raise right innereyebrow

33: raise left mideyebrow

34: raise right mideyebrow

48: head pitch

49: head yaw

FM

Q (

Fo

cal

Mo

tio

n Q

uo

tien

t)

What´s next?

• Better recordings• Detailed analysis of the eye region:

”Gaze and wrinkles”• Use in applications, e.g. spoken

dialogue systems• And more audible prosody…….

New cooperative project

SIMULEKT - Simulering av svenskans prosodiska dialekttyper (Simulating intonational varieties of Swedish)

VR 2007-2009

And finally………..

Congratulations!

Well done Gösta!