65
1 Louis-Philippe Morency Jeffrey Girard Originally developed with help from Stefan Scherer and Tadas Baltrušaitis Multimodal Affective Computing Lecture 5: Vocal Messages

Multimodal Affective Computing

  • Upload
    others

  • View
    24

  • Download
    6

Embed Size (px)

Citation preview

Page 1: Multimodal Affective Computing

1

Louis-Philippe Morency

Jeffrey Girard

Originally developed with help from

Stefan Scherer and Tadas Baltrušaitis

MultimodalAffective Computing

Lecture 5: Vocal Messages

Page 2: Multimodal Affective Computing

2

Outline of this week’s lectures

▪ Multiple Layers of Vocal Messages▪ What we can convey with speech

▪ Fundamentals of speech production and hearing▪ Anatomy of the vocal tract and the physiology of hearing

▪ Fundamental speech measures (direct vs. perceptual measures)

▪ Prosodic manipulation and its meaning

▪ Use and detection of varying voice quality

▪ Nonverbal vocal expressions▪ Laughter, pause filler (e.g. uh, um), and moans

▪ Practical tools for speech signal processing

▪ Automatic Techniques for visual processing

Page 3: Multimodal Affective Computing

Upcoming Schedule

▪ Week 5

▪ Tuesday 2/12: Lecture on Vocal Messages

▪ Thursday 2/14: Discussion (visual & vocal messages)

▪ Week 6:

▪ Tuesday 2/19: Lecture on Verbal Messages

▪ Thursday 2/21: Proposal presentations

▪ Sunday 2/24: Due date for proposal reports

▪ Week 7:

▪ Tuesday 2/26: Lecture on Statistical Analysis

▪ Thursday 2/28: Discussion (verbal messages)

Page 4: Multimodal Affective Computing

4

A story is told as much by

silence as by speech.Susann Griffin

Page 5: Multimodal Affective Computing

5

Multiple Layers of Vocal Messages

▪ Linguistic layer▪ Carries the semantic information

▪ Language, grammar, phonemes, parts of prosody

▪ Paralinguistic layer▪ Non-linguistic and nonverbal layer

▪ Conveys information about current affective state, mood, attitude etc.

▪ Voice quality, prosody, nonverbal vocal expressions (e.g. laughter, moans, sighs)

▪ Main topic of this lecture!

▪ Extralinguistic layer▪ Identifies the speaker (e.g. age, gender, pitch range, habitual

characteristics)

Page 6: Multimodal Affective Computing

6

Fundamentals

of Speech

Page 7: Multimodal Affective Computing

7

Anatomy of the vocal tract

▪ Speech production

involves multiple organs

that shape the sound

▪ Lungs

▪ Vocal folds

▪ Mouth/tongue

▪ Nasal cavity

▪ Lips

Pictures from Gray’s anatomy 1918

Page 8: Multimodal Affective Computing

8

Anatomy of the vocal tract (2)

▪ Vocal folds vibrate

and produce

fundamental

frequency (f0)

▪ Vibration is based

on muscular

tension and air

pressure

Source: youtube.com

Page 9: Multimodal Affective Computing

9

Anatomy of the vocal tract (3)

▪ Vowels

▪ Tongue position

influences the tone

▪ Tongue modulates

the length of the

cavities in the mouth

Source: wikipedia.org

Page 10: Multimodal Affective Computing

10

Anatomy of the vocal tract (4)

▪ Mouth opening, lip

rounding and teeth

influence the sound

▪ Other Sounds:

▪ Nasal sounds (e.g. [m],

[n])

▪ Stops (e.g. [k],[t],…)

▪ …

Ladefoged, A course in phonetics, 2004

Page 11: Multimodal Affective Computing

11

Anatomy of the hearing organ

▪ Outer ear collects sound waves▪ Filters and allows

directionality detection

▪ In the middle ear the sound waves are transferred via drum and three little bones called ossicles

▪ The inner ear transfers physical waves into nervous signals

Pictures from Gray’s anatomy 1918

Page 12: Multimodal Affective Computing

12

Pictures from Gray’s anatomy 1918

Anatomy of the hearing organ (2)

▪ Transformation between

sound waves and

nervous signals sent to

the brain is not linear

▪ There are perceptual

differences:

▪ Not every frequency is

perceived with the same

intensity or accuracy

Page 13: Multimodal Affective Computing

13

Anatomy of the hearing organ (3)

▪ Perceptual measures

▪ Loudness vs. intensity

Try it out: http://tinyurl.com/px5g7cf

Page 14: Multimodal Affective Computing

14

Anatomy of the hearing organ (4)

▪ Perceptual measures

▪ Loudness vs. intensity

▪ Pitch vs. fundamental

frequency

Try it out: http://tinyurl.com/px5g7cf

Page 15: Multimodal Affective Computing

15

Representation

Of Speech

Page 16: Multimodal Affective Computing

16

Representations of speech

Different domain representations:▪ Time

▪ frequency

▪ Signal can be transferred from one domain to the other using Fourier transformation

▪ Measures the amount of “presence” of a sine wave with a certain frequency in the original signal

▪ Two types: broadband and narrowband spectrogram

▪ Bandwidth is defined by width of analysis window

Page 17: Multimodal Affective Computing

17

Fundamental Frequency

Fundamental frequency

(f0) is the basic resonance

of the vocal folds

▪ Harmonics are multiples

of f0

Source: wikipedia.org

Page 18: Multimodal Affective Computing

18

Formant Frequencies

Formant frequencies

are resonance

frequencies

dependent on the

length of the vocal

tract cavities

▪ The vocal tract is

manipulated by the

tongue etc.

Source: wikipedia.org

Page 19: Multimodal Affective Computing

Example Use of Formant Frequencies

Vowel space

▪ Gender and age

define vowel

space size.

What could be a

reason for reduced

vowel space?

Page 20: Multimodal Affective Computing

Example Use of Formant Frequencies

Vowel space

▪ Medical relevance:

Parkinson’s Disease

ALS

Depression

200400600800

1000

1400

1800

2200

2600

Formant 1 (Hz)

Form

an

t 2 (H

z)

Reference

Subject

/i/

/u/

/a/

Page 21: Multimodal Affective Computing

21

Mel frequency cepstral coefficients (MFCC)

Mel frequency

cepstral coefficients

(MFCC)

▪ Compact

representation of the

spectrum

▪ Emulates human

hearing

▪ Popular for speech

recognition

Page 22: Multimodal Affective Computing

22

Time

Frequency

Spectrogram

Frames

Band

Mel Frequency Cepstral Coefficients

0 2000 4000 6000 8000 10000 12000 14000−0.4

−0.2

0

0.2

0.4

0.6

Spectrogram

2

0 3.4Time [s]

3

Frame Spectrum

0 6 82 Frequency [kHz]

4

0 2 6 8Frequency [kHz]

Filter Bank

Mel frequency cepstral coefficients (MFCC)

Time

Frequency

Spectrogram

Frames

Band

Mel Frequency Cepstral Coefficients

0 2000 4000 6000 8000 10000 12000 14000−0.4

−0.2

0

0.2

0.4

0.6

Spectrogram

2

0 3.4Time [s]

3

Frame Spectrum

0 6 82 Frequency [kHz]

4

0 2 6 8Frequency [kHz]

Filter Bank

Page 23: Multimodal Affective Computing

23

Mel frequency cepstral coefficients (MFCC)

Time

Frequency

Spectrogram

Frames

Band

Mel Frequency Cepstral Coefficients

0 2000 4000 6000 8000 10000 12000 14000−0.4

−0.2

0

0.2

0.4

0.6

Spectrogram

2

0 3.4Time [s]

3

Frame Spectrum

0 6 82 Frequency [kHz]

4

0 2 6 8Frequency [kHz]

Filter Bank

5

0 2 6 8

Filter Bank Output

Time

Frequency

Spectrogram

Frames

Band

Mel Frequency Cepstral Coefficients

0 2000 4000 6000 8000 10000 12000 14000−0.4

−0.2

0

0.2

0.4

0.6

Spectrogram

2

0 3.4Time [s]

3

Frame Spectrum

0 6 82 Frequency [kHz]

4

0 2 6 8Frequency [kHz]

Filter Bank

5

0 2 6 8

Filter Bank Output

Page 24: Multimodal Affective Computing

24

Prosody

Page 25: Multimodal Affective Computing

25

Prosody can strongly influence the meaning

Page 26: Multimodal Affective Computing

26

Elements of prosody

▪ Prosody, the suprasegmental envelope of an utterance, is composed by:▪ Syllable length

▪ Loudness

▪ Pitch

▪ Pauses

▪ Prosody influences and defines:▪ Prosodic boundaries

▪ Question or statement

▪ Sarcasm

▪ Emotional state

▪ Meaning of words (e.g. in Chinese)

50 shades of “yeah”

Page 27: Multimodal Affective Computing

27

Prosodic Boundaries

I met Mary and Elena’s mother at the mall yesterday.

Sally saw % the man with the binoculars.

Sally saw the man % with the binoculars.

When Madonna sings % the song is a hit.

When Madonna sings the song % it’s a hit.

Can you hear the difference?

Page 28: Multimodal Affective Computing

Same ‘tune’, different alignment

50

100

150

200

250

300

350

400

LEGUMES are a good source of vitamins

What is a good source of vitamins?

Page 29: Multimodal Affective Computing

Same ‘tune’, different alignment

50

100

150

200

250

300

350

400

Legumes are a GOOD source of vitamins

Are legumes a source of vitamins?

Page 30: Multimodal Affective Computing

Same ‘tune’, different alignment

legumes are a good source of VITAMINS50

100

150

200

250

300

350

400

What are legumes a good source of?

Page 31: Multimodal Affective Computing

31

Other Uses of Pitch Contour Analysis

▪ Rising pitch contour towards the end of an utterance

indicates a yes-no question

▪ WH-questions have a falling pitch contour

▪ Signaling doubt or uncertainty can be expressed by

rising contour in the end of an utterance

Page 32: Multimodal Affective Computing

Yes-No question tune

are LEGUMES a good source of vitamins

Rise from the main accent to the end of the sentence.

50

100

150

200

250

300

350

400

450

500

550

Page 33: Multimodal Affective Computing

Yes-No question tune

are legumes a GOOD source of vitamins

Rise from the main accent to the end of the sentence.

50

100

150

200

250

300

350

400

450

500

550

Page 34: Multimodal Affective Computing

Yes-No question tune

are legumes a good source of VITAMINS

Rise from the main accent to the end of the sentence.

50

100

150

200

250

300

350

400

450

500

550

Page 35: Multimodal Affective Computing

Rising statements

50

100

150

200

250

300

350

400

450

500

550

legumes are a good source of vitamins

High-rising statements can signal that the speaker

is seeking approval.

[... does this statement qualify?]

Page 36: Multimodal Affective Computing

36

Emotional state

▪ Prosody can be used to express emotion

Page 37: Multimodal Affective Computing

37

Emotional state: Neutral

▪ Low pitch

▪ Low pitch variation

▪ Moderate loudness

Page 38: Multimodal Affective Computing

38

Emotional state: Angry

▪ High pitch

▪ High pitch variation

▪ High intensity

Page 39: Multimodal Affective Computing

39

Emotional state: Fear

▪ Average pitch

▪ Average pitch variation

▪ Average intensity

Page 40: Multimodal Affective Computing

Emotional state (5)

▪ QUIZ (Who knows

the Germans best?)

▪ Allowed answers:

▪ Happiness

▪ Anger

▪ Sadness

▪ Disgust

▪ Fear

▪ Neutral

Page 41: Multimodal Affective Computing

41

Voice Quality

Page 42: Multimodal Affective Computing

42

Voice Quality

▪ Does not refer to a term concerning fidelity or “goodness”

▪ Refers to the timbre or coloring of a voice

▪ Functions:▪ Meaning or disambiguation of words (e.g. in Gujarati)

(http://www.phonetics.ucla.edu/vowels/chapter12/gujarati.html)

▪ Paralinguistic signal▪ Attitude

▪ Mood

▪ Social factors (e.g. standing, (inter-)personality)

▪ Affective state

▪ Turn-management (e.g. in Finnish)

Page 43: Multimodal Affective Computing

43

Voice Quality

▪ Can be seen as the signal

residue after removing

effects of the vocal tract filter

▪ The phonation and manner of

the vocal folds vibrate play a

major role (exceptions: e.g.

whisper)

Page 44: Multimodal Affective Computing

Vocal Folds: Phonation Gestures

▪ Adductive tension:

Movement toward the mi

dline of the vocal fold.

▪ Medial compression:

adductive force on vocal

processes

▪ Longitudinal pressure:

tension of vocal folds

Page 45: Multimodal Affective Computing

Modal Voice

▪ “Neutral” mode

▪ Muscular adjustments moderate

▪ Vibration of vocal folds periodic, full

closing of glottis, no audible friction

▪ Frequency of vibration and loudness in

low to mid range for conversational

speech

Page 46: Multimodal Affective Computing

Harsh/Tense Voice

▪ Very strong tension of

vocal folds, very high

tension in vocal tract

Page 47: Multimodal Affective Computing

Whispery Voice

▪ Little or no vocal fold

vibration

▪ Very low adductive

tension

▪ Medial compression

moderately high

▪ Longitudinal tension

moderately high

▪ Turbulence generated

by friction of air in and

above larynx, with vocal

folds not vibrating

Page 48: Multimodal Affective Computing

Creaky Voice (vocal Fry)

▪ Vocal fold vibration at low frequency, irregular

▪ Low tension

▪ The vocal folds strongly adducted

▪ Longitudinal tension weak

▪ Moderately high medial compression

Page 49: Multimodal Affective Computing

Breathy Voice

▪ Tension low

▪ Minimal adductive

tension

▪ Weak medial

compression

▪ Medium longitudinal

vocal fold tension

▪ Vocal folds do not come

together completely,

leading to frication

Page 50: Multimodal Affective Computing

50

Creaky voice (Vocal fry)

Page 51: Multimodal Affective Computing

Examples of Voice Quality Measures

▪ Open Quotient (OQ)

▪ Normalized Amplitude Quotient (NAQ)

▪ Peak Slope

Page 52: Multimodal Affective Computing

52

Suicide Prevention

▪ Nonverbal indicators of suicidal ideations

▪ Dataset: 30 suicidal adolescents/30 non-suicidal adolescents

▪ Suicidal teenagers use more breathy tones

So

urc

e: C

DC

0.1

0.3

0.5

0.7

0.9

Non-Suicidal Suicidal

Open Quotient (OQ)

0

0.1

0.2

0.3

0.4

Non-Suicidal Suicidal

Std. Open Quotient (OQ std.)

Non-Suicidal Suicidal

Std. Norm. Amplitude Quotient (NAQ std.)

0

0.04

0.08

0.12

0.16

****

**

[ICASSP 2013]

Page 53: Multimodal Affective Computing

53

Nonverbal Vocal

Expressions

Page 54: Multimodal Affective Computing

54

Nonverbal vocal expressions

▪ Nonverbal vocal expressions are paralinguistic utterances▪ Backchannel (e.g. uhu, hm, yeh)

▪ Laughter

▪ Moans/Sighs

▪ Pause fillers (e.g. um, uh)

▪ Varying functions and possible interpretations

▪ Distinct prosodic and vocal characteristics

▪ Often accompanied with distinct facial expressions (e.g. laughing)

Page 55: Multimodal Affective Computing

55

Backchannels

▪ Backchannels are

important fragments

of speech

▪ Functions:

▪ Turn-taking

management

▪ Signals agreement

and attention

▪ …

Page 56: Multimodal Affective Computing

56

Laughter

▪ Important social

communicational expression

▪ Understood by all cultures

▪ Varying meanings:

▪ Humorous laughter

▪ Signals agreement

▪ Uncertainty (e.g. social laughter,

nervous laughter)

▪ Multimodal!

“Laughter […] is performed almost exclusively during social

encounters; solitary laughter seldom occurs except in response

to media, a source of vicarious social stimulation.” – Provine and

Yong

Page 57: Multimodal Affective Computing

57

Laughter

Page 58: Multimodal Affective Computing

58

Laughter

Page 59: Multimodal Affective Computing

59

Laughter

▪ Acoustically very distinct but variable▪ Snort-like

▪ Inhaled

▪ Exhaled

▪ Often arranged in bouts consisting of single calls▪ Often about 200ms in

length, but quite variable in length

Page 60: Multimodal Affective Computing

60

Hesitations/Pause fillers

▪ Hesitations (e.g. pause fillers: um, uh)

▪ Multiple functions▪ Signal anxiety or

nervousness

▪ Proficiency in speaking

▪ …

▪ Characterized by prolonged vowels and static spectrum

Page 61: Multimodal Affective Computing

61

Practical Tools for

Speech Processing

Page 62: Multimodal Affective Computing

Useful Software

COVAREP - A Cooperative Voice Analysis Repository

for Speech Technologies (Matlab and Octave)

https://github.com/covarep/covarep/

Page 63: Multimodal Affective Computing

Useful Software

openSMILE - Speech & Music

Interpretation by Large Space Extraction

▪ openSMILE is a fast, real-time (audio)

feature extraction utility (C++)

http://opensmile.sourceforge.net/

Page 64: Multimodal Affective Computing

Useful Software

▪ Praat

▪ http://www.fon.hum.uva.nl/praat/

▪ Useful for feature extraction/visualization

and more!

Page 65: Multimodal Affective Computing

Upcoming Conferences

▪ Affective Computing & Intelligent Interaction (ACII)▪ http://acii-conf.org/2019/

▪ Deadline: April 12, 2019

▪ International Conference on Multimodal Interaction (ICMI)▪ https://icmi.acm.org/2019/

▪ Deadline: May 7th, 2019

▪ Affective Computing Pre-Conference at ISRE (International Society of Research on Emotion)▪ https://www.isre2019.org/program/pre-

conferences/affective-computing

▪ Deadline: May 7th, 2019

▪ Empirical Methods in Natural Language Processing▪ https://www.emnlp-ijcnlp2019.org/

▪ Deadline: May 21th, 2019