Personalized Music Emotion Recognition via Model Adaptation

1

Personalized Music Emotion Recognition via Model Adaptation

Ju-Chiang Wang, Yi-Hsuan Yang, Hsin-Min Wang, and Skyh-Kang Jeng

Academia Sinica, National Taiwan University,Taipei, Taiwan

2

Outline

• Introduction

• The Acoustic Emotion Gaussians (AEG) Model

• Personalization via MAP Adaptation

• Music Emotion Recognition using AEG

• Evaluation and Result

• Conclusion

3

Introduction

• Developing a computational model that comprehends the affective content from musical audio signal, for automatic music emotion recognition and content-based music retrieval

• Emotion perception in music is in nature subjective (fairly user-dependent)– A general music emotion recognition (MER) system

could be insufficient– One’s personal device is desirable to understand

his/her perception of music emotion– Adaptive MER method, efficient and effective

4

Basic Idea

• The UBM-GMM system for speaker adaptation– State-of-the-art systems for speaker recognition– A large background GMM (UBM), representing the

speaker-independent distribution of acoustic features– Obtain the speaker-dependent GMM via model

adaptation with the speech data of a specific speaker• Adaptive MER method for personalization

– A probabilistic background emotion model, learn the broad emotion perception of music from general users

– Personalize the background emotion model via model adaptation in an online and dynamic fashion

5

Multi-Dimensional Emotion• Emotions are considered as numerical values

(instead of discrete labels) over two emotion dimensions, i.e., Valence and Arousal (Activation)

• Good visualization, a unified model

Mr. Emo developed by Yang and Chen

6

The Valence-Arousal Annotations

• Different emotions may be elicited from a song• Assumption: the VA annotation of a song can be

drawn from a Gaussian distribution, as observed• Learn from the multiple annotations and the acoustic

features of the corresponding song• Predict the emotion as a single Gaussian

7

The Acoustic Emotion Gaussians Model

• Represent the acoustic features of a song by a probabilistic histogram vector

• Develop a model to comprehend the relationship between acoustic features and VA annotations– Wang et al. (2012), “The acoustic emotion Gaussians model for

emotion-based music annotation and retrieval,” Proc. ACM Multimedia (full paper)

Acoustic GMM Posterior Distributions

8

Construct Feature Reference Model

A1 A2AK-1

AK A3A4

Each component representsa specific pattern

EM Training

A Universal Music Database

Acoustic GMM

Music Tracks& Audio Signal

Frame-based Features

… …

… …

Global Set of frame vectors randomlyselected from each track

…


A Universal Music Database


9

Represent a Song into Probabilistic Space

1

2

K-1

K…

Posterior Probabilities over the Acoustic GMM

…

A1

A2

AK-1

Acoustic GMM

AK

…

Feature Vectors Histogram:Acoustic GMM Posterior

prob

1 2 K-1 K…

10

Generative Process of VA GMM

• Key idea: Each component in acoustic GMM can generate a component VA Gaussian

Audio Signal of Each Clip

A Mixture of Gaussians in the VA Space

…

A1

A2

AK-1

Acoustic GMM

AK

1

2

K-1

K…

Viewed as a set of acoustic codewords

11

The Likelihood Function of VA GMM

• Each training clip is annotated by multiple users {uj}, indexed by j

• An annotated corpus: assume each annotation eij of clip si can be generated by a weighted VA GMM with {qik}!

• Generating the Corpus-level likelihood and maximize it using the EM algorithm

1 1 1 1

( | ) ( | ) ( | , )jU KN N

i ik ij k ki i j k

p p s q= = = =

= = å E E e m S

1

( | ) ( | , )K

ij i ik ij k kk

p s q=

=åe e Sm

Acoustic GMM posterior

Clip-level likelihood:Each user contributes equally

parameters of each latent VA Gaussian to learn

Annotation-levelLikelihood

12

Personalizing VA GMM via MAP

• Apply the Maximum A Posteriori (MAP) adaptation• Suppose we have a set of personally annotated songs

{ei, qi}, i =1,…,M• The posterior probability over each component zk for ei is

• The expected sufficient statistics with posterior and ei1

( | , )( | , )

( | , )ik i k k

k i i K

iq i q qq

p zq

q=

=å

ee

e

m

m

Sq

S

1

1

( | , )( )

( | , )

M

k i i iik M

k i ii

p zE

p z=

=

¬åå

e e

em

q

q

1

1

( | , )( )

( | , )

W Tk ij i i ii

k W

k i ii

p zE

p z=

=

=åå

e e e

e

qS

q

13

MAP for GMM: Parameter Interpolation

• The updated parameters for the personalized VA GMM can be derived by interpolation

• The effective number of component zk for the target user

• The interpolation factors (data-dependent) can be set by

( ) (1 ) ,k k k k kEa a¢ ¬ + -m m m

( ) (1 )( ) .T Tk k k k k k k k kEa a¢ ¢ ¢¬ + - + -S S S m m m m

1

( | , )M

k k i ii

M p z=

= å e q

k

kk M

M

Parameter interpolation between the expectation and background

14

Graphical Interpretation – MAP Adaptation

1

2

3 6

5

4

Interpolation Factor

Acoustic GMM Posterior

The personal annotation can be applied to clips exclusive to the background training set

15

Music Emotion Recognition

• Given the acoustic GMM posterior of a test song, predict the emotion as a single VA Gaussian

1

2

K-1

K

…

Acoustic GMM Posterior Learned VA GMM Predicted Single Gaussian

1

ˆˆ( | ) ( | , )K

k ij k kk

p s q=

=åe e m S

^

^

^

^

…

{ , }*m *S

16

Find the Representative Gaussian

• Minimize the cumulative weighted relative entropy– The representative Gaussian has the minimal cumulative

distance from all the component VA Gaussians

• The optimal parameters of the Gaussian are

( )KL{ , }

1

ˆ( | , ) argmin ( | , ) || ( | , )K

k k kk

p D p pq* *

=

= åe e eS

S S Sm

m m m

*

1

ˆ K

k kk

q=

= åm m

( )* * *

1

ˆ ( )( )K

Tk k k k

k

q=

= + - -åS S m m m m

17

Evaluation – Dataset and Acoustic Features

• MER60– 60 music clips, each is 30-second– 99 users in total, each clip annotated by 40 subjects– 6 users have annotated all the clips– Evaluate the personalization based on these 6 users

• Bag-of-frames representation, perform the analysis of emotion at the clip-level, instead of frame-level– 70Dim: dynamic, spectral, timbre (13 MFCCs, 13 delta MFCCs,

and 13 delta-delta MFCCs), and tonal

18

Evaluation – Incremental Setting

• Incremental adaptation experiment per target user– Randomly split all the clips (w/ annotations) into 6 folds – Perform 6-fold CV

• Hold out one fold for testing• The rest 5 folds: All the annotations except the

target user’s to train a background VA GMM• Add one fold of annotation of target user into the

adaptation pool (P=5 iterations loop) – Adaptation pool to adapt the background VA

GMM– Evaluate the prediction performance of the test

fold

19

Evaluation – Result

• Metric (ALLi): compute the log-likelihood of the predicted Gaussian with the ground truth annotation of the target user

20

Conclusion and Future Work

• The AEG model provides a principled probabilistic framework that is technically sound, and flexible for adaptation

• We have presented a novel MAP-based adaptation technique which is very efficient for personalizing the AEG model

• Demonstrated the effectiveness of the proposed method for personalizing MER in an incremental learning manner

• We will investigate the maximum likelihood linear regression (MLLR) that learns a linear transformation over the parameters of the AEG model

Technology

Personalized Music Emotion Recognition via Model Adaptation