H.Ling - scriptie v7.2

Bachelor’s Thesis

Sensitivity in evaluating classifications

Name H. Ling (Hansen)

Student ID s678825

Title Sensitivity in evaluating classifications

University Tilburg University

Faculty Faculty of Humanities

Major Business Communication and Digital Media

Bachelor or Master Bachelor

Place Tilburg

Date 28 January 2011

Supervisor dr. M.M. van Zaanen

Second reader dr. T. Gaustad

I

Preface

This Bachelor’s thesis on the evaluation of the automatic mood classification system

has been an exciting and interesting journey. It concludes my studies in Bachelor Business

Communication and Digital Media of the faculty of Humanities at Tilburg University.

Many thanks are in order to those who have aided me in writing this thesis. Firstly, I

would especially like to thank my supervisor Menno van Zaanen for his patience, guidance, and

support throughout the entire process. My gratitude also goes to Tanja Gaustad for her time to

co-assess this thesis.

I would like to thank my supervisor again for the data from their automatic mood

classifier; a system that was built by Pieter Kanters and Menno van Zaanen. Furthermore,

Crayon Room for providing the dataset with the Moody Counts that has been used for the

experiment as well.

Of course, I would also like to thank my family for their enduring support.

Hansen Ling

Tilburg, January 2011

II

Abstract

This thesis is on incorporating weighted distance in the evaluation process of classifiers.

We have studied the effects on the evaluation process of the automatic mood classification

system by Van Zaanen and Kanters (2010). This classifier assigns a mood to music based on its

lyrics. We have focused on two research questions in this study.

The first question concerns incorporating distance between mood classes in the

evaluation of the system. Currently this system is too strict in its evaluation because it uses a

binary metric in which is stated that only perfect matches with the song’s Moody Tag are

counted as a success. However, the wrong predictions are not necessarily wrong because of

nuances between moods. To incorporate sensitivity, two standard distance metrics were

considered: Euclidean and Taxicab metric. The results have shown that with the weighted

accuracy we can evaluate the results more in-depth and discover a difference in the used

retrieval metrics. Based on the new metrics, it was concluded that the tf*idf metric was more

accurate than the tf+tf*idf retrieval metric which according to the original evaluation

performed the same.

Secondly, the Moody Tags are the current gold standard. We studied the influence of

replacing the gold standard with data that is directly obtained from social tagging on the

evaluation of the system which is known as the Moody Counts. The Moody Tags are derived

from the Moody Counts and is therefore an indirect source. Results have shown that the new

standard evaluates in a more fine-grained manner, but did not show a significant difference in

the tf*idf and tf+tf*idf feature.

III

Table of contents

- Preface

- Abstract

- Table of contents

I

II

III

1 Introduction

1.1 Audio Compression Formats

1.2 The story of MP3

1.3 Sharing music

1.4 Metadata of music files

1.5 Music and mood

1

1

2

2

3

4

2 Background information

2.1 Web 2.0

2.1.1 Social Tagging

2.1.2 Metadata of MP3

2.2 Music collection

2.2.1 Creating playlists

2.2.2 Automatic playlist generation

2.2.3 Mood based playlist

2.3 Music, moods, and emotions

2.3.1 Thayer’s model of mood

2.3.2 Language and mood

2.3.3 Music, lyrics and mood

4

4

6

7

8

9

10

11

11

13

14

14

3 Background information for research

3.1 Moody’s mood framework

3.1.2 System’s mood classes

3.2 tf*idf weighting

3.3 Automatic mood classification system using tf*idf

based on lyrics

3.4 Confusion Matrix

3.5 Distance Metrics

15

15

17

18

19

20

21

4 Research questions

4.1 Research purpose

24

24

5 Methodology

5.1 Data

5.2 Method

5.2.1 Method for RQ1


26

26

27

28

30

IV

6 Results

6.1 Results RQ1: Distance metrics

6.2 Results RQ2: Moody Counts

32

32

33

7 Conclusions and discussion

7.1 Answer to RQ1

7.2 Answer to RQ2

7.3 Sensitivity and new standard

34

34

34

35

8 Future research 36

9 References 37

1

1. Introduction

This Bachelor thesis is on incorporating a weighted distance in the evaluation of

classifiers. It is also a follow-up study on Van Zaanen and Kanters’ (2010) “Automatic mood

classification system using tf*idf based on lyrics”, or henceforth also known as the system or

other variations of the word, with the focus on the evaluation of this classifier. Their system

automatically categorizes music tracks in specific mood classes based on the lingual aspect of

music. It was achieved using word oriented metrics such as tf*idf and tf+tf*idf. These are

standard metrics taken from the field of information retrieval that are based on term

frequency and inverse document frequency. Their study has shown that words in lyrics contain

valuable information on the mood that the song writer or artist wants to transfer to his/her

audience. With their study, we can examine the effects of weighted distance.

This section gives an introduction from digital music files to the mood in music. It

describes how large digital music collections came about and how we can organize them by

creating playlists that are based on specific criteria such as the mood that music gives people.

1.1 Audio Compression Formats

Nowadays almost everyone has a (portable) device with which they can play digital

audio files to listen to music. There are numerous devices enabling you to enjoy your favorite

songs. It can be for instance a mobile phone, a portable game console, a portable media player,

or a computer.

The introduction of audio compression formats such as Windows Media Audio [WMA],

Advanced Audio Coding [AAC], Vorbis, and the most popular MPEG-1 Audio Layer 3 [MP3] has

allowed people to carry along more music occupying less space than music compact discs [CD].

In addition, it has allowed us to share music more rapidly and conveniently. This is due to the

relatively small sized files and the convenience of the internet and data sticks. To give an

example, a four minute song would be 40 Megabytes [MB] in size without any compression

methods. When you convert the file to a standard quality MP3 format, a four minute song

would then approximately take up 4 MB of space.

2

1.2 The story of MP3

The lossy audio data coding technique MP3 is developed by the German institute

Fraunhofer IIS (http://www.iis.fraunhofer.de/). The researchers at Fraunhofer IIS found a way

to make audio files about ten times smaller by leaving out auditory elements that cannot be or

can hardly be heard with the human hearing capabilities; auditory masking.

According to Fraunhofer IIS, the MP3 standard first appeared as part of MPEG-1 in 1992

and they decided on the name MP3 for layer 3 in 1995. In 1998, the first portable MP3-players

became available: “Rio 100” by United States’ Diamond Multimedia and “MPMAN” by Korea’s

Saehan Information Systems. These players used flash memory to store and play MP3-files.

MP3-files were either downloaded from the internet or encoded from music CD’s. This has

allowed people to start possessing a music collection on their computer, create their own

playlists, and carry music around on MP3-players.

MP3 and MP3-players rapidly gained massive popularity and started to gain preference

over compact discs. CD-players take up more physical space than MP3-players or multimedia

players such as an iPod. Furthermore, CD-players require a music CD in order to play the music

you wish to hear. Unless you want to listen to one CD, limited to approximately 18 songs, you

will need to carry along extra music CD’s that take up physical space as well. In contrast, MP3

files can be directly stored on the storage space that is usually implemented in your media

player. This has brought much convenience as you do not have to take out the CD and put in

another to listen to other artists. In addition, a music CD can store about 700 MB of data which

is 74 minutes worth of music, while MP3-players depend on its storage capacity which

nowadays can be several Gigabytes [GB] (1 GB ≈ 1000 MB). To complete the comparison,

assume that you have a portable device that can play MP3-files with a storage capacity of 1 GB,

this equals to 250 4-minute-songs that totals 16.7 hours of music. Therefore, it is not difficult

to conclude that MP3-players are currently preferred over CD-players.

1.3 Sharing music

During the introduction of the MP3 most people had analogue internet access via the

telephone line. This dial-up internet access allowed access by establishing a dialed connection

to the internet service provider [ISP]. It did not matter if you were surfing the World Wide Web

3

or downloading files, every kilobyte mattered due to limited transfer speeds of up to 5 to 7

kilobytes per second. Therefore, file sizes were of great importance during the dial-up

generation for a faster internet experience. In addition, since you are basically calling your ISP

to gain internet access you needed to pay telephone costs as well which emphasizes the

importance of file sizes even more. However, transfer speeds have gone up since then with the

introduction of broadband internet access, a high data rate connection. Though, dial-up users

are still present in many parts of the world. Hence, MP3 was and still is a welcomed

development.

Since 1999, file sharing systems such as Napster came into existence (Liebowitz, 2004).

These peer-to-peer file sharing programs linked users’ computer together so that they could

share their files with each other and disregard any copyright infringement. People now had

access to music collections and build up a large collection for free. With the ability to obtain

numerous songs via the internet and being able to share them with others, for instance

through data sticks, the need to buy CD’s has declined. Unlike with CD’s, you can easily add and

delete songs from your computer or portable media-player. This has removed the necessity to

spend money on compact discs that contain songs of your preference as well as songs that you

do not enjoy. Overall, research has shown that the purchase of music CD’s has decreased since

people started to share copyrighted music using the MP3-format illegally (Liebowitz, 2004).

1.4 Metadata of music files

With the songs digitally stored on our computer, we can organize them according to

specific criterion such artist or genre. In the case of MP3-files, it is possible to store this kind of

information in ID3-tags. This tag is integrated in the MP3-file and contains several text fields in

which you can enter information such as artist, title, album title, and genre; metadata.

Assuming that the ID3-tags of all songs are complete, you can automatically create playlists

according to artist, album, or title.

4

1.5 Music and mood

It is commonly known that music not only contains but also gives a certain mood and

emotion to the listeners (e.g. Juslin & Sloboda, 2001; Meyer, 1956). Radocy and Boyle (1988)

stated that in a common culture when people listen to the same song, they tend to agree with

one and other about the elicited mood of the particular song (as cited in Liu, Lu, & Zhang, 2003,

p.13).

Nowadays, many people have a large music collection stored digitally on their

computer or portable device. According to Voong and Beale (2007), music listeners are

interested in creating playlists that suit the listener’s mood. However, it is cumbersome to

remember what mood each song in your music collection elicits. You could manually tag each

of them, but this is a very time consuming and tedious activity that requires you to have

listened to all of your songs.

2. Background information

This section introduces the Web 2.0 phenomenon that has caused an enormous

increase in interactivity on the World Wide Web. This also resulted in a collective provision and

exchange of data such as music track’s metadata and music files.

Furthermore, we discuss how we can organize our large music collection now that a

number of sources are available to obtain music from. As Van Zaanen and Kanters’ automatic

classifier categorizes music by mood, we discuss the mood aspect of music.

2.1 Web 2.0

The term Web 2.0 suggests that the World Wide Web received an update in the past. In

contrast, it actually refers to the changed ways of using the World Wide Web by software

developers and end-users. It is a follow-up on Web 1.0, which mainly consisted of static web

pages that provided users with a one-way stream of information. To differentiate the two web

5

generations, O’Reilly (2005) listed seven main features to describe Web 2.0 which have been

listed below:

- Providing services with cost-effective scalability, and not limited to packaged software

(O’Reilly, 2005); e.g. Google started out with just the search service that expanded with

services such as Google Analytics, Google’s advertising service, Google Mail, etc.

- Having control over unique data sources that get richer when more users make use of

them (O’Reilly, 2005); e.g. BitTorrent is a system that allows files to be shared among

users which are provided by users. The more users who share a (fragment of a) file, the

more sources you have to obtain the complete file from, and the faster your download

will be. The users provide matters such as the bandwidth, the availability of files, and

the files they are willing to share.

- Allowing and treating users as co-developers (O’Reilly, 2005); e.g. open source software

that is software of which the source code is free to be used and altered by others.

Software developers are then able to learn from, improve, and/or make use of each

other’s work. This results in software with direct input from users and therefore more

suitable to the users’ needs.

- Making use of users’ combined intelligence (O’Reilly, 2005). For instance, Flickr

(http://www.flickr.com) is a company that allows people to share pictures online. Users

can share, view, and search for photos. In addition, users can attach related keywords

to a photo; also known as tagging. This allows other users to search and find photos

based on keywords. Another example is Wikipedia, an online encyclopedia on which

entries can be added and edited by different users.

- Benefitting from the power of smaller websites or users that make up for most of the

Web (O’Reilly, 2005), The Long Tail (Anderson, 2004), which are combined through a

service such as Google’s advertising service. This allows advertisers to reach visitors

that are part of the The Long Tail.

- Providing software that can be used on different devices (O’Reilly, 2005).

- Using “lightweight user interfaces, development models, and business models” (O’Reilly,

2005, p. 37); simple for users to use, up to the users to decide what to do with the

obtained data, and easy for others to re-use and remix data.

Web 2.0 is therefore seen as the next generation web experience. It shifted from the

static Web 1.0 to the dynamic and interactive Web 2.0 that provides a rich user experience.

6

Some of these features can be related to the automatic mood classification system. The

system can be introduced as a service for music listeners to gain access to specific mood genre

of music and in the process it is possible for the listener to discover the mass of unknown songs.

Furthermore, the classification system is employable in various settings which give users a new

way of interacting with music; e.g. making recommendations, automatic playlist generation,

and as an organizing tool for large music collections. In addition, the evaluation of the system’s

results is currently against data that is acquired through social tagging which is a valuable way

of gaining data directly from the users. It should have direct access to this data for its

evaluation because these data give a realistic view of the mood that people experience.

2.1.1 Social Tagging

Web 2.0 has meant a massive increase in interactivity on the World Wide Web (section

2.1). This new environment has brought new possibilities and techniques for people to “offer,

find, and interact with online music content” (Kanters, 2009, p. 7). One of the techniques is

tagging which is known under a variety of names such as collaborative tagging, social

classification, social indexing, folksonomy, and social tagging (Tonkin et al., 2008; Voss, 2007).

Tagging is when end-users assign keywords to items after which the tags are immediately

available for other users to see and to use. Social tagging can be seen as a way of indexing

done by end-users instead of experts (Voss, 2007). The keywords are freely chosen instead of

using a controlled vocabulary (Tonkin et al., 2008). The objects being tagged usually concern

digital items such as photos, music, videos, blog posts, or documents. This tagging technique

has become a popular feature for a rapidly increasing number of websites such as Flickr, reddit

(http://www.reddit.com/), and del.icio.us (http://www.delicious.com/).

According to Vander Wal (2005), there are two types of folksonomy (a combination of

folk and taxonomy): broad folksonomy and narrow folksonomy. In broad folksonomy, such as

on del.icio.us, someone created an object and every other user is allowed to freely tag the

object in their own words. In contrast, in narrow folksonomy, such as used on Flickr, people

can only tag their own objects (once). In both cases of folksonomy, tags are used to describe

and organize objects which can be retrieved and accessed by using keywords in your search.

In the field of music, social tags have become an important information source for

online music recommendation systems (Eck, Lamere, Bertin-Mahieux, & Green, 2007). These

7

systems allow users to automatically generate playlists according to their input terms. These

terms will be matched against the tags of songs which were given by users. Last.fm is an

example of such a recommendation site that uses social tags as their information source to

recommend music to listeners.

2.1.2 Metadata of MP3

Digital audio files such as those in the popular MP3-format can contain audio-related

information such as related text, song title, graphical information, and genre. This information

is stored in a data container that is implemented in an MP3-file called ID3-tag. When you

playback an audio file, the audio software may read the metadata and display information such

as artist name, title of track, album name, year and genre. Most people are familiar with this

because many audio devices display this kind of information.

When the Fraunhofer IIS developers originally decided on the .mp3 file-name extension

for the MPEG Layer 3, it did not yet contain the ID3-tag. However, people preferred to know

this kind of information. The same preference occurs when you are listening to a music CD and

come across an unknown song or when you would like to know the content of an album. You

do this by scanning the cover on the back of its jewel CD case for the song title behind the

corresponding track number. It was Eric Kemp who had the idea in 1996 to implement

information as a tag at the end of the audio file (http://www.id3.org/). This information would

include title, artist, album, year, genre, and a comment field. Even though this was a nice

addition to audio files, it has a fixed size of 128 bytes which means that it has a maximum of

128 characters to fill in all information. In addition, the amount of characters per field has a

fixed size. For instance, “song title” can contain a maximum of 30 characters and so does the

field for “artist”, but “year” can only have 4. This tag is known as an ID3-tag, the first version;

ID3v1.

Shortly after ID3v1 came a small improvement that was made by Michael Mutschler in

1997 (http://www.id3.org/). He took 2 bytes from the comment field, and added a new field

where the track number is to be stored. This new version of the tag is known as ID3v1.1.

However, despite the appreciation of the ID3-tag, both ID3v1 and ID3v1.1 still had their

limitations such as the low character limit per field and the placement at the end of the file

8

which makes it the last piece of information to receive when you are streaming the song.

Therefore Martin Nilsson and Michael Mutschler came with the idea for ID3v2 in 1998. Their

approach, however, is radically different from ID3v1. ID3v2 is not limited by size or field limits

and is now stored in the beginning of an audio file. Furthermore, it supports Unicode which

means that foreign characters are now possible too. The new ID3-tag has been improved in

such a way that it can even store pictures, lyrics, equalizer presets, and more because it stores

information in a variable number of dimensions (http://www.id3.org/). Therefore, as the

automatic mood classifier (Van Zaanen & Kanters, 2010) bases its classification on lyrics, it

could automatically implement the mood class in the ID3-tags without searching the World

Wide Web for those lyrics.

ID3-tags can be edited by the users themselves, and shared with others. In other words,

metadata can be found on the internet and automatically assigned to the specific audio files by

audio software or other software provided that sufficient input is given to retrieve the correct

data. It is possible to manually tag songs or edit incorrect tags but there are online music

libraries from which you can get (most of) your information such as All Music

(http://www.allmusic.com) and Last.fm (http://www.last.fm/). In addition, you can add your

contribution to the database and therefore help other users which is one of the Web 2.0

characteristics.

With the information in these tags, we can ease the process of retrieving, and

organizing large quantities of music tracks based on criteria such as artist, genre, and mood.

The addition of mood class would increase the granularity and therefore the interaction with

music.

2.2 Music collection

With the introduction of the MP3, the availability of music via the World Wide Web and

file-sharing has increased immensely whether it is through legal or illegal means. This has been

felt by record labels via their sales (Liebowitz, 2004; Oberholzer-Gee & Strumpf, 2007). An

illegal mean would be for instance obtaining music via The Pirate Bay (http://thepiratebay.org/)

which is a website where millions of users freely exchange music, software, movies, and other

files on the internet and thereby infringing copyrights. An example of a legal music resource is

iTunes (http://www.apple.com/itunes/), an online music service where users can preview and

9

buy digital music files for roughly one US Dollar per song of which a percentage goes to the

artists. Aside of online sources, music lovers exchange these digital files amongst each other as

well.

These resources enable us to acquire a massive amount, yet personally selected music

collection on our computer and portable music players (Voong & Beale, 2007). People’s music

collections have grown due to the increased approachability of music on and off the internet. It

is therefore of importance to organize this mass of music in order to retrieve relevant

information in a time-efficient manner.

2.2.1 Creating Playlists

Kanters (2009) states that: “Playlists and music collections have to be organized in the

best possible way in order to find the relevant information effectively.” In resemblance to

arranging your tangible music records, there are various approaches in organizing your files,

but they all require some kind of criterion. Unlike records, digital music files require far less

physical space and they are much easier to rearrange to fit other categories such as artist or

album. Moreover, when your digital collection is loaded into a library system it will be even

easier to sort and retrieve your music files according to different criteria, assuming that the

necessary information is present in filenames or metadata (ID3-tags). The flexibility of today’s

music allows us to arrange them in any order or combination by making a playlist (Andric &

Haus, 2006; Kanters, 2009).

People create and prepare playlists for various occasions such as jogging or driving a car.

In general, music listeners create a playlist in a few ways and usually play digital music on the

computer. The first method is to load your entire or partial music collection into a playlist and

activate the shuffle mode. This is an easy and quick solution, but it is not based on a specific

criterion. Moreover, you have no or limited control over the contents of the playlist. This

results in a randomly generated list that might not suite your current mood depending on how

diverse your collection is. Another method is to manually arrange songs in folders or playlists

according to criteria such as artist, song title, album, genre, or mood. Vignoli (2004) found that

music listeners tend to use a hierarchical structure to manually organize their music with

folders and subfolders. He also found that users who own a CD collection tend to organize their

digital music collection according to the way their CD’s are organized. This allows some

10

flexibility in creating playlists by loading a main group or subgroup(s) into the list. However,

there are many disadvantages to this approach, especially when you are in the possession of a

large music collection it becomes difficult to navigate your way through the masses. It is

necessary to at least listen to some passages of the music track in order to place the song in a

category and your personal judgment of the song is also required. Creating playlists as

previously described is time consuming, tedious, and static. Pauws and Eggen (2002) state: “It

is hard to arrive at an optimal playlist as music has personal appeal to the listener and is judged

on many subjective criteria.” If your criterion changes, you would need to re-arrange your

collection. In addition, the supply of music continues to grow which makes it important to have

a reliable and effective information access and retrieval (Kanters, 2009). This clearly shows a

need for specialized tools that give users the ability to conveniently fulfill their music demands.

These tools must be able to classify and retrieve files in accordance to the users’ input such as

genre, or mood. The result could mean a custom generated playlist or an organized music

collection that suits the user’s criterion (Kanters, 2009; Meyers, 2007).

2.2.2 Automatic playlist generation

As music collections have grown immensely, people tend to forget parts of their

collection or get lost in the masses. Therefore, new tools are required that allow them to, for

instance, reach forgotten music or easily generate specified playlists; a new way to interact

with music (Vignoli & Pauws, 2005). Vignoli and Pauws evaluated a music retrieval system that

sorts songs based on a seed song and found that it took users less effort and time to make a

more satisfying playlist according to criterion such as mood, rhythm, or artist. Vignoli and

Pauws concluded that “providing users with complete control on their personal definition of

music similarity is found to be more useful and preferred than no control.”

There are automatic playlist generators available since recent years. For instance, the

Smart Playlist feature in the popular music software iTunes which automatically creates

playlists based on the users’ criteria. However, the number of criteria are limited and require

metadata to contain the necessary information; e.g. artist, year, album, rating, newest song,

and play count. Moreover, some criteria require users’ input such as “rating” which is a score

that is given by users to a song, and “play count” which is the number of times each song has

been listened to on iTunes. This approach tends to skip the majority of the songs that have no

11

rating or have never been played before; The Long Tail (Anderson, 2004). You could set the

search for “newest songs” and “no ratings”, but this would include useless results.

2.2.3 Mood based playlist

When you create playlists according to criteria such as artist, album, or title, you

require the necessary information on the music which is usually found in its metadata. With

the complete information it becomes easier to sort and create specific playlists. According to

Liu et al. (2003), with the growing music collection on computers and the internet, it has

become apparent how important (semantic) metadata are for easy access to music.

We can also add additional information that specifies the mood of the music track. By

adding mood tags to the metadata we can view our music collection from a new perspective.

Currently, these tags are manually applied by listeners or obtained through social tagging by

using tagging tools such as Moody (http://www.moodyapp.com/). We can identify a mood in

music tracks the same way we do with moods in our daily life. “Mood tagging and tagging in

general is a relatively new way of expressing one’s feelings or thoughts” (Kanters, 2009, p. 15).

These tags can be used to create mood based playlists. If users want to create a list with

joyful songs they can sort the music files according to mood tags, select all joyful-tagged music

tracks and load them in their playlist. Moreover, it is common knowledge that music affects us

in many ways and there has been much research on musical influences (e.g. Koopman & Davies,

2001; Milliman, 1982; Thompson, Schellenberg, & Husain, 2001). As a mood based playlist

contains songs of the same mood and because people are affected by music, it can be used to

change the mood of the listener over a period of time (Kanters, 2009).

2.3 Music, moods, and emotions

As human beings we are constantly facing and being influenced by emotions and

moods in our daily life. Even though the concepts emotion and mood are related as they are

feelings, there are differences.

As is described by Watson (2000), emotions such as surprise and fear “represent an

organized, highly structured reaction to an event that is relevant to the needs, goals, or

12

survival of the organism.” Watson furthermore writes that they contain interrelated

components: a prototypical form of expression, a consistent autonomic change, a distinct

feeling state, and an adaptive behavior. These can be best explained with an example. You can

recognize the emotion of anger by observing the person’s facial expressions that is universally

typical for anger (Ekman, 1971): the eyebrows are pulled inward and downwards with wrinkles

above the center of the eyes, the upper and lower eyelids pull together a little which make it

appear as though the person is squinting, the lips are pressed tightly together or in other cases

they open which shapes the mouth squared with raised lips that may point forward, and the

teeth might show (Ekman, 1971). An angry person usually experience autonomic changes such

as increasing of heart rate, getting a hot or flushed face, and tensing of muscles (Shields, 1984).

This emotion is also known to give the person an annoyed and irritated feeling (Watson, 2000).

Lastly, anger is often an emotional reaction with the purpose to defend themselves, to

maintain their personal integrity, to correct social injustice, and it is at times a reaction to

protect themselves in dangerous situations in order to survive (Izard, 1991).

Moods and emotions are very much alike because they are periods of feeling that come

and go. However, the episodes of moods have a relatively longer duration compared to

emotions which are usually intense but brief states. Depending on the level of intensity, an

emotion can last from seconds at lower intensity to a few hours when dealing with high

intensity (Izard, 1991). A mood on the other hand, can last for hours or a few days (Thayer,

1989; Watson, 2000). Furthermore, both emotions and moods are influenced by external

events and experiences, but with the difference that internal processes play a role as well

when dealing with moods. In addition, this concept includes all transient feeling states. To

elaborate on the latter matter, it includes states that are milder versions of emotions such as

annoyance for anger and nervousness for fear. It also includes states that require low levels of

activation and arousal like fatigue and serenity which occur frequently in daily life. Watson

(2000) finishes the definition with the notion that in order to define the nuances of affective

experience we deal with in our everyday life we need several states that do not clearly point to

one classically defined emotion.

Kanters (2009) assumes in his study that an emotion is a part of the mood. The mood

being a basic state that someone finds themselves in with occurrences of energetic emotional

outbursts. As previously discussed in this section, emotions are intense but brief states while

moods go on for longer and reside internally. These fluctuations in intensity can also be

13

experienced in music tracks. We sense an overall feeling (mood) in which emotional outbursts

can occur during the length of the song.

2.3.1 Thayer’s model of mood

Thayer (1989) views mood as an experience of biological as well as psychological nature.

According to Thayer, our moods are made up of energy and tension. These are the dimensions

of his mood model, with valence on the x-axis of the plane, which is derived from tension, and

arousal (energy) on the y-axis. Valence is furthermore divided into negative valence (negative

moods) on the left side and positive valence (positive moods) on the right side. On the vertical

axis (arousal), there is at the top high energy arousal (energetic moods) and at the bottom

there is low energy arousal (calm moods). This divides the Valence-Arousal space into four

quadrants of moods as is shown in figure 1: positive valence and high arousal, positive valence

and low arousal, negative valence and high arousal, and negative valence and low arousal. For

instance, a high arousal and negative valence can be an angry mood while a high arousal and

positive valence are moods such as happy and excited.

Figure 1: Thayer Mood Model with examples of mood classes

- Angry

- Annoyed

- Nervous

- Excited

- Happy

- Content

Arousal

Valence

High energy

Low energy

Negative Positive

- Relaxed

- Satisfied

- Calm

- Gloomy

- Sad

- Bored

14

2.3.2 Language and mood

People experience mood on a regular basis. It is an emotional state people find

themselves in that has a relatively long duration. Certain moods can be expressed through

verbal and non-verbal language such as speech, written texts, body language, and facial

expressions.

Among other studies on the relation between mood, and words and language use

(Bower, 1981; Stenius, 1967; Teasdale & Russell, 1983), a study by Beukeboom and Semin

(2006) showed how our current mood state is reflected in our language to describe a social

event. People in a negative mood have a different word choice and language use than those

who are in a positive mood. When someone is in a negative mood, they tend to describe an

event to the point, providing concrete information. The same event described by someone in a

positive mood will be with active interpretation and enriched information. Furthermore, it can

also work the opposite way where the mood of the speaker can be derived from their language

use. Other cues such as emotional tone of voice and facial expressions also help expressing

their mood.

Therefore, people are aware of their mood which can be expressed through language

and people can sense the mood from the speaker through communication. Although aspects

such as body language and tone of voice help determine the mood, language can contain

information on mood as well.

2.3.3 Music, lyrics and mood

We listen to music as it gives us a certain feeling that musical aspects such as melody,

key, and rhythm can emphasize. However, Kanters (2009) states that you can also transfer a

certain mood onto the listeners by using mood related words in the lyrics. Music lyrics are texts

that contain words and phrases that are written by song writers and spoken (sung) by artists.

These can express a certain mood and emotion such as in normal communication. If you want

to express happiness in your song, you typically use words that have that emotional load with

words such as “happy” and “joyful”, and not “darkness” and “hell”. Furthermore, the intensity

of a song can be achieved through choice of words as well.

15

The focus of Kanters’ study lies in the linguistic aspects of music. He assumes that lyrics

contain lexical items which express a certain mood that is transferred by the sender (artist or

songwriter) to the receiver (listener or reader). The mood of a song can be detected via its

lyrics without the presence of aspects such as the musical tracks, key, and tempo. However,

these musical aspects do emphasize the emotional load in the lyrics. Even though the vocal

aspect is of importance in detecting the mood, there are words of which the emotional load is

clear in written form such as “joy” or “misery” (Kanters, 2009).

Omar Ali and Peynircioğlu (2006) have shown “that lyrics can influence the overall

emotional valence of music”. Negative emotions (sad and angry music) were more easily

detected by listeners with the presence of lyrics while the absence of lyrics allows music to

more easily express positive emotions (happy and calm music). Furthermore, it was shown that

melodies are more dominant than lyrics in determining the emotional valence of music.

Therefore, lyrics do aid listeners in discovering the mood of music.

Kanters tested in his study whether lyrics provided the necessary information to assign

main moods to music. He stated that even though it is possible for a song to contain several

emotional events, they are all linked together by one certain mood which is usually the chorus.

3. Background information for research

This section firstly provides information for a better understanding of the automatic

mood classification system by Van Zaanen and Kanters (2010). Secondly, it provides

information on the distance tools that are going to be used for this study.

3.1 Moody’s mood framework

Moody is a third-party software application for iTunes that is created by the company

Crayon Room (http://www.crayonroom.com/). Moody uses a color scheme that is associated

with mood; known as Moody Colors. Users can assign a mood to a song by selecting a Moody

Color from the framework. This information on mood, know as Moody Tag, is then stored in

16

the comment or composer field of ID3-tags. The Moody Tags are also stored in the iTunes

database which allows other users to make use of this information.

It was shown in a study by Voong and Beale (2007) that users found it a useful way to

associate mood with color. Moody has sixteen colors to select from and therefore sixteen

different moods can be used to classify a music track. The standard settings of color codes for

moods are displayed in figure 2. From left to right it goes from sad to happy (1 – 4 on the

horizontal axis), and from the bottom to the upper row it goes from calm to intense music (D –

A on the vertical axis). Therefore, each color is represented by a letter and a number or also

known as the “Moody tag”. The tag or coordinate D1 stands for calm and sad mood, while A4

represents intense and happy.

Intense

Calm

Sad Happy

Figure 2: Moody’s color coding and respective moods. Retrieved 8 December, 2010, from

http://www.moodyapp.com/help/

By using Moody to tag songs in iTunes, users can generate playlists based on your mood

preference. For instance, if you are in a good mood you might feel like listening to a collection

of happy songs. When you have confirmed your mood choice, the application will generate a

playlist that consists of the songs that are tagged as happy.

17

3.1.2 System’s mood classes

Kanters (2009) adopted the Thayer mood model (Thayer, 1989) for the classification

system. Moody’s framework uses the arousal and valence dimensions as well and therefore

integrates well into Thayer’s model as is shown in figure 3. The difference is that Moody uses

hue colors instead of keywords to distinguish moods. Another difference with the Thayer

Model is that instead of four quadrants, there are sixteen different colors to choose from. This

means that the data that Van Zaanen and Kanters received from Crayon Room has that range

of moods as well.

Figure 3: Moody’s framework integrated into Thayer’s mood model (Kanters, 2009)

In order to work with a fine grained set of classes, Van Zaanen and Kanters divided the

Valence-Arousal plane into sixteen parts (Van Zaanen & Kanters, 2010) by dividing each

dimension into four areas. Similar to the Moody framework, the arousal segments are named

A to D and the valence segments 1 to 4. The fine-grained division now resembles Moody’s

framework as is shown in figure 4. However, this division is one of four class divisions that was

studied. The different class divisions were:

- Fine-grained: A1 – D4 (16 classes)

- Arousal: A – D (4 classes)

- Valence: 1 – 4 (4 classes)

Arousal

Valence

High energy

Low energy

Negative Positive

18

- Thayer: the original four quadrants in Thayer’s mood model (4 classes)

Arousal (A-D)

A1

A2

+

A3

A4

Va

len

ce (1

-4)

B1

B2

B3

B4

-

C1

C2

C3

+

C4

D1

D2

D3

-

D4

Figure 4: Class divisions as used by Van Zaanen and Kanters (2010)

3.2 tf*idf weighting

The tf*idf metric is a standard information retrieval metric that consists of the

components term frequency [tf] and the inverse document frequency [idf]. The tf measures

the number of occurrences of term t in document d, which is denoted as tft,d.

tfi,j =

The tf-formula shows the importance of term ti in document dj by dividing the number

of occurrences of the specific term in document dj with the total occurrences of all terms (nk,j)

in the document (dj) (Kanters, 2009).

With the term frequency, all terms are viewed as equally important. Therefore there is

the inverse document frequency that measures the importance of term t in a collection of

documents D, denoted as:

idfi = log

ni,j

∑k nk,j

|D|

|{dj : ti є dj}|

19

The idf-formula gives the logarithm of all documents D divided by the amount of

documents in which a specific term is present (Kanters, 2009). It tells us the uniqueness of a

term in the collection of documents.

In the tf*idf metric the components are multiplied with each other. This metric is a

method for weighting the importance of each term in each document (Manning, Raghavan, &

Schütze, 2009). Van Zaanen and Kanters (2010) also used tf+tf*idf to measure the importance

of words in lyrics in each mood document.

According to Manning, Raghaven, and Schütze, the tf*idf-based metrics assign a high

weight to term t in document d when the term is frequently found in a few documents, such as

certain mood specific words like “happy” and “heartache”. The weight is lower when term t is

found fewer times in a document or found in many documents such as “not”. The weight for

term t is the lowest when the word is present in nearly all documents such as function words.

3.3 Automatic mood classification system using tf*idf based on lyrics

As is mentioned in the introduction of this thesis, the automatic mood classification

system (Van Zaanen & Kanters, 2010) automatically labels music with a mood based on the

lingual aspect of music tracks. The different mood classes that were used to classify the songs

are discussed in section 3.1.2. The data that Crayon Room provided consisted of the Moody

Tags and to which artist and song title it was assigned. This dataset was used as the gold

standard for Van Zaanen and Kanters’ system.

Furthermore, a set of information retrieval features were selected in order to describe

properties of lyrics. Of these sets of features, word-based features showed the best

performance and to be more specific, the tf*idf and tf+tf*idf metric. This information retrieval

metric is generally used to measure the importance of terms in documents in a large document

collection, as is explained in more detail in section 3.2. Every mood class consisted of a

document that represented a specific mood class. That document was a combination of all

lyrics that were used in their study that had the same mood tag. The combined lyrics would

then appear as though it were one document.

Hence, the determining of mood depended on looking at lyrics in a word based manner.

With the tf*idf metric only words remain that do not occur in all mood classes and these were

20

more valued as they could determine in which mood class they are most relevant. Hence,

words with high tf*idf values show a high importance of those words to a mood.

In their study the retrieval metrics were used to show how relevant words in lyrics are

with regard to the mood classes. Their results have shown that the lingual part of music does

provide information on the overall mood of a song (Van Zaanen & Kanters, 2010).

The evaluation part of the automatic mood classification system uses a binary distance

metric to determine the accuracy of the system. The result is that you get values of 0 and 1.

Value 0 represents whenever the system did not predict the mood of a song in correspondence

to the gold standard and value 1 represented the perfect matches between the prediction and

the actual mood class that a song was supposed to be in. The accuracy was then calculated by

dividing the sum of all values by the total amount of elements in the dataset.

3.4 Confusion Matrix

To get an overview of how many mood classes were tagged correctly, how many were

mistaken, and in which classes it went wrong, a confusion matrix (Kohavi & Provost, 1998) can

provide the necessary information. It is a tool that is used to share information about actual

and predicted classifications done by a classification system such as in machine learning.

Table 1 shows an example of a confusion matrix. It shows that of the nine that should

actually be classed “happy”, eight predictions were correct, and one was classed “angry” and

therefore mistaken. Another way to view this example is that of the predictions that were

classed as “angry”, only one of the predictions was wrong. In addition, we can also conclude

that “angry” contains the most mistakes and “sad” the least. The accuracy of the classifier can

therefore also be determined.

Table 1: Example of a confusion matrix

Gold standard

Happy Angry Sad

Pre

dic

ted

Happy 8 1 0 9

Angry 1 6 0 7

Sad 0 2 9 11

9 9 9 27

21

3.5 Distance Metrics

Moods are not necessarily complete opposites of each other because there are levels of

intensity to consider, compare excitement with pleased for instance. If the system labels a

song with a mood that is close to the actual mood, it is not as wrong as if it were classified with

a completely opposite mood. The mood classification system uses two dimensions to classify

mood, namely arousal and valence (Figure 4). With the help of a distance metric, it is possible

to measure how far off the mark a class is from the actual mood class or other classes.

The form of a simple distance metric for the distance between point A and point B is

d(A, B) (Kochanski, 2009). However, this metric has to comply with the following conditions:

- d(A, B) ≥ 0

- d(A, B) = 0

- d(A, B) = d(B, A)

- d(A, C) ≤ d(A, B) + d(B, C)

(non-negativity: the distance from one mood to another

cannot be negative)

(identity of indiscernibles: only if mood A is equal to mood

B)

(symmetry: the distance from mood A to B is the same as

from mood B to A)

(triangle inequality; the distance from A to C is always

smaller or equal to the distance from A to B and B to C

combined)

We discuss in this section the most common two distance metrics: Taxicab and

Euclidean metric.

The Taxicab metric is known under variations of names such as rectilinear distance, city

block distance, and Manhattan distance. As the various names might suggest and as it is shown

in figure 5, it measures the distance between two points by counting the steps on a city road

grid (Krause, 1986).

22

B

(3,3)

B

(3,3)

B

(3,3)

A

(0,0)

A

(0,0)

A

(0,0)

5a 5b 5c

Figure 5: Different ways from point A to B using the Taxicab metric. Each way has a distance of 6 steps.

The amount of steps that is needed to go from coordinate A(0,0) to coordinate B(3,3) is

the same for figure 5a, 5b, and 5c; in this case it is 6 steps. By using a function the process of

counting the steps can be automated. This distance metric says that the distance between A

and B is the sum of the absolute difference of coordinate A to coordinate B (Krause, 1986).

Hence, |0 - 3| + |0 – 3| = 3 + 3 = 6.

The Euclidean metric is represented in figure 6. It describes the shortest route available

by drawing a straight line from point A to point B.

B

(3,3)

A

(0,0)

Figure 6: Shortest distance from point A to B

The distance can be calculated with the Pythagorean Theorem. This theorem can be

explained with the help of figure 7. It states that in any right-angled triangle, the surface of

area C is equal to the sum of area A and area B. The areas are squares, this means that the

length of each side is equal and therefore surface B is equal to b x b or b2. According to

Pythagorean Theorem, c can be calculated as follows:

23

C = a2 + b

2

c = √C

c = √(a2 + b

2)

The Euclidean metric can be denoted as (Krause, 1986): d(A, B) = √((Ax - Bx)2 + (Ay - By)

2)

Figure 7: Pythagoras Theorem

By applying this method to figure 6, we can calculate the distance from A to B by

looking at their coordinates; A(0,0) and B(3,3). We can create an imaginary right-angled

triangle (such as figure 6) with line AB as one of the sides (such as c in figure 7). The length of

the other two sides is equal to |(0,0)-(3,3)|. Side a would be the difference from A to B on the

horizontal axis, therefore 3. Side b would also be 3, because on the vertical axis the difference

is |0-3|. Now we calculate c by filling in the formula:

c = √(a2 + b

2)

AB = √(32 + 3

2)

AB = √(9 + 9)

AB = √18 ≈ 4.24

The difference between the Taxicab and the Euclidean metric is that the Taxicab metric

follows the grid horizontally and vertically to go from one point to the other. Furthermore, it is

with the Taxicab metric possible to have various ways to reach a destination; as is shown in

figure 5. The Euclidean distance gives the shortest route by drawing a straight line from point A

to point B and thus unable to have multiple ways to reach point B. However, both metrics give

c b

a

B

C

A

24

a clear indication of distance by using a simple method that can be applied on a (multi-

dimensional) grid.

4. Research questions

The first question that we ask ourselves is on the precision of the automatic mood

classification system. Currently the system evaluates a proposed tag as correct when it

perfectly matches the Moody Tag that was assigned to a song. Their results (Van Zaanen &

Kanters, 2010) have shown that the fine-grained division does not show a difference in

accuracy for the tf+tf*idf and tf*idf metric. Therefore, it is not known which metric is more

useful. We need to get an overview of what mood classes go wrong and based on that we find

a way to indicate how far off the wrong classifications are.

RQ1: How does incorporating the distance between mood classes affect the evaluation

of the automatic classifier?

Secondly, the Moody Tags are the result of a social tagging process which is recorded in

a dataset to which we refer to as Moody Counts. It was originally not exactly known by Van

Zaanen and Kanters (2010) how the songs received a Moody Tag. Therefore it is possible that

the mood range of a music track is broader than just one mood class. For instance, the Moody

Tag A4 can have received the arousal A as a result of 20 votes for arousal A and 19 votes for B.

Hence, the difference is too small to say with confidence that the song elicits only arousal A.

Therefore we ask the question whether the proposed tags are in the range of classes that the

users generally found as acceptable.

RQ2: How does replacing the Moody Tags with the Moody Counts affect the evaluation

of the automatic classifier?

4.1 Research purpose

The current system is too strict in its evaluation against Moody’s data because a perfect

match has to occur in order to be counted as a success; it does not incorporate nuances

25

between mood classes. The main mood of a song does not necessarily have to be described

with one exact class because listeners can have different perception or standards. There is also

the issue of judging the level of intensity of an experienced mood. For instance, if the system

detects the mood A1 for a song but the gold standard shows the class A2, it should be a better

assessment than when the outcome D4 would have been proposed. A more fine-grained

approach is necessary to give a more realistic outcome by counting A2 as being partially

correct.

Furthermore, currently the automatic mood classification system evaluates its data

against that of Crayon Room’s Moody Tags. The tag of each song in Moody’s dataset is the

result of the highest frequency count in arousal and valence; Moody Counts. The counts are

the result of social tagging and therefore a spread of counts over the arousal and valence

values is very likely. It is therefore unknown how well the remaining arousal and valence values

scored with respect to the Moody Tag. Not every listener of a song might assign the exact

same mood tag to the song as another listener. One person might experience songs differently

or perhaps the mood they had influenced their judgment. Furthermore, as is discussed in

section 2.3, music consists of a main mood with emotional outbursts. These outbursts can

influence the judging of a song’s mood as well. The system does not consider the possibility

that there is a range of moods into which the users feel the music can be categorized.

With Van Zaanen and Kanters’ system we can classify music tracks based on mood and

therefore have the possibility to create mood based playlists. However, a more sensitive

approach is preferred by considering nuances in mood classes that should not be considered as

being completely off target. With the news that Crayon Room has agreed to provide the raw

data (Moody Counts), we can explore how the system fares when it is evaluated against this

raw data. By taking into account that some mood classes are more closely related than others,

we can partially include songs in the results of which the system indicated to be closely related

to the specified mood class of a song’s mood. This could prove that the system is not

necessarily wrong in certain cases but that perhaps the processed data should not be seen as

the standard against which the system should be evaluated.

This study may provide tools to give the automatic mood classifier a more fine-grained

evaluation approach. This allows music listeners to effectively and efficiently create mood

based playlists and it gives them the ability to reach throughout their music collection;

providing them with a new way to interact with music.

26

5. Methodology

5.1 Data

As this is a follow-up study on Van Zaanen and Kanters’ (2010) classification system, the

data for this thesis will be the results from their system and the Moody Tags. However, we will

be focusing on the tf+tf*idf and tf*idf results of the fine-grained division. This division gives the

most precise results because it concerns all sixteen mood classes whereas the other class

divisions leave out parts of information. For instance, the arousal division generalizes all

arousal values and does not incorporate the valence dimension.

In addition, we asked Crayon Room for their unprocessed data to see how users

actually tag. The reason is because the firstly received dataset from Crayon Room [Moody Data

A] is based on the secondly received set [Moody Counts].

Moody Data A consists of the Moody Tags of songs. The tags were integrated as the

gold standard in the system’s dataset to evaluate the tags that were given by the system. Both

Moody Tag and system tags are needed to create a confusion matrix on which we base our

distance calculations from system tag to Moody Tag.

The second part of this study is to look at the distribution of the Moody Counts. Crayon

Room acquires its data through social tagging, which is a typical Web 2.0 characteristic. Their

system keeps track of how a song is rated by each user. However, it does not register a

selected Moody tag as is (e.g. D4, C1, A3), but separates the value into two dimensions; arousal

and valence. Therefore, the data consists of ten fields. After artist and title, there are eight

fields: four fields for arousal (A, B, C, D) and four fields for valence (1, 2, 3, 4). When for

instance a song is rated C2, the corresponding item receives one added value for C and one

added value for 2. Based on this data, the Moody Tag for each song is determined by

combining the highest rated arousal value with the highest rated valence value. To illustrate

the latter, if the highest arousal value is B and the highest valence is 1, the song will be tagged

27

as B1 which then results in Moody Data A. Our request to Crayon Room for the Moody Counts

resulted in a dataset with data on 4,503 out of the requested 5,631 songs.

With the Moody Counts, we attempt to see how broad the mood range for each song is

according to the users. Hence, we can compare the system’s results with this data and

conclude whether the system was too strict. In other words, did the system give a tag that was

half right or even accepted by music listeners as an alternative?

5.2 Method

An exploratory research will be carried out with the automatic classification system by

Van Zaanen and Kanters (2010) as the starting point. The Moody dataset consists of 4,503

entries. During the processing of the results, it turned out that some of the previously used

Moody Tags for Kanter’s study had changed during the period of time; specific songs were

given a new Moody Tag based on new Moody Counts. This meant that the Moody Tags in the

system’s dataset had to be updated to correspond with the latest Moody dataset. We had to

make the sets correspond in order to have the same entries and with the data that belong to

each entry.

Furthermore, the Moody data showed entries that received a Moody Tag that was

based on null values which was unexpected because these apparently did not receive any tags

and therefore should not have been registered in the Moody Counts. In total, it concerned 38

entries which were automatically classed as A1 and have been removed from the dataset. The

final dataset consisted of 4,465 entries with the fields: Moody Tag, system tag, and Moody

Counts for arousal A to D and valence 1 to 4. The distribution is shown in table 2.

Table 2: Distribution of 4,465 entries

Arousal

A B C D

Va

len

ce

1 104 199 196 152 719

2 316 474 474 208 1456

3 303 530 405 167 1410

4 198 326 269 144 880

921 1529 1344 671 4465

28

Then the accuracy of the current system needs to be calculated based on the newly

acquired dataset (4,465 entries). The new accuracy of the tf+tf*idf and tf*idf approach remain

equal to each other; 70.97%. The original study showed an accuracy of 70.89% for both

approaches in the fine-grained division (Van Zaanen and Kanters, 2010). The accuracy is

needed for comparison reasons when we take the distance metrics into consideration for RQ1

and when we substitute the gold standard in RQ2.


The first research question requires the system tags and Moody Tags over which the

distance metrics will be applied based on a confusion matrix. For the confusion matrix the

classes given by the system are the predicted values which will be set against the gold standard.

This will tell us how frequent a correct tag was given and we see what tag and how many times

that wrong tag was given.

Then we use the Euclidean and Taxicab metrics to see how far off each given system tag

was from the actual class. The mood classes will need to be converted to coordinates first after

which we apply them to the distance metrics. To avoid confusion we set A1, which is the top

left corner, equal to coordinate (1,1) and D4 equal to (4,4), the bottom right corner. Both

Moody Tags and system tags will need to be separated into the two dimensions before we

translate them to coordinates. The grid would be as is shown in figure 8.

Valence

1

2

3

4

Aro

usa

l

1 A1 A2 A3 A4

2 B1 B2 B3 B4

3 C1 C2 C3 C4

4 D1 D2 D3 D4

Figure 8: Mood Tags translated to coordinates

29

Afterwards the outcomes will be normalized against the maximum distance a class can

have. For example, the maximum distance for A1 is to D4 and for D2 it is A4. The normalized

results range from zero to one that will indicate the degree of error. The outcome zero

indicates that the system provided the same tag as Moody and when the outcome is one, the

system was completely wrong (maximum distance between classes). The normalization is done

by dividing the distance from system tag to Moody Tag by the maximum distance that the

Moody Tag can have:

d(System tag, Moody Tag)

maximum d(Moody Tag, furthest Moody Tag)

To calculate a weighted accuracy that takes the distance into consideration requires a

weighted distance score which is done with: 1 - normalized distance. By doing this, we allow

wrong instances to be partially correct. In this case, the perfect matches get the score 1. For

instance when the actual mood class is B3 and the system classified the song as B1, it is still

partially right because the system at least predicted the arousal correctly. This score then gets

multiplied by the total amount of identical instances in the confusion matrix. Lastly, the

product will be divided by the total amount of elements in the confusion matrix (4,465 entries).

In other words, the weighted accuracy is the sum of:

(elements in class)*(1-normalized distance)

total amount of elements

For instance, if the system’s prediction of a music track is B3 but the gold standard

shows B2, the distance between these two mood classes will be 1 with the Taxicab metric or 1

when we apply the Euclidean metric. The furthest Moody Tag that is possible for B2 is D4

which gives a maximum distance of 4 (Taxicab) or 2.83 (Euclidean). When we normalize the

distances the outcome will be 0.25 for the Taxicab method or 0.35 for the Euclidean method.

= normalized distance

= weighted accuracy

30

With this information we can calculate the weighted accuracy for this case. Assume that this

particular error (B3 – B2) has occurred 60 times according to the confusion matrix then its

normalized distance will be multiplied by this frequency. Lastly, the outcome of the

multiplication will be divided by the total amount of entries; 4,465. Hence, with the distance

acquired from using the Taxicab metric, the weighted accuracy in this case is:

60 * (1 - 0.25)

4465

With the Euclidean distance, the weighted accuracy is:

60 * (1 - 0.35)

4465

Finally, when all the weighted accuracies are known, the total weighted accuracy will be

calculated by the sum of these.


For the second research question the system tags, Moody Tags, and Moody Counts are

required. The Moody Tags are based on the Moody Counts which were acquired through social

tagging. The users categorize a song by selecting a Moody Color which is linked to a Moody Tag.

This tag is registered in the Moody Counts as two separate values: arousal and valence. We

want to see whether the system tag is within the range of arousal and valence values that were

inputted by the users.

As is mentioned in section 5.1, the Moody Counts consist of counts for each separate

dimension and its values. Due to the nature of this data, the dimensions of the system tags and

Moody Tags need to be separated as well. This is required in order to make a comparison

between the dimensions of the system tag (predicted tag) with counts of the dimensions in the

Moody Counts of the corresponding Moody Tag (actual tag). To indicate the degree of

relevance, the count of the arousal value that matches the system tag’s arousal value gets

divided by the count of the arousal value that matches Moody Tag’s arousal. The same goes for

the valence dimension and is done for each of the 4,465 items. Afterwards, the result of both

= 0.00

= 0.01

31

dimensions will be added to each other and divided by two. In other words, the mean

relevance is:

This weights the relevance of the system tag’s dimensions to the dimensions of the

Moody tag. Then the weighted accuracy will be calculated by dividing the sum of all mean

relevance by the total amount of elements. For instance, a song with the Moody Tag B2 could

be the result of the distribution of Moody Counts as shown in table 3. However, based on the

analysis of the system, it classified the song as B1.

Table 3: Example of Moody Counts’ distribution for Moody Tag B2

System tag Moody Tag Moody Counts

Arousal Valence

A B C D 1 2 3 4

B1 B2 11 20 4 0 10 17 8 0

To see how the proposed tag compares to the Moody Counts with regard to the Moody

Tag, we calculate the mean relevance. As the system tag’s arousal is B and the Moody Tag’s

arousal is B, we divide the respective counts by each other; 20/20. The valences differ in this

case; the system gives a valence of 1 while Moody gives valence 2 which gives us the division of

10/17. Therefore, when we fill in the entire formula, it gives us the mean relevance for this

example which is:

= 0.79

2

+ 20

20

10

17

= mean relevance

2

+ Counts system arousal

Counts Moody arousal

Counts system valence

Counts Moody valence

32

6. Results

The accuracy of the current system was measured based on a binary approach which

meant that results were either precisely right or wrong. When the system tag was precisely

right, it received the value one and otherwise it was zero. According to the binary metric, the

tf+tf*idf-approach resulted in an accuracy of 71.0% and for the tf*idf-approach the accuracy is

also 71.0%. As there is no noticeable difference, both approaches can be seen as equals in

terms of performance.

6.1 Results RQ1: Distance metrics

Table 4 shows the accuracies that have been achieved for the tf+tf*idf and the tf*idf

approach when the distance from the proposed class to actual classes were taken into

consideration. When the Euclidean metric was applied to tf+tf*idf it showed a weighted

accuracy of 91.9% (n=4,465) and for tf*idf this was 92.1%. The Taxicab method resulted in a

92.3% accuracy for tf+tf*idf and 92.4% for tf*idf.

The highest percentage is found when the Taxicab metric was applied to the results of

tf*idf and the lowest percentage when the Euclidean metric was applied to the results of

tf+tf*idf division.

When we look at the columns of table 4 we see that tf+tf*idf has percentages of 91.9%

and 92.3%, respectively with the Euclidean method and the Taxicab method. The tf*idf section

scored slightly higher with 92.1% and 92.4%.

Table 4: Accuracy for tf+tf*idf and tf*idf results after incorporating distance (n=4,465)

tf+tf*idf tf*idf

Euclidean metric 91.9% 92.1%

Taxicab metric 92.3% 92.4%

33

6.2 Results RQ2: Moody Counts

When we combine the Moody Counts for each class together, as is done in table 5, we

can see that for each Moody Tag the highest counts are found in the corresponding arousal

and valence column; in general the Moody Tags are correctly chosen. We also see in the results

that there is a distribution of votes across the remaining arousal and valence values.

Table 5: Sum of Moody Counts against Moody Tags

Arousal

Valence

Moody Tag A B C D 1 2 3 4

A1 2737 103 53 57 2729 84 62 75

A2 5960 339 127 90 168 5963 284 101

A3 6531 317 153 57 133 176 6500 247

A4 3399 163 58 30 29 41 142 3438

B1 94 3579 127 75 3510 170 127 68

B2 249 8716 249 150 154 8715 317 178

B3 219 9231 235 90 153 198 9177 247

B4 159 5454 152 70 65 132 200 5438

C1 58 94 3635 185 3642 169 122 39

C2 107 258 8637 195 220 8578 265 134

C3 84 244 6253 135 118 163 6218 217

C4 72 131 4332 96 73 56 207 4294

D1 19 79 183 2870 2809 231 64 47

D2 35 41 182 3147 115 3178 85 27

D3 17 54 135 3170 60 88 3163 66

D4 12 23 64 1683 28 39 52 1664

Table 6 shows the score and accuracy that are based on an evaluation of the system

tags against the Moody Counts. The score is the result of the sum of all mean relevance. It

shows that tf+tf*idf had a score of 3276.96 and tf*idf had 3278.38. Furthermore, this resulted

in an accuracy of 73.4% for tf+tf*idf and 73.4% for tf*idf. However, as can be seen in the scores,

there is a very small difference in favor of tf*idf, but it goes unnoticed due to the rounding of

the outcomes of the accuracy.

34

Table 6: Results based on examination of system tags in comparison to Moody Counts (n=4.465)

tf+tf*idf tf*idf

Score 3276.96 3278.38

Accuracy 73.4% 73.4%

7. Conclusions and discussion

7.1 Answer to RQ1

The original results showed an accuracy of 71.0% for either of the tf+tf*idf and tf*idf

approach. By taking the weighted distance into account classes that were close to the actual

mood class were now considered as partially correct in the evaluation process. Both distance

metrics gave a fine-grained evaluation of the results which resulted in a noticeable difference

between tf+tf*idf and tf*idf. It can be concluded that the tf*idf feature showed a better

performance than tf+tf*idf. Hence, tf*idf had more predictions that were close to the gold

standard in terms of distance. This is in contrast to the original results where both features

showed no difference in accuracy. The distance tools provided the necessary information on

the relation between the system tag and the gold standard. However, it is not yet known which

of these evaluation metrics are more suitable or accurate. Furthermore, we did not consider

the strength of relations between moods. Should the relation between moods A2 and A3 be

considered as strong as the relation between A2 and B2?

7.2 Answer to RQ2

Upon examination of the Moody Counts, results have shown that the given Moody Tags

were not always accurate in determining the mood of a song. This is due to how the Moody

Tags are generated. The counts have shown that people do not always experience the exact

same mood. There was a noticeable spread of counts across the dimensions as can be seen in

table 5. There are (small) variations in what mood class gets chosen for each music track. By

considering this range of mood classes that each song acquires from the users, we can

35

conclude that the Van Zaanen and Kanters’ (2010) system actually performed better than was

presented.

The current gold standard consists of Moody Tags. Our results have shown that the

Moody Tags are not perfectly reliable, because those tags rely on the highest arousal and

valence value in the Moody Counts. When a song receives 23 votes for arousal A and 23 votes

for arousal B, it seems that Moody will select the first highest value that occurs in that

dimension; in this case arousal A will be selected. Consequently, when null values occur in the

counts, it will automatically assign A1 to the song. Therefore, this study shows that the

unprocessed data is more reliable as a standard.

Even though the new standard allows us to evaluate in a more fine-grained manner, it

did not show a clear difference in accuracy between the tf+tf*idf and the tf*idf approach.

Though, the scores did show a very slight difference in favor of tf*idf. Therefore, the usefulness

of the new standard can also be questioned, but with the mentioned issues on Moody’s

algorithm on deciding Moody Tags, the Moody Counts remain a more reliable source.

7.3 Sensitivity and new standard

The weighted distance has proven to give the system a more fine-grained evaluation as

it takes more information into account during the process. The distance tools have given a

better view on the difference in performance of the information retrieval features. In contrast,

the study on the Moody Counts did not clearly show this difference between tf+tf*idf and

tf*idf. However, the counts did provide the range of moods for each song that were

considered acceptable by the users. If we implement a distance tool and substitute the gold

standard in the current system, we will most likely improve the evaluation of the system as

was desired. However, future research is required in determining the weights of the distance

metrics and which metric is more suitable.

Even though this study was carried out using the mood classifier as our starting point,

the distance tools can also be applied in other classification fields with multiple dimensions

where the boundaries are not clear-cut. In this thesis we looked at different mood classes, but

this could just as well have been about e.g. classifying personalities or other multidimensional

classes.

36

8. Future research

In this study, the two most common distance metrics were used to see the effect of

incorporating distance in the evaluation of the automatic classification system by Van Zaanen

and Kanters (2010). However, other distance metrics could have been used as well. It is not

clear which metric is more suitable or gives better outcomes. In future research, we could

possibly perform a user evaluation to compare the two distance metrics.

Furthermore, this study does not take into account which adjacent mood classes weigh

more than others in relation to the actual mood. For instance, is the relation between moods

A2 and A3 stronger than A3 and B3? We would need to study the relation between mood

classes and adjust the weighting of the distance metrics or possibly introduce a new metric.

37

9. References

Anderson, C. (2004). The Long Tail. Wired, 12(10), 1-30.

Andric, A., & Haus, G. (2006). Automatic playlist generation based on tracking user’s listening habits. Multimedia

Tools and Applications, 29(2), 127-151.

Beukeboom, C. J., & Semin, G. R. (2006). How mood turns on language. Journal of Experimental Social Psychology,

42(5), 553-566.

Bower, G. H. (1981). Mood and memory. American Psychologist, 36(2), 129-148.

Confusion matrix. (n.d.). Retrieved 8 November, 2010, from

http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html/

Daelemans, W., Zavrel, J., Van der Sloot, K., & Van den Bosch, A. (2010). TiMBL: Tilburg Memory-Based Learner,

version 6.3, Reference Guide. ILK Technical Report 10-01, available from

http://ilk.uvt.nl/downloads/pub/papers/ilk.1001.pdf/

Eck, D., Lamere, P., Bertin-Mahieux, T., & Green, S. (2007). Automatic Generation of Social Tags for Music

Recommendation. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis, (Eds.), Advances in Neural Information

Processing Systems 20, 20, 1-8. MIT Press.

Ekman, P. (1971). Universals and Cultural Differences in Facial Expressions of Emotion. In J. K. Cole (Ed.), Nebraska

Symposium On Motivation: Vol. 19. (pp. 207-283). Lincoln: University of Nebraska Press.

Izard, C. E. (1977). Human emotions. New York: Plenum Press.

Juslin, P. N., & Sloboda, J. A. (2001). Music and emotion: Theory and research. Oxford: Oxford University Press.

Kanters, P.W.M. (2009). Automated Mood Classification for Music. (Master Thesis, Tilburg University, 2009).

Retrieved from http://arno.uvt.nl/show.cgi?fid=95615/

Kochanski, G. (2009). Distance Metrics. Retrieved 8 November, 2010, from

http://kochanski.org/gpk/research/misc/2004/distance-metric/dist.pdf/

Kohavi, R., & Provost, F. (1998). Glossary of terms. Editorial for the Special Issue on Applications of Machine

Learning and the Knowledge Discovery Process, 30(2-3).

Koopman, C., & Davies, S. (2001). Musical Meaning in a Broader Perspective. The Journal of Aesthetics and Art

Criticism, 59(3), 261-273.

Krause, E. F. (1986). Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Mineola: Dover

Publications, Inc.

Liebowitz, S. (2004). Will MP3 downloads annihilate the record industry? The evidence so far. Advances in the

Study of Entrepreneurship Innovation and Economic Growth, 15, 229-260.

38

Liu, D., Lu, L., & Zhang, H. (2003). Automatic Mood Detection from Acoustic Music Data. Proceedings of 4th

International Symposium on Music Information Retrieval, 4, 81-87.

Loomis, E. S. (1968). The Pythagorean Proposition: Its Demonstration Analyzed and Classified and Bibliography of

Sources for Data of the Four Kinds of “Proofs”. Washington D.C.: National Council of Teachers of

Mathematics.

Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. Cambridge University

Press.

Meyer, L. B. (1956). Emotion and Meaning in Music. Chicago: University of Chicago Press.

Meyers, O. C. (2007). A mood-based music classification and exploration system. SciencesNew York.

Citeseer. Retrieved 20 November, 2010, from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.3440&rep=rep1&type=pdf/

Milliman, R. E. (1982). Using Background Music to Affect the Behavior of Supermarket Shoppers. The Journal of

Marketing, 46(3), 86–91.

O’Reilly, T. (2007). What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software.

Communications & Strategies, 65(1), 17-37. Retrieved from http://ssrn.com/abstract=1008839/

Oberholzer-Gee, F., & Strumpf, K. (2007). The Effect of File Sharing on Record Sales: An Empirical Analysis. Journal

of Political Economy, 115(1), 1-42.

Omar Ali, S., & Peynircioğlu, Z. F. (2006). Songs and emotions: are lyrics and melodies equal partners? Psychology

of Music, 34(4), 511-534.

Pauws, S., & Eggen, B. (2002). PATS: Realization and user evaluation of an automatic playlist generator.

Proceeding of the 3rd International Conference on Music Information Retrieval, 222-230.

Shields, S. A. (1984). Reports of bodily change in anxiety, sadness, and anger. Motivation and Emotion, 8, 1-21.

Stenius, E. (1967). Mood and Language-Game. Synthese, 17, 254-274.

Teasdale, J. D., & Russell, M. L. (1983). Differential effects of induced mood on the recall of positive, negative and

neutral words. The British journal of clinical psychology the British Psychological Society, 22(3), 163-171.

Thayer, R.E. (1989). The biopsychology of mood and arousal. New York: Oxford University Press.

The Story of MP3. (n.d.). Retrieved 20 November, 2010, from

http://www.iis.fraunhofer.de/en/bf/amm/mp3geschichte/mp3blicklabor/

Thompson, W. F., Schellenberg, E. G., & Husain, G. (2001). Arousal, mood, and the Mozart effect. Psychological

Science, 12(3), 248-251.

39

Tonkin, E., Corrado, E. M., Moulaison, H. L., Kipp, M. E., Resmini, A., Pfeiffer, H., & Zhang, Q. (2008). Collaborative

and Social Tagging Networks. Ariadne, 54(54), 1-20.

Van Zaanen, M., & Kanters, P.W.M. (2010). Automatic mood classification using tf*idf based on lyrics. 11th

International Society for Music Information Retrieval Conference, 11, 75-80.

Vander Wal, T. (2005). Explaining and Showing Broad and Narrow Folksonomies. Personal InfoCloud. Retrieved

20 December, 2010, from http://www.personalinfocloud.com/2005/02/explaining_and_.html/

Vignoli, F. (2004). Digital Music Interaction concepts: a user study. Proceedings of International Conference on

Music Information Retrieval, 415-420.

Vignoli, F., & Pauws, S. (2005). A music retrieval system based on user-driven similarity and its evaluation.

Proceedings of International Conference on Music Information Retrieval, 272-279.

Voong, M., & Beale, R. (2007). Music organisation using colour synaesthesia. CHI ‘07 extended abstracts on

Human factors in computing systems, 1869-1874.

Voss, J. (2007). Tagging, Folksonomy & Co - Renaissance of Manual Indexing? International Symposium for

Information Science, 10, 1-12.

Watson, D. (2000). Mood and Temperament. New York: The Guilford Press.

Documents

H.Ling - scriptie v7.2