Affective Image Retrieval via Multi-Graph Learning · Affective Image Retrieval via Multi-Graph Learning Sicheng Zhaoy, Hongxun Yaoy, You Yangz, Yanhao Zhangy ySchool of Computer

Affective Image Retrieval via Multi-Graph Learning

Sicheng Zhao†, Hongxun Yao†, You Yang‡, Yanhao Zhang†

†School of Computer Science and Technology, Harbin Institute of Technology, China.‡Department of Electronics and Information Engineering, Huazhong University of Science and Technology, China.

{zsc, h.yao, yhzhang}@hit.edu.cn; [email protected]

ABSTRACTImages can convey rich emotions to viewers. Recent research onimage emotion analysis mainly focused on affective image clas-sification, trying to find features that can classify emotions bet-ter. We concentrate on affective image retrieval and investigate theperformance of different features on different kinds of images ina multi-graph learning framework. Firstly, we extract commonlyused features of different levels for each image. Generic featuresand features derived from elements-of-art are extracted as low-levelfeatures. Attributes and interpretable principles-of-art based fea-tures are viewed as mid-level features, while semantic concepts de-scribed by adjective noun pairs and facial expressions are extractedas high-level features. Secondly, we construct single graph for eachkind of feature to test the retrieval performance. Finally, we com-bine the multiple graphs together in a regularization framework tolearn the optimized weights of each graph to efficiently explore thecomplementation of different features. Extensive experiments areconducted on five datasets and the results demonstrate the effec-tiveness of the proposed method.

Categories and Subject DescriptorsH.3.3 [Information storage and retrieval]: Information Searchand Retrieval; I.4.7 [Image processing and computer vision]: Fea-ture Measurement

General TermsAlgorithms, Experimentation, Performance

KeywordsAffective Image Retrieval; Image Emotion; Multi-Graph Learning

1. INTRODUCTIONWith the rapid development of social networks, such as Twit-

ter and Weibo, people tend to express and share their opinions andemotions online using text and images. Understanding the emo-tions implied in the increasing repository of data is of vital impor-

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’14, November 03 - 07 2014, Orlando, FL, USA.Copyright 2014 ACM 978-1-4503-3063-3/14/11. . . $15.00.http://dx.doi.org/10.1145/2647868.2655035.

(a) Fear (b) Excitement

(c) Sadness (d) Contentment

Figure 1: The emotions of different kinds of images are domi-nated by different features: (a) Aesthetic features (low satura-tion, cool color, low contrast of hue); (b) Attributes (snow, ski-ing); (c) Semantic concepts described by adjective noun pairs(broken car); (d) Facial expressions (happiness).

tance to behavior sciences [12] and enables wide applications, suchas human decision making and brand monitoring [1].

While sentiment analysis in text is relative mature [12], emotionanalysis in images is still in its infancy, due to the subjective evalua-tion and the “affective gap” [4]. Recent research mainly focused onaffective image classification, trying to find features that can clas-sify emotions better. Lu et al. [8] investigated the computability ofemotion through shape features. Inspired from psychology and arttheory, Machajdik et al. [10] extracted features of different levels,including color and texture as low-level features, composition asmid-level features, and human faces and skin as high-level features.Zhao et al. [21] proposed to extract principles-of-art based emotionfeatures, which are more interpretable by humans and have strongerlink to emotions than the elements-of-art based ones. Mid-levelscene attributes are used for binary sentiment classification [19]. Alarge scale visual sentiment ontology and detectors are proposedto detect 1,200 high-level adjective noun pairs [1]. Besides, socialcorrelation features are extracted for social network images [6].

Meanwhile, low-level visual features, such as SIFT, are widelyused for content based image retrieval. However, they cannot per-form well for image retrieval on the affective level [16].

In this paper, we concentrate on affective image retrieval andinvestigate the performance of different features on different kindsof images, as emotions can be conveyed by various features, asshown in Figure 1. For example, principles-of-art features, suchas balance and contrast, mainly contribute to emotions of artisticand abstract images [21]. Facial expressions may determine theemotion of the images containing faces, while semantic concepts

1025

Table 1: Summary of the extracted features of different levels. ‘#’ indicates the dimension of each feature.Levels Generality Short Description #Low Generic GIST, HOG2x2, self-similarity and geometric context color histogram features [17] [13] 512+6300+6300+3920

Special Color: Mean saturation, brightness and hue, PAD, colorfulness and color names [10] 1+1+4+3+1+11Texture: Tamura, Wavelet and GLCM [10] 3+12+12

Mid Generic Attributes [13] [19] 102Special Principles-of-art: balance, emphasis, harmony, variety, gradation, movement [21] 60+18+2+60+9+16

High Generic Semantic concepts based on adjective noun pairs [1] 1200Special Facial expressions [18] 8

ImageDataset

Elements-of-artBased Features

Generic Features

Principles-of-artBased Features

Attributes

Adjective Noun Concepts

Facial Expressions

Low-levelFeatures

Mid-levelFeatures

High-levelFeatures

Graph 4

Graph 5

Graph 6

Graph 2

Graph 3

Graph 1

Fused Graph

Retrieval Result

Query

Figure 2: The framework of proposed affective image retrieval.

described by adjective noun pairs, such as beautiful flower, playan important role in expressing emotions of many natural images.Features of three different levels and two different generalities areextracted for each image based on related research on multimediaand computer vision. We then combine these features in a multi-graph learning framework [15] for affective image retrieval task.The framework of the proposed method is illustrated in Figure 2.

2. METHODIn this section, we introduce the proposed method, including the

extracted emotion features and the multi-graph learning model.

2.1 Emotion Features

2.1.1 Low-level FeaturesLow-level features suffer from the difficulty of easy interpreta-

tion by humans. For generic features, we choose HOG2x2, self-similarity, GIST and geometric context color histogram features asin [17] [13], because they are each individually powerful and candescribe distinct visual phenomena in a scene perspective.

We also extract special features derived from elements of art,including color and texture [10]. Low-level color features includemean saturation and brightness, vector based mean hue, emotionalcoordinates (pleasure, arousal and dominance) based on brightnessand saturation, colorfulness and color names. Low-level texturefeatures include Tamura texture, Wavelet textures and gray-levelco-occurrence matrix (GLCM) based texture [10].

2.1.2 Mid-level FeaturesMid-level features are more semantic, more interpretable by hu-

mans and have stronger link to emotions than low-level features [21].For generic features, we choose five types of attributes people tend

to describe scenes: (1) materials (mental), (2) surface properties(dirty), (3) functions or affordances (reading), (4) spatial envelopattributes (cluttered) and (5) object presence (flowers). Based onthe criteria of descent detection accuracy, potentially correlation toone emotion and easy understanding, we choose 102 attributes asin [13]. All the four generic low-level features are used to trainattribute classifiers based on 14,340 images in SUN database [17].The consensus labels (2 or 3 votes) are selected as positive and oth-erwise negative. SVM implemented in Liblinear toolbox1 is usedas classifiers for time saving. Finally, we can get a 102 dimensionalbinary vector of attribute features.

Features inspired from principles of art, including balance, con-trast, harmony, variety, gradation, and movement are extracted asspecial features [21]. Please refer to [21] for details.

2.1.3 High-level FeaturesHigh-level features are the semantic contents contained in im-

ages. People can easily understand the emotions conveyed in im-ages by recognizing the semantics. We choose concepts describedby 1,200 adjective noun pairs (ANPs) [1] as generic features. TheANPs are detected by a large detector library SentiBank [1], whichis trained using GIST, a 3 × 256 dimension color histogram, a 53dimensional LBP descriptor, a Bag-of-Words quantized descriptorusing a 1,000 word dictionary with a 2-layer spatial pyramid andmax pooling, and a 2,000 dimensional attribute on about 500k im-ages downloaded from Flickr. Liblinear SVM is used as classifiersand early fusion is adopted. The advantages of ANP are that itturns a neutral noun into an ANP with strong emotions and makesthe concepts more detectable, as compared to nouns and adjectives,respectively. A 1,200 dimensional binary vector is obtained.

Meanwhile, we extract 8 kinds of facial expressions (anger, con-tempt, disgust, fear, happiness, sadness, surprise, neutral) [9] asspecial high-level features. Compositional features of local Haarappearance features are built by a minimum error based optimiza-tion strategy, which are embedded into an improved AdaBoost al-gorithm [18] [20]. Trained on CK+ database [9], the method per-forms well even on low intensity expressions [18]. As some im-ages do not contain faces, we conduct face detection first using theViola-Jones algorithm [14] implemented in OpenCV. The expres-sions of images detected without faces are set as neutral. Finally,we can get a 8 dimensional vector, each element of which indicatesthe number of related facial expressions in the image.

2.2 Multi-Graph LearningMulti-graph learning [15] is the extension of single-graph learn-

ing and is widely used for reranking in various domains. In eachgraph, the vertices represent images and the edges reflect the simi-larities between image pairs.

Suppose we have N graphs G1,G2, · · · ,GN , each Gn is an

1http://www.csie.ntu.edu.tw/~cjlin/liblinear/

1026

affinity matrix, where Gn,ij is the similarity between the ith andjth images in the nth graph. Given the labels Y and similaritymatrix, we aim to get the relevance scores r. The N graphs can beintegrated into a regularization framework [15],

Q(r) =N∑n=1

αn

∑i,j

Gn,ij

∣∣∣∣ ri√Λn,ij

−rj√Λn,ij

∣∣∣∣2+µn∑i

|ri − Yi|2), r = argmin

rQ(r),

(1)

where α = [α1, ..., αN ] is a weight vector, αn > 0,∑Nn=1 αn =

1, and Λn is a diagonal matrix with Λn,ii =∑j Gn,ij . The

closed-form solution of Equ. (1) is

r =

(I +

∑Nn=1 αnLn∑Nn=1 αnµn

)−1

Y, (2)

where Ln = Λ−1/2n (Λn − Gn)Λ

−1/2n is the normalized graph

Laplacian. For computational efficiency, α is regarded as variablesand Equ. (1) should be optimized with respect to both r and α [15].To avoid that only extreme points are selected, a relaxation is madeas follows [15],

Q(r, α) =N∑n=1

αλn

∑i,j

Gn,ij

∣∣∣∣ ri√Λn,ij

−rj√Λn,ij

∣∣∣∣2+µn∑i

|ri − Yi|2), [r, α] = argmin

r,αQ(r, α).

(3)

Based on the fact that Q is convex with respect to both r andα [15], we can iteratively update r and α to minimize Q(f, α).The detailed iterative solution algorithm and its corresponding con-vergence proof can be referred in [15]. The optimal choice of λ(λ > 1) depends on the complementation of the features. In prac-tice, λ is decided by corss-validation.

When dealing with single kind of features, the multi-graph learn-ing becomes single-graph learning. It means that N = 1, α = 1,thus there is no need to optimize α. We can directly use the closed-form solution of f in Equ. (2) with α = 1.

As our goal in this paper is to retrieve emotionally similar imagesto the query image, namely a retrieval task, we set Yi as

Yi =

{1, if image i is the query image,0, otherwise.

(4)

As described in Section 2.1, we adopt 6 kinds of features, so wecan construct 6 single graphs (see Figure 2). As each kind of fea-ture (e.g. low-level special) is composed of different components(e.g. color, texture), we use the average similarity as the overallsimilarity. For the mth component (m = 1, . . . ,Mn) of the nthkind of features (n = 1, . . . , 6), the sub-graph is generated by

Gmn,ij =

{exp

(− d(x

min,x

mjn)

σmn

), if i 6= j,

0, otherwise,(5)

where d(., .) is a specified distance function, xmin is the related fea-ture vector of the mth component of the nth kind of features forimage i, and σmn is set as the average distance of the distance ma-trix of this component. Here the Euclidean distance is used ford(., .) and the final nthe graph is Gn = 1

Mn

∑Mnm=1G

mn .

3. EXPERIMENTSTo evaluate the effectiveness of the proposed affective image re-

trieval method, we carry out experiments on several datasets.

3.1 DatasetsSubset A of IAPS dataset (IAPSa) includes 395 images [11],

selected from a standard emotion evoking image set in psychology,named International Affective Picture System (IAPS) [7].

Artistic dataset (ArtPhoto) consists of 806 artistic photographsfrom a photo sharing site searched by emotion categories [10].

Abstract dataset (Abstract) includes 228 peer rated abstractpaintings without contextual content [10].

The images in these three datasets (IAPSa, ArtPhoto, Abstract)are categorized into eight discreet categories [11]: Anger, Disgust,Fear, Sad, Amusement, Awe, Contentment, and Excitement.

The Geneva affective picture database (GAPED) consists of520 negative images, 121 positive images and 89 neutral images [2].

Tweet dataset (Tweet) includes 470 positive tweets and 133negative tweets [1] crawled from PeopleBrowsr with 21 hashtags.

3.2 Evaluation CriteriaWe employ several widely used measurements as in [3].(1) Nearest neighbor rate (NN) is defined as the percentage that

the first returned image and the query image express the same emo-tions, indicating the performance as a nearest neighbor classifier.

(2) First tier (FT) and second tier (ST) are defined as the recallof the top N retrieval results. For the class with m relevant imagesin the whole dataset, N = m for FT and N = 2m for ST. That is,FT = nm/m, ST = n2m/m, where nm and n2m are the numberof relevant images in the top m and 2m retrieval results.

(3) Precision-recall curve. Precision and recall are defined asPre = n/q, Rec = n/m, where n is the number of relevantimages among the first q retrieval results.

(4) F1 score (F1). F1 is a measure of accuracy based Precision-Recall value, defined as F1 = 2 · Pre · Rec/(Pre + Rec). Herewe consider the top 20 retrieval results.

(5) Discounted cumulative gain (DCG) [5] measures the im-portance of different positions of relevant results, assuming thatusers consider the frontal results more.

(6) Average normalized modified retrieval rank (ANMRR) [3]is a rank based metric, which takes into account the ranking se-quence of relevant images within the retrieved images.

All the six measurements range from 0 to 1. Higher value repre-sents better performance for the first five measurements and lowervalue indicates better performance for ANMRR.

3.3 Results and DiscussionEach image in the dataset is selected as query image and the av-

erage performance is computed. The precision-recall curves of dif-ferent methods are shown in Figure 4, where ‘LowGen’, ‘LowArt’,‘MidAtt’, ‘MidArt’, ‘HighANP’ and ‘HighExp’ represent single-graph learning methods based on low generic, low special, midgeneric, mid special, high generic and high special features, respec-tively, while ‘Fusion’ is the proposed multi-graph learning mehtod.The performance comparison of other evaluation measurements areillustrated in Figure 3. Some detailed retrieval results of the pro-posed method are shown in Figure 5.

From these results, we can conclude that: (1) Different featuresperform differently on different kinds of images. Though perform-ing well for traditional CBIR, the low-level generic features are nota good choice for affective image retrieval. (2) Generally, the fu-sion of different kinds of features outperforms single feature basedmethod, because multi-graph learning can explore the complemen-tation of different features. (3) For Abstract images, features in-spired from art theory outperform others, as few attributes, concept-s or faces are detected. (4) The emotions in IAPS, GAPED andTweet datasets are often evoked by the presence of a certain object

1027

NN FT ST F1 DCG ANMRR0.0

0.2

0.4

0.6

0.8

1.0 LowGen LowArt MidAtt MidArt HighANP HighExp Fusion

(a) IAPSaNN FT ST F1 DCG ANMRR

0.0

0.2

0.4

0.6

0.8


(b) AbstractNN FT ST F1 DCG ANMRR

0.0

0.2

0.4

0.6

0.8


(c) ArtPhotoNN FT ST F1 DCG ANMRR

0.0

0.2

0.4

0.6

0.8


(d) GAPEDNN FT ST F1 DCG ANMRR

0.2

0.4

0.6

0.8


(e) Tweet

Figure 3: Average performance comparison of different methods for several measurements on different datasets.

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

Recall

LowGen LowArt MidAtt MidArt HighANP HighExp Fusion

(a) IAPSa

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Pr

ecis

ion

Recall


(b) Abstract

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

Recall


(c) ArtPhoto

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

Recall


(d) GAPED

0.0 0.2 0.4 0.6 0.8 1.00.6

0.7

0.8

0.9

1.0

Prec

isio

n

Recall


(e) Tweet

Figure 4: Average precision-recall curve comparison of different methods on different datasets.

(a) Amusement (b) Sadness

(c) Contentment (d) Fear

Figure 5: Some retrieval results of the proposed method. Thefirst image of each group is the query image. Correct retrievalresults are surrounded by green and thick frames while incor-rect ones by red and dashed frames.

or special semantics. Therefore, the high level features, especial-ly semantic concepts described by ANPs, perform well. (5) Theoverall performance of all methods is still not satisfying. To ac-curately understand the image emotions and retrieve emotionallysimilar images, there is still a long way to go.

4. CONCLUSIONIn this paper, we proposed to use multi-graph learning for affec-

tive image retrieval based on features of different levels and differ-ent generalities that may contribute to express image emotions. Wetested the performances of different features and combine them bymulti-graph learning. The experiments conducted on five datasetsdemonstrated that the proposed method is effective to retrieve emo-tionally similar images. It should be noted that this method can beeasily extended to more kinds of features and learning methods. Wewill try to firstly classify the images into different types and thenselect the best features for affective image retrieval. Comparisonwith state-of-the-art will be added in an extended version.

5. ACKNOWLEDGEMENTSThis work was supported by National Natural Science Founda-

tion of China (No. 61071180) and Key Program (No. 61133003).Sicheng Zhao was also supported by the Ph.D. Short-Term Over-seas Visiting Scholar Program of Harbin Institute of Technology.

6. REFERENCES[1] D. Borth et al. Large-scale visual sentiment ontology and detectors using

adjective noun pairs. In MM, 2013.[2] E. S. Dan-Glauser and K. R. Scherer. The geneva affective picture database

(gaped): a new 730-picture database focusing on valence and normativesignificance. Behavior research methods, 43(2):468–477, 2011.

[3] Y. Gao et al. 3-d object retrieval and recognition with hypergraph analysis.IEEE TIP, 21(9):4290–4303, 2012.

[4] A. Hanjalic. Extracting moods from pictures and sounds: Towards trulypersonalized tv. IEEE Signal Processing Magazine, 23(2):90–100, 2006.

[5] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of irtechniques. ACM TOIS, 20(4):422–446, 2002.

[6] J. Jia et al. Can we understand van gogh’s mood? learning to infer affects fromimages in social networks. In MM, 2012.

[7] P. Lang et al. International affective picture system (IAPS): Affective ratings ofpictures and instruction manual. NIMH, Center for the Study of Emotion &Attention, 2005.

[8] X. Lu et al. On shape and the computability of emotions. In MM, 2012.[9] P. Lucey et al. The extended cohn-kanade dataset (ck+): A complete dataset for

action unit and emotion-specified expression. In CVPR Workshops, 2010.[10] J. Machajdik and A. Hanbury. Affective image classification using features

inspired by psychology and art theory. In MM, 2010.[11] J. Mikels et al. Emotional category data on images from the international

affective picture system. Behavior research methods, 37(4):626–630, 2005.[12] B. Pang and L. Lee. Opinion mining and sentiment analysis. Information

Retrieval, 2(1-2):1–135, 2008.[13] G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and

recognizing scene attributes. In CVPR, 2012.[14] P. Viola et al. Robust real-time face detection. IJCV, 57(2):137–154, 2004.[15] M. Wang et al. Unified video annotation via multigraph learning. IEEE TCSVT,

19(5):733–746, 2009.[16] W. Wang et al. A survey on emotional semantic image retrieval. In ICIP, 2008.[17] J. Xiao et al. Sun database: Large-scale scene recognition from abbey to zoo. In

CVPR, 2010.[18] P. Yang et al. Exploring facial expressions with compositional features. In

CVPR, 2010.[19] J. Yuan et al. Sentribute: image sentiment analysis from a mid-level perspective.

In WISDOM, 2013.[20] S. Zhao et al. Video indexing and recommendation based on affective analysis

of viewers. In ACM MM, 2011.[21] S. Zhao et al. Exploring principles-of-art features for image emotion

recognition. In ACM MM, 2014.

1028

Documents

Affective Image Retrieval via Multi-Graph Learning · Affective Image Retrieval via Multi-Graph Learning Sicheng Zhaoy, Hongxun Yaoy, You Yangz, Yanhao Zhangy ySchool of Computer