6
MINING VISUALNESS Zheng Xu 1, Xin-Jing Wang 2 , Chang Wen Chen 3 1 University of Science and Technology of China, Hefei, 230027, P.R. China 2 Microsoft Research Asia, Beijing, 100080, P.R. China 3 State University of New York at Buffalo, Buffalo, NY, 14260, U.S.A. [email protected], [email protected], [email protected] ABSTRACT To understand which concepts are visualizable and to what extent they can be visualized are worthwhile for multime- dia and computer vision research. Unfortunately, few pre- vious works have ever touched such topics. In this paper, we propose an unified model to automatically identify visual concepts and estimate their visual characteristics, or visual- ness, from a large-scale image dataset. To this end, an image heterogeneous graph is first built to integrate various visual features, and then a simultaneous ranking and clustering al- gorithm is introduced to generate visually and semantically compact image clusters, named visualsets. Based on the vi- sualsets, visualizable concepts are discovered and their visu- alness scores are estimated. The experimental results demon- strate the effectiveness of the proposed schema. Index Termsvisualness, visualsets, image heteroge- neous graph, ranking, clustering 1. INTRODUCTION Despite decades of successful research on multimedia and computer vision, the semantic gap between low-level visu- al features and high-level semantic concepts remains a prob- lem. Instead of generating more powerful features or learning more intelligent models, researchers have started to investi- gate which concepts can be more easily modeled by existing visual features [1, 2, 3]. To understand to what extent a concept has visual char- acteristics, i.e. “visualness”, has many values. For instance, it can benefit recent research efforts on constructing image databases [4, 5]. These efforts generally attempt to attach im- ages onto pre-defined lexical ontology, while existing ontol- ogy were built without taking visual characteristics into con- sideration. Knowing which concepts are more likely to find relevant images will help save labors and control noises in database construction. Visualness estimation is also useful for image-to-text [2, 6] and text-to-image [7] translation, e.g., words of more visualizable concepts are potentially better an- notations for an image. This work was performed at Microsoft Research Asia. Cordless phone 5.01 Egypt Sphinx 4.06 Offshore oilfield 4.00 Eagle 1.90 Pink 0.75 Fig. 1. Examples of simple concepts (left) and compound concepts (right) with estimated visualness. Albeit the usefulness, a general solution of visualness es- timation faces many challenges: 1) It is unknown which con- cepts or which types of concepts are visualizable, i.e. whether representative images can be found to visualize its semantics. For instance, “dignity” and “fragrant” are both abstract noun- s, but the former is more difficult to visualize as “fragrant” is closely related to visual concepts such as flowers and fruits; 2) Different visual concepts have diverse visual compactness and consistency, especially for collective nouns (e.g., “ani- mal”) and ambiguous concepts (e.g., “apple”, which may rep- resent a kind of fruit or a company); and 3) Even though a concept is highly visualizable, it may still be difficult to cap- ture the visual characteristics due to the semantic gap, e.g., “tiny teacup chihuahua”. Few previous works in the literature have touched this re- search topic. To the best of our knowledge, Yanai et al. [1] are the first to propose visualness of a concept, who estimate visualness of adjectives with image region entropy. Zhu et al. [7] measure “picturablility” of a keyword by the ratio of the number of images to that of webpages retrieved by com- mercial search engines. Lu et al. [2] attempt to identify con- cept lexicon with small semantic gaps by clustering images based on visual and textual features, and then extracting the most frequent keywords occurred in image clusters. Berg et al. [8] utilize SVM classifiers to measure visualness of at-

[IEEE 2013 IEEE International Conference on Multimedia and Expo (ICME) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and Expo (ICME)

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 IEEE International Conference on Multimedia and Expo (ICME) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and Expo (ICME)

MINING VISUALNESS

Zheng Xu1∗, Xin-Jing Wang2, Chang Wen Chen3

1University of Science and Technology of China, Hefei, 230027, P.R. China2Microsoft Research Asia, Beijing, 100080, P.R. China

3State University of New York at Buffalo, Buffalo, NY, 14260, U.S.A.

[email protected], [email protected], [email protected]

ABSTRACT

To understand which concepts are visualizable and to what

extent they can be visualized are worthwhile for multime-

dia and computer vision research. Unfortunately, few pre-

vious works have ever touched such topics. In this paper,

we propose an unified model to automatically identify visual

concepts and estimate their visual characteristics, or visual-

ness, from a large-scale image dataset. To this end, an image

heterogeneous graph is first built to integrate various visual

features, and then a simultaneous ranking and clustering al-

gorithm is introduced to generate visually and semantically

compact image clusters, named visualsets. Based on the vi-

sualsets, visualizable concepts are discovered and their visu-

alness scores are estimated. The experimental results demon-

strate the effectiveness of the proposed schema.

Index Terms— visualness, visualsets, image heteroge-

neous graph, ranking, clustering

1. INTRODUCTION

Despite decades of successful research on multimedia and

computer vision, the semantic gap between low-level visu-

al features and high-level semantic concepts remains a prob-

lem. Instead of generating more powerful features or learning

more intelligent models, researchers have started to investi-

gate which concepts can be more easily modeled by existing

visual features [1, 2, 3].

To understand to what extent a concept has visual char-

acteristics, i.e. “visualness”, has many values. For instance,

it can benefit recent research efforts on constructing image

databases [4, 5]. These efforts generally attempt to attach im-

ages onto pre-defined lexical ontology, while existing ontol-

ogy were built without taking visual characteristics into con-

sideration. Knowing which concepts are more likely to find

relevant images will help save labors and control noises in

database construction. Visualness estimation is also useful

for image-to-text [2, 6] and text-to-image [7] translation, e.g.,

words of more visualizable concepts are potentially better an-

notations for an image.

∗This work was performed at Microsoft Research Asia.

Cordless phone 5.01

Egypt Sphinx 4.06

Offshore oilfield 4.00

Eagle 1.90

Pink 0.75

Fig. 1. Examples of simple concepts (left) and compound

concepts (right) with estimated visualness.

Albeit the usefulness, a general solution of visualness es-

timation faces many challenges: 1) It is unknown which con-

cepts or which types of concepts are visualizable, i.e. whether

representative images can be found to visualize its semantics.

For instance, “dignity” and “fragrant” are both abstract noun-

s, but the former is more difficult to visualize as “fragrant” is

closely related to visual concepts such as flowers and fruits;

2) Different visual concepts have diverse visual compactness

and consistency, especially for collective nouns (e.g., “ani-

mal”) and ambiguous concepts (e.g., “apple”, which may rep-

resent a kind of fruit or a company); and 3) Even though a

concept is highly visualizable, it may still be difficult to cap-

ture the visual characteristics due to the semantic gap, e.g.,

“tiny teacup chihuahua”.

Few previous works in the literature have touched this re-

search topic. To the best of our knowledge, Yanai et al. [1]

are the first to propose visualness of a concept, who estimate

visualness of adjectives with image region entropy. Zhu et

al. [7] measure “picturablility” of a keyword by the ratio of

the number of images to that of webpages retrieved by com-

mercial search engines. Lu et al. [2] attempt to identify con-

cept lexicon with small semantic gaps by clustering images

based on visual and textual features, and then extracting the

most frequent keywords occurred in image clusters. Berg et

al. [8] utilize SVM classifiers to measure visualness of at-

Page 2: [IEEE 2013 IEEE International Conference on Multimedia and Expo (ICME) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and Expo (ICME)

Web images

Ocean

Blue

Sea

Nature

Alfaromeo

Cloverleaf

Park

Auto

Car

Canon

Ferrari

Voiture

Flower

Beautiful

Sky

Cloud

Asia

Green

Lily

Beetle

Pest

Auto

Verde

Green

Beatle

Car

Highway

Overpass

Onramp

Cloverleaf

Old

Blue

Classic

Car

Macro

Shamrock

Nature

Kleeblatt

Cloverleaf

Flowers

Red

Macro

Rose

Africa

Sky

Nature

Clouds

Ocean

a) Building image heterogeneous graph

global feature

…h1

h2

h3

g1

g2

g3

local feature

h1

h2

h3

g1

g2

g3

w1

w2

w3

w4

w5

Green

Red

Car

Flower

Ocean

Sky

Blue

Cloverleaf

1) attribute nodes 2) image heterogeneous graph

b) Mining visualsets

1) image heterogeneous subgraph

h1

h2

g1

g3

w1

w2

w3Green

CarFlower

Sky

Cloverleaf

3) posterior representation

iii) cluster

refinement

0.23 0.17

Car 0.31

Red 0.19

Auto 0.05

0.31 0.12

Red 0.25

Flower 0.25

Sky 0.01

Ocean 0.23

Blue 0.22

Sea 0.21

…0.25 0.24

Green 0.24

Nature 0.21

0.17 0.09

2) visualsets Visualness

Red flower 0.63

Car 0.42

Cloverleaf 0.22

Ocean 0.39

Green flower 0.17

Fig. 2. The proposed framework of visualness mining: a) build image heterogenous graph from web images with multi-type

features; b) mine visualsets from image heterogenous graph by iteratively performing i) ranking nodes ii) posterior estimation

and iii) cluster refinement; c) estimate visualness with visualsets.

tributes. Sun et al. [9] investigate if a tag well represents its

annotated images in the social tagging scenario, by leverag-

ing visual compactness of the annotated images and their vi-

sual distinctness to the entire image database. Jeong et al. [3]

quantify visualness of complex concepts of the form “adjec-

tive(attribute)+noun(concept)”by integrating intra-cluster pu-

rity, inter-cluster purity, and entropy of retrieved images from

commercial search engines.

There are two major disadvantages of previous works: 1)

concept lists are pre-defined, and 2) concepts are insufficient-

ly assumed prototypical. Two notable works are reported in

[2] and [3]. Lu et al. [2] mine the concepts of small semantic

gaps automatically from an image dataset, but assume each

concept prototypical. In contrast, Jeong et al. [3] focus on

“complex concepts” that imply reliable prototypical concepts,

but the concept vocabulary is pre-defined. In fact, many con-

cepts are semantically ambiguous (e.g., “apple” as a fruit or

a company) or have multiple senses (e.g., “snake” as a noun

meaning the reptile snake or a verb suggesting a snake-like

pattern).

In this paper, we attempt to discover and quantify the vi-

sualness of concepts automatically from a large-scale dataset.

The quantitative measure of a concept is based on visual and

semantic synsets (we call “visualsets”), rather than a single

image cluster or keyword as in previous works. Visualset-

s perform disambiguation on the semantics of a concept and

ensures visual compactness and consistency, which is inspired

by synsets in the work of ImageNet[4] and Visual Synsets[6].

In our approach, a visualset is a group of visually similar

images (we call “member images”) and related words, both

are scored by their membership probabilities. Visualsets con-

tain prototypical visual cues as well as prototypical semantic

concepts. Given the visualsets, the visualness of a concept

is thus modeled as a mixture distribution on its correspond-

ing visualsets. Moreover, we discover both simple concepts

(keywords) and compound concepts (combination of unique

keywords) simultaneously from the generated visualsets, see

Fig. 1.

2. VISUALNESS MINING

In this section, we describe our approach of quantifying the

visualness of concepts.

2.1. The Framework

Our approach contains three steps: 1) build an image het-

erogenous graph with attribute nodes generated from multi-

type features; 2) mine visualsets from the heterogeneous

graph with an iterative clustering-ranking algorithm; and 3)

estimate visualness of concepts with visualsets. Fig. 2 illus-

trate the entire process.

Given a (noisily) tagged image dataset such as a web im-

age collection, we connect the images into a graph to facili-

tate the clustering approach for visualsets mining, as shown

in Fig. 2(a). Specifically, we extract multiple types of visual

features and textual feature for images to generate attribute

nodes (see Fig. 2(a1)). The edges of the graph are defined

by links between images and attribute nodes (see Fig. 2(a2))

instead of image similarities which are generally adopted in

previous works [2]. That is, images are implicitly connected

if they share some attribute nodes.

Then, an iterative ranking-clustering approach is applied

to form visual and textual synsets, i.e. visualsets, as shown in

Fig. 2(b). In each iteration, we start with the guess on image

clusters (see Fig. 2(b1)). Based on the guess, we score and

rank each image as well as attribute nodes in each visualset

(Fig. 2(b2)). Images are then mapped to the feature space

defined by the visualsets mixture model (Fig. 2(b3)). Clusters

are refined based on the estimated posteriors, which gives the

guess on image clusters for the next iteration.

After the clustering-ranking approach converges, we esti-

mate the visualness of concepts (simple and compound) from

the visualsets based on final scores of images and attribute

nodes, as shown in Fig. 2(c).

2.2. Building An Image Heterogeneous Graph

Given images X =

x1, x2, . . . , x|X |

, visual features GIST,

color histogram and SIFT are extracted for each image. We

Page 3: [IEEE 2013 IEEE International Conference on Multimedia and Expo (ICME) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and Expo (ICME)

(a) Color histogram clusters (b) GIST clusters

Fig. 3. Examples of image clusters generated by different vi-

sual features. They suggest how different visual features favor

distinct visual properties. Color histogram captures intensity

and contrast, while GIST prefers structure and edge.

Color

Image

GIST

Visual

words

Tags

i

m

j

k

n

Set-based

attribute nodes

Word-based

attribute nodes

Fig. 4. A heterogeneous sub-graph centered on one image.

Each image has only one set-based attribute node for each

type of global features (h, g), and multiple word-based at-

tribute nodes from local features or textual features (w, t).

exploit multiple types of visual features as they capture differ-

ent perspectives of visual properties which compensate each

other (Fig. 3). Our proposed visualness mining method is in-

dependent of the visual features selected.

We do k-means clustering on the three types of visual fea-

tures respectively, and treat the cluster centers as the attribute

nodes. We represent each image with one attribute node from

GIST and color histogram respectively (so-called set-based

attribute nodes), and a bag of words on SIFT and texts respec-

tively (so-called word-based attribute nodes). Fig. 4 illustrates

the star-structure centered on an image.

Denote the attribute nodes from GIST, color histogram,

SIFT, and texts as G =

g1, g2, . . . , g|G|

, H =

h1, h2, . . . , h|H|

, W =

w1, w2, . . . , w|W|

, T =

t1, t2, . . . , t|T |

respectively. We have the set of attribute

nodes as A = G ∪H ∪W ∪ T .

We define the heterogeneous image graph G = 〈V ,Ω〉 as

a weighted undirected graph , where V = X ∪A are the nodes

and Ω =

ωxinj|xi ∈ X , nj ∈ V

define the edges. We have

ωxinj=

1, nj ∈ G ∪ H, xi ∼ nj

c, nj ∈ W ∪ T , freq(xi, nj) = c

0, otherwise

(1)

where xi ∼ nj denotes a link between xi and nj , and

freq(xi, nj) represents the frequency of (visual) word nj oc-

curs in image xi.

2.3. Mining Visualsets

Algorithm 1 The Visualsets Mining Algorithm

Input: G = 〈V ,Ω〉Output: C

Initialization:

get C(0)k

Kk=1 by randomly partitioning images

Iteration:

for each t ∈ [1, N ] do

Ranking:

compute P(S|C(t)k

), P(A|C(t)k

) by Eq. 2, Eq. 3

compute P(X|C(t)k

) by Eq. 4

Clustering:

estimate p(C(t)k

) and p(C(t)k

|xi) by Eq. 7

refine C(t)k

by Section 2.3.3

end for

return C = C(N)k

|k = 1, . . . ,K

Denote the set of K visualsets as C =Ck|k = 1, . . . ,K. Topologically Ck ⊂ G, which is

a subgraph defined on the member images of Ck and related

attribute nodes. Each node of Ck is associated with a score

indicating its importance to Ck. Images in Ck are assumed

to be consistent in visual appearance, while texts in Ck form

a synset which has disambiguated semantics, and together

they construct a visually and semantically compact and

disambiguated visualset. C defines a generative mixture

model for images on which visualness can be measured.

We adopt the ranking-clustering framework [10] to gen-

erate C. The framework provides a method can simultane-

ously partition a heterogeneous graph and weight nodes in

each cluster, hence is suitable for visualsets mining. Mean-

while, addressing authority of nodes by ranking during image

clustering can help handle the noisy web data. We choose a

representative image instead of the mean as centers in cluster

refinement to further facilitate the process.

Algorithm 1 outlines our method. We initialize the al-

gorithm with a random partition on X and assign a uniform

score to image xi ∈ C(0)k . We propagate the scores via the

edges Ω of the graph G to rank attribute node nj ∈ A, which

in turn updates the scores of xi in each visualset Ck respec-

tively. This is the ranking step. Images are then mapped to

the measurement space defined by the mixture model of vi-

sualsets. The partition of X is then refined according to pos-

teriors. This is the clustering step. The process iterates until

convergence, which gives final visualsets C. We detail the

iteration process below.

2.3.1. Ranking nodes

Denote XCk as the member images of Ck and P(XCk) as the

authority of member images in Ck. Denote set-based attribute

Page 4: [IEEE 2013 IEEE International Conference on Multimedia and Expo (ICME) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and Expo (ICME)

nodes as S ∈ G,H and word-based attribute nodes as A ∈W , T . Let P(S|Ck) and P(A|Ck) be the authority of the

elements of S and A in Ck. Let WSCkand WACk

be the

weight matrices of edges between attribute nodes S,A and

member images XCk respectively. We have WSCk= WT

CkS

and WACk= WT

CkA.

Assuming that different types of attribute nodes are inde-

pendent, we first compute the authority scores of each type

of attribute nodes respectively. We adopt the HITS [11] like

iterative algorithm on the bipartite graph between each type

of word-based attribute nodes A and XCkto get P(A|Ck) as

Eq. 2. We introduce smooth λA to P(A|Ck) to tolerate errors

in the clustering for the generation of attribute nodes.

P(XCk) =

WCkA· P(A|Ck)

‖WCkA· P(A|Ck)‖1

P(A|Ck) = (1 − λA)WACk

· P(XCk)

‖WACk· P(XCk

)‖1+ λA

1

|A|

(2)

On the other hand, authority P(S|Ck) of each type of set-

based attribute nodes S is estimated with the help of word-

based attribute nodes A, bridging by XCk. We involve A to

rank S as word-based attribute nodes are either semantic key-

words or visual words that indicate semantically and visually

similarity. The scoring function is as Eq. 3 below.

P(A|Ck) =WACk

·WCkS· P(S|Ck)

‖WACk·WCkS

· P(S|Ck)‖1

P(S|Ck) = (1− λS)WSCk

·WCkA· P(A|Ck)

‖WSCk·WCkA

· P(A|Ck)‖1+ λS

1

|S|(3)

We then estimate the authority of each image xi ∈ X in

Ck, i.e. p(xi|Ck). Since each type of attribute nodes Q ∈G,H,W , T are assumed independent, we have

p(xi|Ck) =1

Z

Q

p(xi|Q,Ck) (4)

where

p(xi|Q,Ck) =

nj∈Q

ωxinjp(nj |Ck)

nj∈Q

ωxinj

(5)

and Z =∑

xi∈X

p(xi|Ck) is the normalization factor.

2.3.2. Posterior estimation

We then estimate the posterior p(Ck|xi), Ck ∈ C, xi ∈ X ,

based on which image clusters are refined. We use the EM al-

gorithm to maximize the likelihood (Eq. 6) to generate images

from the mixture model built on visualsets.

logL =

|X|∑

i=1

log p(xi) =X∑

i=1

log(K∑

k=1

p(xi|Ck)p(Ck)) (6)

Introducing p(Ck) as hidden variables, p(Ck|xi) and p(Ck)are estimated iteratively as

p(Ck |xi) =p(xi|Ck)p(Ck)

K∑

k=1p(xi|Ck)p(Ck)

p(Ck) =1

|X |·|X|∑

i=1p(Ck |xi)

(7)

2.3.3. Cluster refinement

After posterior estimation, each image xi is rep-

resented as a K-dimensional vector p(xi) =(p(C1|xi), p(C2|xi), . . . , p(CK |xi)). The cluster refine-

ment is to reassign xi to the nearest visualset as member

images. The distance of xi against visualset Ck is measured

with cluster center xCkby

dist(p(xi), p(xCk)) = 1−

p′(xi) · p(xCk)

‖p(xi)‖2 · ‖p(xCk)‖2

(8)

where xCkis the most representative image of Ck chosen ac-

cording to posterior by Eq. 9.

xCk= arg max

x∈XCk

p(Ck|x) (9)

We then have xi ∈ Ck, if ∀k′ ∈ 1, 2, . . . ,K, k′ 6= k,

dist(

p (xi) , p(xCk))

< dist(

p (xi) , p(xCk′))

.

2.4. Visualness estimation

After visualsets converge, we compute final scores of textual

words T by

P (T |Ck) =WT X · P (X|Ck)

‖WT X · P (X|Ck)‖1(10)

P (T |Ck) measures the importance of words in a semantically

and visually compact visualset.

We then discover concepts and estimate their visualness

in the following way:

We form a concept as t = tlLl=1 , tl ∈ T , L = 1 indi-

cates a simple concept, whereas L > 1 identifies a compound

concept. The associated score of concept t with visualset Ck

is p (t|Ck) = mintl∈t

p(tl|Ck). Visualness V (t) of concept t is

measured on highly relevant visualsets which have p (t|Ck)larger than the threshold τ . Define the indicator function as

I (t, Ck) =

1, p (T|Ck) > τ

0, otherwise(11)

Visualness score V (t) of concept t is a combination of asso-

ciated scores on relevant visualsets:

V (t) =

K∑

k=1p (t|Ck) ·p (Ck) ·I (t|Ck)

K∑

k=1I (t|Ck)

(12)

Page 5: [IEEE 2013 IEEE International Conference on Multimedia and Expo (ICME) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and Expo (ICME)

Visualset ID: 59

Alley 7.79

Street 2.78

Building 2.20

Light 2.15

Italy 1.49

Visualset ID: 91

Canyon 17.59

Antelope 16.25

Arizona 12.21

Page 8.11

Slot 5.51

Visualset ID: 406

Butterfly 17.43

Flower 6.78

Jesters 5.26

Nature 4.73

Mariposa 3.84

Visualset ID: 530

Ceiling 12.77

Glass 12.63

British 10.59

Museum 10.45

London 10.45

Visualset ID: 886

Red 5.40

Orange 2.20

Bravo 1.34

Light 1.04

Wall 0.95

Visualset ID: 1120

Foals 22.65

Horses 17.41

Horse 5.97

Pony 3.81

Small 3.73

Visualset ID: 1223

Giant 10.67

Panda 10.46

Zoo 10.44

California 10.04

Bear 9.59

Visualset ID: 1338

Gymnastics 85.40

Senior 4.76

Sports 0.87

Gymnast 0.86

Floor 0.43

Visualset ID: 1404

Herd 12.91

Horse 12.35

Gaucho 11.59

Criollo 11.41

Argentina 10.69

Fig. 5. Examples of visualsets. Top ranked images and terms are shown.

3. EXPERIMENT

3.1. Dataset

The NUS-WIDE dataset [12] containing 269,648 images and

5,018 unique tag words from Flickr is used in our experi-

ment. Two types of global features, 64-D color histogram

and 512-D GIST, are extracted. Each type of global features

is further clustered into 2000 clusters by k-means clustering,

whose centers form the set-based attribute nodes of the image

heterogeneous graph. Local SIFT features are also extracted

and clustered into 2000 visual words by k-means clustering,

based on which word-based attribute nodes are generated.

3.2. Visualsets

Firstly, we evaluate the performance of our visualsets gen-

eration approach presented in Section 2.3. By setting visu-

alsets number K = 3000, smooth parameter λS = 0.3 for

set-based attribute nodes, λA = 0.1 for word-based attribute

nodes, which are optimal parameters via experiments, visu-

alsets mining finishes in less than 30 iterations.

Fig. 5 shows some randomly selected visualsets. In each

visualset, top ranked images and top ranked words are dis-

played. Generally, images in a visualset are visually compact

and consistent, and are semantically related to the top ranked

words. Meanwhile, the associated scores of words illustrate

the effectiveness of visualsets. For example, the five word-

s of visualset 530 are highly relevant to the member images

and have close scores. However, there is a big jump on scores

of the top two words in visualset 1338 as apparently “Gym-

nastics” is much more relevant to the images than “senior”.

Moreover, visualsets 1120 and 1404 suggest to learn synsets

is valuable. Both visualsets are about horses, but 1120 focus-

es on foals whereas 1404 is about a herd of horses.

Visualsets also reveal relationships between concepts. For

example, visualset 59 may mean “Alley” is a narrow “street”

between “buildings”. The visualset 91 turns out to suggest

“antelope canyon” in “Arizona”. Visualset 406, on the oth-

er hand, implies frequent co-occurrences between “butterfly”

and “flower”, while visualset 1223 has very close scores for

“giant panda”, which suggests a compound concept of “giant

panda” for the panda images.

Saw: 34.71

War: 1.66

Army: 1.03

Tollbooth: 30.66

Glasgow: 6.94

Clock: 6.53

Tollbooth: 40.15

Road: 8.20

Indiana: 1.99

Pavilions: 17.56Architecture:11.45

Italy: 11.08

Pavilions: 17.05

China: 6.70

Temple: 2.25

Ladybug: 8.33

Pattern: 8.33

Texture: 8.33

Architecture: 6.10

Building: 5.81

Asian: 3.83

Cheerleader:7.31

Girls: 7.31

Asian: 7.31

concept

Saw

Tollbooth

Foal

Cordless

Cheerleader

Cellphone

Firefighter

Tennis

Donkey

Pavilion

Telephone

Beard

Gymnastics

Foxhole

Camel

Asian

October

Colorful

Ladybug

Available

visualness

11.01

9.18

8.61

8.28

8.19

8.06

7.57

6.90

6.83

6.40

6.33

6.30

6.03

5.88

5.77

3.10E-2

2.64E-2

2.53E-2

4.89E-2

1.01E-4

Fig. 6. Visualness of simple concepts.

3.3. Visualness

Now we evaluate the effectiveness on discovering simple con-

cepts and compound concepts as well as estimating their vi-

sualness. We set threshold τ = 0.04, and then have all visu-

alness scores normalized.

3.3.1. Simple concepts

We find that 2,424 simple concept out of the 5,018 unique tag

words in NUS-WIDE are visualizable, i.e., visualness scores

exceed τ = 0.04. Fig. 6 illustrates the top and bottom simple

concepts, their visualness scores, and top images and word-

s from the corresponding visualsets. A concept may relate

to multiple visualsets, e.g. the two visualsets of “tollbooth”

show the semantics in daytime and at night respectively.

From Fig. 6, we can see that concepts of more compact vi-

sualsets generally have higher visualness scores, e.g. “Asian”

has very low visualness scores because its relevant visualsets

(we show only two due to space limit) are quite different in

Page 6: [IEEE 2013 IEEE International Conference on Multimedia and Expo (ICME) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and Expo (ICME)

concept

Camel

Goat

Moose

Rabbit

Bear

Giraffe

Lion

Antelope

Animal

visualness

5.77

4.02

4.02

3.58

3.35

3.31

3.12

2.63

2.59

Moose: 32.42

Alaska: 25.65

Animal: 7.79

Antelope: 10.04

Zoo: 5.06

Nature: 4.75

Canyon: 14.98

Antelope: 14.67

Arizona: 12.49

Giraffe: 26.82

Zoo: 12.21

Animal: 6.58

Fox: 14.70

Wildlife: 6.95

Animal: 6.11

Fig. 7. Visualness of animals.

concept

London graffiti

Cordless phone

Wedding ceremony

Wedding bride

Egypt sphinx

Offshore oilfield

Sky flags

Car pavilions

Sport tennis

Horses foals

Snow avalanche

Cute animal

Vacation outdoors

Pretty things

Surprise things

visualness

5.82

5.01

4.66

4.24

4.06

4.00

3.86

3.76

3.68

3.61

3.57

0.98

2.30E-2

1.01E-4

1.01E-4

Graffiti: 46.33

London: 16.90

Artillery: 16.67

London: 36.27

Graffiti: 35.56

Art: 9.37

Bride: 11.42

Wedding: 11.37

Portrait: 3.47

Wedding: 12.55

Groom: 11.37

Bride: 11.71

Avalanche:11.02

Snow: 9.57

Washington:4.82

Harbors: 16.37

Outdoors: 11.85

Vacation: 4.52

Fig. 8. Visualness of compound concepts.

their visual and textual features. Though “ladybug” is intu-

itively a highly visualizable concept, it is scored low because

there is few ladybug images in the NUS-WIDE dataset. 1

We also found the following concepts are not visualizable:

“positive”, “planar”, “reflected”, “fund”, “sold”, “vernacu-

lar”, “grief”. Despite dataset effect, generally speaking, ab-

stract concepts are difficult to visualize.

Fig. 7 shows more results on animals. A specific concept

like “moose” is more visualizable than a general concept “an-

imal”. “Antelope” is ranked in between as it is an ambiguous

concept which can indicate a landmark or an animal.

3.3.2. Compound concepts

We discovered 26,378 visualizable compound concepts from

NUS-WIDE. Fig. 8 shows a few examples. Each compound

concept is a combination of two unique tag words. The top

visualizable compound concepts are composed of closely re-

lated terms rather than the restricted “adjective+noun” pattern

that is studied in [3], e.g. the landmark “Sphinx” is closely re-

lated to its located country “Egypt”, whereas event “wedding”

1Our method is able to take the popularity of concepts into consideration

due to the priors p(Ck) of visualsets in Eq. 12, which is reasonable as in the

web scenario, a concept is less visualizable if it has fewer relevant images.

and character “bride” will co-occur. A compound concept

may also relate to multiple visualsets as well, e.g., “London

graffiti”. The bottom-ranked compound concepts are gener-

ally a combination of general words or abstract words. For

instance, it is difficult to seize the compact visual property of

“pretty things” and “surprise things”.

4. CONCLUSION

To answer the question “how well a concept could be rep-

resented in a real visual world?”, we present in this paper a

novel method of visualness mining to automatically discov-

er and measure visualizable concepts. The method consists

of three steps: 1) Build an image heterogeneous graph with

multiple types of visual and textual features. 2) Generate vi-

sualsets with a simultaneous ranking and clustering algorith-

m. 3) Discover visualizable simple and compound concepts

with visualness estimation based on the visualsets. Evaluat-

ed on a large-scale benchmark dataset, our method achieves

promising results.

5. REFERENCES

[1] K. Yanai and K. Barnard, “Image region entropy: a measure of

“visualness” of web images associated with one concept,” in

ACM Multimedia, 2005.

[2] Y. Lu, L. Zhang, Q. Tian, and W.Y. Ma, “What are the high-

level concepts with small semantic gaps?,” in CVPR, 2008.

[3] J.W. Jeong, X.J. Wang, and D.H. Lee, “Towards measuring the

visualness of a concept,” in CIKM, 2012.

[4] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei,

“Imagenet: A large-scale hierarchical image database,” in

CVPR, 2009.

[5] X.J. Wang, Z. Xu, L. Zhang, C. Liu, and Y. Rui, “Towards in-

dexing representative images on the web,” in ACM Multimedia

Brave New Idea Track, 2012.

[6] D. Tsai, Y. Jing, Y. Liu, H. A. Rowley, S. Ioffe, and J. M. Rehg,

“Large-scale image annotation using visual synset,” in ICCV,

2011.

[7] X. Zhu, A. B. Goldberg, M. Eldawy, C. R. Dyer, and B. Strock,

“A text-to-picture synthesis system for augmenting communi-

cation,” in AAAI, 2007.

[8] T. L. Berg, A. C. Berg, and J. Shih, “Automatic attribute dis-

covery and characterization from noisy web data,” in ECCV,

2010.

[9] A. Sun and S. S. Bhowmick, “Quantifying tag representative-

ness of visual content of social images,” in ACM Multimedia,

2010.

[10] Y. Sun, Y. Yu, and J. Han, “Ranking-based clustering of het-

erogeneous information networks with star network schema,”

in KDD, 2009.

[11] J. M. Kleinberg, “Authoritative sources in a hyperlinked envi-

ronment,” J. ACM, vol. 46, no. 5, pp. 604–632, 1999.

[12] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng,

“Nus-wide: a real-world web image database from national u-

niversity of singapore,” in CIVR, 2009.